1. 07 May, 2007 40 commits
    • Christoph Lameter's avatar
      Make page->private usable in compound pages · d85f3385
      Christoph Lameter authored
      If we add a new flag so that we can distinguish between the first page and the
      tail pages then we can avoid to use page->private in the first page.
      page->private == page for the first page, so there is no real information in
      there.
      
      Freeing up page->private makes the use of compound pages more transparent.
      They become more usable like real pages.  Right now we have to be careful f.e.
       if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
      can then no longer use the private field.  This is one of the issues that
      cause us not to support debugging for page size slabs in SLAB.
      
      Having page->private available for SLUB would allow more meta information in
      the page struct.  I can probably avoid the 16 bit ints that I have in there
      right now.
      
      Also if page->private is available then a compound page may be equipped with
      buffer heads.  This may free up the way for filesystems to support larger
      blocks than page size.
      
      We add PageTail as an alias of PageReclaim.  Compound pages cannot currently
      be reclaimed.  Because of the alias one needs to check PageCompound first.
      
      The RFC for the this approach was discussed at
      http://marc.info/?t=117574302800001&r=1&w=2
      
      [nacc@us.ibm.com: fix hugetlbfs]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d85f3385
    • Christoph Lameter's avatar
      PowerPC: Disable SLUB for configurations in which slab page structs are modified · 30520864
      Christoph Lameter authored
      PowerPC uses the slab allocator to manage the lowest level of the page
      table.  In high cpu configurations we also use the page struct to split the
      page table lock.  Disallow the selection of SLUB for that case.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30520864
    • Christoph Lameter's avatar
      SLUB: allocate smallest object size if the user asks for 0 bytes · 614410d5
      Christoph Lameter authored
      Makes SLUB behave like SLAB in this area to avoid issues....
      
      Throw a stack dump to alert people.
      
      At some point the behavior should be switched back.  NULL is no memory as
      far as I can tell and if the use asked for 0 bytes then he need to get no
      memory.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      614410d5
    • Christoph Lameter's avatar
      SLUB: change default alignments · 47bfdc0d
      Christoph Lameter authored
      Structures may contain u64 items on 32 bit platforms that are only able to
      address 64 bit items on 64 bit boundaries.  Change the mininum alignment of
      slabs to conform to those expectations.
      
      ARCH_KMALLOC_MINALIGN must be changed for good since a variety of structure
      are mixed in the general slabs.
      
      ARCH_SLAB_MINALIGN is changed because currently there is no consistent
      specification of object alignment.  We may have that in the future when the
      KMEM_CACHE and related macros are used to generate slabs.  These pass the
      alignment of the structure generated by the compiler to the slab.
      
      With KMEM_CACHE etc we could align structures that do not contain 64
      bit values to 32 bit boundaries potentially saving some memory.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47bfdc0d
    • Christoph Lameter's avatar
      SLUB core · 81819f0f
      Christoph Lameter authored
      This is a new slab allocator which was motivated by the complexity of the
      existing code in mm/slab.c. It attempts to address a variety of concerns
      with the existing implementation.
      
      A. Management of object queues
      
         A particular concern was the complex management of the numerous object
         queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
         each allocating CPU and use objects from a slab directly instead of
         queueing them up.
      
      B. Storage overhead of object queues
      
         SLAB Object queues exist per node, per CPU. The alien cache queue even
         has a queue array that contain a queue for each processor on each
         node. For very large systems the number of queues and the number of
         objects that may be caught in those queues grows exponentially. On our
         systems with 1k nodes / processors we have several gigabytes just tied up
         for storing references to objects for those queues  This does not include
         the objects that could be on those queues. One fears that the whole
         memory of the machine could one day be consumed by those queues.
      
      C. SLAB meta data overhead
      
         SLAB has overhead at the beginning of each slab. This means that data
         cannot be naturally aligned at the beginning of a slab block. SLUB keeps
         all meta data in the corresponding page_struct. Objects can be naturally
         aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
         boundaries and can fit tightly into a 4k page with no bytes left over.
         SLAB cannot do this.
      
      D. SLAB has a complex cache reaper
      
         SLUB does not need a cache reaper for UP systems. On SMP systems
         the per CPU slab may be pushed back into partial list but that
         operation is simple and does not require an iteration over a list
         of objects. SLAB expires per CPU, shared and alien object queues
         during cache reaping which may cause strange hold offs.
      
      E. SLAB has complex NUMA policy layer support
      
         SLUB pushes NUMA policy handling into the page allocator. This means that
         allocation is coarser (SLUB does interleave on a page level) but that
         situation was also present before 2.6.13. SLABs application of
         policies to individual slab objects allocated in SLAB is
         certainly a performance concern due to the frequent references to
         memory policies which may lead a sequence of objects to come from
         one node after another. SLUB will get a slab full of objects
         from one node and then will switch to the next.
      
      F. Reduction of the size of partial slab lists
      
         SLAB has per node partial lists. This means that over time a large
         number of partial slabs may accumulate on those lists. These can
         only be reused if allocator occur on specific nodes. SLUB has a global
         pool of partial slabs and will consume slabs from that pool to
         decrease fragmentation.
      
      G. Tunables
      
         SLAB has sophisticated tuning abilities for each slab cache. One can
         manipulate the queue sizes in detail. However, filling the queues still
         requires the uses of the spin lock to check out slabs. SLUB has a global
         parameter (min_slab_order) for tuning. Increasing the minimum slab
         order can decrease the locking overhead. The bigger the slab order the
         less motions of pages between per CPU and partial lists occur and the
         better SLUB will be scaling.
      
      G. Slab merging
      
         We often have slab caches with similar parameters. SLUB detects those
         on boot up and merges them into the corresponding general caches. This
         leads to more effective memory use. About 50% of all caches can
         be eliminated through slab merging. This will also decrease
         slab fragmentation because partial allocated slabs can be filled
         up again. Slab merging can be switched off by specifying
         slub_nomerge on boot up.
      
         Note that merging can expose heretofore unknown bugs in the kernel
         because corrupted objects may now be placed differently and corrupt
         differing neighboring objects. Enable sanity checks to find those.
      
      H. Diagnostics
      
         The current slab diagnostics are difficult to use and require a
         recompilation of the kernel. SLUB contains debugging code that
         is always available (but is kept out of the hot code paths).
         SLUB diagnostics can be enabled via the "slab_debug" option.
         Parameters can be specified to select a single or a group of
         slab caches for diagnostics. This means that the system is running
         with the usual performance and it is much more likely that
         race conditions can be reproduced.
      
      I. Resiliency
      
         If basic sanity checks are on then SLUB is capable of detecting
         common error conditions and recover as best as possible to allow the
         system to continue.
      
      J. Tracing
      
         Tracing can be enabled via the slab_debug=T,<slabcache> option
         during boot. SLUB will then protocol all actions on that slabcache
         and dump the object contents on free.
      
      K. On demand DMA cache creation.
      
         Generally DMA caches are not needed. If a kmalloc is used with
         __GFP_DMA then just create this single slabcache that is needed.
         For systems that have no ZONE_DMA requirement the support is
         completely eliminated.
      
      L. Performance increase
      
         Some benchmarks have shown speed improvements on kernbench in the
         range of 5-10%. The locking overhead of slub is based on the
         underlying base allocation size. If we can reliably allocate
         larger order pages then it is possible to increase slub
         performance much further. The anti-fragmentation patches may
         enable further performance increases.
      
      Tested on:
      i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator
      
      SLUB Boot options
      
      slub_nomerge		Disable merging of slabs
      slub_min_order=x	Require a minimum order for slab caches. This
      			increases the managed chunk size and therefore
      			reduces meta data and locking overhead.
      slub_min_objects=x	Mininum objects per slab. Default is 8.
      slub_max_order=x	Avoid generating slabs larger than order specified.
      slub_debug		Enable all diagnostics for all caches
      slub_debug=<options>	Enable selective options for all caches
      slub_debug=<o>,<cache>	Enable selective options for a certain set of
      			caches
      
      Available Debug options
      F		Double Free checking, sanity and resiliency
      R		Red zoning
      P		Object / padding poisoning
      U		Track last free / alloc
      T		Trace all allocs / frees (only use for individual slabs).
      
      To use SLUB: Apply this patch and then select SLUB as the default slab
      allocator.
      
      [hugh@veritas.com: fix an oops-causing locking error]
      [akpm@linux-foundation.org: various stupid cleanups and small fixes]
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81819f0f
    • Andy Whitcroft's avatar
      tty_register_driver: only allocate tty instances when defined · 543691a6
      Andy Whitcroft authored
      If device->num is zero we attempt to kmalloc() zero bytes.  When SLUB is
      enabled this returns a null pointer and take that as an allocation failure
      and fail the device register.  Check for no devices and avoid the
      allocation.
      
      [akpm: opportunistic kzalloc() conversion]
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      543691a6
    • Christoph Lameter's avatar
      i386: use page allocator to allocate thread_info structure · b5637e65
      Christoph Lameter authored
      i386 uses kmalloc to allocate the threadinfo structure assuming that the
      allocations result in a page sized aligned allocation.  That has worked so
      far because SLAB exempts page sized slabs from debugging and aligns them in
      special ways that goes beyond the restrictions imposed by
      KMALLOC_ARCH_MINALIGN valid for other slabs in the kmalloc array.
      
      SLUB also works fine without debugging since page sized allocations neatly
      align at page boundaries.  However, if debugging is switched on then SLUB
      will extend the slab with debug information.  The resulting slab is not
      longer of page size.  It will only be aligned following the requirements
      imposed by KMALLOC_ARCH_MINALIGN.  As a result the threadinfo structure may
      not be page aligned which makes i386 fail to boot with SLUB debug on.
      
      Replace the calls to kmalloc with calls into the page allocator.
      
      An alternate solution may be to create a custom slab cache where the
      alignment is set to PAGE_SIZE.  That would allow slub debugging to be
      applied to the threadinfo structure.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5637e65
    • David Rientjes's avatar
      cpusets: allow TIF_MEMDIE threads to allocate anywhere · c596d9f3
      David Rientjes authored
      OOM killed tasks have access to memory reserves as specified by the
      TIF_MEMDIE flag in the hopes that it will quickly exit.  If such a task has
      memory allocations constrained by cpusets, we may encounter a deadlock if a
      blocking task cannot exit because it cannot allocate the necessary memory.
      
      We allow tasks that have the TIF_MEMDIE flag to allocate memory anywhere,
      including outside its cpuset restriction, so that it can quickly die
      regardless of whether it is __GFP_HARDWALL.
      
      Cc: Andi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c596d9f3
    • Andrew Morton's avatar
      slab: mark set_up_list3s() __init · a3a02be7
      Andrew Morton authored
      It is only ever used prior to free_initmem().
      
      (It will cause a warning when we run the section checking, but that's a
      false-positive and it simply changes the source of an existing warning, which
      is also a false-positive)
      
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3a02be7
    • Mel Gorman's avatar
      Do not disable interrupts when reading min_free_kbytes · 3b1d92c5
      Mel Gorman authored
      The sysctl handler for min_free_kbytes calls setup_per_zone_pages_min() on
      read or write.  This function iterates through every zone and calls
      spin_lock_irqsave() on the zone LRU lock.  When reading min_free_kbytes,
      this is a total waste of time that disables interrupts on the local
      processor.  It might even be noticable machines with large numbers of zones
      if a process started constantly reading min_free_kbytes.
      
      This patch only calls setup_per_zone_pages_min() only on write. Tested on
      an x86 laptop and it did the right thing.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarChristoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b1d92c5
    • Eric Dumazet's avatar
      slab: NUMA kmem_cache diet · 8da3430d
      Eric Dumazet authored
      Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer
      possible nodes.  This patch dynamically sizes the 'struct kmem_cache' to
      allocate only needed space.
      
      I moved nodelists[] field at the end of struct kmem_cache, and use the
      following computation in kmem_cache_init()
      
      cache_cache.buffer_size = offsetof(struct kmem_cache, nodelists) +
                                       nr_node_ids * sizeof(struct kmem_list3 *);
      
      On my two nodes x86_64 machine, kmem_cache.obj_size is now 192 instead of 704
      (This is because on x86_64, MAX_NUMNODES is 64)
      
      On bigger NUMA setups, this might reduce the gfporder of "cache_cache"
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8da3430d
    • Eric Dumazet's avatar
      SLAB: don't allocate empty shared caches · 63109846
      Eric Dumazet authored
      We can avoid allocating empty shared caches and avoid unecessary check of
      cache->limit.  We save some memory.  We avoid bringing into CPU cache
      unecessary cache lines.
      
      All accesses to l3->shared are already checking NULL pointers so this patch is
      safe.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63109846
    • Eric Dumazet's avatar
      SLAB: use num_possible_cpus() in enable_cpucache() · 364fbb29
      Eric Dumazet authored
      The existing comment in mm/slab.c is *perfect*, so I reproduce it :
      
               /*
                * CPU bound tasks (e.g. network routing) can exhibit cpu bound
                * allocation behaviour: Most allocs on one cpu, most free operations
                * on another cpu. For these cases, an efficient object passing between
                * cpus is necessary. This is provided by a shared array. The array
                * replaces Bonwick's magazine layer.
                * On uniprocessor, it's functionally equivalent (but less efficient)
                * to a larger limit. Thus disabled by default.
                */
      
      As most shiped linux kernels are now compiled with CONFIG_SMP, there is no way
      a preprocessor #if can detect if the machine is UP or SMP. Better to use
      num_possible_cpus().
      
      This means on UP we allocate a 'size=0 shared array', to be more efficient.
      
      Another patch can later avoid the allocations of 'empty shared arrays', to
      save some memory.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      364fbb29
    • Jan Kara's avatar
      readahead: code cleanup · 6ce745ed
      Jan Kara authored
      Rename file_ra_state.prev_page to prev_index and file_ra_state.offset to
      prev_offset.  Also update of prev_index in do_generic_mapping_read() is now
      moved close to the update of prev_offset.
      
      [wfg@mail.ustc.edu.cn: fix it]
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ce745ed
    • Jan Kara's avatar
      readahead: improve heuristic detecting sequential reads · ec0f1637
      Jan Kara authored
      Introduce ra.offset and store in it an offset where the previous read
      ended.  This way we can detect whether reads are really sequential (and
      thus we should not mark the page as accessed repeatedly) or whether they
      are random and just happen to be in the same page (and the page should
      really be marked accessed again).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarNick Piggin <nickpiggin@yahoo.com.au>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec0f1637
    • David Rientjes's avatar
      smaps: add clear_refs file to clear reference · b813e931
      David Rientjes authored
      Adds /proc/pid/clear_refs.  When any non-zero number is written to this file,
      pte_mkold() and ClearPageReferenced() is called for each pte and its
      corresponding page, respectively, in that task's VMAs.  This file is only
      writable by the user who owns the task.
      
      It is now possible to measure _approximately_ how much memory a task is using
      by clearing the reference bits with
      
      	echo 1 > /proc/pid/clear_refs
      
      and checking the reference count for each VMA from the /proc/pid/smaps output
      at a measured time interval.  For example, to observe the approximate change
      in memory footprint for a task, write a script that clears the references
      (echo 1 > /proc/pid/clear_refs), sleeps, and then greps for Pgs_Referenced and
      extracts the size in kB.  Add the sizes for each VMA together for the total
      referenced footprint.  Moments later, repeat the process and observe the
      difference.
      
      For example, using an efficient Mozilla:
      
      	accumulated time		referenced memory
      	----------------		-----------------
      		 0 s				 408 kB
      		 1 s				 408 kB
      		 2 s				 556 kB
      		 3 s				1028 kB
      		 4 s				 872 kB
      		 5 s				1956 kB
      		 6 s				 416 kB
      		 7 s				1560 kB
      		 8 s				2336 kB
      		 9 s				1044 kB
      		10 s				 416 kB
      
      This is a valuable tool to get an approximate measurement of the memory
      footprint for a task.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      [akpm@linux-foundation.org: build fixes]
      [mpm@selenic.com: rename for_each_pmd]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b813e931
    • David Rientjes's avatar
      smaps: add pages referenced count to smaps · f79f177c
      David Rientjes authored
      Adds an additional unsigned long field to struct mem_size_stats called
      'referenced'.  For each pte walked in the smaps code, this field is
      incremented by PAGE_SIZE if it has pte-reference bits.
      
      An additional line was added to the /proc/pid/smaps output for each VMA to
      indicate how many pages within it are currently marked as referenced or
      accessed.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f79f177c
    • David Rientjes's avatar
      smaps: extract pmd walker from smaps code · 826fad1b
      David Rientjes authored
      Extracts the pmd walker from smaps-specific code in fs/proc/task_mmu.c.
      
      The new struct pmd_walker includes the struct vm_area_struct of the memory to
      walk over.  Iteration begins at the vma->vm_start and completes at
      vma->vm_end.  A pointer to another data structure may be stored in the private
      field such as struct mem_size_stats, which acts as the smaps accumulator.  For
      each pmd in the VMA, the action function is called with a pointer to its
      struct vm_area_struct, a pointer to the pmd_t, its start and end addresses,
      and the private field.
      
      The interface for walking pmd's in a VMA for fs/proc/task_mmu.c is now:
      
      	void for_each_pmd(struct vm_area_struct *vma,
      			  void (*action)(struct vm_area_struct *vma,
      					 pmd_t *pmd, unsigned long addr,
      					 unsigned long end,
      					 void *private),
      			  void *private);
      
      Since the pmd walker is now extracted from the smaps code, smaps_one_pmd() is
      invoked for each pmd in the VMA.  Its behavior and efficiency is identical to
      the existing implementation.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      826fad1b
    • Zachary Amsden's avatar
      i386: use pte_update_defer in ptep_test_and_clear_{dirty,young} · 0013572b
      Zachary Amsden authored
      If you actually clear the bit, you need to:
      
      +         pte_update_defer(vma->vm_mm, addr, ptep);
      
      The reason is, when updating PTEs, the hypervisor must be notified.  Using
      atomic operations to do this is fine for all hypervisors I am aware of.
      However, for hypervisors which shadow page tables, if these PTE
      modifications are not trapped, you need a post-modification call to fulfill
      the update of the shadow page table.
      Acked-by: default avatarZachary Amsden <zach@vmware.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0013572b
    • David Rientjes's avatar
      i386: add ptep_test_and_clear_{dirty,young} · 10a8d6ae
      David Rientjes authored
      Add ptep_test_and_clear_{dirty,young} to i386.  They advertise that they
      have it and there is at least one place where it needs to be called without
      the page table lock: to clear the accessed bit on write to
      /proc/pid/clear_refs.
      
      ptep_clear_flush_{dirty,young} are updated to use the new functions.  The
      overall net effect to current users of ptep_clear_flush_{dirty,young} is
      that we introduce an additional branch.
      
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10a8d6ae
    • Borislav Petkov's avatar
      Add unitialized_var() macro for suppressing gcc warnings · 94909914
      Borislav Petkov authored
      Introduce a macro for suppressing gcc from generating a warning about a
      probable uninitialized state of a variable.
      
      Example:
      
      -	spinlock_t *ptl;
      +	spinlock_t *uninitialized_var(ptl);
      
      Not a happy solution, but those warnings are obnoxious.
      
      - Using the usual pointlessly-set-it-to-zero approach wastes several
        bytes of text.
      
      - Using a macro means we can (hopefully) do something else if gcc changes
        cause the `x = x' hack to stop working
      
      - Using a macro means that people who are worried about hiding true bugs
        can easily turn it off.
      Signed-off-by: default avatarBorislav Petkov <bbpetkov@yahoo.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94909914
    • Nick Piggin's avatar
      mm: simplify filemap_nopage · a8127717
      Nick Piggin authored
      Identical block is duplicated twice: contrary to the comment, we have been
      re-reading the page *twice* in filemap_nopage rather than once.
      
      If any retry logic or anything is needed, it belongs in lower levels anyway.
      Only retry once.  Linus agrees.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8127717
    • Andy Whitcroft's avatar
      add pfn_valid_within helper for sub-MAX_ORDER hole detection · 14e07298
      Andy Whitcroft authored
      Generally we work under the assumption that memory the mem_map array is
      contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie.  that if we
      have validated any page within this MAX_ORDER_NR_PAGES block we need not check
      any other.  This is not true when CONFIG_HOLES_IN_ZONE is set and we must
      check each and every reference we make from a pfn.
      
      Add a pfn_valid_within() helper which should be used when scanning pages
      within a MAX_ORDER_NR_PAGES block when we have already checked the validility
      of the block normally with pfn_valid().  This can then be optimised away when
      we do not have holes within a MAX_ORDER_NR_PAGES block of pages.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarBob Picco <bob.picco@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      14e07298
    • Adrian Bunk's avatar
      mm/slab.c: proper prototypes · ac267728
      Adrian Bunk authored
      Add proper prototypes in include/linux/slab.h.
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac267728
    • Heiko Carstens's avatar
      Introduce CONFIG_HAS_DMA · 411f0f3e
      Heiko Carstens authored
      Architectures that don't support DMA can say so by adding a config NO_DMA
      to their Kconfig file.  This will prevent compilation of some dma specific
      driver code.  Also dma-mapping-broken.h isn't needed anymore on at least
      s390.  This avoids compilation and linking of otherwise dead/broken code.
      
      Other architectures that include dma-mapping-broken.h are arm26, h8300,
      m68k, m68knommu and v850.  If these could be converted as well we could get
      rid of the header file.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      "John W. Linville" <linville@tuxdriver.com>
      Cc: Kyle McMartin <kyle@parisc-linux.org>
      Cc: <James.Bottomley@SteelEye.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Jeff Garzik <jeff@garzik.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <geert@linux-m68k.org>
      Cc: <zippel@linux-m68k.org>
      Cc: <spyro@f2s.com>
      Cc: <uclinux-v850@lsi.nec.co.jp>
      Cc: <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      411f0f3e
    • Joshua N Pritikin's avatar
      allow oom_adj of saintly processes · 9a82782f
      Joshua N Pritikin authored
      If the badness of a process is zero then oom_adj>0 has no effect.  This
      patch makes sure that the oom_adj shift actually increases badness points
      appropriately.
      Signed-off-by: default avatarJoshua N. Pritikin <jpritikin@pobox.com>
      Cc: Andrea Arcangeli <andrea@novell.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a82782f
    • Nick Piggin's avatar
      fs: buffer don't PageUptodate without page locked · 3d67f2d7
      Nick Piggin authored
      __block_write_full_page is calling SetPageUptodate without the page locked.
      This is unusual, but not incorrect, as PG_writeback is still set.
      
      However the next patch will require that SetPageUptodate always be called with
      the page locked.  Simply don't bother setting the page uptodate in this case
      (it is unusual that the write path does such a thing anyway).  Instead just
      leave it to the read side to bring the page uptodate when it notices that all
      buffers are uptodate.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d67f2d7
    • Nick Piggin's avatar
      mm: make read_cache_page synchronous · 6fe6900e
      Nick Piggin authored
      Ensure pages are uptodate after returning from read_cache_page, which allows
      us to cut out most of the filesystem-internal PageUptodate calls.
      
      I didn't have a great look down the call chains, but this appears to fixes 7
      possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
      ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
      block2mtd.  All depending on whether the filler is async and/or can return
      with a !uptodate page.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fe6900e
    • Pekka Enberg's avatar
      slab: ensure cache_alloc_refill terminates · 714b8171
      Pekka Enberg authored
      If slab->inuse is corrupted, cache_alloc_refill can enter an infinite
      loop as detailed by Michael Richardson in the following post:
      <http://lkml.org/lkml/2007/2/16/292>. This adds a BUG_ON to catch
      those cases.
      
      Cc: Michael Richardson <mcr@sandelman.ca>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      714b8171
    • Nick Piggin's avatar
      mm: remove gcc workaround · 5f22df00
      Nick Piggin authored
      Minimum gcc version is 3.2 now.  However, with likely profiling, even
      modern gcc versions cannot always eliminate the call.
      
      Replace the placeholder functions with the more conventional empty static
      inlines, which should be optimal for everyone.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f22df00
    • Adrian Bunk's avatar
      proper prototype for hugetlb_get_unmapped_area() · d2ba27e8
      Adrian Bunk authored
      Add a proper prototype for hugetlb_get_unmapped_area() in
      include/linux/hugetlb.h.
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Acked-by: default avatarWilliam Irwin <wli@holomorphy.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2ba27e8
    • Christoph Lameter's avatar
      Use ZVC counters to establish exact size of dirtyable pages · 1b424464
      Christoph Lameter authored
      We can use the global ZVC counters to establish the exact size of the LRU
      and the free pages.  This allows a more accurate determination of the dirty
      ratio.
      
      This patch will fix the broken ratio calculations if large amounts of
      memory are allocated to huge pags or other consumers that do not put the
      pages on to the LRU.
      
      Notes:
      - I did not add NR_SLAB_RECLAIMABLE to the calculation of the
        dirtyable pages. Those may be reclaimable but they are at this
        point not dirtyable. If NR_SLAB_RECLAIMABLE would be considered
        then a huge number of reclaimable pages would stop writeback
        from occurring.
      
      - This patch used to be in mm as the last one in a series of patches.
        It was removed when Linus updated the treatment of highmem because
        there was a conflict. I updated the patch to follow Linus' approach.
        This patch is neede to fulfill the claims made in the beginning of the
        patchset that is now in Linus' tree.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b424464
    • Christoph Lameter's avatar
      Safer nr_node_ids and nr_node_ids determination and initial values · 476f3534
      Christoph Lameter authored
      The nr_cpu_ids value is currently only calculated in smp_init.  However, it
      may be needed before (SLUB needs it on kmem_cache_init!) and other kernel
      components may also want to allocate dynamically sized per cpu array before
      smp_init.  So move the determination of possible cpus into sched_init()
      where we already loop over all possible cpus early in boot.
      
      Also initialize both nr_node_ids and nr_cpu_ids with the highest value they
      could take.  If we have accidental users before these values are determined
      then the current valud of 0 may cause too small per cpu and per node arrays
      to be allocated.  If it is set to the maximum possible then we only waste
      some memory for early boot users.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      476f3534
    • Jeremy Fitzhardinge's avatar
      Add apply_to_page_range() which applies a function to a pte range · aee16b3c
      Jeremy Fitzhardinge authored
      Add a new mm function apply_to_page_range() which applies a given function to
      every pte in a given virtual address range in a given mm structure.  This is a
      generic alternative to cut-and-pasting the Linux idiomatic pagetable walking
      code in every place that a sequence of PTEs must be accessed.
      
      Although this interface is intended to be useful in a wide range of
      situations, it is currently used specifically by several Xen subsystems, for
      example: to ensure that pagetables have been allocated for a virtual address
      range, and to construct batched special pagetable update requests to map I/O
      memory (in ioremap()).
      
      [akpm@linux-foundation.org: fix warning, unpleasantly]
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Matt Mackall <mpm@waste.org>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aee16b3c
    • Jiri Slaby's avatar
      Serial: serial_core, use pr_debug · eb3a1e11
      Jiri Slaby authored
      serial_core, use pr_debug
      Signed-off-by: default avatarJiri Slaby <jirislaby@gmail.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb3a1e11
    • Dave Jiang's avatar
      MPSC serial driver tx locking · 1733310b
      Dave Jiang authored
      The MPSC serial driver assumes that interrupt is always on to pick up the
      DMA transmit ops that aren't submitted while the DMA engine is active.
      However when irqs are off for a period of time such as operations under
      kernel crash dump console messages do not show up due to additional DMA ops
      are being dropped.  This makes console writes to process through all the tx
      DMAs queued up before submitting a new request.
      
      Also, the current locking mechanism does not protect the hardware registers
      and ring buffer when a printk is done during the serial write operations.
      The additional per port transmit lock provides a finer granular locking and
      protects registers being clobbered while printks are nested within UART
      writes.
      Signed-off-by: default avatarDave Jiang <djiang@mvista.com>
      Signed-off-by: default avatarMark A. Greer <mgreer@mvista.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1733310b
    • David Gibson's avatar
      serial: define FIXED_PORT flag for serial_core · abb4a239
      David Gibson authored
      At present, the serial core always allows setserial in userspace to change the
      port address, irq and base clock of any serial port.  That makes sense for
      legacy ISA ports, but not for (say) embedded ns16550 compatible serial ports
      at peculiar addresses.  In these cases, the kernel code configuring the ports
      must know exactly where they are, and their clocking arrangements (which can
      be unusual on embedded boards).  It doesn't make sense for userspace to change
      these settings.
      
      Therefore, this patch defines a UPF_FIXED_PORT flag for the uart_port
      structure.  If this flag is set when the serial port is configured, any
      attempts to alter the port's type, io address, irq or base clock with
      setserial are ignored.
      
      In addition this patch uses the new flag for on-chip serial ports probed in
      arch/powerpc/kernel/legacy_serial.c, and for other hard-wired serial ports
      probed by drivers/serial/of_serial.c.
      Signed-off-by: default avatarDavid Gibson <dwg@au1.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abb4a239
    • Thomas Koeller's avatar
      RM9000 serial driver · bd71c182
      Thomas Koeller authored
      Add support for the integrated serial ports of the MIPS RM9122 processor
      and its relatives.
      
      The patch also does some whitespace cleanup.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarThomas Koeller <thomas.koeller@baslerweb.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd71c182
    • Marc St-Jean's avatar
      serial driver PMC MSP71xx · beab697a
      Marc St-Jean authored
      Serial driver patch for the PMC-Sierra MSP71xx devices.
      
      There are three different fixes:
      
      1 Fix for DesignWare APB THRE errata: In brief, this is a non-standard
        16550 in that the THRE interrupt will not re-assert itself simply by
        disabling and re-enabling the THRI bit in the IER, it is only re-enabled
        if a character is actually sent out.
      
        It appears that the "8250-uart-backup-timer.patch" in the "mm" tree
        also fixes it so we have dropped our initial workaround.  This patch now
        needs to be applied on top of that "mm" patch.
      
      2 Fix for Busy Detect on LCR write: The DesignWare APB UART has a feature
        which causes a new Busy Detect interrupt to be generated if it's busy
        when the LCR is written.  This fix saves the value of the LCR and
        rewrites it after clearing the interrupt.
      
      3 Workaround for interrupt/data concurrency issue: The SoC needs to
        ensure that writes that can cause interrupts to be cleared reach the UART
        before returning from the ISR.  This fix reads a non-destructive register
        on the UART so the read transaction completion ensures the previously
        queued write transaction has also completed.
      Signed-off-by: default avatarMarc St-Jean <Marc_St-Jean@pmc-sierra.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      beab697a
    • Bernhard Walle's avatar
      add new_id to PCMCIA drivers · 6179b556
      Bernhard Walle authored
      PCI drivers have the new_id file in sysfs which allows new IDs to be added
      at runtime.  The advantage is to avoid re-compilation of a driver that
      works for a new device, but it's ID table doesn't contain the new device.
      This mechanism is only meant for testing, after the driver has been tested
      successfully, the ID should be added in source code so that new revisions
      of the kernel automatically detect the device.
      
      The implementation follows the PCI implementation. The interface is documented
      in Documentation/pcmcia/driver.txt. Computations should be done in userspace,
      so the sysfs string contains the raw structure members for matching.
      Signed-off-by: default avatarBernhard Walle <bwalle@suse.de>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6179b556