1. 17 Jun, 2009 40 commits
    • Joe Perches's avatar
    • Joe Perches's avatar
      scripts/get_maintainer.pl: output first field only in mailing lists and after maintainers. · 290603c1
      Joe Perches authored
      Fix mailing lists that are described, but not "(subscriber-only)"
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      290603c1
    • Zygo Blaxell's avatar
      lib/genalloc.c: remove unmatched write_lock() in gen_pool_destroy · 8e8a2dea
      Zygo Blaxell authored
      There is a call to write_lock() in gen_pool_destroy which is not balanced
      by any corresponding write_unlock().  This causes problems with preemption
      because the preemption-disable counter is incremented in the write_lock()
      call, but never decremented by any call to write_unlock().  This bug is
      gen_pool_destroy, and one of them is non-x86 arch-specific code.
      Signed-off-by: default avatarZygo Blaxell <zygo.blaxell@xandros.com>
      Cc: Jiri Kosina <trivial@kernel.org>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e8a2dea
    • Tomas Szepe's avatar
      CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK · 69050eee
      Tomas Szepe authored
      CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK.
      
      This makes it possible to run complete systems out of a CONFIG_BLOCK=n
      initramfs on current kernels again (this last worked on 2.6.27.*).
      
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69050eee
    • Florian Fainelli's avatar
      drivers: add support for the TI VLYNQ bus · 55e331cf
      Florian Fainelli authored
      Add support for the TI VLYNQ high-speed, serial and packetized bus.
      
      This bus allows external devices to be connected to the System-on-Chip and
      appear in the main system memory just like any memory mapped peripheral.
      It is widely used in TI's networking and multimedia SoC, including the AR7
      SoC.
      Signed-off-by: default avatarEugene Konev <ejka@imfi.kspu.ru>
      Signed-off-by: default avatarFlorian Fainelli <florian@openwrt.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      55e331cf
    • Daniel Mack's avatar
      console: make blank timeout value a boot option · f324edc8
      Daniel Mack authored
      The console blank timer is currently hardcoded to 10*60 seconds which
      might be annoying on systems with no input devices attached to wake up the
      console again.  Especially during development, disabling the screen saver
      can be handy - for example when debugging the root fs mount mechanism or
      other scenarios where no userspace program could be started to do that at
      runtime from userspace.
      
      This patch defines a core_param for the variable in charge which allows
      users to entirely disable the blank feature at boot time by setting it 0.
      The value can still be overwritten at runtime using the standard ioctl
      call - this just allows to conditionally change the default.
      Signed-off-by: default avatarDaniel Mack <daniel@caiaq.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f324edc8
    • Figo.zhang's avatar
      Documentation/atomic_ops.txt: fix sample code · 4764e280
      Figo.zhang authored
      list_add() lost a parameter in sample code.
      Signed-off-by: default avatarFigo.zhang <figo1802@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4764e280
    • Maciej W. Rozycki's avatar
      eisa.ids: add Network Peripherals FDDI boards · 73d05163
      Maciej W. Rozycki authored
      Add EISA IDs for Network Peripherals FDDI boards.  Descriptions taken from
      the respective EISA configuration files.
      
      It's unlikely we'll ever support these cards, the problem being the lack
      of documentation.  Assuming the policy for the EISA ID database is the
      same as for PCI I'm sending these entries for the sake of completeness.
      Signed-off-by: default avatarMaciej W. Rozycki <macro@linux-mips.org>
      Cc: Marc Zyngier <maz@misterjones.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73d05163
    • Masatake YAMATO's avatar
      syscalls.h: remove duplicated declarations for sys_pipe2 · cc6f2677
      Masatake YAMATO authored
      sys_pipe2 is declared twice in include/linux/syscalls.h.
      Signed-off-by: default avatarMasatake YAMATO <yamato@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc6f2677
    • Randy Dunlap's avatar
      kmap_types: make most arches use generic header file · e4c9dd0f
      Randy Dunlap authored
      Convert most arches to use asm-generic/kmap_types.h.
      
      Move the KM_FENCE_ macro additions into asm-generic/kmap_types.h,
      controlled by __WITH_KM_FENCE from each arch's kmap_types.h file.
      
      Would be nice to be able to add custom KM_types per arch, but I don't yet
      see a nice, clean way to do that.
      
      Built on x86_64, i386, mips, sparc, alpha(tonyb), powerpc(tonyb), and
      68k(tonyb).
      
      Note: avr32 should be able to remove KM_PTE2 (since it's not used) and
      then just use the generic kmap_types.h file.  Get avr32 maintainer
      approval.
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Cc: <linux-arch@vger.kernel.org>
      Acked-by: default avatarMike Frysinger <vapier@gentoo.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Bryan Wu <cooloney@kernel.org>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: "Luck Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4c9dd0f
    • Jaswinder Singh Rajput's avatar
      Documentation/accounting/getdelays.c intialize the variable before using it · b8d9a865
      Jaswinder Singh Rajput authored
      Fix compilation warning:
      
      Documentation/accounting/getdelays.c: In function `main':
      Documentation/accounting/getdelays.c:249: warning: `cmd_type' may be used uninitialized in this function
      
      This is in fact a false positive.
      Signed-off-by: default avatarJaswinder Singh Rajput <jaswinderrajput@gmail.com>
      Acked-by: default avatarBalbir Singh <balbir@in.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8d9a865
    • Li Zefan's avatar
      hexdump: remove the trailing space · c67ae69b
      Li Zefan authored
      For example:
              hex_dump_to_buffer("AB", 2, 16, 1, buf, 100, 0);
              pr_info("[%s]\n", buf);
      
      I'd expect the output to be "[41 42]", but actually it's "[41 42 ]"
      
      This patch also makes the required buf to be minimum.  To print the hex
      format of "AB", a buf with size 6 should be sufficient, but
      hex_dump_to_buffer() required at least 8.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c67ae69b
    • Minchan Kim's avatar
      use printk_once() in several places · a9c56953
      Minchan Kim authored
      There are some places to be able to use printk_once instead of hard coding.
      Signed-off-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9c56953
    • Chris Peterson's avatar
      slow-work: use round_jiffies() for thread pool's cull and OOM timers · 009789f0
      Chris Peterson authored
      Round the slow work queue's cull and OOM timeouts to whole second boundary
      with round_jiffies().  The slow work queue uses a pair of timers to cull
      idle threads and, after OOM, to delay new thread creation.
      
      This patch also extracts the mod_timer() logic for the cull timer into a
      separate helper function.
      
      By rounding non-time-critical timers such as these to whole seconds, they
      will be batched up to fire at the same time rather than being spread out.
      This allows the CPU wake up less, which saves power.
      Signed-off-by: default avatarChris Peterson <cpeterso@cpeterso.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      009789f0
    • Huang Shijie's avatar
      lib: do code optimization for radix_tree_lookup() and radix_tree_lookup_slot() · b72b71c6
      Huang Shijie authored
      radix_tree_lookup() and radix_tree_lookup_slot() have much the
      same code except for the return value.
      
      Introduce radix_tree_lookup_element() to do the real work.
      
      /*
       * is_slot == 1 : search for the slot.
       * is_slot == 0 : search for the node.
       */
      static void * radix_tree_lookup_element(struct radix_tree_root *root,
      					unsigned long index, int is_slot);
      Signed-off-by: default avatarHuang Shijie <shijie8@gmail.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b72b71c6
    • Alexey Dobriyan's avatar
      groups: move code to kernel/groups.c · 30639b6a
      Alexey Dobriyan authored
      Move supplementary groups implementation to kernel/groups.c .
      kernel/sys.c already accumulated quite a few random stuff.
      
      Do strictly copy/paste + add required headers to compile.  Compile-tested
      on many configs and archs.
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30639b6a
    • Thomas Gleixner's avatar
      remove put_cpu_no_resched() · 8b0b1db0
      Thomas Gleixner authored
      put_cpu_no_resched() is an optimization of put_cpu() which unfortunately
      can cause high latencies.
      
      The nfs iostats code uses put_cpu_no_resched() in a code sequence where a
      reschedule request caused by an interrupt between the get_cpu() and the
      put_cpu_no_resched() can delay the reschedule for at least HZ.
      
      The other users of put_cpu_no_resched() optimize correctly in interrupt
      code, but there is no real harm in using the put_cpu() function which is
      an alias for preempt_enable().  The extra check of the preemmpt count is
      not as critical as the potential source of missing a reschedule.
      
      Debugged in the preempt-rt tree and verified in mainline.
      
      Impact: remove a high latency source
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b0b1db0
    • Philipp Reisner's avatar
      drbd: add major number to major.h · 10fc89d0
      Philipp Reisner authored
      Since we have had a LANANA major number for years, and it is documented in
      devices.txt, I think that this first patch can go upstream.
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10fc89d0
    • Andrew Morton's avatar
      headers: move module_bug_finalize()/module_bug_cleanup() definitions into module.h · 0d9c25dd
      Andrew Morton authored
      They're in linux/bug.h at present, which causes include order tangles.  In
      particular, linux/bug.h cannot be used by linux/atomic.h because,
      according to Nikanth:
      
      linux/bug.h pulls in linux/module.h => linux/spinlock.h => asm/spinlock.h
      (which uses atomic_inc) => asm/atomic.h.
      
      bug.h is a pretty low-level thing and module.h is a higher-level thing,
      IMO.
      
      Cc: Nikanth Karthikesan <knikanth@novell.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d9c25dd
    • Eric Dumazet's avatar
      poll: avoid extra wakeups in select/poll · 4938d7e0
      Eric Dumazet authored
      After introduction of keyed wakeups Davide Libenzi did on epoll, we are
      able to avoid spurious wakeups in poll()/select() code too.
      
      For example, typical use of poll()/select() is to wait for incoming
      network frames on many sockets.  But TX completion for UDP/TCP frames call
      sock_wfree() which in turn schedules thread.
      
      When scheduled, thread does a full scan of all polled fds and can sleep
      again, because nothing is really available.  If number of fds is large,
      this cause significant load.
      
      This patch makes select()/poll() aware of keyed wakeups and useless
      wakeups are avoided.  This reduces number of context switches by about 50%
      on some setups, and work performed by sofirq handlers.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4938d7e0
    • Robert P. J. Day's avatar
      ntfs: use is_power_of_2() function for clarity. · 02d5341a
      Robert P. J. Day authored
      Signed-off-by: default avatarRobert P. J. Day <rpjday@crashcourse.ca>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02d5341a
    • Robert P. J. Day's avatar
    • Jan Blunck's avatar
      atomic: only take lock when the counter drops to zero on UP as well · 417dcdf9
      Jan Blunck authored
      _atomic_dec_and_lock() should not unconditionally take the lock before
      calling atomic_dec_and_test() in the UP case.  For consistency reasons it
      should behave exactly like in the SMP case.
      
      Besides that this works around the problem that with CONFIG_DEBUG_SPINLOCK
      this spins in __spin_lock_debug() if the lock is already taken even if the
      counter doesn't drop to 0.
      Signed-off-by: default avatarJan Blunck <jblunck@suse.de>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Valerie Aurora <vaurora@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      417dcdf9
    • Dan Smith's avatar
      utsname.h: make new_utsname fields use the proper length constant · a7d932af
      Dan Smith authored
      The members of the new_utsname structure are defined with magic numbers
      that *should* correspond to the constant __NEW_UTS_LEN+1.  Everywhere
      else, code assumes this and uses the constant, so this patch makes the
      structure match.
      
      Originally suggested by Serge here:
      
      https://lists.linux-foundation.org/pipermail/containers/2009-March/016258.htmlSigned-off-by: default avatarDan Smith <danms@us.ibm.com>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7d932af
    • Roel Kluin's avatar
      uml: bad macro expansion, parameter is member · b08cd961
      Roel Kluin authored
      `ELF_CORE_COPY_REGS(x, y)' will make expansions like:
      `(y)[0] = (x)->x.gp[0]' but correct is `(y)[0] = (x)->regs.gp[0]'
      Signed-off-by: default avatarRoel Kluin <roel.kluin@gmail.com>
      Cc: WANG Cong <amwang@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b08cd961
    • Amerigo Wang's avatar
      uml: fix a section warning · 276c974a
      Amerigo Wang authored
      When compiling uml on x86_64:
      
        MODPOST vmlinux.o
      WARNING: vmlinux.o (.__syscall_stub.2): unexpected non-allocatable section.
      Did you forget to use "ax"/"aw" in a .S file?
      Note that for example <linux/init.h> contains
      section definitions for use in .S files.
      
      Because modpost checks for missing SHF_ALLOC section flag.  So just add
      it.
      Signed-off-by: default avatarWANG Cong <amwang@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      276c974a
    • Thomas Gleixner's avatar
      um: remove obsolete hw_interrupt_type · 6fa851c3
      Thomas Gleixner authored
      The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) have
      been kept around for migration reasons.  After more than two years it's
      time to remove them finally.
      
      This patch cleans up one of the remaining users.  When all such patches
      hit mainline we can remove the defines and typedefs finally.
      
      Impact: cleanup
      
      Convert the last remaining users to struct irq_chip and remove the
      define.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Jeff Dike <jdike@addtoit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fa851c3
    • Alan Cox's avatar
      uml: UML net driver does not allow for vlans · 7e1cb780
      Alan Cox authored
      See ancient discussion at
      http://marc.info/?l=user-mode-linux-devel&m=101990155831279&w=2
      
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=7854Signed-off-by: default avatarAlan Cox <alan@linux.intel.com>
      Reported-by: default avatarPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Roland Kletzing <devzero@web.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e1cb780
    • Thomas Gleixner's avatar
      m32r: remove obsolete hw_interrupt_type · 189e91f5
      Thomas Gleixner authored
      The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) have
      been kept around for migration reasons.  After more than two years it's
      time to remove them finally.
      
      This patch cleans up one of the remaining users.  When all such patches
      hit mainline we can remove the defines and typedefs finally.
      
      Impact: cleanup
      
      Convert the last remaining users to struct irq_chip and remove the
      define.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      189e91f5
    • Roel Kluin's avatar
      alpha: bad macro expansion, parameter is member · fb26b3e6
      Roel Kluin authored
      `for_each_mem_cluster(x, y, z)' will expand to
      `for ((x) = (y)->x ...' but correct is `for ((x) = (y)->cluster ...'
      Signed-off-by: default avatarRoel Kluin <roel.kluin@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb26b3e6
    • Thomas Gleixner's avatar
      alpha: remove obsolete hw_interrupt_type · 44377f62
      Thomas Gleixner authored
      The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) have
      been kept around for migration reasons.  After more than two years it's
      time to remove them finally.
      
      This patch cleans up one of the remaining users.  When all such patches
      hit mainline we can remove the defines and typedefs finally.
      
      Impact: cleanup
      
      Convert the last remaining users to struct irq_chip and remove the
      define.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      44377f62
    • KAMEZAWA Hiroyuki's avatar
      mm: fix lumpy reclaim lru handling at isolate_lru_pages · ee993b13
      KAMEZAWA Hiroyuki authored
      At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be
      pushed back to "src" list by list_move().  But the page may not be from
      "src" list.  This pushes the page back to wrong LRU.  And list_move()
      itself is unnecessary because the page is not on top of LRU.  Then, leave
      it as it is if __isolate_lru_page() fails.
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee993b13
    • Mel Gorman's avatar
      vmscan: count the number of times zone_reclaim() scans and fails · 24cf7251
      Mel Gorman authored
      On NUMA machines, the administrator can configure zone_reclaim_mode that
      is a more targetted form of direct reclaim.  On machines with large NUMA
      distances for example, a zone_reclaim_mode defaults to 1 meaning that
      clean unmapped pages will be reclaimed if the zone watermarks are not
      being met.
      
      There is a heuristic that determines if the scan is worthwhile but it is
      possible that the heuristic will fail and the CPU gets tied up scanning
      uselessly.  Detecting the situation requires some guesswork and
      experimentation so this patch adds a counter "zreclaim_failed" to
      /proc/vmstat.  If during high CPU utilisation this counter is increasing
      rapidly, then the resolution to the problem may be to set
      /proc/sys/vm/zone_reclaim_mode to 0.
      
      [akpm@linux-foundation.org: name things consistently]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24cf7251
    • Mel Gorman's avatar
      vmscan: do not unconditionally treat zones that fail zone_reclaim() as full · fa5e084e
      Mel Gorman authored
      On NUMA machines, the administrator can configure zone_reclaim_mode that
      is a more targetted form of direct reclaim.  On machines with large NUMA
      distances for example, a zone_reclaim_mode defaults to 1 meaning that
      clean unmapped pages will be reclaimed if the zone watermarks are not
      being met.  The problem is that zone_reclaim() failing at all means the
      zone gets marked full.
      
      This can cause situations where a zone is usable, but is being skipped
      because it has been considered full.  Take a situation where a large tmpfs
      mount is occuping a large percentage of memory overall.  The pages do not
      get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
      and the zonelist cache considers them not worth trying in the future.
      
      This patch makes zone_reclaim() return more fine-grained information about
      what occured when zone_reclaim() failued.  The zone only gets marked full
      if it really is unreclaimable.  If it's a case that the scan did not occur
      or if enough pages were not reclaimed with the limited reclaim_mode, then
      the zone is simply skipped.
      
      There is a side-effect to this patch.  Currently, if zone_reclaim()
      successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
      ahead.  With this patch applied, zone watermarks are rechecked after
      zone_reclaim() does some work.
      
      This bug was introduced by commit 9276b1bc
      ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
      zonelist_cache was introduced.  It was not intended that zone_reclaim()
      aggressively consider the zone to be full when it failed as full direct
      reclaim can still be an option.  Due to the age of the bug, it should be
      considered a -stable candidate.
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa5e084e
    • Mel Gorman's avatar
      vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim · 90afa5de
      Mel Gorman authored
      A bug was brought to my attention against a distro kernel but it affects
      mainline and I believe problems like this have been reported in various
      guises on the mailing lists although I don't have specific examples at the
      moment.
      
      The reported problem was that malloc() stalled for a long time (minutes in
      some cases) if a large tmpfs mount was occupying a large percentage of
      memory overall.  The pages did not get cleaned or reclaimed by
      zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
      are uselessly scanned frequencly making the CPU spin at near 100%.
      
      This patchset intends to address that bug and bring the behaviour of
      zone_reclaim() more in line with expectations which were noticed during
      investigation.  It is based on top of mmotm and takes advantage of
      Kosaki's work with respect to zone_reclaim().
      
      Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
      	scan should go ahead. The broken heuristic is what was causing the
      	malloc() stall as it uselessly scanned the LRU constantly. Currently,
      	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
      	could not deal with tmpfs pages at all. This fixes up the heuristic so
      	that an unnecessary scan is more likely to be correctly avoided.
      
      Patch 2 notes that zone_reclaim() returning a failure automatically means
      	the zone is marked full. This is not always true. It could have
      	failed because the GFP mask or zone_reclaim_mode were unsuitable.
      
      Patch 3 introduces a counter zreclaim_failed that will increment each
      	time the zone_reclaim scan-avoidance heuristics fail. If that
      	counter is rapidly increasing, then zone_reclaim_mode should be
      	set to 0 as a temporarily resolution and a bug reported because
      	the scan-avoidance heuristic is still broken.
      
      This patch:
      
      On NUMA machines, the administrator can configure zone_reclaim_mode that
      is a more targetted form of direct reclaim.  On machines with large NUMA
      distances for example, a zone_reclaim_mode defaults to 1 meaning that
      clean unmapped pages will be reclaimed if the zone watermarks are not
      being met.
      
      There is a heuristic that determines if the scan is worthwhile but the
      problem is that the heuristic is not being properly applied and is
      basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
      proper detection can manfiest as high CPU usage as the LRU list is scanned
      uselessly.
      
      Historically, once enabled it was depending on NR_FILE_PAGES which may
      include swapcache pages that the reclaim_mode cannot deal with.  Patch
      vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
      Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
      pages that were not file-backed such as swapcache and made a calculation
      based on the inactive, active and mapped files.  This is far superior when
      zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
      reasonable starting figure.
      
      This patch alters how zone_reclaim() works out how many pages it might be
      able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
      the reclaim_mode it will either consider NR_FILE_PAGES as potential
      candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
      swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
      then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
      not set, then NR_FILE_MAPPED are not.
      
      [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
      [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90afa5de
    • Wu Fengguang's avatar
      writeback: skip new or to-be-freed inodes · 84a89245
      Wu Fengguang authored
      1) I_FREEING tests should be coupled with I_CLEAR
      
      The two I_FREEING tests are racy because clear_inode() can set i_state to
      I_CLEAR between the clear of I_SYNC and the test of I_FREEING.
      
      2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
         races with generic_forget_inode()
      
      generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
      generic_sync_sb_inodes() shall not try to step in and create possible races:
      
        generic_forget_inode
          inode->i_state |= I_WILL_FREE;
          spin_unlock(&inode_lock);
                                             generic_sync_sb_inodes()
                                               spin_lock(&inode_lock);
                                               __iget(inode);
                                               __writeback_single_inode
                                                 // see non zero i_count
       may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
                                               spin_unlock(&inode_lock);
       may call generic_forget_inode again ==> iput(inode);
      
      The above race and warning didn't turn up because writeback_inodes() holds
      the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
      early.  But we are not sure the UBIFS calls and future callers will
      guarantee that.  So skip I_WILL_FREE inodes for the sake of safety.
      
      Cc: Eric Sandeen <sandeen@sandeen.net>
      Acked-by: default avatarJeff Layton <jlayton@redhat.com>
      Cc: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84a89245
    • David Rientjes's avatar
      oom: only oom kill exiting tasks with attached memory · 81236810
      David Rientjes authored
      When a task is chosen for oom kill and is found to be PF_EXITING,
      __oom_kill_task() is called to elevate the task's timeslice and give it
      access to memory reserves so that it may quickly exit.
      
      This privilege is unnecessary, however, if the task has already detached
      its mm.  Although its possible for the mm to become detached later since
      task_lock() is not held, __oom_kill_task() will simply be a no-op in such
      circumstances.
      
      Subsequently, it is no longer necessary to warn about killing mm-less
      tasks since it is a no-op.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81236810
    • Daisuke Nishimura's avatar
      vmscan: handle may_swap more strictly · 9198e96c
      Daisuke Nishimura authored
      Commit 2e2e4259 ("vmscan,memcg:
      reintroduce sc->may_swap) add may_swap flag and handle it at
      get_scan_ratio().
      
      But the result of get_scan_ratio() is ignored when priority == 0, so anon
      lru is scanned even if may_swap == 0 or nr_swap_pages == 0.  IMHO, this is
      not an expected behavior.
      
      As for memcg especially, because of this behavior many and many pages are
      swapped-out just in vain when oom is invoked by mem+swap limit.
      
      This patch is for handling may_swap flag more strictly.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9198e96c
    • Wu Fengguang's avatar
      vmscan: merge duplicate code in shrink_active_list() · 3eb4140f
      Wu Fengguang authored
      The "move pages to active list" and "move pages to inactive list" code
      blocks are mostly identical and can be served by a function.
      
      Thanks to Andrew Morton for pointing this out.
      
      Note that buffer_heads_over_limit check will also be carried out for
      re-activated pages, which is slightly different from pre-2.6.28 kernels.
      Also, Rik's "vmscan: evict use-once pages first" patch could totally stop
      scans of active file list when memory pressure is low.  So the net effect
      could be, the number of buffer heads is now more likely to grow large.
      
      However that's fine according to Johannes' comments:
      
        I don't think that this could be harmful.  We just preserve the buffer
        mappings of what we consider the working set and with low memory
        pressure, as you say, this set is not big.
      
        As to stripping of reactivated pages: the only pages we re-activate
        for now are those VM_EXEC mapped ones.  Since we don't expect IO from
        or to these pages, removing the buffer mappings in case they grow too
        large should be okay, I guess.
      
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3eb4140f
    • Wu Fengguang's avatar
      vmscan: make mapped executable pages the first class citizen · 8cab4754
      Wu Fengguang authored
      Protect referenced PROT_EXEC mapped pages from being deactivated.
      
      PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
      currently running executables and their linked libraries, they shall really be
      cached aggressively to provide good user experiences.
      
      Thanks to Johannes Weiner for the advice to reuse the VMA walk in
      page_referenced() to get the PROT_EXEC bit.
      
      [more details]
      
      ( The consequences of this patch will have to be discussed together with
        Rik van Riel's recent patch "vmscan: evict use-once pages first". )
      
      ( Some of the good points and insights are taken into this changelog.
        Thanks to all the involved people for the great LKML discussions. )
      
      the problem
      ===========
      
      For a typical desktop, the most precious working set is composed of
      *actively accessed*
      	(1) memory mapped executables
      	(2) and their anonymous pages
      	(3) and other files
      	(4) and the dcache/icache/.. slabs
      while the least important data are
      	(5) infrequently used or use-once files
      
      For a typical desktop, one major problem is busty and large amount of (5)
      use-once files flushing out the working set.
      
      Inside the working set, (4) dcache/icache have already been too sticky ;-)
      So we only have to care (2) anonymous and (1)(3) file pages.
      
      anonymous pages
      ===============
      
      Anonymous pages are effectively immune to the streaming IO attack, because we
      now have separate file/anon LRU lists. When the use-once files crowd into the
      file LRU, the list's "quality" is significantly lowered. Therefore the scan
      balance policy in get_scan_ratio() will choose to scan the (low quality) file
      LRU much more frequently than the anon LRU.
      
      file pages
      ==========
      
      Rik proposed to *not* scan the active file LRU when the inactive list grows
      larger than active list. This guarantees that when there are use-once streaming
      IO, and the working set is not too large(so that active_size < inactive_size),
      the active file LRU will *not* be scanned at all. So the not-too-large working
      set can be well protected.
      
      But there are also situations where the file working set is a bit large so that
      (active_size >= inactive_size), or the streaming IOs are not purely use-once.
      In these cases, the active list will be scanned slowly. Because the current
      shrink_active_list() policy is to deactivate active pages regardless of their
      referenced bits. The deactivated pages become susceptible to the streaming IO
      attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
      the deactivated pages don't have enough time to get re-referenced. Because a
      user tend to switch between windows in intervals from seconds to minutes.
      
      This patch holds mapped executable pages in the active list as long as they
      are referenced during each full scan of the active list.  Because the active
      list is normally scanned much slower, they get longer grace time (eg. 100s)
      for further references, which better matches the pace of user operations.
      
      Therefore this patch greatly prolongs the in-cache time of executable code,
      when there are moderate memory pressures.
      
      	before patch: guaranteed to be cached if reference intervals < I
      	after  patch: guaranteed to be cached if reference intervals < I+A
      		      (except when randomly reclaimed by the lumpy reclaim)
      where
      	A = time to fully scan the   active file LRU
      	I = time to fully scan the inactive file LRU
      
      Note that normally A >> I.
      
      side effects
      ============
      
      This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
      but in a much smaller and well targeted scope.
      
      One may worry about some one to abuse the PROT_EXEC heuristic.  But as
      Andrew Morton stated, there are other tricks to getting that sort of boost.
      
      Another concern is the PROT_EXEC mapped pages growing large in rare cases,
      and therefore hurting reclaim efficiency. But a sane application targeted for
      large audience will never use PROT_EXEC for data mappings. If some home made
      application tries to abuse that bit, it shall be aware of the consequences.
      If it is abused to scale of 2/3 total memory, it gains nothing but overheads.
      
      benchmarks
      ==========
      
      1) memory tight desktop
      
      1.1) brief summary
      
      - clock time and major faults are reduced by 50%;
      - pswpin numbers are reduced to ~1/3.
      
      That means X desktop responsiveness is doubled under high memory/swap pressure.
      
      1.2) test scenario
      
      - nfsroot gnome desktop with 512M physical memory
      - run some programs, and switch between the existing windows
        after starting each new program.
      
      1.3) progress timing (seconds)
      
        before       after    programs
          0.02        0.02    N xeyes
          0.75        0.76    N firefox
          2.02        1.88    N nautilus
          3.36        3.17    N nautilus --browser
          5.26        4.89    N gthumb
          7.12        6.47    N gedit
          9.22        8.16    N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
         13.58       12.55    N xterm
         15.87       14.57    N mlterm
         18.63       17.06    N gnome-terminal
         21.16       18.90    N urxvt
         26.24       23.48    N gnome-system-monitor
         28.72       26.52    N gnome-help
         32.15       29.65    N gnome-dictionary
         39.66       36.12    N /usr/games/sol
         43.16       39.27    N /usr/games/gnometris
         48.65       42.56    N /usr/games/gnect
         53.31       47.03    N /usr/games/gtali
         58.60       52.05    N /usr/games/iagno
         65.77       55.42    N /usr/games/gnotravex
         70.76       61.47    N /usr/games/mahjongg
         76.15       67.11    N /usr/games/gnome-sudoku
         86.32       75.15    N /usr/games/glines
         92.21       79.70    N /usr/games/glchess
        103.79       88.48    N /usr/games/gnomine
        113.84       96.51    N /usr/games/gnotski
        124.40      102.19    N /usr/games/gnibbles
        137.41      114.93    N /usr/games/gnobots2
        155.53      125.02    N /usr/games/blackjack
        179.85      135.11    N /usr/games/same-gnome
        224.49      154.50    N /usr/bin/gnome-window-properties
        248.44      162.09    N /usr/bin/gnome-default-applications-properties
        282.62      173.29    N /usr/bin/gnome-at-properties
        323.72      188.21    N /usr/bin/gnome-typing-monitor
        363.99      199.93    N /usr/bin/gnome-at-visual
        394.21      206.95    N /usr/bin/gnome-sound-properties
        435.14      224.49    N /usr/bin/gnome-at-mobility
        463.05      234.11    N /usr/bin/gnome-keybinding-properties
        503.75      248.59    N /usr/bin/gnome-about-me
        554.00      276.27    N /usr/bin/gnome-display-properties
        615.48      304.39    N /usr/bin/gnome-network-preferences
        693.03      342.01    N /usr/bin/gnome-mouse-properties
        759.90      388.58    N /usr/bin/gnome-appearance-properties
        937.90      508.47    N /usr/bin/gnome-control-center
       1109.75      587.57    N /usr/bin/gnome-keyboard-properties
       1399.05      758.16    N : oocalc
       1524.64      830.03    N : oodraw
       1684.31      900.03    N : ooimpress
       1874.04      993.91    N : oomath
       2115.12     1081.89    N : ooweb
       2369.02     1161.99    N : oowriter
      
      Note that the last ": oo*" commands are actually commented out.
      
      1.4) vmstat numbers (some relevant ones are marked with *)
      
                                  before    after
       nr_free_pages              1293      3898
       nr_inactive_anon           59956     53460
       nr_active_anon             26815     30026
       nr_inactive_file           2657      3218
       nr_active_file             2019      2806
       nr_unevictable             4         4
       nr_mlock                   4         4
       nr_anon_pages              26706     27859
      *nr_mapped                  3542      4469
       nr_file_pages              72232     67681
       nr_dirty                   1         0
       nr_writeback               123       19
       nr_slab_reclaimable        3375      3534
       nr_slab_unreclaimable      11405     10665
       nr_page_table_pages        8106      7864
       nr_unstable                0         0
       nr_bounce                  0         0
      *nr_vmscan_write            394776    230839
       nr_writeback_temp          0         0
       numa_hit                   6843353   3318676
       numa_miss                  0         0
       numa_foreign               0         0
       numa_interleave            1719      1719
       numa_local                 6843353   3318676
       numa_other                 0         0
      *pgpgin                     5954683   2057175
      *pgpgout                    1578276   922744
      *pswpin                     1486615   512238
      *pswpout                    394568    230685
       pgalloc_dma                277432    56602
       pgalloc_dma32              6769477   3310348
       pgalloc_normal             0         0
       pgalloc_movable            0         0
       pgfree                     7048396   3371118
       pgactivate                 2036343   1471492
       pgdeactivate               2189691   1612829
       pgfault                    3702176   3100702
      *pgmajfault                 452116    201343
       pgrefill_dma               12185     7127
       pgrefill_dma32             334384    653703
       pgrefill_normal            0         0
       pgrefill_movable           0         0
       pgsteal_dma                74214     22179
       pgsteal_dma32              3334164   1638029
       pgsteal_normal             0         0
       pgsteal_movable            0         0
       pgscan_kswapd_dma          1081421   1216199
       pgscan_kswapd_dma32        58979118  46002810
       pgscan_kswapd_normal       0         0
       pgscan_kswapd_movable      0         0
       pgscan_direct_dma          2015438   1086109
       pgscan_direct_dma32        55787823  36101597
       pgscan_direct_normal       0         0
       pgscan_direct_movable      0         0
       pginodesteal               3461      7281
       slabs_scanned              564864    527616
       kswapd_steal               2889797   1448082
       kswapd_inodesteal          14827     14835
       pageoutrun                 43459     21562
       allocstall                 9653      4032
       pgrotated                  384216    228631
      
      1.5) free numbers at the end of the tests
      
      before patch:
                                   total       used       free     shared    buffers     cached
                      Mem:           474        467          7          0          0        236
                      -/+ buffers/cache:        230        243
                      Swap:         1023        418        605
      
      after patch:
                                   total       used       free     shared    buffers     cached
                      Mem:           474        457         16          0          0        236
                      -/+ buffers/cache:        221        253
                      Swap:         1023        404        619
      
      2) memory flushing in a file server
      
      2.1) brief summary
      
      The number of major faults from 50 to 3 during 10% cache hot reads.
      
      That means this patch successfully stops major faults when the active file
      list is slowly scanned when there are partially cache hot streaming IO.
      
      2.2) test scenario
      
      Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
      pages will be activated:
      
              for i in `seq 0 100 10000000`; do echo $i 110;  done > pattern-hot-10
              iotrace.rb --load pattern-hot-10 --play /b/sparse
      	vmmon  nr_mapped nr_active_file nr_inactive_file   pgmajfault pgdeactivate pgfree
      
      and monitor /proc/vmstat during the time. The test box has 2G memory.
      
      I carried out tests on fresh booted console as well as X desktop, and
      fetched the vmstat numbers on
      
      (1) begin:     shortly after the big read IO starts;
      (2) end:       just before the big read IO stops;
      (3) restore:   the big read IO stops and the zsh working set restored
      (4) restore X: after IO, switch back and forth between the urxvt and firefox
                     windows to restore their working set.
      
      2.3) console mode results
      
              nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
      
      2.6.29 VM_EXEC protection ON:
      begin:       2481             2237             8694              630                0           574299
      end:          275           231976           233914              633           776271         20933042
      restore:      370           232154           234524              691           777183         20958453
      
      2.6.29 VM_EXEC protection ON (second run):
      begin:       2434             2237             8493              629                0           574195
      end:          284           231970           233536              632           771918         20896129
      restore:      399           232218           234789              690           774526         20957909
      
      2.6.30-rc4-mm VM_EXEC protection OFF:
      begin:       2479             2344             9659              210                0           579643
      end:          284           232010           234142              260           772776         20917184
      restore:      379           232159           234371              301           774888         20967849
      
      The above console numbers show that
      
      - The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
        I'd attribute that improvement to the mmap readahead improvements :-)
      
      - The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
        That's a huge improvement - which means with the VM_EXEC protection logic,
        active mmap pages is pretty safe even under partially cache hot streaming IO.
      
      - when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
        under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
        That roughly means the active mmap pages get 20.8 more chances to get
        re-referenced to stay in memory.
      
      - The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
        dropped pages are mostly inactive ones. The patch has almost no impact in
        this aspect, that means it won't unnecessarily increase memory pressure.
        (In contrast, your 20% mmap protection ratio will keep them all, and
        therefore eliminate the extra 41 major faults to restore working set
        of zsh etc.)
      
      The iotrace.rb read throughput is
      	151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
      which means the inactive list is rotated at the speed of 250MB/s,
      so a full scan of which takes about 3.5 seconds, while a full scan
      of active file list takes about 77 seconds.
      
      2.4) X mode results
      
      We can reach roughly the same conclusions for X desktop:
      
              nr_mapped   nr_active_file nr_inactive_file       pgmajfault     pgdeactivate           pgfree
      
      2.6.30-rc4-mm VM_EXEC protection ON:
      begin:       9740             8920            64075              561                0           678360
      end:          768           218254           220029              565           798953         21057006
      restore:      857           218543           220987              606           799462         21075710
      restore X:   2414           218560           225344              797           799462         21080795
      
      2.6.30-rc4-mm VM_EXEC protection OFF:
      begin:       9368             5035            26389              554                0           633391
      end:          770           218449           221230              661           646472         17832500
      restore:     1113           218466           220978              710           649881         17905235
      restore X:   2687           218650           225484              947           802700         21083584
      
      - the absolute nr_mapped drops considerably (to 1/13 of the original size)
        during the streaming IO.
      - the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
        during the whole process.
      
      Cc: Elladan <elladan@eskimo.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cab4754