1. 13 Oct, 2009 1 commit
  2. 13 Nov, 2009 1 commit
    • Lee Schermerhorn's avatar
      This patch derives a "nodes_allowed" node mask from the numa mempolicy of · 1d447c7c
      Lee Schermerhorn authored
      the task modifying the number of persistent huge pages to control the
      allocation, freeing and adjusting of surplus huge pages when the pool page
      count is modified via the new sysctl or sysfs attribute
      "nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:
      
      * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
        is produced.  This will cause the hugetlb subsystem to use
        node_online_map as the "nodes_allowed".  This preserves the
        behavior before this patch.
      * For "preferred" mempolicy, including explicit local allocation,
        a nodemask with the single preferred node will be produced.
        "local" policy will NOT track any internode migrations of the
        task adjusting nr_hugepages.
      * For "bind" and "interleave" policy, the mempolicy's nodemask
        will be used.
      * Other than to inform the construction of the nodes_allowed node
        mask, the actual mempolicy mode is ignored.  That is, all modes
        behave like interleave over the resulting nodes_allowed mask
        with no "fallback".
      
      See the updated documentation [next patch] for more information
      about the implications of this patch.
      
      Examples:
      
      Starting with:
      
      	Node 0 HugePages_Total:     0
      	Node 1 HugePages_Total:     0
      	Node 2 HugePages_Total:     0
      	Node 3 HugePages_Total:     0
      
      Default behavior [with or without this patch] balances persistent
      hugepage allocation across nodes [with sufficient contiguous memory]:
      
      	sysctl vm.nr_hugepages[_mempolicy]=32
      
      yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:     8
      	Node 3 HugePages_Total:     8
      
      Of course, we only have nr_hugepages_mempolicy with the patch,
      but with default mempolicy, nr_hugepages_mempolicy behaves the
      same as nr_hugepages.
      
      Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
      '--membind' because it allows multiple nodes to be specified
      and it's easy to type]--we can allocate huge pages on
      individual nodes or sets of nodes.  So, starting from the
      condition above, with 8 huge pages per node, add 8 more to
      node 2 using:
      
      	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
      
      This yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The incremental 8 huge pages were restricted to node 2 by the
      specified mempolicy.
      
      Similarly, we can use mempolicy to free persistent huge pages
      from specified nodes:
      
      	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
      
      yields:
      
      	Node 0 HugePages_Total:     4
      	Node 1 HugePages_Total:     4
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The 8 huge pages freed were balanced over nodes 0 and 1.
      
      [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d447c7c
  3. 13 Oct, 2009 5 commits
    • Lee Schermerhorn's avatar
      Factor init_nodemask_of_node() out of the nodemask_of_node() macro. · bb762d64
      Lee Schermerhorn authored
      This will be used to populate the huge pages "nodes_allowed" nodemask for
      a single node when basing nodes_allowed on a preferred/local mempolicy or
      when a persistent huge page pool page count is modified via a per node
      sysfs attribute.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bb762d64
    • David Rientjes's avatar
      On Thu, 8 Oct 2009, Lee Schermerhorn wrote: · 1e678f9b
      David Rientjes authored
      > @@ -1144,14 +1156,15 @@ static void __init report_hugepages(void
      >  }
      >
      >  #ifdef CONFIG_HIGHMEM
      > -static void try_to_free_low(struct hstate *h, unsigned long count)
      > +static void try_to_free_low(struct hstate *h, unsigned long count,
      > +						nodemask_t *nodes_allowed)
      >  {
      >  	int i;
      >
      >  	if (h->order >= MAX_ORDER)
      >  		return;
      >
      > -	for (i = 0; i < MAX_NUMNODES; ++i) {
      > +	for_each_node_mask(node, nodes_allowed_) {
      >  		struct page *page, *next;
      >  		struct list_head *freel = &h->hugepage_freelists[i];
      >  		list_for_each_entry_safe(page, next, freel, lru) {
      
      That's not looking good for i386, Andrew please fold the following into
      this patch when it's merged into -mm:
      
      [rientjes@google.com: fix HIGHMEM compile error]
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1e678f9b
    • Lee Schermerhorn's avatar
      In preparation for constraining huge page allocation and freeing by the · 902c08ec
      Lee Schermerhorn authored
      controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
      to the allocate, free and surplus adjustment functions.  For now, pass
      NULL to indicate default behavior--i.e., use node_online_map.  A
      subsqeuent patch will derive a non-default mask from the controlling
      task's numa mempolicy.
      
      Note that this method of updating the global hstate nr_hugepages under the
      constraint of a nodemask simplifies keeping the global state
      consistent--especially the number of persistent and surplus pages relative
      to reservations and overcommit limits.  There are undoubtedly other ways
      to do this, but this works for both interfaces: mempolicy and per node
      attributes.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      902c08ec
    • Lee Schermerhorn's avatar
      Modify the hstate_next_node* functions to allow them to be called to · 212dc261
      Lee Schermerhorn authored
      obtain the "start_nid".  Then, whereas prior to this patch we
      unconditionally called hstate_next_node_to_{alloc|free}(), whether or not
      we successfully allocated/freed a huge page on the node, now we only call
      these functions on failure to alloc/free to advance to next allowed node.
      
      Factor out the next_node_allowed() function to handle wrap at end of
      node_online_map.  In this version, the allowed nodes include all of the
      online nodes.
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: default avatarMel Gorman <mel@csn.ul.ie>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      212dc261
    • David Rientjes's avatar
      This is a series of patches to provide control over the location of the · 398ff16a
      David Rientjes authored
      allocation and freeing of persistent huge pages on a NUMA platform. 
      Please consider for merging into mmotm.
      
      This series uses two mechanisms to constrain the nodes from which
      persistent huge pages are allocated: 1) the task NUMA mempolicy of the
      task modifying a new sysctl "nr_hugepages_mempolicy", based on a
      suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs
      attributes have been added [in V4] to each node system device under:
      
      	/sys/devices/node/node[0-9]*/hugepages
      
      The per node attibutes allow direct assignment of a huge page count on a
      specific node, regardless of the task's mempolicy or cpuset constraints.  
      
      
      This patch:
      
      NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary. 
      It's perfectly reasonable to use this macro to allocate a nodemask_t,
      which is anonymous, either dynamically or on the stack depending on
      NODES_SHIFT.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      398ff16a
  4. 13 Nov, 2009 1 commit
  5. 25 Sep, 2009 4 commits
  6. 14 Sep, 2009 1 commit
  7. 13 Sep, 2009 1 commit
    • Andrew Morton's avatar
      cleanuplets. · 4f50dc9d
      Andrew Morton authored
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4f50dc9d
  8. 12 Sep, 2009 3 commits
  9. 11 Nov, 2009 4 commits
    • Vincent Li's avatar
      If we can't isolate pages from LRU list, we don't have to account page · a9e01acd
      Vincent Li authored
      movement, either.  Already, in commit 5343daceec, KOSAKI did it about
      shrink_inactive_list.
      
      This patch removes unnecessary overhead of page accounting and locking in
      shrink_active_list as follow-up work of commit 5343daceec.
      Signed-off-by: default avatarVincent Li <macli@brc.ubc.ca>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a9e01acd
    • Andrew Morton's avatar
      ERROR: "foo * bar" should be "foo *bar" · 3a220441
      Andrew Morton authored
      #116: FILE: mm/mmap.c:1835:
      +static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
      
      ERROR: "foo * bar" should be "foo *bar"
      #138: FILE: mm/mmap.c:1888:
      +int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
      
      total: 2 errors, 0 warnings, 67 lines checked
      
      ./patches/mmap-dont-return-enomem-when-mapcount-is-temporarily-exceeded-in-munmap.patch has style problems, please review.  If any of these errors
      are false positives report them to the maintainer, see
      CHECKPATCH in MAINTAINERS.
      
      Please run checkpatch prior to sending patches
      
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a220441
    • KOSAKI Motohiro's avatar
      On ia64, the following test program exit abnormally, because glibc thread · 06666f81
      KOSAKI Motohiro authored
      library called abort().
      
       ========================================================
       (gdb) bt
       #0  0xa000000000010620 in __kernel_syscall_via_break ()
       #1  0x20000000003208e0 in raise () from /lib/libc.so.6.1
       #2  0x2000000000324090 in abort () from /lib/libc.so.6.1
       #3  0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
       #4  0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
       #5  0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
       ========================================================
      
      The fact is, glibc call munmap() when thread exitng time for freeing
      stack, and it assume munlock() never fail.  However, munmap() often make
      vma splitting and it with many mapcount make -ENOMEM.
      
      Oh well, that's crazy, because stack unmapping never increase mapcount. 
      The maxcount exceeding is only temporary.  internal temporary exceeding
      shouldn't make ENOMEM.
      
      This patch does it.
      
       test_max_mapcount.c
       ==================================================================
        #include<stdio.h>
        #include<stdlib.h>
        #include<string.h>
        #include<pthread.h>
        #include<errno.h>
        #include<unistd.h>
      
        #define THREAD_NUM 30000
        #define MAL_SIZE (8*1024*1024)
      
       void *wait_thread(void *args)
       {
       	void *addr;
      
       	addr = malloc(MAL_SIZE);
       	sleep(10);
      
       	return NULL;
       }
      
       void *wait_thread2(void *args)
       {
       	sleep(60);
      
       	return NULL;
       }
      
       int main(int argc, char *argv[])
       {
       	int i;
       	pthread_t thread[THREAD_NUM], th;
       	int ret, count = 0;
       	pthread_attr_t attr;
      
       	ret = pthread_attr_init(&attr);
       	if(ret) {
       		perror("pthread_attr_init");
       	}
      
       	ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
       	if(ret) {
       		perror("pthread_attr_setdetachstate");
       	}
      
       	for (i = 0; i < THREAD_NUM; i++) {
       		ret = pthread_create(&th, &attr, wait_thread, NULL);
       		if(ret) {
       			fprintf(stderr, "[%d] ", count);
       			perror("pthread_create");
       		} else {
       			printf("[%d] create OK.\n", count);
       		}
       		count++;
      
       		ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
       		if(ret) {
       			fprintf(stderr, "[%d] ", count);
       			perror("pthread_create");
       		} else {
       			printf("[%d] create OK.\n", count);
       		}
       		count++;
       	}
      
       	sleep(3600);
       	return 0;
       }
       ==================================================================
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      06666f81
    • Alex Chiang's avatar
      On a system with large amount of memory (256GB), invoking page-types can · 7e6e1679
      Alex Chiang authored
      take quite a long time, which is unreasonable considering the user only
      wants a description of the flags:
      
      	# time ./page-types -d 0x10
      	0x0000000000000010	____D_____________________________	dirty
      
      	real	0m34.285s
      	user	0m1.966s
      	sys	0m32.313s
      
      This is because we still walk the entire address range.
      
      Exiting early seems like a reasonble solution:
      
      # time ./page-types -d 0x10
      	0x0000000000000010	____D_____________________________	dirty
      
      	real	0m0.007s
      	user	0m0.001s
      	sys	0m0.005s
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Haicheng Li <haicheng.li@intel.com>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7e6e1679
  10. 17 Nov, 2009 1 commit
  11. 11 Nov, 2009 1 commit
    • Wu Fengguang's avatar
      Teach page-types to describe page flags directly from the command · 3af04adb
      Wu Fengguang authored
      line.
      Why is this useful? For instance, if you're using memory hotplug
      and see this in /var/log/messages:
      	kernel: removing from LRU failed 3836dd0/1/1e00000000000010
      
      It would be nice to decode those page flags without staring at
      the source.
      Example usage and output:
      
      # Documentation/vm/page-types -d 0x10
      0x0000000000000010	____D_____________________________	dirty
      
      # Documentation/vm/page-types -d anon
      0x0000000000001000	____________a_____________________	anonymous
      
      # Documentation/vm/page-types -d anon,0x10
      0x0000000000001010	____D_______a_____________________	dirty,anonymous
      
      [achiang@hp.com: documentation]
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Haicheng Li <haicheng.li@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3af04adb
  12. 12 Oct, 2009 2 commits
  13. 11 Nov, 2009 1 commit
  14. 21 Sep, 2009 1 commit
    • Hisashi Hifumi's avatar
      I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O · fd437237
      Hisashi Hifumi authored
      is unpluged to improve throughput on especially RAID environment.
      
      The normal case is, if page N become uptodate at time T(N), then T(N) <=
      T(N+1) holds.  With RAID (and NFS to some degree), there is no strict
      ordering, the data arrival time depends on runtime status of individual
      disks, which breaks that formula.  So in do_generic_file_read(), just
      after submitting the async readahead IO request, the current page may well
      be uptodate, so the page won't be locked, and the block device won't be
      implicitly unplugged:
      
                     if (PageReadahead(page))
                              page_cache_async_readahead()
                      if (!PageUptodate(page))
                                      goto page_not_up_to_date;
                      //...
      page_not_up_to_date:
                      lock_page_killable(page);
      
      Therefore explicit unplugging can help.
      
      Following is the test result with dd.
      
      #dd if=testdir/testfile of=/dev/null bs=16384
      
      -2.6.30-rc6
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 224.182 seconds, 76.6 MB/s
      
      -2.6.30-rc6-patched
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 206.465 seconds, 83.2 MB/s
      
      (7Disks RAID-0 Array)
      
      -2.6.30-rc6
      1054976+0 records in
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 212.233 seconds, 81.4 MB/s
      
      -2.6.30-rc6-patched
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 198.878 seconds, 86.9 MB/s
      
      (7Disks RAID-5 Array)
      
      The patch was found to improve performance with the SCST scsi target
      driver.  See
      http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel
      
      [akpm@linux-foundation.org: unbust comment layout]
      [akpm@linux-foundation.org: "fix" CONFIG_BLOCK=n]
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: default avatarRonald <intercommit@gmail.com>
      Cc: Bart Van Assche <bart.vanassche@gmail.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fd437237
  15. 09 Oct, 2009 1 commit
  16. 13 Nov, 2009 1 commit
  17. 11 Nov, 2009 1 commit
  18. 30 Sep, 2009 1 commit
  19. 13 Aug, 2009 1 commit
  20. 24 Jul, 2009 1 commit
  21. 17 Nov, 2009 1 commit
  22. 30 Sep, 2009 2 commits
    • Sage Weil's avatar
      Get rid of the goto by flipping the if (!result) over. Make the comments · 98022231
      Sage Weil authored
      a bit more descriptive.  Fix a few kernel style problems.  No functional
      changes.
      
      Cc: Ian Kent <raven@themaw.net>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger@sun.com>
      Signed-off-by: default avatarYehuda Sadeh <yehuda@newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      98022231
    • Sage Weil's avatar
      real_lookup() is called by do_lookup() if dentry revalidation fails. If · cb21ed84
      Sage Weil authored
      the cache is re-populated while waiting for i_mutex, it may find that a
      d_lookup() subsequently succeeds (see the "Uhhuh!  Nasty case" comment).
      
      Previously, real_lookup() would drop i_mutex and do_revalidate() again. 
      If revalidate failed _again_, however, it would give up with -ENOENT.  The
      problem here that network file systems may be invalidating dentries via
      server callbacks, e.g.  due to concurrent access from another client, and
      -ENOENT is frequently the wrong answer.
      
      This problem has been seen with both Lustre and Ceph.  It seems possible
      to hit this case with NFS as well if the cache lifetime is very short.
      
      Instead, we should do_revalidate() while i_mutex is still held.  If
      revalidation fails, we can move on to a ->lookup() and ensure a correct
      result without worrying about any subsequent races.
      
      Note that do_revalidate() is called with i_mutex held elsewhere.  For
      example, do_filp_open(), lookup_create(), do_unlinkat(), do_rmdir(), and
      possibly others all take the directory i_mutex, and then
      
      -> lookup_hash
              -> __lookup_hash
                      -> cached_lookup
                              -> do_revalidate
      
      so this does not introduce any new locking rules for d_revalidate
      implementations.
      
      Yes, the goto is ugly.  A cleanup patch follows.
      
      Cc: Ian Kent <raven@themaw.net>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger@sun.com>
      Signed-off-by: default avatarYehuda Sadeh <yehuda@newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb21ed84
  23. 25 Sep, 2009 1 commit
    • Nick Piggin's avatar
      Invalidate sb->s_bdev on remount,ro. · d14ff77a
      Nick Piggin authored
      Fixes a problem reported by Jorge Boncompte who is seeing corruption
      trying to snapshot a minix filesystem image.  Some filesystems modify
      their metadata via a path other than the bdev buffer cache (eg.  they may
      use a private linear mapping for their metadata, or implement directories
      in pagecache, etc).  Also, file data modifications usually go to the bdev
      via their own mappings.
      
      These updates are not coherent with buffercache IO (eg.  via /dev/bdev)
      and never have been.  However there could be a reasonable expectation that
      after a mount -oremount,ro operation then the buffercache should
      subsequently be coherent with previous filesystem modifications.
      
      So invalidate the bdev mappings on a remount,ro operation to provide a
      coherency point.
      
      The problem was exposed when we switched the old rd to brd because old rd
      didn't really function like a normal block device and updates to rd via
      mappings other than the buffercache would still end up going into its
      buffercache.  But the same problem has always affected other "normal"
      block devices, including loop.
      
      [akpm@linux-foundation.org: repair comment layout]
      Reported-by: default avatar"Jorge Boncompte [DTI2]" <jorge@dti2.net>
      Tested-by: default avatar"Jorge Boncompte [DTI2]" <jorge@dti2.net>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d14ff77a
  24. 11 Nov, 2009 3 commits
    • Nick Piggin's avatar
      Filesystems outside the regular namespace do not have to clear · 5e6bb6c5
      Nick Piggin authored
      DCACHE_UNHASHED in order to have a working /proc/$pid/fd/XXX.  Nothing in
      proc prevents the fd link from being used if its dentry is not in the
      hash.
      
      Also, it does not get put into the dcache hash if DCACHE_UNHASHED is
      clear; that depends on the filesystem calling d_add or d_rehash.
      
      So delete the misleading comments and needless code.
      Acked-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e6bb6c5
    • Roland Dreier's avatar
      > ============================================= · 399da0ba
      Roland Dreier authored
       >  [ INFO: possible recursive locking detected ]
       >  2.6.31-2-generic #14~rbd3
       >  ---------------------------------------------
       >  firefox-3.5/4162 is trying to acquire lock:
       >   (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >
       >  but task is already holding lock:
       >   (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >
       >  other info that might help us debug this:
       >  3 locks held by firefox-3.5/4162:
       >   #0:  (&s->s_vfs_rename_mutex){+.+.+.}, at: [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >   #1:  (&sb->s_type->i_mutex_key#11/1){+.+.+.}, at: [<ffffffff81139d5a>] lock_rename+0x6a/0xf0
       >   #2:  (&sb->s_type->i_mutex_key#11/2){+.+.+.}, at: [<ffffffff81139d6f>] lock_rename+0x7f/0xf0
       >
       >  stack backtrace:
       >  Pid: 4162, comm: firefox-3.5 Tainted: G         C 2.6.31-2-generic #14~rbd3
       >  Call Trace:
       >   [<ffffffff8108ae74>] print_deadlock_bug+0xf4/0x100
       >   [<ffffffff8108ce26>] validate_chain+0x4c6/0x750
       >   [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
       >   [<ffffffff8108d585>] lock_acquire+0xa5/0x150
       >   [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
       >   [<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0
       >   [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
       >   [<ffffffff81139d31>] ? lock_rename+0x41/0xf0
       >   [<ffffffff8120eaf9>] ? ecryptfs_rename+0x99/0x170
       >   [<ffffffff81552b36>] mutex_lock_nested+0x46/0x60
       >   [<ffffffff81139d31>] lock_rename+0x41/0xf0
       >   [<ffffffff8120eb2a>] ecryptfs_rename+0xca/0x170
       >   [<ffffffff81139a9e>] vfs_rename_dir+0x13e/0x160
       >   [<ffffffff8113ac7e>] vfs_rename+0xee/0x290
       >   [<ffffffff8113c212>] ? __lookup_hash+0x102/0x160
       >   [<ffffffff8113d512>] sys_renameat+0x252/0x280
       >   [<ffffffff81133eb4>] ? cp_new_stat+0xe4/0x100
       >   [<ffffffff8101316a>] ? sysret_check+0x2e/0x69
       >   [<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190
       >   [<ffffffff8113d55b>] sys_rename+0x1b/0x20
       >   [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
      
      The trace above is totally reproducible by doing a cross-directory
      rename on an ecryptfs directory.
      
      The issue seems to be that sys_renameat() does lock_rename() then calls
      into the filesystem; if the filesystem is ecryptfs, then
      ecryptfs_rename() again does lock_rename() on the lower filesystem, and
      lockdep can't tell that the two s_vfs_rename_mutexes are different.  It
      seems an annotation like the following is sufficient to fix this (it
      does get rid of the lockdep trace in my simple tests); however I would
      like to make sure I'm not misunderstanding the locking, hence the CC
      list...
      Signed-off-by: default avatarRoland Dreier <rdreier@cisco.com>
      Cc: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
      Cc: Dustin Kirkland <kirkland@canonical.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      399da0ba
    • Tony Battersby's avatar
      Improve the description of fget_light(), which is currently incorrect · bc95777a
      Tony Battersby authored
      about needing a prior refcnt (judging by the way it is actually used).
      Signed-off-by: default avatarTony Battersby <tonyb@cybernetics.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bc95777a