Commit b291f000 authored by Nick Piggin's avatar Nick Piggin Committed by Linus Torvalds

mlock: mlocked pages are unevictable

Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.

This is achieved through various strategies:

1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.

   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.

2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.

3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.

4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.

Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: default avatarNick Piggin <npiggin@suse.de>

splitlru: introduce __get_user_pages():

  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.

[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: default avatarRik van Riel <riel@redhat.com>
Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 89e004ea
......@@ -131,6 +131,11 @@ extern unsigned int kobjsize(const void *objp);
#define VM_SequentialReadHint(v) ((v)->vm_flags & VM_SEQ_READ)
#define VM_RandomReadHint(v) ((v)->vm_flags & VM_RAND_READ)
/*
* special vmas that are non-mergable, non-mlock()able
*/
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
/*
* mapping from the currently active vm_flags protection bits (the
* low four bits) to a page protection mask..
......
......@@ -96,6 +96,7 @@ enum pageflags {
PG_swapbacked, /* Page is backed by RAM/swap */
#ifdef CONFIG_UNEVICTABLE_LRU
PG_unevictable, /* Page is "unevictable" */
PG_mlocked, /* Page is vma mlocked */
#endif
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
......@@ -232,7 +233,17 @@ PAGEFLAG_FALSE(SwapCache)
#ifdef CONFIG_UNEVICTABLE_LRU
PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable)
TESTCLEARFLAG(Unevictable, unevictable)
#define MLOCK_PAGES 1
PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
TESTSCFLAG(Mlocked, mlocked)
#else
#define MLOCK_PAGES 0
PAGEFLAG_FALSE(Mlocked)
SETPAGEFLAG_NOOP(Mlocked) TESTCLEARFLAG_FALSE(Mlocked)
PAGEFLAG_FALSE(Unevictable) TESTCLEARFLAG_FALSE(Unevictable)
SETPAGEFLAG_NOOP(Unevictable) CLEARPAGEFLAG_NOOP(Unevictable)
__CLEARPAGEFLAG_NOOP(Unevictable)
......@@ -354,15 +365,17 @@ static inline void __ClearPageTail(struct page *page)
#endif /* !PAGEFLAGS_EXTENDED */
#ifdef CONFIG_UNEVICTABLE_LRU
#define __PG_UNEVICTABLE (1 << PG_unevictable)
#define __PG_UNEVICTABLE (1 << PG_unevictable)
#define __PG_MLOCKED (1 << PG_mlocked)
#else
#define __PG_UNEVICTABLE 0
#define __PG_UNEVICTABLE 0
#define __PG_MLOCKED 0
#endif
#define PAGE_FLAGS (1 << PG_lru | 1 << PG_private | 1 << PG_locked | \
1 << PG_buddy | 1 << PG_writeback | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
__PG_UNEVICTABLE)
__PG_UNEVICTABLE | __PG_MLOCKED)
/*
* Flags checked in bad_page(). Pages on the free list should not have
......
......@@ -117,6 +117,19 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
*/
int page_mkclean(struct page *);
#ifdef CONFIG_UNEVICTABLE_LRU
/*
* called in munlock()/munmap() path to check for other vmas holding
* the page mlocked.
*/
int try_to_munlock(struct page *);
#else
static inline int try_to_munlock(struct page *page)
{
return 0; /* a.k.a. SWAP_SUCCESS */
}
#endif
#else /* !CONFIG_MMU */
#define anon_vma_init() do {} while (0)
......@@ -140,5 +153,6 @@ static inline int page_mkclean(struct page *page)
#define SWAP_SUCCESS 0
#define SWAP_AGAIN 1
#define SWAP_FAIL 2
#define SWAP_MLOCK 3
#endif /* _LINUX_RMAP_H */
......@@ -61,6 +61,10 @@ static inline unsigned long page_order(struct page *page)
return page_private(page);
}
extern int mlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
extern void munlock_vma_pages_all(struct vm_area_struct *vma);
#ifdef CONFIG_UNEVICTABLE_LRU
/*
* unevictable_migrate_page() called only from migrate_page_copy() to
......@@ -79,6 +83,65 @@ static inline void unevictable_migrate_page(struct page *new, struct page *old)
}
#endif
#ifdef CONFIG_UNEVICTABLE_LRU
/*
* Called only in fault path via page_evictable() for a new page
* to determine if it's being mapped into a LOCKED vma.
* If so, mark page as mlocked.
*/
static inline int is_mlocked_vma(struct vm_area_struct *vma, struct page *page)
{
VM_BUG_ON(PageLRU(page));
if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
return 0;
SetPageMlocked(page);
return 1;
}
/*
* must be called with vma's mmap_sem held for read, and page locked.
*/
extern void mlock_vma_page(struct page *page);
/*
* Clear the page's PageMlocked(). This can be useful in a situation where
* we want to unconditionally remove a page from the pagecache -- e.g.,
* on truncation or freeing.
*
* It is legal to call this function for any page, mlocked or not.
* If called for a page that is still mapped by mlocked vmas, all we do
* is revert to lazy LRU behaviour -- semantics are not broken.
*/
extern void __clear_page_mlock(struct page *page);
static inline void clear_page_mlock(struct page *page)
{
if (unlikely(TestClearPageMlocked(page)))
__clear_page_mlock(page);
}
/*
* mlock_migrate_page - called only from migrate_page_copy() to
* migrate the Mlocked page flag
*/
static inline void mlock_migrate_page(struct page *newpage, struct page *page)
{
if (TestClearPageMlocked(page))
SetPageMlocked(newpage);
}
#else /* CONFIG_UNEVICTABLE_LRU */
static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
{
return 0;
}
static inline void clear_page_mlock(struct page *page) { }
static inline void mlock_vma_page(struct page *page) { }
static inline void mlock_migrate_page(struct page *new, struct page *old) { }
#endif /* CONFIG_UNEVICTABLE_LRU */
/*
* FLATMEM and DISCONTIGMEM configurations use alloc_bootmem_node,
......@@ -148,4 +211,12 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
}
#endif /* CONFIG_SPARSEMEM */
#define GUP_FLAGS_WRITE 0x1
#define GUP_FLAGS_FORCE 0x2
#define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4
int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
struct page **pages, struct vm_area_struct **vmas);
#endif
......@@ -64,6 +64,8 @@
#include "internal.h"
#include "internal.h"
#ifndef CONFIG_NEED_MULTIPLE_NODES
/* use the per-pgdat data instead for discontigmem - mbligh */
unsigned long max_mapnr;
......@@ -1129,12 +1131,17 @@ static inline int use_zero_page(struct vm_area_struct *vma)
return !vma->vm_ops || !vma->vm_ops->fault;
}
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int write, int force,
int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
struct page **pages, struct vm_area_struct **vmas)
{
int i;
unsigned int vm_flags;
unsigned int vm_flags = 0;
int write = !!(flags & GUP_FLAGS_WRITE);
int force = !!(flags & GUP_FLAGS_FORCE);
int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
if (len <= 0)
return 0;
......@@ -1158,7 +1165,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
if (write) /* user gate pages are read-only */
/* user gate pages are read-only */
if (!ignore && write)
return i ? : -EFAULT;
if (pg > TASK_SIZE)
pgd = pgd_offset_k(pg);
......@@ -1190,8 +1199,9 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
continue;
}
if (!vma || (vma->vm_flags & (VM_IO | VM_PFNMAP))
|| !(vm_flags & vma->vm_flags))
if (!vma ||
(vma->vm_flags & (VM_IO | VM_PFNMAP)) ||
(!ignore && !(vm_flags & vma->vm_flags)))
return i ? : -EFAULT;
if (is_vm_hugetlb_page(vma)) {
......@@ -1266,6 +1276,23 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
} while (len);
return i;
}
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int write, int force,
struct page **pages, struct vm_area_struct **vmas)
{
int flags = 0;
if (write)
flags |= GUP_FLAGS_WRITE;
if (force)
flags |= GUP_FLAGS_FORCE;
return __get_user_pages(tsk, mm,
start, len, flags,
pages, vmas);
}
EXPORT_SYMBOL(get_user_pages);
pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
......@@ -1858,6 +1885,15 @@ gotten:
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
if (!new_page)
goto oom;
/*
* Don't let another task, with possibly unlocked vma,
* keep the mlocked page.
*/
if (vma->vm_flags & VM_LOCKED) {
lock_page(old_page); /* for LRU manipulation */
clear_page_mlock(old_page);
unlock_page(old_page);
}
cow_user_page(new_page, old_page, address, vma);
__SetPageUptodate(new_page);
......@@ -2325,7 +2361,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_add_anon_rmap(page, vma, address);
swap_free(entry);
if (vm_swap_full())
if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
remove_exclusive_swap_page(page);
unlock_page(page);
......@@ -2465,6 +2501,12 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
ret = VM_FAULT_OOM;
goto out;
}
/*
* Don't let another task, with possibly unlocked vma,
* keep the mlocked page.
*/
if (vma->vm_flags & VM_LOCKED)
clear_page_mlock(vmf.page);
copy_user_highpage(page, vmf.page, address, vma);
__SetPageUptodate(page);
} else {
......
......@@ -371,6 +371,8 @@ static void migrate_page_copy(struct page *newpage, struct page *page)
__set_page_dirty_nobuffers(newpage);
}
mlock_migrate_page(newpage, page);
#ifdef CONFIG_SWAP
ClearPageSwapCache(page);
#endif
......
This diff is collapsed.
......@@ -662,8 +662,6 @@ again: remove_next = 1 + (end > next->vm_end);
* If the vma has a ->close operation then the driver probably needs to release
* per-vma resources, so we don't attempt to merge those.
*/
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)
static inline int is_mergeable_vma(struct vm_area_struct *vma,
struct file *file, unsigned long vm_flags)
{
......
......@@ -34,6 +34,8 @@
#include <asm/tlb.h>
#include <asm/tlbflush.h>
#include "internal.h"
void *high_memory;
struct page *mem_map;
unsigned long max_mapnr;
......@@ -128,20 +130,16 @@ unsigned int kobjsize(const void *objp)
return PAGE_SIZE << compound_order(page);
}
/*
* get a list of pages in an address range belonging to the specified process
* and indicate the VMA that covers each page
* - this is potentially dodgy as we may end incrementing the page count of a
* slab page or a secondary page from a compound page
* - don't permit access to VMAs that don't support it, such as I/O mappings
*/
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int write, int force,
struct page **pages, struct vm_area_struct **vmas)
int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
struct page **pages, struct vm_area_struct **vmas)
{
struct vm_area_struct *vma;
unsigned long vm_flags;
int i;
int write = !!(flags & GUP_FLAGS_WRITE);
int force = !!(flags & GUP_FLAGS_FORCE);
int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS);
/* calculate required read or write permissions.
* - if 'force' is set, we only require the "MAY" flags.
......@@ -156,7 +154,7 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
/* protect what we can, including chardevs */
if (vma->vm_flags & (VM_IO | VM_PFNMAP) ||
!(vm_flags & vma->vm_flags))
(!ignore && !(vm_flags & vma->vm_flags)))
goto finish_or_fault;
if (pages) {
......@@ -174,6 +172,30 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
finish_or_fault:
return i ? : -EFAULT;
}
/*
* get a list of pages in an address range belonging to the specified process
* and indicate the VMA that covers each page
* - this is potentially dodgy as we may end incrementing the page count of a
* slab page or a secondary page from a compound page
* - don't permit access to VMAs that don't support it, such as I/O mappings
*/
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int write, int force,
struct page **pages, struct vm_area_struct **vmas)
{
int flags = 0;
if (write)
flags |= GUP_FLAGS_WRITE;
if (force)
flags |= GUP_FLAGS_FORCE;
return __get_user_pages(tsk, mm,
start, len, flags,
pages, vmas);
}
EXPORT_SYMBOL(get_user_pages);
DEFINE_RWLOCK(vmlist_lock);
......
......@@ -616,7 +616,11 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim |
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
1 << PG_owner_priv_1 | 1 << PG_mappedtodisk
#ifdef CONFIG_UNEVICTABLE_LRU
| 1 << PG_mlocked
#endif
);
set_page_private(page, 0);
set_page_refcounted(page);
......
This diff is collapsed.
......@@ -278,7 +278,7 @@ void lru_add_drain(void)
put_cpu();
}
#ifdef CONFIG_NUMA
#if defined(CONFIG_NUMA) || defined(CONFIG_UNEVICTABLE_LRU)
static void lru_add_drain_per_cpu(struct work_struct *dummy)
{
lru_add_drain();
......
......@@ -582,11 +582,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
sc->nr_scanned++;
if (unlikely(!page_evictable(page, NULL))) {
unlock_page(page);
putback_lru_page(page);
continue;
}
if (unlikely(!page_evictable(page, NULL)))
goto cull_mlocked;
if (!sc->may_swap && page_mapped(page))
goto keep_locked;
......@@ -624,9 +621,19 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
*/
if (PageAnon(page) && !PageSwapCache(page))
if (PageAnon(page) && !PageSwapCache(page)) {
switch (try_to_munlock(page)) {
case SWAP_FAIL: /* shouldn't happen */
case SWAP_AGAIN:
goto keep_locked;
case SWAP_MLOCK:
goto cull_mlocked;
case SWAP_SUCCESS:
; /* fall thru'; add to swap cache */
}
if (!add_to_swap(page, GFP_ATOMIC))
goto activate_locked;
}
#endif /* CONFIG_SWAP */
mapping = page_mapping(page);
......@@ -641,6 +648,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto activate_locked;
case SWAP_AGAIN:
goto keep_locked;
case SWAP_MLOCK:
goto cull_mlocked;
case SWAP_SUCCESS:
; /* try to free the page below */
}
......@@ -731,6 +740,11 @@ free_it:
}
continue;
cull_mlocked:
unlock_page(page);
putback_lru_page(page);
continue;
activate_locked:
/* Not a candidate for swapping, so reclaim swap space. */
if (PageSwapCache(page) && vm_swap_full())
......@@ -742,7 +756,7 @@ keep_locked:
unlock_page(page);
keep:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page));
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
......@@ -2329,12 +2343,13 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* @vma: the VMA in which the page is or will be mapped, may be NULL
*
* Test whether page is evictable--i.e., should be placed on active/inactive
* lists vs unevictable list.
* lists vs unevictable list. The vma argument is !NULL when called from the
* fault path to determine how to instantate a new page.
*
* Reasons page might not be evictable:
* (1) page's mapping marked unevictable
* (2) page is part of an mlocked VMA
*
* TODO - later patches
*/
int page_evictable(struct page *page, struct vm_area_struct *vma)
{
......@@ -2342,7 +2357,8 @@ int page_evictable(struct page *page, struct vm_area_struct *vma)
if (mapping_unevictable(page_mapping(page)))
return 0;
/* TODO: test page [!]evictable conditions */
if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page)))
return 0;
return 1;
}
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment