Commit 27421e21 authored by Linus Torvalds's avatar Linus Torvalds

Manually revert "mlock: downgrade mmap sem while populating mlocked regions"

This essentially reverts commit 8edb08ca.

It downgraded our mmap semaphore to a read-lock while mlocking pages, in
order to allow other threads (and external accesses like "ps" et al) to
walk the vma lists and take page faults etc.  Which is a nice idea, but
the implementation does not work.

Because we cannot upgrade the lock back to a write lock without
releasing the mmap semaphore, the code had to release the lock entirely
and then re-take it as a writelock.  However, that meant that the caller
possibly lost the vma chain that it was following, since now another
thread could come in and mmap/munmap the range.

The code tried to work around that by just looking up the vma again and
erroring out if that happened, but quite frankly, that was just a buggy
hack that doesn't actually protect against anything (the other thread
could just have replaced the vma with another one instead of totally
unmapping it).

The only way to downgrade to a read map _reliably_ is to do it at the
end, which is likely the right thing to do: do all the 'vma' operations
with the write-lock held, then downgrade to a read after completing them
all, and then do the "populate the newly mlocked regions" while holding
just the read lock.  And then just drop the read-lock and return to user
space.

The (perhaps somewhat simpler) alternative is to just make all the
callers of mlock_vma_pages_range() know that the mmap lock got dropped,
and just re-grab the mmap semaphore if it needs to mlock more than one
vma region.

So we can do this "downgrade mmap sem while populating mlocked regions"
thing right, but the way it was done here was absolutely not correct.
Thus the revert, in the expectation that we will do it all correctly
some day.

Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 45c82b5a
...@@ -294,14 +294,10 @@ static inline int __mlock_posix_error_return(long retval) ...@@ -294,14 +294,10 @@ static inline int __mlock_posix_error_return(long retval)
* *
* return number of pages [> 0] to be removed from locked_vm on success * return number of pages [> 0] to be removed from locked_vm on success
* of "special" vmas. * of "special" vmas.
*
* return negative error if vma spanning @start-@range disappears while
* mmap semaphore is dropped. Unlikely?
*/ */
long mlock_vma_pages_range(struct vm_area_struct *vma, long mlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end) unsigned long start, unsigned long end)
{ {
struct mm_struct *mm = vma->vm_mm;
int nr_pages = (end - start) / PAGE_SIZE; int nr_pages = (end - start) / PAGE_SIZE;
BUG_ON(!(vma->vm_flags & VM_LOCKED)); BUG_ON(!(vma->vm_flags & VM_LOCKED));
...@@ -314,20 +310,8 @@ long mlock_vma_pages_range(struct vm_area_struct *vma, ...@@ -314,20 +310,8 @@ long mlock_vma_pages_range(struct vm_area_struct *vma,
if (!((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) || if (!((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
is_vm_hugetlb_page(vma) || is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current))) { vma == get_gate_vma(current))) {
long error;
downgrade_write(&mm->mmap_sem);
error = __mlock_vma_pages_range(vma, start, end, 1);
up_read(&mm->mmap_sem); return __mlock_vma_pages_range(vma, start, end, 1);
/* vma can change or disappear */
down_write(&mm->mmap_sem);
vma = find_vma(mm, start);
/* non-NULL vma must contain @start, but need to check @end */
if (!vma || end > vma->vm_end)
return -ENOMEM;
return 0; /* hide other errors from mmap(), et al */
} }
/* /*
...@@ -438,41 +422,14 @@ success: ...@@ -438,41 +422,14 @@ success:
vma->vm_flags = newflags; vma->vm_flags = newflags;
if (lock) { if (lock) {
/*
* mmap_sem is currently held for write. Downgrade the write
* lock to a read lock so that other faults, mmap scans, ...
* while we fault in all pages.
*/
downgrade_write(&mm->mmap_sem);
ret = __mlock_vma_pages_range(vma, start, end, 1); ret = __mlock_vma_pages_range(vma, start, end, 1);
/* if (ret > 0) {
* Need to reacquire mmap sem in write mode, as our callers
* expect this. We have no support for atomically upgrading
* a sem to write, so we need to check for ranges while sem
* is unlocked.
*/
up_read(&mm->mmap_sem);
/* vma can change or disappear */
down_write(&mm->mmap_sem);
*prev = find_vma(mm, start);
/* non-NULL *prev must contain @start, but need to check @end */
if (!(*prev) || end > (*prev)->vm_end)
ret = -ENOMEM;
else if (ret > 0) {
mm->locked_vm -= ret; mm->locked_vm -= ret;
ret = 0; ret = 0;
} else } else
ret = __mlock_posix_error_return(ret); /* translate if needed */ ret = __mlock_posix_error_return(ret); /* translate if needed */
} else { } else {
/*
* TODO: for unlocking, pages will already be resident, so
* we don't need to wait for allocations/reclaim/pagein, ...
* However, unlocking a very large region can still take a
* while. Should we downgrade the semaphore for both lock
* AND unlock ?
*/
__mlock_vma_pages_range(vma, start, end, 0); __mlock_vma_pages_range(vma, start, end, 0);
} }
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment