• Anton Altaparmakov's avatar
    [PATCH] Fix soft lockup due to NTFS: VFS part and explanation · 88bd5121
    Anton Altaparmakov authored
    Something has changed in the core kernel such that we now get concurrent
    inode write outs, one e.g via pdflush and one via sys_sync or whatever.
    This causes a nasty deadlock in ntfs.  The only clean solution
    unfortunately requires a minor vfs api extension.
    
    First the deadlock analysis:
    
    Prerequisive knowledge: NTFS has a file $MFT (inode 0) loaded at mount
    time.  The NTFS driver uses the page cache for storing the file contents as
    usual.  More interestingly this file contains the table of on-disk inodes
    as a sequence of MFT_RECORDs.  Thus NTFS driver accesses the on-disk inodes
    by accessing the MFT_RECORDs in the page cache pages of the loaded inode
    $MFT.
    
    The situation: VFS inode X on a mounted ntfs volume is dirty.  For same
    inode X, the ntfs_inode is dirty and thus corresponding on-disk inode,
    which is as explained above in a dirty PAGE_CACHE_PAGE belonging to the
    table of inodes ($MFT, inode 0).
    
    What happens:
    
    Process 1: sys_sync()/umount()/whatever...  calls __sync_single_inode() for
    $MFT -> do_writepages() -> write_page for the dirty page containing the
    on-disk inode X, the page is now locked -> ntfs_write_mst_block() which
    clears PageUptodate() on the page to prevent anyone else getting hold of it
    whilst it does the write out (this is necessary as the on-disk inode needs
    "fixups" applied before the write to disk which are removed again after the
    write and PageUptodate is then set again).  It then analyses the page
    looking for dirty on-disk inodes and when it finds one it calls
    ntfs_may_write_mft_record() to see if it is safe to write this on-disk
    inode.  This then calls ilookup5() to check if the corresponding VFS inode
    is in icache().  This in turn calls ifind() which waits on the inode lock
    via wait_on_inode whilst holding the global inode_lock.
    
    Process 2: pdflush results in a call to __sync_single_inode for the same
    VFS inode X on the ntfs volume.  This locks the inode (I_LOCK) then calls
    write-inode -> ntfs_write_inode -> map_mft_record() -> read_cache_page() of
    the page (in page cache of table of inodes $MFT, inode 0) containing the
    on-disk inode.  This page has PageUptodate() clear because of Process 1
    (see above) so read_cache_page() blocks when tries to take the page lock
    for the page so it can call ntfs_read_page().
    
    Thus Process 1 is holding the page lock on the page containing the on-disk
    inode X and it is waiting on the inode X to be unlocked in ifind() so it
    can write the page out and then unlock the page.
    
    And Process 2 is holding the inode lock on inode X and is waiting for the
    page to be unlocked so it can call ntfs_readpage() or discover that
    Process 1 set PageUptodate() again and use the page.
    
    Thus we have a deadlock due to ifind() waiting on the inode lock.
    
    The only sensible solution: NTFS does not care whether the VFS inode is
    locked or not when it calls ilookup5() (it doesn't use the VFS inode at
    all, it just uses it to find the corresponding ntfs_inode which is of
    course attached to the VFS inode (both are one single struct); and it uses
    the ntfs_inode which is subject to its own locking so I_LOCK is irrelevant)
    hence we want a modified ilookup5_nowait() which is the same as ilookup5()
    but it does not wait on the inode lock.
    
    Without such functionality I would have to keep my own ntfs_inode cache in
    the NTFS driver just so I can find ntfs_inodes independent of their VFS
    inodes which would be slow, memory and cpu cycle wasting, and incredibly
    stupid given the icache already exists in the VFS.
    
    Below is a patch that does the ilookup5_nowait() implementation in
    fs/inode.c and exports it.
    
    ilookup5_nowait.diff:
    
    Introduce ilookup5_nowait() which is basically the same as ilookup5() but
    it does not wait on the inode's lock (i.e. it omits the wait_on_inode()
    done in ifind()).
    
    This is needed to avoid a nasty deadlock in NTFS.
    Signed-off-by: default avatarAnton Altaparmakov <aia21@cantab.net>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    88bd5121
inode.c 35.8 KB