kernel-ark/fs
Greg Thelen 2e898e4c0a writeback: safer lock nesting
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>	[v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00
..
9p fscache development 2018-04-07 09:08:24 -07:00
adfs
affs
afs afs: Fix server record deletion 2018-04-20 09:59:33 -07:00
autofs4 autofs4: use wait_event_killable 2018-04-11 10:28:36 -07:00
befs
bfs
btrfs for-4.17-part2-tag 2018-04-15 18:08:35 -07:00
cachefiles fscache: Pass object size in rather than calling back for it 2018-04-06 14:05:14 +01:00
ceph ceph: always update atime/mtime/ctime for new inode 2018-04-16 09:38:40 +02:00
cifs cifs: change validate_buf to validate_iov 2018-04-12 20:32:55 -05:00
coda vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
configfs
cramfs cramfs: better MTD dependency expression 2018-02-08 11:37:31 -08:00
crypto
debugfs debugfs_lookup(): switch to lookup_one_len_unlocked() 2018-03-29 15:07:47 -04:00
devpts devpts: comment devpts_mntget() 2018-03-14 13:31:23 +01:00
dlm net: make getname() functions return length rather than use int* parameter 2018-02-12 14:15:04 -05:00
ecryptfs eCryptfs: don't pass up plaintext names when using filename encryption 2018-04-16 18:51:22 +00:00
efivarfs efivarfs: Limit the rate for non-root to read files 2018-02-22 10:21:02 -08:00
efs
exofs iversion.h related cleanup for v4.16 2018-02-07 14:25:22 -08:00
exportfs ovl: do not try to reconnect a disconnected origin dentry 2018-04-12 12:04:49 +02:00
ext2 \n 2018-04-20 09:01:26 -07:00
ext4 libnvdimm for 4.17 2018-04-10 10:25:57 -07:00
f2fs page cache: use xa_lock 2018-04-11 10:28:39 -07:00
fat
freevxfs
fscache fscache: use appropriate radix tree accessors 2018-04-11 10:28:39 -07:00
fuse fuse: define the filesystem as untrusted 2018-03-23 06:31:37 -04:00
gfs2 GFS2: Minor improvements to comments and documentation 2018-04-12 10:07:51 -07:00
hfs
hfsplus hfsplus: honor setgid flag on directories 2018-02-06 18:32:45 -08:00
hostfs hostfs: rename do_rmdir() to hostfs_do_rmdir() 2018-04-02 20:15:53 +02:00
hpfs
hugetlbfs hugetlbfs: fix bug in pgoff overflow checking 2018-04-05 21:36:21 -07:00
isofs isofs: fix potential memory leak in mount option parsing 2018-04-16 09:47:41 +02:00
jbd2 jbd2: if the journal is aborted then don't allow update of the log tail 2018-02-19 12:22:53 -05:00
jffs2 jffs2_kill_sb(): deal with failed allocations 2018-04-15 23:49:05 -04:00
jfs
kernfs vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
lockd net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00
minix treewide: simplify Kconfig dependencies for removed archs 2018-03-26 15:55:57 +02:00
nfs NFS client updates for Linux 4.17 2018-04-12 12:55:50 -07:00
nfs_common net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00
nfsd nfsd: fix incorrect umasks 2018-04-03 16:27:08 -04:00
nilfs2 page cache: use xa_lock 2018-04-11 10:28:39 -07:00
nls
notify fsnotify: fix ignore mask logic in send_to_group() 2018-04-13 15:52:49 +02:00
ntfs ntfs: fix bogus __mark_inode_dirty(I_DIRTY_SYNC | I_DIRTY_DATASYNC) call 2018-03-28 01:39:02 -04:00
ocfs2 Merge branch 'akpm' (patches from Andrew) 2018-04-06 14:19:26 -07:00
omfs
openpromfs
orangefs orangefs_kill_sb(): deal with allocation failures 2018-04-15 23:49:12 -04:00
overlayfs ovl: add support for "xino" mount and config options 2018-04-12 12:04:50 +02:00
proc mm, pagemap: fix swap offset value for PMD migration entry 2018-04-20 17:18:35 -07:00
pstore pstore: fix crypto dependencies without compression 2018-04-06 15:45:33 -07:00
qnx4
qnx6
quota fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init 2018-04-09 17:48:54 +02:00
ramfs
reiserfs fs/reiserfs/journal.c: add missing resierfs_warning() arg 2018-04-11 10:28:36 -07:00
romfs
squashfs
sysfs sysfs: symlink: export sysfs_create_link_nowarn() 2018-03-19 21:14:26 -04:00
sysv
tracefs
ubifs This pull request contains updates for both UBI and UBIFS: 2018-04-11 16:39:34 -07:00
udf udf: Fix leak of UTF-16 surrogates into encoded strings 2018-04-18 16:34:55 +02:00
ufs iversion.h related cleanup for v4.16 2018-02-07 14:25:22 -08:00
xfs Changes since last update: 2018-04-12 13:28:22 -07:00
aio.c fs/aio: Use rcu_work instead of explicit rcu and work item 2018-03-19 10:12:03 -07:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_elf_fdpic.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_elf.c elf: enforce MAP_FIXED on overlaying elf segments 2018-04-11 10:28:38 -07:00
binfmt_em86.c
binfmt_flat.c exec: introduce finalize_exec() before start_thread() 2018-04-11 10:28:37 -07:00
binfmt_misc.c fs: add ksys_close() wrapper; remove in-kernel calls to sys_close() 2018-04-02 20:16:00 +02:00
binfmt_script.c
block_dev.c libnvdimm for 4.17 2018-04-10 10:25:57 -07:00
buffer.c Merge branch 'work.thaw' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-04-12 12:28:32 -07:00
char_dev.c block, char_dev: Use correct format specifier for unsigned ints 2018-03-15 17:59:24 +01:00
compat_binfmt_elf.c
compat_ioctl.c
compat.c
coredump.c
d_path.c split d_path() and friends into a separate file 2018-03-29 15:07:46 -04:00
dax.c page cache: use xa_lock 2018-04-11 10:28:39 -07:00
dcache.c fs/dcache.c: add cond_resched() in shrink_dentry_list() 2018-04-11 10:28:38 -07:00
dcookies.c fs: add do_lookup_dcookie() helper; remove in-kernel call to syscall 2018-04-02 20:15:39 +02:00
direct-io.c Merge branch 'akpm' (patches from Andrew) 2018-04-06 14:19:26 -07:00
drop_caches.c
eventfd.c fs: add do_eventfd() helper; remove internal call to sys_eventfd() 2018-04-02 20:15:39 +02:00
eventpoll.c fs: add do_epoll_*() helpers; remove internal calls to sys_epoll_*() 2018-04-02 20:15:37 +02:00
exec.c exec: pin stack limit during exec 2018-04-11 10:28:37 -07:00
fcntl.c fs: add do_compat_fcntl64() helper; remove in-kernel call to compat syscall 2018-04-02 20:15:42 +02:00
fhandle.c
file_table.c
file.c fs: add ksys_close() wrapper; remove in-kernel calls to sys_close() 2018-04-02 20:16:00 +02:00
filesystems.c
fs_pin.c
fs_struct.c
fs-writeback.c writeback: safer lock nesting 2018-04-20 17:18:35 -07:00
inode.c page cache: use xa_lock 2018-04-11 10:28:39 -07:00
internal.h Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-04-06 11:07:08 -07:00
ioctl.c fs: add ksys_ioctl() helper; remove in-kernel calls to sys_ioctl() 2018-04-02 20:16:03 +02:00
iomap.c
Kconfig libnvdimm for 4.16 2018-02-06 10:41:33 -08:00
Kconfig.binfmt treewide: simplify Kconfig dependencies for removed archs 2018-03-26 15:55:57 +02:00
libfs.c fs, dax: prepare for dax-specific address_space_operations 2018-03-30 11:34:55 -07:00
locks.c treewide: Align function definition open/close braces 2018-03-26 11:13:09 +02:00
Makefile split d_path() and friends into a separate file 2018-03-29 15:07:46 -04:00
mbcache.c
mount.h
mpage.c
namei.c Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-04-09 12:48:05 -07:00
namespace.c vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion 2018-04-20 09:59:33 -07:00
no-block.c
nsfs.c net: Export open_related_ns() 2018-02-15 15:34:42 -05:00
open.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-04-06 11:07:08 -07:00
pipe.c fs: add do_pipe2() helper; remove internal call to sys_pipe2() 2018-04-02 20:15:35 +02:00
pnode.c
pnode.h
posix_acl.c
proc_namespace.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
read_write.c fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls 2018-04-02 20:16:09 +02:00
readdir.c fs: add ksys_getdents64() helper; remove in-kernel calls to sys_getdents64() 2018-04-02 20:16:02 +02:00
select.c fs: add do_compat_select() helper; remove in-kernel call to compat syscall 2018-04-02 20:15:42 +02:00
seq_file.c seq_file: account everything to kmemcg 2018-04-11 10:28:36 -07:00
signalfd.c fs: add do_compat_signalfd4() helper; remove in-kernel call to compat syscall 2018-04-02 20:15:43 +02:00
splice.c fs: add do_vmsplice() helper; remove in-kernel call to syscall 2018-04-02 20:15:40 +02:00
stack.c
stat.c fs: add do_readlinkat() helper; remove internal call to sys_readlinkat() 2018-04-02 20:15:34 +02:00
statfs.c
super.c mm,vmscan: Allow preallocating memory for register_shrinker(). 2018-04-16 02:06:47 -04:00
sync.c Changes for this release: 2018-04-04 12:44:02 -07:00
timerfd.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
userfaultfd.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
utimes.c fs: add do_compat_futimesat() helper; remove in-kernel call to compat syscall 2018-04-02 20:15:44 +02:00
xattr.c