linux/fs
Jason Baron b6a515c8a0 epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
In the current implementation of the EPOLLEXCLUSIVE flag (added for
4.5-rc1), if epoll waiters create different POLL* sets and register them
as exclusive against the same target fd, the current implementation will
stop waking any further waiters once it finds the first idle waiter.
This means that waiters could miss wakeups in certain cases.

For example, when we wake up a pipe for reading we do:
wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
one epoll set or epfd is added to pipe p with POLLIN and a second set
epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
wakeup since the current implementation will stop after it finds any
intersection of events with a waiter that is blocked in epoll_wait().

We could potentially address this by requiring all epoll waiters that
are added to p be required to pass the same set of POLL* events.  IE the
first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
However, I think it might be somewhat confusing interface as we would
have to reference count the number of users for that set, and so
userspace would have to keep track of that count, or we would need a
more involved interface.  It also adds some shared state that we'd have
store somewhere.  I don't think anybody will want to bloat
__wait_queue_head for this.

I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
such that it can only be specified with EPOLLIN and/or EPOLLOUT.  So
that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
once we hit the first idle waiter that specifies the EPOLLIN bit, since
any remaining waiters that only have 'POLLOUT' set wouldn't need to be
woken.  Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
bit set and not 'POLLIN'.  If both 'POLLOUT' and 'POLLIN' are set in the
wake bit set (there is at least one example of this I saw in fs/pipe.c),
then we just wake the entire exclusive list.  Having both 'POLLOUT' and
'POLLIN' both set should not be on any performance critical path, so I
think that's ok (in fs/pipe.c its in pipe_release()).  We also continue
to include EPOLLERR and EPOLLHUP by default in any exclusive set.  Thus,
the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
so.

Since epoll waiters may be interested in other events as well besides
EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
doing a 'dup' call on the target fd and adding that as one normally
would with EPOLL_CTL_ADD.  Since I think that the POLLIN and POLLOUT
events are what we are interest in balancing, I think that the 'dup'
thing could perhaps be added to only one of the waiter threads.
However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
sufficient for the majority of use-cases.

Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
among multiple epfds, where between 1 and n of the epfds may receive an
event, it does not satisfy the semantics of EPOLLONESHOT where only 1
epfd would get an event.  Thus, it is not allowed to be specified in
conjunction with EPOLLEXCLUSIVE.

EPOLL_CTL_MOD is also not allowed if the fd was previously added as
EPOLLEXCLUSIVE.  It seems with the limited number of flags to not be as
interesting, but this could be relaxed at some further point.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Tested-by: Madars Vitolins <m@silodev.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05 18:10:40 -08:00
..
9p wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
adfs fs/adfs/adfs.h: tidy up comments 2016-01-20 17:09:18 -08:00
affs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
afs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
autofs4 switch ->get_link() to delayed_call, kill ->put_link() 2015-12-30 13:01:03 -05:00
befs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
bfs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
btrfs Merge branch 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2016-01-29 15:46:49 -08:00
cachefiles wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2016-01-24 12:34:13 -08:00
cifs Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 2016-01-24 12:31:12 -08:00
coda Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-01-23 12:24:56 -08:00
configfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
cramfs don't put symlink bodies in pagecache into highmem 2015-12-08 22:41:36 -05:00
debugfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
devpts wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
dlm [regression] fix braino in fs/dlm/user.c 2016-01-21 17:45:15 -05:00
ecryptfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
efivarfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
efs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
exofs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
exportfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
ext2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-01-23 12:24:56 -08:00
ext4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-01-23 12:24:56 -08:00
f2fs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
fat wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
freevxfs don't put symlink bodies in pagecache into highmem 2015-12-08 22:41:36 -05:00
fscache FS-Cache: Handle a write to the page immediately beyond the EOF marker 2015-11-11 02:11:02 -05:00
fuse wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
gfs2 wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
hfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
hfsplus wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
hostfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
hpfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
hugetlbfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
isofs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
jbd2 fs: use block_device name vsprintf helper 2016-01-06 13:03:18 -05:00
jffs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-01-23 12:24:56 -08:00
jfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
kernfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
lockd lockd: constify nlmsvc_binding structure 2016-01-07 10:10:50 -05:00
logfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
minix kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
ncpfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
nfs NFS: Cleanup - rename NFS_LAYOUT_RETURN_BEFORE_CLOSE 2016-01-27 20:40:05 -05:00
nfs_common
nfsd wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
nilfs2 wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
nls
notify fsnotify: destroy marks with call_srcu instead of dedicated thread 2016-01-14 16:00:49 -08:00
ntfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
ocfs2 ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup 2016-02-05 18:10:40 -08:00
omfs
openpromfs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
overlayfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
proc proc: revert /proc/<pid>/maps [stack:TID] annotation 2016-02-03 08:28:43 -08:00
pstore wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
qnx4 kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
qnx6 kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
quota wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
ramfs don't put symlink bodies in pagecache into highmem 2015-12-08 22:41:36 -05:00
reiserfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
romfs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
squashfs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
sysfs platform/chrome: Branch for v4.4 2015-11-13 21:53:18 -08:00
sysv kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
tracefs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
ubifs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
udf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-01-23 12:24:56 -08:00
ufs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
xfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-01-23 12:24:56 -08:00
aio.c
anon_inodes.c
attr.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
bad_inode.c fs/bad_inode.c: is_bad_inode can be boolean 2015-12-06 21:17:14 -05:00
binfmt_aout.c
binfmt_elf_fdpic.c libnvdimm for 4.4: 2015-11-10 12:07:22 -08:00
binfmt_elf.c ELF: Also pass any interpreter's file header to `arch_check_elf' 2016-01-20 00:39:20 +01:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
binfmt_script.c
block_dev.c block: fix pfn_mkwrite() DAX fault handler 2016-02-05 18:10:40 -08:00
buffer.c fs: use block_device name vsprintf helper 2016-01-06 13:03:18 -05:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c Bluetooth: Add missing COMPATIBLE_IOCTL for UART line discipline 2016-01-27 10:48:26 -05:00
compat.c saner calling conventions for copy_mount_options() 2016-01-04 10:28:32 -05:00
coredump.c fs/coredump: prevent "" / "." / ".." core path components 2016-01-20 17:09:18 -08:00
dax.c dax: dirty inode only if required 2016-02-05 18:10:40 -08:00
dcache.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
dcookies.c
direct-io.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
drop_caches.c
eventfd.c Documentation: filesystem: Fix typo in fs/eventfd.c 2015-12-08 14:52:03 +01:00
eventpoll.c epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT 2016-02-05 18:10:40 -08:00
exec.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
fcntl.c fcntl: allow to set O_DIRECT flag on pipe 2016-01-09 02:55:37 -05:00
fhandle.c
file_table.c
file.c kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
filesystems.c find_filesystem(): simplify comparison 2016-01-19 12:02:23 -05:00
fs_pin.c
fs_struct.c
fs-writeback.c cgroup, memcg, writeback: drop spurious rcu locking around mem_cgroup_css_from_page() 2016-01-15 17:56:32 -08:00
inode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-01-23 12:24:56 -08:00
internal.h Merge branch 'for-linus' into work.misc 2016-01-08 21:20:11 -05:00
ioctl.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
Kconfig dax: re-enable dax pmd mappings 2016-01-15 17:56:32 -08:00
Kconfig.binfmt
libfs.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
locks.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
Makefile ext4: promote ext4 over ext2 in the default probe order 2015-10-15 10:33:21 -04:00
mbcache.c
mount.h
mpage.c mm, fs: introduce mapping_gfp_constraint() 2015-11-06 17:50:42 -08:00
namei.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
namespace.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
no-block.c
nsfs.c
open.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
pipe.c pipe: limit the per-user amount of pages allocated in pipes 2016-01-19 19:25:21 -05:00
pnode.c
pnode.h
posix_acl.c xattr handlers: Simplify list operation 2015-12-13 19:46:12 -05:00
proc_namespace.c vfs: show_vfsstat: remove redundant initialization and check of error code 2015-12-06 21:17:16 -05:00
read_write.c vfs: abort dedupe loop if fatal signals are pending 2016-01-22 20:29:55 -05:00
readdir.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
select.c poll: plug an unused argument to do_poll 2016-01-06 08:26:52 -05:00
seq_file.c fs, seqfile: always allow oom killer 2015-11-06 17:50:42 -08:00
signalfd.c
splice.c fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE 2016-01-09 02:55:35 -05:00
stack.c
stat.c fs/stat.c: drop the last new_valid_dev check 2016-01-16 11:17:23 -08:00
statfs.c
super.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2016-01-14 17:04:19 -08:00
sync.c fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writeback 2015-11-06 17:50:42 -08:00
timerfd.c timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper 2016-01-17 11:13:55 +01:00
userfaultfd.c
utimes.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
xattr.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00