2019-05-23 11:14:55 +02:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-04-16 15:20:36 -07:00
/*
* Directory notifications for Linux .
*
* Copyright ( C ) 2000 , 2001 , 2002 Stephen Rothwell
*
2009-05-21 17:01:33 -04:00
* Copyright ( C ) 2009 Eric Paris < Red Hat Inc >
* dnotify was largly rewritten to use the new fsnotify infrastructure
2005-04-16 15:20:36 -07:00
*/
# include <linux/fs.h>
# include <linux/module.h>
# include <linux/sched.h>
2017-07-16 22:05:57 -05:00
# include <linux/sched/signal.h>
2005-04-16 15:20:36 -07:00
# include <linux/dnotify.h>
# include <linux/init.h>
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 11:20:00 -04:00
# include <linux/security.h>
2005-04-16 15:20:36 -07:00
# include <linux/spinlock.h>
# include <linux/slab.h>
2008-04-24 07:44:08 -04:00
# include <linux/fdtable.h>
2009-05-21 17:01:33 -04:00
# include <linux/fsnotify_backend.h>
2005-04-16 15:20:36 -07:00
2022-01-21 22:11:29 -08:00
static int dir_notify_enable __read_mostly = 1 ;
# ifdef CONFIG_SYSCTL
static struct ctl_table dnotify_sysctls [ ] = {
{
. procname = " dir-notify-enable " ,
. data = & dir_notify_enable ,
. maxlen = sizeof ( int ) ,
. mode = 0644 ,
. proc_handler = proc_dointvec ,
} ,
{ }
} ;
static void __init dnotify_sysctl_init ( void )
{
register_sysctl_init ( " fs " , dnotify_sysctls ) ;
}
# else
# define dnotify_sysctl_init() do { } while (0)
# endif
2005-04-16 15:20:36 -07:00
2009-05-21 17:01:33 -04:00
static struct kmem_cache * dnotify_struct_cache __read_mostly ;
2009-12-17 21:24:24 -05:00
static struct kmem_cache * dnotify_mark_cache __read_mostly ;
2009-05-21 17:01:33 -04:00
static struct fsnotify_group * dnotify_group __read_mostly ;
/*
2009-12-17 21:24:24 -05:00
* dnotify will attach one of these to each inode ( i_fsnotify_marks ) which
2009-05-21 17:01:33 -04:00
* is being watched by dnotify . If multiple userspace applications are watching
* the same directory with dnotify their information is chained in dn
*/
2009-12-17 21:24:24 -05:00
struct dnotify_mark {
struct fsnotify_mark fsn_mark ;
2009-05-21 17:01:33 -04:00
struct dnotify_struct * dn ;
} ;
2005-04-16 15:20:36 -07:00
2009-05-21 17:01:33 -04:00
/*
* When a process starts or stops watching an inode the set of events which
* dnotify cares about for that inode may change . This function runs the
* list of everything receiving dnotify events about this directory and calculates
* the set of all those events . After it updates what dnotify is interested in
* it calls the fsnotify function so it can update the set of all events relevant
* to this inode .
*/
2009-12-17 21:24:24 -05:00
static void dnotify_recalc_inode_mask ( struct fsnotify_mark * fsn_mark )
2005-04-16 15:20:36 -07:00
{
2016-12-21 16:03:59 +01:00
__u32 new_mask = 0 ;
2005-04-16 15:20:36 -07:00
struct dnotify_struct * dn ;
2009-12-17 21:24:24 -05:00
struct dnotify_mark * dn_mark = container_of ( fsn_mark ,
struct dnotify_mark ,
fsn_mark ) ;
2009-05-21 17:01:33 -04:00
2009-12-17 21:24:24 -05:00
assert_spin_locked ( & fsn_mark - > lock ) ;
2005-04-16 15:20:36 -07:00
2009-12-17 21:24:24 -05:00
for ( dn = dn_mark - > dn ; dn ! = NULL ; dn = dn - > dn_next )
2009-05-21 17:01:33 -04:00
new_mask | = ( dn - > dn_mask & ~ FS_DN_MULTISHOT ) ;
2016-12-21 16:03:59 +01:00
if ( fsn_mark - > mask = = new_mask )
2009-05-21 17:01:33 -04:00
return ;
2016-12-21 16:03:59 +01:00
fsn_mark - > mask = new_mask ;
2009-05-21 17:01:33 -04:00
2017-03-15 09:16:27 +01:00
fsnotify_recalc_mask ( fsn_mark - > connector ) ;
2005-04-16 15:20:36 -07:00
}
2009-05-21 17:01:33 -04:00
/*
* Mains fsnotify call where events are delivered to dnotify .
* Find the dnotify mark on the relevant inode , run the list of dnotify structs
* on that mark and determine which of them has expressed interest in receiving
* events of this type . When found send the correct process and signal and
* destroy the dnotify struct if it was not registered to receive multiple
* events .
*/
2020-07-22 15:58:48 +03:00
static int dnotify_handle_event ( struct fsnotify_mark * inode_mark , u32 mask ,
struct inode * inode , struct inode * dir ,
2020-12-02 14:07:07 +02:00
const struct qstr * name , u32 cookie )
2009-05-21 17:01:33 -04:00
{
2009-12-17 21:24:24 -05:00
struct dnotify_mark * dn_mark ;
2009-05-21 17:01:33 -04:00
struct dnotify_struct * dn ;
struct dnotify_struct * * prev ;
struct fown_struct * fown ;
2014-01-21 15:48:14 -08:00
__u32 test_mask = mask & ~ FS_EVENT_ON_CHILD ;
2009-05-21 17:01:33 -04:00
2020-07-22 15:58:48 +03:00
/* not a dir, dnotify doesn't care */
if ( ! dir & & ! ( mask & FS_ISDIR ) )
return 0 ;
2010-07-28 10:18:39 -04:00
dn_mark = container_of ( inode_mark , struct dnotify_mark , fsn_mark ) ;
2009-05-21 17:01:33 -04:00
2010-07-28 10:18:39 -04:00
spin_lock ( & inode_mark - > lock ) ;
2009-12-17 21:24:24 -05:00
prev = & dn_mark - > dn ;
2009-05-21 17:01:33 -04:00
while ( ( dn = * prev ) ! = NULL ) {
dnotify: ignore FS_EVENT_ON_CHILD
Mask off FS_EVENT_ON_CHILD in dnotify_handle_event(). Otherwise, when there
is more than one watch on a directory and dnotify_should_send_event()
succeeds, events with FS_EVENT_ON_CHILD set will trigger all watches and cause
spurious events.
This case was overlooked in commit e42e2773.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
static void create_event(int s, siginfo_t* si, void* p)
{
printf("create\n");
}
static void delete_event(int s, siginfo_t* si, void* p)
{
printf("delete\n");
}
int main (void) {
struct sigaction action;
char *tmpdir, *file;
int fd1, fd2;
sigemptyset (&action.sa_mask);
action.sa_flags = SA_SIGINFO;
action.sa_sigaction = create_event;
sigaction (SIGRTMIN + 0, &action, NULL);
action.sa_sigaction = delete_event;
sigaction (SIGRTMIN + 1, &action, NULL);
# define TMPDIR "/tmp/test.XXXXXX"
tmpdir = malloc(strlen(TMPDIR) + 1);
strcpy(tmpdir, TMPDIR);
mkdtemp(tmpdir);
# define TMPFILE "/file"
file = malloc(strlen(tmpdir) + strlen(TMPFILE) + 1);
sprintf(file, "%s/%s", tmpdir, TMPFILE);
fd1 = open (tmpdir, O_RDONLY);
fcntl(fd1, F_SETSIG, SIGRTMIN);
fcntl(fd1, F_NOTIFY, DN_MULTISHOT | DN_CREATE);
fd2 = open (tmpdir, O_RDONLY);
fcntl(fd2, F_SETSIG, SIGRTMIN + 1);
fcntl(fd2, F_NOTIFY, DN_MULTISHOT | DN_DELETE);
if (fork()) {
/* This triggers a create event */
creat(file, 0600);
/* This triggers a create and delete event (!) */
unlink(file);
} else {
sleep(1);
rmdir(tmpdir);
}
return 0;
}
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2009-10-15 00:13:23 +02:00
if ( ( dn - > dn_mask & test_mask ) = = 0 ) {
2009-05-21 17:01:33 -04:00
prev = & dn - > dn_next ;
continue ;
}
fown = & dn - > dn_filp - > f_owner ;
send_sigio ( fown , dn - > dn_fd , POLL_MSG ) ;
if ( dn - > dn_mask & FS_DN_MULTISHOT )
prev = & dn - > dn_next ;
else {
* prev = dn - > dn_next ;
kmem_cache_free ( dnotify_struct_cache , dn ) ;
2010-07-28 10:18:39 -04:00
dnotify_recalc_inode_mask ( inode_mark ) ;
2009-05-21 17:01:33 -04:00
}
}
2010-07-28 10:18:39 -04:00
spin_unlock ( & inode_mark - > lock ) ;
2009-05-21 17:01:33 -04:00
return 0 ;
}
2009-12-17 21:24:24 -05:00
static void dnotify_free_mark ( struct fsnotify_mark * fsn_mark )
2009-05-21 17:01:33 -04:00
{
2009-12-17 21:24:24 -05:00
struct dnotify_mark * dn_mark = container_of ( fsn_mark ,
struct dnotify_mark ,
fsn_mark ) ;
2009-05-21 17:01:33 -04:00
2009-12-17 21:24:24 -05:00
BUG_ON ( dn_mark - > dn ) ;
2009-05-21 17:01:33 -04:00
2009-12-17 21:24:24 -05:00
kmem_cache_free ( dnotify_mark_cache , dn_mark ) ;
2009-05-21 17:01:33 -04:00
}
2017-08-30 18:09:02 +05:30
static const struct fsnotify_ops dnotify_fsnotify_ops = {
2020-07-22 15:58:48 +03:00
. handle_inode_event = dnotify_handle_event ,
2016-12-21 18:06:12 +01:00
. free_mark = dnotify_free_mark ,
2009-05-21 17:01:33 -04:00
} ;
/*
* Called every time a file is closed . Looks first for a dnotify mark on the
2009-12-17 21:24:24 -05:00
* inode . If one is found run all of the - > dn structures attached to that
2009-05-21 17:01:33 -04:00
* mark for one relevant to this process closing the file and remove that
* dnotify_struct . If that was the last dnotify_struct also remove the
2009-12-17 21:24:24 -05:00
* fsnotify_mark .
2009-05-21 17:01:33 -04:00
*/
2005-04-16 15:20:36 -07:00
void dnotify_flush ( struct file * filp , fl_owner_t id )
{
2009-12-17 21:24:24 -05:00
struct fsnotify_mark * fsn_mark ;
struct dnotify_mark * dn_mark ;
2005-04-16 15:20:36 -07:00
struct dnotify_struct * dn ;
struct dnotify_struct * * prev ;
struct inode * inode ;
2015-09-04 15:43:12 -07:00
bool free = false ;
2005-04-16 15:20:36 -07:00
2013-01-23 17:07:38 -05:00
inode = file_inode ( filp ) ;
2005-04-16 15:20:36 -07:00
if ( ! S_ISDIR ( inode - > i_mode ) )
return ;
2009-05-21 17:01:33 -04:00
2016-12-21 16:28:45 +01:00
fsn_mark = fsnotify_find_mark ( & inode - > i_fsnotify_marks , dnotify_group ) ;
2009-12-17 21:24:24 -05:00
if ( ! fsn_mark )
2009-05-21 17:01:33 -04:00
return ;
2009-12-17 21:24:24 -05:00
dn_mark = container_of ( fsn_mark , struct dnotify_mark , fsn_mark ) ;
2009-05-21 17:01:33 -04:00
2022-04-22 15:03:21 +03:00
fsnotify_group_lock ( dnotify_group ) ;
2009-05-21 17:01:33 -04:00
2009-12-17 21:24:24 -05:00
spin_lock ( & fsn_mark - > lock ) ;
prev = & dn_mark - > dn ;
2005-04-16 15:20:36 -07:00
while ( ( dn = * prev ) ! = NULL ) {
if ( ( dn - > dn_owner = = id ) & & ( dn - > dn_filp = = filp ) ) {
* prev = dn - > dn_next ;
2009-05-21 17:01:33 -04:00
kmem_cache_free ( dnotify_struct_cache , dn ) ;
2009-12-17 21:24:24 -05:00
dnotify_recalc_inode_mask ( fsn_mark ) ;
2005-04-16 15:20:36 -07:00
break ;
}
prev = & dn - > dn_next ;
}
2009-05-21 17:01:33 -04:00
2009-12-17 21:24:24 -05:00
spin_unlock ( & fsn_mark - > lock ) ;
2009-05-21 17:01:33 -04:00
2013-07-08 15:59:44 -07:00
/* nothing else could have found us thanks to the dnotify_groups
mark_mutex */
2015-09-04 15:43:12 -07:00
if ( dn_mark - > dn = = NULL ) {
fsnotify_detach_mark ( fsn_mark ) ;
free = true ;
}
2009-05-21 17:01:33 -04:00
2022-04-22 15:03:21 +03:00
fsnotify_group_unlock ( dnotify_group ) ;
2009-05-21 17:01:33 -04:00
2015-09-04 15:43:12 -07:00
if ( free )
fsnotify_free_mark ( fsn_mark ) ;
2009-12-17 21:24:24 -05:00
fsnotify_put_mark ( fsn_mark ) ;
2009-05-21 17:01:33 -04:00
}
/* this conversion is done only at watch creation */
static __u32 convert_arg ( unsigned long arg )
{
__u32 new_mask = FS_EVENT_ON_CHILD ;
if ( arg & DN_MULTISHOT )
new_mask | = FS_DN_MULTISHOT ;
if ( arg & DN_DELETE )
new_mask | = ( FS_DELETE | FS_MOVED_FROM ) ;
if ( arg & DN_MODIFY )
new_mask | = FS_MODIFY ;
if ( arg & DN_ACCESS )
new_mask | = FS_ACCESS ;
if ( arg & DN_ATTRIB )
new_mask | = FS_ATTRIB ;
if ( arg & DN_RENAME )
2021-11-29 22:15:30 +02:00
new_mask | = FS_RENAME ;
2009-05-21 17:01:33 -04:00
if ( arg & DN_CREATE )
new_mask | = ( FS_CREATE | FS_MOVED_TO ) ;
return new_mask ;
2005-04-16 15:20:36 -07:00
}
2009-05-21 17:01:33 -04:00
/*
* If multiple processes watch the same inode with dnotify there is only one
2009-12-17 21:24:24 -05:00
* dnotify mark in inode - > i_fsnotify_marks but we chain a dnotify_struct
2009-05-21 17:01:33 -04:00
* onto that mark . This function either attaches the new dnotify_struct onto
* that list , or it | = the mask onto an existing dnofiy_struct .
*/
2009-12-17 21:24:24 -05:00
static int attach_dn ( struct dnotify_struct * dn , struct dnotify_mark * dn_mark ,
2009-05-21 17:01:33 -04:00
fl_owner_t id , int fd , struct file * filp , __u32 mask )
{
struct dnotify_struct * odn ;
2009-12-17 21:24:24 -05:00
odn = dn_mark - > dn ;
2009-05-21 17:01:33 -04:00
while ( odn ! = NULL ) {
/* adding more events to existing dnofiy_struct? */
if ( ( odn - > dn_owner = = id ) & & ( odn - > dn_filp = = filp ) ) {
odn - > dn_fd = fd ;
odn - > dn_mask | = mask ;
return - EEXIST ;
}
odn = odn - > dn_next ;
}
dn - > dn_mask = mask ;
dn - > dn_fd = fd ;
dn - > dn_filp = filp ;
dn - > dn_owner = id ;
2009-12-17 21:24:24 -05:00
dn - > dn_next = dn_mark - > dn ;
dn_mark - > dn = dn ;
2009-05-21 17:01:33 -04:00
return 0 ;
}
/*
* When a process calls fcntl to attach a dnotify watch to a directory it ends
* up here . Allocate both a mark for fsnotify to add and a dnotify_struct to be
* attached to the fsnotify_mark .
*/
2005-04-16 15:20:36 -07:00
int fcntl_dirnotify ( int fd , struct file * filp , unsigned long arg )
{
2009-12-17 21:24:24 -05:00
struct dnotify_mark * new_dn_mark , * dn_mark ;
struct fsnotify_mark * new_fsn_mark , * fsn_mark ;
2005-04-16 15:20:36 -07:00
struct dnotify_struct * dn ;
struct inode * inode ;
fl_owner_t id = current - > files ;
2008-05-01 03:52:22 +01:00
struct file * f ;
2009-05-21 17:01:33 -04:00
int destroy = 0 , error = 0 ;
__u32 mask ;
/* we use these to tell if we need to kfree */
2009-12-17 21:24:24 -05:00
new_fsn_mark = NULL ;
2009-05-21 17:01:33 -04:00
dn = NULL ;
2005-04-16 15:20:36 -07:00
2009-05-21 17:01:33 -04:00
if ( ! dir_notify_enable ) {
error = - EINVAL ;
goto out_err ;
}
/* a 0 mask means we are explicitly removing the watch */
2005-04-16 15:20:36 -07:00
if ( ( arg & ~ DN_MULTISHOT ) = = 0 ) {
dnotify_flush ( filp , id ) ;
2009-05-21 17:01:33 -04:00
error = 0 ;
goto out_err ;
2005-04-16 15:20:36 -07:00
}
2009-05-21 17:01:33 -04:00
/* dnotify only works on directories */
2013-01-23 17:07:38 -05:00
inode = file_inode ( filp ) ;
2009-05-21 17:01:33 -04:00
if ( ! S_ISDIR ( inode - > i_mode ) ) {
error = - ENOTDIR ;
goto out_err ;
2005-04-16 15:20:36 -07:00
}
fanotify, inotify, dnotify, security: add security hook for fs notifications
As of now, setting watches on filesystem objects has, at most, applied a
check for read access to the inode, and in the case of fanotify, requires
CAP_SYS_ADMIN. No specific security hook or permission check has been
provided to control the setting of watches. Using any of inotify, dnotify,
or fanotify, it is possible to observe, not only write-like operations, but
even read access to a file. Modeling the watch as being merely a read from
the file is insufficient for the needs of SELinux. This is due to the fact
that read access should not necessarily imply access to information about
when another process reads from a file. Furthermore, fanotify watches grant
more power to an application in the form of permission events. While
notification events are solely, unidirectional (i.e. they only pass
information to the receiving application), permission events are blocking.
Permission events make a request to the receiving application which will
then reply with a decision as to whether or not that action may be
completed. This causes the issue of the watching application having the
ability to exercise control over the triggering process. Without drawing a
distinction within the permission check, the ability to read would imply
the greater ability to control an application. Additionally, mount and
superblock watches apply to all files within the same mount or superblock.
Read access to one file should not necessarily imply the ability to watch
all files accessed within a given mount or superblock.
In order to solve these issues, a new LSM hook is implemented and has been
placed within the system calls for marking filesystem objects with inotify,
fanotify, and dnotify watches. These calls to the hook are placed at the
point at which the target path has been resolved and are provided with the
path struct, the mask of requested notification events, and the type of
object on which the mark is being set (inode, superblock, or mount). The
mask and obj_type have already been translated into common FS_* values
shared by the entirety of the fs notification infrastructure. The path
struct is passed rather than just the inode so that the mount is available,
particularly for mount watches. This also allows for use of the hook by
pathname-based security modules. However, since the hook is intended for
use even by inode based security modules, it is not placed under the
CONFIG_SECURITY_PATH conditional. Otherwise, the inode-based security
modules would need to enable all of the path hooks, even though they do not
use any of them.
This only provides a hook at the point of setting a watch, and presumes
that permission to set a particular watch implies the ability to receive
all notification about that object which match the mask. This is all that
is required for SELinux. If other security modules require additional hooks
or infrastructure to control delivery of notification, these can be added
by them. It does not make sense for us to propose hooks for which we have
no implementation. The understanding that all notifications received by the
requesting application are all strictly of a type for which the application
has been granted permission shows that this implementation is sufficient in
its coverage.
Security modules wishing to provide complete control over fanotify must
also implement a security_file_open hook that validates that the access
requested by the watching application is authorized. Fanotify has the issue
that it returns a file descriptor with the file mode specified during
fanotify_init() to the watching process on event. This is already covered
by the LSM security_file_open hook if the security module implements
checking of the requested file mode there. Otherwise, a watching process
can obtain escalated access to a file for which it has not been authorized.
The selinux_path_notify hook implementation works by adding five new file
permissions: watch, watch_mount, watch_sb, watch_reads, and watch_with_perm
(descriptions about which will follow), and one new filesystem permission:
watch (which is applied to superblock checks). The hook then decides which
subset of these permissions must be held by the requesting application
based on the contents of the provided mask and the obj_type. The
selinux_file_open hook already checks the requested file mode and therefore
ensures that a watching process cannot escalate its access through
fanotify.
The watch, watch_mount, and watch_sb permissions are the baseline
permissions for setting a watch on an object and each are a requirement for
any watch to be set on a file, mount, or superblock respectively. It should
be noted that having either of the other two permissions (watch_reads and
watch_with_perm) does not imply the watch, watch_mount, or watch_sb
permission. Superblock watches further require the filesystem watch
permission to the superblock. As there is no labeled object in view for
mounts, there is no specific check for mount watches beyond watch_mount to
the inode. Such a check could be added in the future, if a suitable labeled
object existed representing the mount.
The watch_reads permission is required to receive notifications from
read-exclusive events on filesystem objects. These events include accessing
a file for the purpose of reading and closing a file which has been opened
read-only. This distinction has been drawn in order to provide a direct
indication in the policy for this otherwise not obvious capability. Read
access to a file should not necessarily imply the ability to observe read
events on a file.
Finally, watch_with_perm only applies to fanotify masks since it is the
only way to set a mask which allows for the blocking, permission event.
This permission is needed for any watch which is of this type. Though
fanotify requires CAP_SYS_ADMIN, this is insufficient as it gives implicit
trust to root, which we do not do, and does not support least privilege.
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-08-12 11:20:00 -04:00
/*
* convert the userspace DN_ * " arg " to the internal FS_ *
* defined in fsnotify
*/
mask = convert_arg ( arg ) ;
error = security_path_notify ( & filp - > f_path , mask ,
FSNOTIFY_OBJ_TYPE_INODE ) ;
if ( error )
goto out_err ;
2009-05-21 17:01:33 -04:00
/* expect most fcntl to add new rather than augment old */
dn = kmem_cache_alloc ( dnotify_struct_cache , GFP_KERNEL ) ;
if ( ! dn ) {
error = - ENOMEM ;
goto out_err ;
}
2008-05-01 03:52:22 +01:00
2009-05-21 17:01:33 -04:00
/* new fsnotify mark, we expect most fcntl calls to add a new mark */
2009-12-17 21:24:24 -05:00
new_dn_mark = kmem_cache_alloc ( dnotify_mark_cache , GFP_KERNEL ) ;
if ( ! new_dn_mark ) {
2009-05-21 17:01:33 -04:00
error = - ENOMEM ;
goto out_err ;
}
2005-04-16 15:20:36 -07:00
2009-12-17 21:24:24 -05:00
/* set up the new_fsn_mark and new_dn_mark */
new_fsn_mark = & new_dn_mark - > fsn_mark ;
2016-12-21 18:06:12 +01:00
fsnotify_init_mark ( new_fsn_mark , dnotify_group ) ;
2009-12-17 21:24:24 -05:00
new_fsn_mark - > mask = mask ;
new_dn_mark - > dn = NULL ;
2005-04-16 15:20:36 -07:00
2009-05-21 17:01:33 -04:00
/* this is needed to prevent the fcntl/close race described below */
2022-04-22 15:03:21 +03:00
fsnotify_group_lock ( dnotify_group ) ;
2005-04-16 15:20:36 -07:00
2009-12-17 21:24:24 -05:00
/* add the new_fsn_mark or find an old one. */
2016-12-21 16:28:45 +01:00
fsn_mark = fsnotify_find_mark ( & inode - > i_fsnotify_marks , dnotify_group ) ;
2009-12-17 21:24:24 -05:00
if ( fsn_mark ) {
dn_mark = container_of ( fsn_mark , struct dnotify_mark , fsn_mark ) ;
spin_lock ( & fsn_mark - > lock ) ;
2009-05-21 17:01:33 -04:00
} else {
2018-04-20 16:10:55 -07:00
error = fsnotify_add_inode_mark_locked ( new_fsn_mark , inode , 0 ) ;
2017-10-31 09:53:28 +01:00
if ( error ) {
2022-04-22 15:03:21 +03:00
fsnotify_group_unlock ( dnotify_group ) ;
2017-10-31 09:53:28 +01:00
goto out_err ;
}
2009-12-17 21:24:24 -05:00
spin_lock ( & new_fsn_mark - > lock ) ;
fsn_mark = new_fsn_mark ;
dn_mark = new_dn_mark ;
/* we used new_fsn_mark, so don't free it */
new_fsn_mark = NULL ;
2009-05-21 17:01:33 -04:00
}
2005-04-16 15:20:36 -07:00
2009-05-21 17:01:33 -04:00
rcu_read_lock ( ) ;
2020-11-20 17:14:27 -06:00
f = lookup_fd_rcu ( fd ) ;
2009-05-21 17:01:33 -04:00
rcu_read_unlock ( ) ;
2005-04-16 15:20:36 -07:00
2009-05-21 17:01:33 -04:00
/* if (f != filp) means that we lost a race and another task/thread
* actually closed the fd we are still playing with before we grabbed
2013-07-08 15:59:44 -07:00
* the dnotify_groups mark_mutex and fsn_mark - > lock . Since closing the
* fd is the only time we clean up the marks we need to get our mark
* off the list . */
2009-05-21 17:01:33 -04:00
if ( f ! = filp ) {
/* if we added ourselves, shoot ourselves, it's possible that
2009-12-17 21:24:24 -05:00
* the flush actually did shoot this fsn_mark . That ' s fine too
2009-05-21 17:01:33 -04:00
* since multiple calls to destroy_mark is perfectly safe , if
2009-12-17 21:24:24 -05:00
* we found a dn_mark already attached to the inode , just sod
2009-05-21 17:01:33 -04:00
* off silently as the flush at close time dealt with it .
*/
2009-12-17 21:24:24 -05:00
if ( dn_mark = = new_dn_mark )
2009-05-21 17:01:33 -04:00
destroy = 1 ;
2017-10-31 09:53:28 +01:00
error = 0 ;
2009-05-21 17:01:33 -04:00
goto out ;
}
2005-04-16 15:20:36 -07:00
2017-07-16 22:05:57 -05:00
__f_setown ( filp , task_pid ( current ) , PIDTYPE_TGID , 0 ) ;
2009-05-21 17:01:33 -04:00
2009-12-17 21:24:24 -05:00
error = attach_dn ( dn , dn_mark , id , fd , filp , mask ) ;
/* !error means that we attached the dn to the dn_mark, so don't free it */
2009-05-21 17:01:33 -04:00
if ( ! error )
dn = NULL ;
/* -EEXIST means that we didn't add this new dn and used an old one.
* that isn ' t an error ( and the unused dn should be freed ) */
else if ( error = = - EEXIST )
error = 0 ;
2009-12-17 21:24:24 -05:00
dnotify_recalc_inode_mask ( fsn_mark ) ;
2009-05-21 17:01:33 -04:00
out :
2009-12-17 21:24:24 -05:00
spin_unlock ( & fsn_mark - > lock ) ;
2009-05-21 17:01:33 -04:00
if ( destroy )
2015-09-04 15:43:12 -07:00
fsnotify_detach_mark ( fsn_mark ) ;
2022-04-22 15:03:21 +03:00
fsnotify_group_unlock ( dnotify_group ) ;
2015-09-04 15:43:12 -07:00
if ( destroy )
fsnotify_free_mark ( fsn_mark ) ;
2009-12-17 21:24:24 -05:00
fsnotify_put_mark ( fsn_mark ) ;
2009-05-21 17:01:33 -04:00
out_err :
2009-12-17 21:24:24 -05:00
if ( new_fsn_mark )
fsnotify_put_mark ( new_fsn_mark ) ;
2009-05-21 17:01:33 -04:00
if ( dn )
kmem_cache_free ( dnotify_struct_cache , dn ) ;
return error ;
2005-04-16 15:20:36 -07:00
}
static int __init dnotify_init ( void )
{
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 15:46:39 -07:00
dnotify_struct_cache = KMEM_CACHE ( dnotify_struct ,
SLAB_PANIC | SLAB_ACCOUNT ) ;
dnotify_mark_cache = KMEM_CACHE ( dnotify_mark , SLAB_PANIC | SLAB_ACCOUNT ) ;
2009-05-21 17:01:33 -04:00
2022-04-22 15:03:21 +03:00
dnotify_group = fsnotify_alloc_group ( & dnotify_fsnotify_ops ,
FSNOTIFY_GROUP_NOFS ) ;
2009-05-21 17:01:33 -04:00
if ( IS_ERR ( dnotify_group ) )
panic ( " unable to allocate fsnotify group for dnotify \n " ) ;
2022-01-21 22:11:29 -08:00
dnotify_sysctl_init ( ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
module_init ( dnotify_init )