2019-05-19 16:51:48 +03:00
// SPDX-License-Identifier: GPL-2.0-or-later
2009-05-22 01:01:20 +04:00
/*
* Copyright ( C ) 2008 Red Hat , Inc . , Eric Paris < eparis @ redhat . com >
*/
# include <linux/list.h>
# include <linux/mutex.h>
# include <linux/slab.h>
# include <linux/srcu.h>
# include <linux/rculist.h>
# include <linux/wait.h>
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 01:46:39 +03:00
# include <linux/memcontrol.h>
2009-05-22 01:01:20 +04:00
# include <linux/fsnotify_backend.h>
# include "fsnotify.h"
2011-07-27 03:09:06 +04:00
# include <linux/atomic.h>
2009-05-22 01:01:20 +04:00
/*
* Final freeing of a group
*/
2014-10-10 02:24:35 +04:00
static void fsnotify_final_destroy_group ( struct fsnotify_group * group )
2009-05-22 01:01:20 +04:00
{
if ( group - > ops - > free_group_priv )
group - > ops - > free_group_priv ( group ) ;
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 01:46:39 +03:00
mem_cgroup_put ( group - > memcg ) ;
2020-05-12 21:18:03 +03:00
mutex_destroy ( & group - > mark_mutex ) ;
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.
The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs. All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.
The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated. However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg. This patch series
contains two such concrete use-cases i.e. fsnotify and buffer_head.
The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener. The events
are allocated in the context of the event producer. However they should
be charged to the event consumer. Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.
To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg. In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged. For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.
This patch (of 2):
A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener. This can cause
system level memory pressure or OOMs. So, it's better to account the
fsnotify kmem caches to the memcg of the listener.
However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer. This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.
There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener. So, SLAB_ACCOUNT is enough for these caches.
The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.
The allocations from the event caches happen in the context of the event
producer. For such caches we will need to remote charge the allocations
to the listener's memcg. Thus we save the memcg reference in the
fsnotify_group structure of the listener.
This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.
[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 01:46:39 +03:00
2009-05-22 01:01:20 +04:00
kfree ( group ) ;
}
2016-09-20 00:44:27 +03:00
/*
* Stop queueing new events for this group . Once this function returns
* fsnotify_add_event ( ) will not add any new events to the group ' s queue .
*/
void fsnotify_group_stop_queueing ( struct fsnotify_group * group )
{
2016-10-08 02:56:52 +03:00
spin_lock ( & group - > notification_lock ) ;
2016-09-20 00:44:27 +03:00
group - > shutdown = true ;
2016-10-08 02:56:52 +03:00
spin_unlock ( & group - > notification_lock ) ;
2016-09-20 00:44:27 +03:00
}
2009-05-22 01:01:26 +04:00
/*
2011-06-14 19:29:47 +04:00
* Trying to get rid of a group . Remove all marks , flush all events and release
* the group reference .
* Note that another thread calling fsnotify_clear_marks_by_group ( ) may still
* hold a ref to the group .
2009-05-22 01:01:26 +04:00
*/
2011-06-14 19:29:45 +04:00
void fsnotify_destroy_group ( struct fsnotify_group * group )
2009-05-22 01:01:26 +04:00
{
2016-09-20 00:44:27 +03:00
/*
* Stop queueing new events . The code below is careful enough to not
* require this but fanotify needs to stop queuing events even before
* fsnotify_destroy_group ( ) is called and this makes the other callers
* of fsnotify_destroy_group ( ) to see the same behavior .
*/
fsnotify_group_stop_queueing ( group ) ;
2016-12-21 16:48:18 +03:00
/* Clear all marks for this group and queue them for destruction */
2021-11-29 23:15:27 +03:00
fsnotify_clear_marks_by_group ( group , FSNOTIFY_OBJ_TYPE_ANY ) ;
2017-01-04 12:51:58 +03:00
/*
* Some marks can still be pinned when waiting for response from
* userspace . Wait for those now . fsnotify_prepare_user_wait ( ) will
* not succeed now so this wait is race - free .
*/
wait_event ( group - > notification_waitq , ! atomic_read ( & group - > user_waits ) ) ;
2009-05-22 01:01:26 +04:00
2016-05-20 03:08:59 +03:00
/*
2016-12-21 16:48:18 +03:00
* Wait until all marks get really destroyed . We could actually destroy
* them ourselves instead of waiting for worker to do it , however that
* would be racy as worker can already be processing some marks before
* we even entered fsnotify_destroy_group ( ) .
2016-05-20 03:08:59 +03:00
*/
2016-12-21 16:48:18 +03:00
fsnotify_wait_marks_destroyed ( ) ;
2010-07-28 18:18:38 +04:00
2016-05-20 03:08:59 +03:00
/*
* Since we have waited for fsnotify_mark_srcu in
* fsnotify_mark_destroy_list ( ) there can be no outstanding event
* notification against this group . So clearing the notification queue
* of all events is reliable now .
*/
2011-06-14 19:29:47 +04:00
fsnotify_flush_notify ( group ) ;
2014-02-21 22:14:11 +04:00
/*
* Destroy overflow event ( we cannot use fsnotify_destroy_event ( ) as
* that deliberately ignores overflow events .
*/
if ( group - > overflow_event )
2021-10-25 22:27:27 +03:00
group - > ops - > free_event ( group , group - > overflow_event ) ;
2014-02-21 22:14:11 +04:00
2011-06-14 19:29:47 +04:00
fsnotify_put_group ( group ) ;
2009-05-22 01:01:26 +04:00
}
2011-06-14 19:29:46 +04:00
/*
* Get reference to a group .
*/
void fsnotify_get_group ( struct fsnotify_group * group )
{
2017-10-20 13:26:01 +03:00
refcount_inc ( & group - > refcnt ) ;
2011-06-14 19:29:46 +04:00
}
2009-05-22 01:01:20 +04:00
/*
* Drop a reference to a group . Free it if it ' s through .
*/
void fsnotify_put_group ( struct fsnotify_group * group )
{
2017-10-20 13:26:01 +03:00
if ( refcount_dec_and_test ( & group - > refcnt ) )
2011-06-14 19:29:47 +04:00
fsnotify_final_destroy_group ( group ) ;
2009-05-22 01:01:20 +04:00
}
2019-08-18 21:18:46 +03:00
EXPORT_SYMBOL_GPL ( fsnotify_put_group ) ;
2009-05-22 01:01:20 +04:00
2020-12-20 07:46:08 +03:00
static struct fsnotify_group * __fsnotify_alloc_group (
2022-04-22 15:03:15 +03:00
const struct fsnotify_ops * ops ,
int flags , gfp_t gfp )
2009-05-22 01:01:20 +04:00
{
2022-04-22 15:03:17 +03:00
static struct lock_class_key nofs_marks_lock ;
2009-12-18 05:24:22 +03:00
struct fsnotify_group * group ;
2009-05-22 01:01:20 +04:00
2020-12-20 07:46:08 +03:00
group = kzalloc ( sizeof ( struct fsnotify_group ) , gfp ) ;
2009-05-22 01:01:20 +04:00
if ( ! group )
return ERR_PTR ( - ENOMEM ) ;
2009-12-18 05:24:23 +03:00
/* set to 0 when there a no external references to this group */
2017-10-20 13:26:01 +03:00
refcount_set ( & group - > refcnt , 1 ) ;
2016-11-10 18:02:11 +03:00
atomic_set ( & group - > user_waits , 0 ) ;
2009-12-18 05:24:23 +03:00
2016-10-08 02:56:52 +03:00
spin_lock_init ( & group - > notification_lock ) ;
2009-05-22 01:01:37 +04:00
INIT_LIST_HEAD ( & group - > notification_list ) ;
init_waitqueue_head ( & group - > notification_waitq ) ;
group - > max_events = UINT_MAX ;
2011-06-14 19:29:50 +04:00
mutex_init ( & group - > mark_mutex ) ;
2009-12-18 05:24:24 +03:00
INIT_LIST_HEAD ( & group - > marks_list ) ;
2009-05-22 01:01:26 +04:00
2009-05-22 01:01:20 +04:00
group - > ops = ops ;
2022-04-22 15:03:15 +03:00
group - > flags = flags ;
2022-04-22 15:03:17 +03:00
/*
* For most backends , eviction of inode with a mark is not expected ,
* because marks hold a refcount on the inode against eviction .
*
* Use a different lockdep class for groups that support evictable
* inode marks , because with evictable marks , mark_mutex is NOT
* fs - reclaim safe - the mutex is taken when evicting inodes .
*/
if ( flags & FSNOTIFY_GROUP_NOFS )
lockdep_set_class ( & group - > mark_mutex , & nofs_marks_lock ) ;
2009-05-22 01:01:20 +04:00
return group ;
}
2020-12-20 07:46:08 +03:00
/*
* Create a new fsnotify_group and hold a reference for the group returned .
*/
2022-04-22 15:03:15 +03:00
struct fsnotify_group * fsnotify_alloc_group ( const struct fsnotify_ops * ops ,
int flags )
2020-12-20 07:46:08 +03:00
{
2022-04-22 15:03:15 +03:00
gfp_t gfp = ( flags & FSNOTIFY_GROUP_USER ) ? GFP_KERNEL_ACCOUNT :
GFP_KERNEL ;
2011-10-15 01:43:39 +04:00
2022-04-22 15:03:15 +03:00
return __fsnotify_alloc_group ( ops , flags , gfp ) ;
2020-12-20 07:46:08 +03:00
}
2022-04-22 15:03:15 +03:00
EXPORT_SYMBOL_GPL ( fsnotify_alloc_group ) ;
2020-12-20 07:46:08 +03:00
2011-10-15 01:43:39 +04:00
int fsnotify_fasync ( int fd , struct file * file , int on )
{
struct fsnotify_group * group = file - > private_data ;
return fasync_helper ( fd , file , on , & group - > fsn_fa ) > = 0 ? 0 : - EIO ;
}