License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
// SPDX-License-Identifier: GPL-2.0
2007-05-07 01:49:36 +04:00
/*
* SLUB : A slab allocator that limits cache line use instead of queuing
* objects in per cpu and per node lists .
*
2011-06-01 21:25:53 +04:00
* The allocator synchronizes using per slab locks or atomic operatios
* and only uses a centralized lock to manage a pool of partial slabs .
2007-05-07 01:49:36 +04:00
*
2008-07-04 20:59:22 +04:00
* ( C ) 2007 SGI , Christoph Lameter
2011-06-01 21:25:53 +04:00
* ( C ) 2011 Linux Foundation , Christoph Lameter
2007-05-07 01:49:36 +04:00
*/
# include <linux/mm.h>
2009-05-05 13:13:44 +04:00
# include <linux/swap.h> /* struct reclaim_state */
2007-05-07 01:49:36 +04:00
# include <linux/module.h>
# include <linux/bit_spinlock.h>
# include <linux/interrupt.h>
# include <linux/bitops.h>
# include <linux/slab.h>
2012-07-07 00:25:11 +04:00
# include "slab.h"
2008-10-06 02:42:17 +04:00
# include <linux/proc_fs.h>
2007-05-07 01:49:36 +04:00
# include <linux/seq_file.h>
2015-02-14 01:39:38 +03:00
# include <linux/kasan.h>
2007-05-07 01:49:36 +04:00
# include <linux/cpu.h>
# include <linux/cpuset.h>
# include <linux/mempolicy.h>
# include <linux/ctype.h>
2008-04-30 11:55:01 +04:00
# include <linux/debugobjects.h>
2007-05-07 01:49:36 +04:00
# include <linux/kallsyms.h>
2007-10-22 03:41:37 +04:00
# include <linux/memory.h>
2008-05-01 15:34:31 +04:00
# include <linux/math64.h>
2008-12-23 13:37:01 +03:00
# include <linux/fault-inject.h>
2011-07-07 23:47:01 +04:00
# include <linux/stacktrace.h>
2012-01-31 01:53:51 +04:00
# include <linux/prefetch.h>
2012-12-19 02:22:34 +04:00
# include <linux/memcontrol.h>
2017-09-07 02:19:18 +03:00
# include <linux/random.h>
2007-05-07 01:49:36 +04:00
2010-10-21 13:29:19 +04:00
# include <trace/events/kmem.h>
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
# include "internal.h"
2007-05-07 01:49:36 +04:00
/*
* Lock order :
2012-07-07 00:25:12 +04:00
* 1. slab_mutex ( Global Mutex )
2011-06-01 21:25:53 +04:00
* 2. node - > list_lock
* 3. slab_lock ( page ) ( Only on some arches and for debugging )
2007-05-07 01:49:36 +04:00
*
2012-07-07 00:25:12 +04:00
* slab_mutex
2011-06-01 21:25:53 +04:00
*
2012-07-07 00:25:12 +04:00
* The role of the slab_mutex is to protect the list of all the slabs
2011-06-01 21:25:53 +04:00
* and to synchronize major metadata changes to slab cache structures .
*
* The slab_lock is only used for debugging and on arches that do not
2018-06-08 03:08:46 +03:00
* have the ability to do a cmpxchg_double . It only protects :
2011-06-01 21:25:53 +04:00
* A . page - > freelist - > List of object free in a page
2018-06-08 03:08:46 +03:00
* B . page - > inuse - > Number of objects in use
* C . page - > objects - > Number of objects in page
* D . page - > frozen - > frozen state
2011-06-01 21:25:53 +04:00
*
* If a slab is frozen then it is exempt from list management . It is not
2019-05-14 03:16:28 +03:00
* on any list except per cpu partial list . The processor that froze the
* slab is the one who can perform list operations on the page . Other
* processors may put objects onto the freelist but the processor that
* froze the slab is the only one that can retrieve the objects from the
* page ' s freelist .
2007-05-07 01:49:36 +04:00
*
* The list_lock protects the partial and full list on each node and
* the partial slab counter . If taken then no new slabs may be added or
* removed from the lists nor make the number of partial slabs be modified .
* ( Note that the total number of slabs is an atomic value that may be
* modified without taking the list lock ) .
*
* The list_lock is a centralized lock and thus we avoid taking it as
* much as possible . As long as SLUB does not have to handle partial
* slabs , operations can continue without any centralized lock . F . e .
* allocating a long series of objects that fill up slabs does not require
* the list lock .
* Interrupts are disabled during allocation and deallocation in order to
* make the slab allocator safe to use in the context of an irq . In addition
* interrupts are disabled to ensure that the processor does not change
* while handling per_cpu slabs , due to kernel preemption .
*
* SLUB assigns one slab for allocation to each processor .
* Allocations only occur from these slabs called cpu slabs .
*
2007-05-09 13:32:39 +04:00
* Slabs with free elements are kept on a partial list and during regular
* operations no list for full slabs is used . If an object in a full slab is
2007-05-07 01:49:36 +04:00
* freed then the slab will show up again on the partial lists .
2007-05-09 13:32:39 +04:00
* We track full slabs for debugging purposes though because otherwise we
* cannot scan all objects .
2007-05-07 01:49:36 +04:00
*
* Slabs are freed when they become empty . Teardown and setup is
* minimal so we rely on the page allocators per cpu caches for
* fast frees and allocs .
*
2019-12-01 04:49:34 +03:00
* page - > frozen The slab is frozen and exempt from list processing .
2007-05-17 09:10:53 +04:00
* This means that the slab is dedicated to a purpose
* such as satisfying allocations for a specific
* processor . Objects may be freed in the slab while
* it is frozen but slab_free will then skip the usual
* list operations . It is up to the processor holding
* the slab to integrate the slab into the slab lists
* when the slab is no longer needed .
*
* One use of this flag is to mark slabs that are
* used for allocations . Then such a slab becomes a cpu
* slab . The cpu slab may be equipped with an additional
2007-10-16 12:26:05 +04:00
* freelist that allows lockless access to
2007-05-10 14:15:16 +04:00
* free objects in addition to the regular freelist
* that requires the slab lock .
2007-05-07 01:49:36 +04:00
*
2019-12-01 04:49:34 +03:00
* SLAB_DEBUG_FLAGS Slab requires special handling due to debug
2007-05-07 01:49:36 +04:00
* options set . This moves slab handling out of
2007-05-10 14:15:16 +04:00
* the fast path and disables lockless freelists .
2007-05-07 01:49:36 +04:00
*/
2020-08-07 09:18:51 +03:00
# ifdef CONFIG_SLUB_DEBUG
# ifdef CONFIG_SLUB_DEBUG_ON
DEFINE_STATIC_KEY_TRUE ( slub_debug_enabled ) ;
# else
DEFINE_STATIC_KEY_FALSE ( slub_debug_enabled ) ;
# endif
# endif
2020-08-07 09:18:55 +03:00
static inline bool kmem_cache_debug ( struct kmem_cache * s )
{
return kmem_cache_debug_flags ( s , SLAB_DEBUG_FLAGS ) ;
2010-07-09 23:07:14 +04:00
}
2007-05-17 09:10:56 +04:00
2016-08-05 01:31:55 +03:00
void * fixup_red_left ( struct kmem_cache * s , void * p )
2016-03-16 00:55:12 +03:00
{
2020-08-07 09:18:55 +03:00
if ( kmem_cache_debug_flags ( s , SLAB_RED_ZONE ) )
2016-03-16 00:55:12 +03:00
p + = s - > red_left_pad ;
return p ;
}
2013-06-19 09:05:52 +04:00
static inline bool kmem_cache_has_cpu_partial ( struct kmem_cache * s )
{
# ifdef CONFIG_SLUB_CPU_PARTIAL
return ! kmem_cache_debug ( s ) ;
# else
return false ;
# endif
}
2007-05-07 01:49:36 +04:00
/*
* Issues still to be resolved :
*
* - Support PAGE_ALLOC_DEBUG . Should be easy to do .
*
* - Variable sizing of the per node arrays
*/
/* Enable to test recovery from slab corruption on boot */
# undef SLUB_RESILIENCY_TEST
2011-06-01 21:25:49 +04:00
/* Enable to log cmpxchg failures */
# undef SLUB_DEBUG_CMPXCHG
2007-05-07 01:49:46 +04:00
/*
* Mininum number of partial slabs . These will be left on the partial
* lists even if they are empty . kmem_cache_shrink may reclaim them .
*/
2007-12-22 01:37:37 +03:00
# define MIN_PARTIAL 5
2007-05-07 01:49:44 +04:00
2007-05-07 01:49:46 +04:00
/*
* Maximum number of desirable partial slabs .
* The existence of more partial slabs makes kmem_cache_shrink
2013-11-08 16:47:37 +04:00
* sort the partial list by the number of objects in use .
2007-05-07 01:49:46 +04:00
*/
# define MAX_PARTIAL 10
2016-03-16 00:55:06 +03:00
# define DEBUG_DEFAULT_FLAGS (SLAB_CONSISTENCY_CHECKS | SLAB_RED_ZONE | \
2007-05-07 01:49:36 +04:00
SLAB_POISON | SLAB_STORE_USER )
2007-05-09 13:32:39 +04:00
2016-03-16 00:55:09 +03:00
/*
* These debug flags cannot use CMPXCHG because there might be consistency
* issues when checking or reading debug information
*/
# define SLAB_NO_CMPXCHG (SLAB_CONSISTENCY_CHECKS | SLAB_STORE_USER | \
SLAB_TRACE )
2009-07-07 11:14:14 +04:00
/*
2009-07-28 05:30:35 +04:00
* Debugging flags that require metadata to be stored in the slab . These get
* disabled when slub_debug = O is used and a cache ' s min order increases with
* metadata .
2009-07-07 11:14:14 +04:00
*/
2009-07-28 05:30:35 +04:00
# define DEBUG_METADATA_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
2009-07-07 11:14:14 +04:00
2008-10-22 23:00:38 +04:00
# define OO_SHIFT 16
# define OO_MASK ((1 << OO_SHIFT) - 1)
2011-06-01 21:25:45 +04:00
# define MAX_OBJS_PER_PAGE 32767 /* since page.objects is u15 */
2008-10-22 23:00:38 +04:00
2007-05-07 01:49:36 +04:00
/* Internal SLUB flags */
2017-11-16 04:32:18 +03:00
/* Poison object */
2017-11-16 04:32:21 +03:00
# define __OBJECT_POISON ((slab_flags_t __force)0x80000000U)
2017-11-16 04:32:18 +03:00
/* Use cmpxchg_double */
2017-11-16 04:32:21 +03:00
# define __CMPXCHG_DOUBLE ((slab_flags_t __force)0x40000000U)
2007-05-07 01:49:36 +04:00
2007-05-09 13:32:43 +04:00
/*
* Tracking user of a slab .
*/
2011-07-07 22:36:36 +04:00
# define TRACK_ADDRS_COUNT 16
2007-05-09 13:32:43 +04:00
struct track {
2008-08-19 21:43:25 +04:00
unsigned long addr ; /* Called from address */
2011-07-07 22:36:36 +04:00
# ifdef CONFIG_STACKTRACE
unsigned long addrs [ TRACK_ADDRS_COUNT ] ; /* Called from address */
# endif
2007-05-09 13:32:43 +04:00
int cpu ; /* Was running on cpu */
int pid ; /* Pid context */
unsigned long when ; /* When did the operation occur */
} ;
enum track_item { TRACK_ALLOC , TRACK_FREE } ;
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SYSFS
2007-05-07 01:49:36 +04:00
static int sysfs_slab_add ( struct kmem_cache * ) ;
static int sysfs_slab_alias ( struct kmem_cache * , const char * ) ;
# else
2007-07-17 15:03:24 +04:00
static inline int sysfs_slab_add ( struct kmem_cache * s ) { return 0 ; }
static inline int sysfs_slab_alias ( struct kmem_cache * s , const char * p )
{ return 0 ; }
2007-05-07 01:49:36 +04:00
# endif
2011-03-22 21:35:00 +03:00
static inline void stat ( const struct kmem_cache * s , enum stat_item si )
2008-02-08 04:47:41 +03:00
{
# ifdef CONFIG_SLUB_STATS
2014-04-08 02:39:42 +04:00
/*
* The rmw is racy on a preemptible kernel but this is acceptable , so
* avoid this_cpu_add ( ) ' s irq - disable overhead .
*/
raw_cpu_inc ( s - > cpu_slab - > stat [ si ] ) ;
2008-02-08 04:47:41 +03:00
# endif
}
2007-05-07 01:49:36 +04:00
/********************************************************************
* Core slab cache functions
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
2017-09-07 02:19:18 +03:00
/*
* Returns freelist pointer ( ptr ) . With hardening , this is obfuscated
* with an XOR of the address where the pointer is held and a per - cache
* random number .
*/
static inline void * freelist_ptr ( const struct kmem_cache * s , void * ptr ,
unsigned long ptr_addr )
{
# ifdef CONFIG_SLAB_FREELIST_HARDENED
2019-02-21 09:19:32 +03:00
/*
* When CONFIG_KASAN_SW_TAGS is enabled , ptr_addr might be tagged .
* Normally , this doesn ' t cause any issues , as both set_freepointer ( )
* and get_freepointer ( ) are called with a pointer with the same tag .
* However , there are some issues with CONFIG_SLUB_DEBUG code . For
* example , when __free_slub ( ) iterates over objects in a cache , it
* passes untagged pointers to check_object ( ) . check_object ( ) in turns
* calls get_freepointer ( ) with an untagged pointer , which causes the
* freepointer to be restored incorrectly .
*/
return ( void * ) ( ( unsigned long ) ptr ^ s - > random ^
2020-04-02 07:04:23 +03:00
swab ( ( unsigned long ) kasan_reset_tag ( ( void * ) ptr_addr ) ) ) ;
2017-09-07 02:19:18 +03:00
# else
return ptr ;
# endif
}
/* Returns the freelist pointer recorded at location ptr_addr. */
static inline void * freelist_dereference ( const struct kmem_cache * s ,
void * ptr_addr )
{
return freelist_ptr ( s , ( void * ) * ( unsigned long * ) ( ptr_addr ) ,
( unsigned long ) ptr_addr ) ;
}
2007-05-09 13:32:40 +04:00
static inline void * get_freepointer ( struct kmem_cache * s , void * object )
{
2017-09-07 02:19:18 +03:00
return freelist_dereference ( s , object + s - > offset ) ;
2007-05-09 13:32:40 +04:00
}
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
static void prefetch_freepointer ( const struct kmem_cache * s , void * object )
{
2018-08-18 01:44:44 +03:00
prefetch ( object + s - > offset ) ;
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
}
2011-05-17 00:26:08 +04:00
static inline void * get_freepointer_safe ( struct kmem_cache * s , void * object )
{
2017-09-07 02:19:18 +03:00
unsigned long freepointer_addr ;
2011-05-17 00:26:08 +04:00
void * p ;
mm, debug_pagealloc: don't rely on static keys too early
Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.
However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.
To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.
[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-14 03:29:20 +03:00
if ( ! debug_pagealloc_enabled_static ( ) )
2016-03-18 00:17:53 +03:00
return get_freepointer ( s , object ) ;
2017-09-07 02:19:18 +03:00
freepointer_addr = ( unsigned long ) object + s - > offset ;
2020-06-17 10:37:53 +03:00
copy_from_kernel_nofault ( & p , ( void * * ) freepointer_addr , sizeof ( p ) ) ;
2017-09-07 02:19:18 +03:00
return freelist_ptr ( s , p , freepointer_addr ) ;
2011-05-17 00:26:08 +04:00
}
2007-05-09 13:32:40 +04:00
static inline void set_freepointer ( struct kmem_cache * s , void * object , void * fp )
{
2017-09-07 02:19:18 +03:00
unsigned long freeptr_addr = ( unsigned long ) object + s - > offset ;
2017-09-07 02:19:22 +03:00
# ifdef CONFIG_SLAB_FREELIST_HARDENED
BUG_ON ( object = = fp ) ; /* naive detection of double free or corruption */
# endif
2017-09-07 02:19:18 +03:00
* ( void * * ) freeptr_addr = freelist_ptr ( s , fp , freeptr_addr ) ;
2007-05-09 13:32:40 +04:00
}
/* Loop over all objects in a slab */
2008-04-14 20:11:31 +04:00
# define for_each_object(__p, __s, __addr, __objects) \
2016-03-16 00:55:12 +03:00
for ( __p = fixup_red_left ( __s , __addr ) ; \
__p < ( __addr ) + ( __objects ) * ( __s ) - > size ; \
__p + = ( __s ) - > size )
2007-05-09 13:32:40 +04:00
2018-06-08 03:09:10 +03:00
static inline unsigned int order_objects ( unsigned int order , unsigned int size )
2011-03-10 10:21:48 +03:00
{
2018-06-08 03:09:10 +03:00
return ( ( unsigned int ) PAGE_SIZE < < order ) / size ;
2011-03-10 10:21:48 +03:00
}
2018-04-06 02:21:39 +03:00
static inline struct kmem_cache_order_objects oo_make ( unsigned int order ,
2018-06-08 03:09:10 +03:00
unsigned int size )
2008-04-14 20:11:31 +04:00
{
struct kmem_cache_order_objects x = {
2018-06-08 03:09:10 +03:00
( order < < OO_SHIFT ) + order_objects ( order , size )
2008-04-14 20:11:31 +04:00
} ;
return x ;
}
2018-04-06 02:21:39 +03:00
static inline unsigned int oo_order ( struct kmem_cache_order_objects x )
2008-04-14 20:11:31 +04:00
{
2008-10-22 23:00:38 +04:00
return x . x > > OO_SHIFT ;
2008-04-14 20:11:31 +04:00
}
2018-04-06 02:21:39 +03:00
static inline unsigned int oo_objects ( struct kmem_cache_order_objects x )
2008-04-14 20:11:31 +04:00
{
2008-10-22 23:00:38 +04:00
return x . x & OO_MASK ;
2008-04-14 20:11:31 +04:00
}
2011-06-01 21:25:53 +04:00
/*
* Per slab locking using the pagelock
*/
static __always_inline void slab_lock ( struct page * page )
{
2016-01-16 03:51:24 +03:00
VM_BUG_ON_PAGE ( PageTail ( page ) , page ) ;
2011-06-01 21:25:53 +04:00
bit_spin_lock ( PG_locked , & page - > flags ) ;
}
static __always_inline void slab_unlock ( struct page * page )
{
2016-01-16 03:51:24 +03:00
VM_BUG_ON_PAGE ( PageTail ( page ) , page ) ;
2011-06-01 21:25:53 +04:00
__bit_spin_unlock ( PG_locked , & page - > flags ) ;
}
2011-07-14 21:49:12 +04:00
/* Interrupts must be disabled (for the fallback code to work right) */
static inline bool __cmpxchg_double_slab ( struct kmem_cache * s , struct page * page ,
void * freelist_old , unsigned long counters_old ,
void * freelist_new , unsigned long counters_new ,
const char * n )
{
VM_BUG_ON ( ! irqs_disabled ( ) ) ;
2012-01-13 05:17:33 +04:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-07-14 21:49:12 +04:00
if ( s - > flags & __CMPXCHG_DOUBLE ) {
2012-01-02 21:02:18 +04:00
if ( cmpxchg_double ( & page - > freelist , & page - > counters ,
2014-08-07 03:04:48 +04:00
freelist_old , counters_old ,
freelist_new , counters_new ) )
2015-04-15 01:44:31 +03:00
return true ;
2011-07-14 21:49:12 +04:00
} else
# endif
{
slab_lock ( page ) ;
2013-07-15 05:05:29 +04:00
if ( page - > freelist = = freelist_old & &
page - > counters = = counters_old ) {
2011-07-14 21:49:12 +04:00
page - > freelist = freelist_new ;
2018-06-08 03:08:31 +03:00
page - > counters = counters_new ;
2011-07-14 21:49:12 +04:00
slab_unlock ( page ) ;
2015-04-15 01:44:31 +03:00
return true ;
2011-07-14 21:49:12 +04:00
}
slab_unlock ( page ) ;
}
cpu_relax ( ) ;
stat ( s , CMPXCHG_DOUBLE_FAIL ) ;
# ifdef SLUB_DEBUG_CMPXCHG
2014-06-05 03:06:34 +04:00
pr_info ( " %s %s: cmpxchg double redo " , n , s - > name ) ;
2011-07-14 21:49:12 +04:00
# endif
2015-04-15 01:44:31 +03:00
return false ;
2011-07-14 21:49:12 +04:00
}
2011-06-01 21:25:49 +04:00
static inline bool cmpxchg_double_slab ( struct kmem_cache * s , struct page * page ,
void * freelist_old , unsigned long counters_old ,
void * freelist_new , unsigned long counters_new ,
const char * n )
{
2012-01-13 05:17:33 +04:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-06-01 21:25:49 +04:00
if ( s - > flags & __CMPXCHG_DOUBLE ) {
2012-01-02 21:02:18 +04:00
if ( cmpxchg_double ( & page - > freelist , & page - > counters ,
2014-08-07 03:04:48 +04:00
freelist_old , counters_old ,
freelist_new , counters_new ) )
2015-04-15 01:44:31 +03:00
return true ;
2011-06-01 21:25:49 +04:00
} else
# endif
{
2011-07-14 21:49:12 +04:00
unsigned long flags ;
local_irq_save ( flags ) ;
2011-06-01 21:25:53 +04:00
slab_lock ( page ) ;
2013-07-15 05:05:29 +04:00
if ( page - > freelist = = freelist_old & &
page - > counters = = counters_old ) {
2011-06-01 21:25:49 +04:00
page - > freelist = freelist_new ;
2018-06-08 03:08:31 +03:00
page - > counters = counters_new ;
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2011-07-14 21:49:12 +04:00
local_irq_restore ( flags ) ;
2015-04-15 01:44:31 +03:00
return true ;
2011-06-01 21:25:49 +04:00
}
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2011-07-14 21:49:12 +04:00
local_irq_restore ( flags ) ;
2011-06-01 21:25:49 +04:00
}
cpu_relax ( ) ;
stat ( s , CMPXCHG_DOUBLE_FAIL ) ;
# ifdef SLUB_DEBUG_CMPXCHG
2014-06-05 03:06:34 +04:00
pr_info ( " %s %s: cmpxchg double redo " , n , s - > name ) ;
2011-06-01 21:25:49 +04:00
# endif
2015-04-15 01:44:31 +03:00
return false ;
2011-06-01 21:25:49 +04:00
}
2007-05-09 13:32:44 +04:00
# ifdef CONFIG_SLUB_DEBUG
2020-01-31 09:11:57 +03:00
static unsigned long object_map [ BITS_TO_LONGS ( MAX_OBJS_PER_PAGE ) ] ;
static DEFINE_SPINLOCK ( object_map_lock ) ;
2011-04-15 23:48:13 +04:00
/*
* Determine a map of object in use on a page .
*
2011-06-01 21:25:53 +04:00
* Node listlock must be held to guarantee that the page does
2011-04-15 23:48:13 +04:00
* not vanish from under us .
*/
2020-01-31 09:11:57 +03:00
static unsigned long * get_map ( struct kmem_cache * s , struct page * page )
2020-04-07 06:08:15 +03:00
__acquires ( & object_map_lock )
2011-04-15 23:48:13 +04:00
{
void * p ;
void * addr = page_address ( page ) ;
2020-01-31 09:11:57 +03:00
VM_BUG_ON ( ! irqs_disabled ( ) ) ;
spin_lock ( & object_map_lock ) ;
bitmap_zero ( object_map , page - > objects ) ;
2011-04-15 23:48:13 +04:00
for ( p = page - > freelist ; p ; p = get_freepointer ( s , p ) )
2020-08-07 09:20:42 +03:00
set_bit ( __obj_to_index ( s , addr , p ) , object_map ) ;
2020-01-31 09:11:57 +03:00
return object_map ;
}
2020-04-07 06:08:18 +03:00
static void put_map ( unsigned long * map ) __releases ( & object_map_lock )
2020-01-31 09:11:57 +03:00
{
VM_BUG_ON ( map ! = object_map ) ;
spin_unlock ( & object_map_lock ) ;
2011-04-15 23:48:13 +04:00
}
2018-04-06 02:21:43 +03:00
static inline unsigned int size_from_object ( struct kmem_cache * s )
2016-03-16 00:55:12 +03:00
{
if ( s - > flags & SLAB_RED_ZONE )
return s - > size - s - > red_left_pad ;
return s - > size ;
}
static inline void * restore_red_left ( struct kmem_cache * s , void * p )
{
if ( s - > flags & SLAB_RED_ZONE )
p - = s - > red_left_pad ;
return p ;
}
2007-05-09 13:32:44 +04:00
/*
* Debug settings :
*/
2015-11-06 05:51:23 +03:00
# if defined(CONFIG_SLUB_DEBUG_ON)
2017-11-16 04:32:18 +03:00
static slab_flags_t slub_debug = DEBUG_DEFAULT_FLAGS ;
2007-07-16 10:38:14 +04:00
# else
2017-11-16 04:32:18 +03:00
static slab_flags_t slub_debug ;
2007-07-16 10:38:14 +04:00
# endif
2007-05-09 13:32:44 +04:00
2020-08-07 09:18:35 +03:00
static char * slub_debug_string ;
2009-07-07 11:14:14 +04:00
static int disable_higher_order_debug ;
2007-05-09 13:32:44 +04:00
2015-02-14 01:39:38 +03:00
/*
* slub is about to manipulate internal object metadata . This memory lies
* outside the range of the allocated object , so accessing it would normally
* be reported by kasan as a bounds error . metadata_access_enable ( ) is used
* to tell kasan that these accesses are OK .
*/
static inline void metadata_access_enable ( void )
{
kasan_disable_current ( ) ;
}
static inline void metadata_access_disable ( void )
{
kasan_enable_current ( ) ;
}
2007-05-07 01:49:36 +04:00
/*
* Object debugging
*/
2016-03-16 00:55:12 +03:00
/* Verify that a pointer has an address that is valid within a slab page */
static inline int check_valid_pointer ( struct kmem_cache * s ,
struct page * page , void * object )
{
void * base ;
if ( ! object )
return 1 ;
base = page_address ( page ) ;
2019-02-21 09:19:36 +03:00
object = kasan_reset_tag ( object ) ;
2016-03-16 00:55:12 +03:00
object = restore_red_left ( s , object ) ;
if ( object < base | | object > = base + page - > objects * s - > size | |
( object - base ) % s - > size ) {
return 0 ;
}
return 1 ;
}
2017-01-25 02:18:02 +03:00
static void print_section ( char * level , char * text , u8 * addr ,
unsigned int length )
2007-05-07 01:49:36 +04:00
{
2015-02-14 01:39:38 +03:00
metadata_access_enable ( ) ;
2017-01-25 02:18:02 +03:00
print_hex_dump ( level , text , DUMP_PREFIX_ADDRESS , 16 , 1 , addr ,
2011-07-29 16:10:20 +04:00
length , 1 ) ;
2015-02-14 01:39:38 +03:00
metadata_access_disable ( ) ;
2007-05-07 01:49:36 +04:00
}
2020-05-08 04:36:06 +03:00
/*
* See comment in calculate_sizes ( ) .
*/
static inline bool freeptr_outside_object ( struct kmem_cache * s )
{
return s - > offset > = s - > inuse ;
}
/*
* Return offset of the end of info block which is inuse + free pointer if
* not overlapping with object .
*/
static inline unsigned int get_info_end ( struct kmem_cache * s )
{
if ( freeptr_outside_object ( s ) )
return s - > inuse + sizeof ( void * ) ;
else
return s - > inuse ;
}
2007-05-07 01:49:36 +04:00
static struct track * get_track ( struct kmem_cache * s , void * object ,
enum track_item alloc )
{
struct track * p ;
2020-05-08 04:36:06 +03:00
p = object + get_info_end ( s ) ;
2007-05-07 01:49:36 +04:00
return p + alloc ;
}
static void set_track ( struct kmem_cache * s , void * object ,
2008-08-19 21:43:25 +04:00
enum track_item alloc , unsigned long addr )
2007-05-07 01:49:36 +04:00
{
2009-03-06 18:36:21 +03:00
struct track * p = get_track ( s , object , alloc ) ;
2007-05-07 01:49:36 +04:00
if ( addr ) {
2011-07-07 22:36:36 +04:00
# ifdef CONFIG_STACKTRACE
2019-04-25 12:45:00 +03:00
unsigned int nr_entries ;
2011-07-07 22:36:36 +04:00
2015-02-14 01:39:38 +03:00
metadata_access_enable ( ) ;
2019-04-25 12:45:00 +03:00
nr_entries = stack_trace_save ( p - > addrs , TRACK_ADDRS_COUNT , 3 ) ;
2015-02-14 01:39:38 +03:00
metadata_access_disable ( ) ;
2011-07-07 22:36:36 +04:00
2019-04-25 12:45:00 +03:00
if ( nr_entries < TRACK_ADDRS_COUNT )
p - > addrs [ nr_entries ] = 0 ;
2011-07-07 22:36:36 +04:00
# endif
2007-05-07 01:49:36 +04:00
p - > addr = addr ;
p - > cpu = smp_processor_id ( ) ;
2008-06-23 02:58:37 +04:00
p - > pid = current - > pid ;
2007-05-07 01:49:36 +04:00
p - > when = jiffies ;
2019-04-10 13:28:05 +03:00
} else {
2007-05-07 01:49:36 +04:00
memset ( p , 0 , sizeof ( struct track ) ) ;
2019-04-10 13:28:05 +03:00
}
2007-05-07 01:49:36 +04:00
}
static void init_tracking ( struct kmem_cache * s , void * object )
{
2007-07-17 15:03:18 +04:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2008-08-19 21:43:25 +04:00
set_track ( s , object , TRACK_FREE , 0UL ) ;
set_track ( s , object , TRACK_ALLOC , 0UL ) ;
2007-05-07 01:49:36 +04:00
}
2018-04-06 02:20:15 +03:00
static void print_track ( const char * s , struct track * t , unsigned long pr_time )
2007-05-07 01:49:36 +04:00
{
if ( ! t - > addr )
return ;
2014-06-05 03:06:34 +04:00
pr_err ( " INFO: %s in %pS age=%lu cpu=%u pid=%d \n " ,
2018-04-06 02:20:15 +03:00
s , ( void * ) t - > addr , pr_time - t - > when , t - > cpu , t - > pid ) ;
2011-07-07 22:36:36 +04:00
# ifdef CONFIG_STACKTRACE
{
int i ;
for ( i = 0 ; i < TRACK_ADDRS_COUNT ; i + + )
if ( t - > addrs [ i ] )
2014-06-05 03:06:34 +04:00
pr_err ( " \t %pS \n " , ( void * ) t - > addrs [ i ] ) ;
2011-07-07 22:36:36 +04:00
else
break ;
}
# endif
2007-07-17 15:03:18 +04:00
}
2020-08-07 09:19:05 +03:00
void print_tracking ( struct kmem_cache * s , void * object )
2007-07-17 15:03:18 +04:00
{
2018-04-06 02:20:15 +03:00
unsigned long pr_time = jiffies ;
2007-07-17 15:03:18 +04:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2018-04-06 02:20:15 +03:00
print_track ( " Allocated " , get_track ( s , object , TRACK_ALLOC ) , pr_time ) ;
print_track ( " Freed " , get_track ( s , object , TRACK_FREE ) , pr_time ) ;
2007-07-17 15:03:18 +04:00
}
static void print_page_info ( struct page * page )
{
2014-06-05 03:06:34 +04:00
pr_err ( " INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx \n " ,
2013-07-15 05:05:29 +04:00
page , page - > objects , page - > inuse , page - > freelist , page - > flags ) ;
2007-07-17 15:03:18 +04:00
}
static void slab_bug ( struct kmem_cache * s , char * fmt , . . . )
{
2014-06-05 03:06:35 +04:00
struct va_format vaf ;
2007-07-17 15:03:18 +04:00
va_list args ;
va_start ( args , fmt ) ;
2014-06-05 03:06:35 +04:00
vaf . fmt = fmt ;
vaf . va = & args ;
2014-06-05 03:06:34 +04:00
pr_err ( " ============================================================================= \n " ) ;
2014-06-05 03:06:35 +04:00
pr_err ( " BUG %s (%s): %pV \n " , s - > name , print_tainted ( ) , & vaf ) ;
2014-06-05 03:06:34 +04:00
pr_err ( " ----------------------------------------------------------------------------- \n \n " ) ;
2012-09-18 23:54:12 +04:00
2013-01-21 10:47:39 +04:00
add_taint ( TAINT_BAD_PAGE , LOCKDEP_NOW_UNRELIABLE ) ;
2014-06-05 03:06:35 +04:00
va_end ( args ) ;
2007-05-07 01:49:36 +04:00
}
2007-07-17 15:03:18 +04:00
static void slab_fix ( struct kmem_cache * s , char * fmt , . . . )
{
2014-06-05 03:06:35 +04:00
struct va_format vaf ;
2007-07-17 15:03:18 +04:00
va_list args ;
va_start ( args , fmt ) ;
2014-06-05 03:06:35 +04:00
vaf . fmt = fmt ;
vaf . va = & args ;
pr_err ( " FIX %s: %pV \n " , s - > name , & vaf ) ;
2007-07-17 15:03:18 +04:00
va_end ( args ) ;
}
2020-06-02 07:45:47 +03:00
static bool freelist_corrupted ( struct kmem_cache * s , struct page * page ,
void * freelist , void * nextfree )
{
if ( ( s - > flags & SLAB_CONSISTENCY_CHECKS ) & &
! check_valid_pointer ( s , page , nextfree ) ) {
object_err ( s , page , freelist , " Freechain corrupt " ) ;
freelist = NULL ;
slab_fix ( s , " Isolate corrupted freechain " ) ;
return true ;
}
return false ;
}
2007-07-17 15:03:18 +04:00
static void print_trailer ( struct kmem_cache * s , struct page * page , u8 * p )
2007-05-07 01:49:36 +04:00
{
unsigned int off ; /* Offset of last byte */
2008-03-02 00:40:44 +03:00
u8 * addr = page_address ( page ) ;
2007-07-17 15:03:18 +04:00
print_tracking ( s , p ) ;
print_page_info ( page ) ;
2014-06-05 03:06:34 +04:00
pr_err ( " INFO: Object 0x%p @offset=%tu fp=0x%p \n \n " ,
p , p - addr , get_freepointer ( s , p ) ) ;
2007-07-17 15:03:18 +04:00
2016-03-16 00:55:12 +03:00
if ( s - > flags & SLAB_RED_ZONE )
2017-01-25 02:18:02 +03:00
print_section ( KERN_ERR , " Redzone " , p - s - > red_left_pad ,
s - > red_left_pad ) ;
2016-03-16 00:55:12 +03:00
else if ( p > addr + 16 )
2017-01-25 02:18:02 +03:00
print_section ( KERN_ERR , " Bytes b4 " , p - 16 , 16 ) ;
2007-05-07 01:49:36 +04:00
2017-01-25 02:18:02 +03:00
print_section ( KERN_ERR , " Object " , p ,
2018-04-06 02:21:17 +03:00
min_t ( unsigned int , s - > object_size , PAGE_SIZE ) ) ;
2007-05-07 01:49:36 +04:00
if ( s - > flags & SLAB_RED_ZONE )
2017-01-25 02:18:02 +03:00
print_section ( KERN_ERR , " Redzone " , p + s - > object_size ,
2012-06-13 19:24:57 +04:00
s - > inuse - s - > object_size ) ;
2007-05-07 01:49:36 +04:00
2020-05-08 04:36:06 +03:00
off = get_info_end ( s ) ;
2007-05-07 01:49:36 +04:00
2007-07-17 15:03:18 +04:00
if ( s - > flags & SLAB_STORE_USER )
2007-05-07 01:49:36 +04:00
off + = 2 * sizeof ( struct track ) ;
2016-07-29 01:49:07 +03:00
off + = kasan_metadata_size ( s ) ;
2016-03-16 00:55:12 +03:00
if ( off ! = size_from_object ( s ) )
2007-05-07 01:49:36 +04:00
/* Beginning of the filler is the free pointer */
2017-01-25 02:18:02 +03:00
print_section ( KERN_ERR , " Padding " , p + off ,
size_from_object ( s ) - off ) ;
2007-07-17 15:03:18 +04:00
dump_stack ( ) ;
2007-05-07 01:49:36 +04:00
}
2015-02-14 01:39:35 +03:00
void object_err ( struct kmem_cache * s , struct page * page ,
2007-05-07 01:49:36 +04:00
u8 * object , char * reason )
{
2008-04-23 23:28:01 +04:00
slab_bug ( s , " %s " , reason ) ;
2007-07-17 15:03:18 +04:00
print_trailer ( s , page , object ) ;
2007-05-07 01:49:36 +04:00
}
2018-06-08 03:05:17 +03:00
static __printf ( 3 , 4 ) void slab_err ( struct kmem_cache * s , struct page * page ,
2013-07-15 05:05:29 +04:00
const char * fmt , . . . )
2007-05-07 01:49:36 +04:00
{
va_list args ;
char buf [ 100 ] ;
2007-07-17 15:03:18 +04:00
va_start ( args , fmt ) ;
vsnprintf ( buf , sizeof ( buf ) , fmt , args ) ;
2007-05-07 01:49:36 +04:00
va_end ( args ) ;
2008-04-23 23:28:01 +04:00
slab_bug ( s , " %s " , buf ) ;
2007-07-17 15:03:18 +04:00
print_page_info ( page ) ;
2007-05-07 01:49:36 +04:00
dump_stack ( ) ;
}
2010-09-29 16:15:01 +04:00
static void init_object ( struct kmem_cache * s , void * object , u8 val )
2007-05-07 01:49:36 +04:00
{
u8 * p = object ;
2016-03-16 00:55:12 +03:00
if ( s - > flags & SLAB_RED_ZONE )
memset ( p - s - > red_left_pad , val , s - > red_left_pad ) ;
2007-05-07 01:49:36 +04:00
if ( s - > flags & __OBJECT_POISON ) {
2012-06-13 19:24:57 +04:00
memset ( p , POISON_FREE , s - > object_size - 1 ) ;
p [ s - > object_size - 1 ] = POISON_END ;
2007-05-07 01:49:36 +04:00
}
if ( s - > flags & SLAB_RED_ZONE )
2012-06-13 19:24:57 +04:00
memset ( p + s - > object_size , val , s - > inuse - s - > object_size ) ;
2007-05-07 01:49:36 +04:00
}
2007-07-17 15:03:18 +04:00
static void restore_bytes ( struct kmem_cache * s , char * message , u8 data ,
void * from , void * to )
{
slab_fix ( s , " Restoring 0x%p-0x%p=0x%x \n " , from , to - 1 , data ) ;
memset ( from , data , to - from ) ;
}
static int check_bytes_and_report ( struct kmem_cache * s , struct page * page ,
u8 * object , char * what ,
2008-01-08 10:20:27 +03:00
u8 * start , unsigned int value , unsigned int bytes )
2007-07-17 15:03:18 +04:00
{
u8 * fault ;
u8 * end ;
2019-12-01 04:49:31 +03:00
u8 * addr = page_address ( page ) ;
2007-07-17 15:03:18 +04:00
2015-02-14 01:39:38 +03:00
metadata_access_enable ( ) ;
2011-11-01 04:08:07 +04:00
fault = memchr_inv ( start , value , bytes ) ;
2015-02-14 01:39:38 +03:00
metadata_access_disable ( ) ;
2007-07-17 15:03:18 +04:00
if ( ! fault )
return 1 ;
end = start + bytes ;
while ( end > fault & & end [ - 1 ] = = value )
end - - ;
slab_bug ( s , " %s overwritten " , what ) ;
2019-12-01 04:49:31 +03:00
pr_err ( " INFO: 0x%p-0x%p @offset=%tu. First byte 0x%x instead of 0x%x \n " ,
fault , end - 1 , fault - addr ,
fault [ 0 ] , value ) ;
2007-07-17 15:03:18 +04:00
print_trailer ( s , page , object ) ;
restore_bytes ( s , what , value , fault , end ) ;
return 0 ;
2007-05-07 01:49:36 +04:00
}
/*
* Object layout :
*
* object address
* Bytes of the object to be managed .
* If the freepointer may overlay the object then the free
2020-05-08 04:36:06 +03:00
* pointer is at the middle of the object .
2007-05-09 13:32:39 +04:00
*
2007-05-07 01:49:36 +04:00
* Poisoning uses 0x6b ( POISON_FREE ) and the last byte is
* 0xa5 ( POISON_END )
*
2012-06-13 19:24:57 +04:00
* object + s - > object_size
2007-05-07 01:49:36 +04:00
* Padding to reach word boundary . This is also used for Redzoning .
2007-05-09 13:32:39 +04:00
* Padding is extended by another word if Redzoning is enabled and
2012-06-13 19:24:57 +04:00
* object_size = = inuse .
2007-05-09 13:32:39 +04:00
*
2007-05-07 01:49:36 +04:00
* We fill with 0xbb ( RED_INACTIVE ) for inactive objects and with
* 0xcc ( RED_ACTIVE ) for objects in use .
*
* object + s - > inuse
2007-05-09 13:32:39 +04:00
* Meta data starts here .
*
2007-05-07 01:49:36 +04:00
* A . Free pointer ( if we cannot overwrite object on free )
* B . Tracking data for SLAB_STORE_USER
2007-05-09 13:32:39 +04:00
* C . Padding to reach required alignment boundary or at mininum
2008-02-16 10:45:26 +03:00
* one word if debugging is on to be able to detect writes
2007-05-09 13:32:39 +04:00
* before the word boundary .
*
* Padding is done using 0x5a ( POISON_INUSE )
2007-05-07 01:49:36 +04:00
*
* object + s - > size
2007-05-09 13:32:39 +04:00
* Nothing is used beyond s - > size .
2007-05-07 01:49:36 +04:00
*
2012-06-13 19:24:57 +04:00
* If slabcaches are merged then the object_size and inuse boundaries are mostly
2007-05-09 13:32:39 +04:00
* ignored . And therefore no slab options that rely on these boundaries
2007-05-07 01:49:36 +04:00
* may be used with merged slabcaches .
*/
static int check_pad_bytes ( struct kmem_cache * s , struct page * page , u8 * p )
{
2020-05-08 04:36:06 +03:00
unsigned long off = get_info_end ( s ) ; /* The end of info */
2007-05-07 01:49:36 +04:00
if ( s - > flags & SLAB_STORE_USER )
/* We also have user information there */
off + = 2 * sizeof ( struct track ) ;
2016-07-29 01:49:07 +03:00
off + = kasan_metadata_size ( s ) ;
2016-03-16 00:55:12 +03:00
if ( size_from_object ( s ) = = off )
2007-05-07 01:49:36 +04:00
return 1 ;
2007-07-17 15:03:18 +04:00
return check_bytes_and_report ( s , page , p , " Object padding " ,
2016-03-16 00:55:12 +03:00
p + off , POISON_INUSE , size_from_object ( s ) - off ) ;
2007-05-07 01:49:36 +04:00
}
2008-04-14 20:11:30 +04:00
/* Check the pad bytes at the end of a slab page */
2007-05-07 01:49:36 +04:00
static int slab_pad_check ( struct kmem_cache * s , struct page * page )
{
2007-07-17 15:03:18 +04:00
u8 * start ;
u8 * fault ;
u8 * end ;
2018-02-01 03:15:43 +03:00
u8 * pad ;
2007-07-17 15:03:18 +04:00
int length ;
int remainder ;
2007-05-07 01:49:36 +04:00
if ( ! ( s - > flags & SLAB_POISON ) )
return 1 ;
2008-03-02 00:40:44 +03:00
start = page_address ( page ) ;
2019-09-24 01:34:25 +03:00
length = page_size ( page ) ;
2008-04-14 20:11:30 +04:00
end = start + length ;
remainder = length % s - > size ;
2007-05-07 01:49:36 +04:00
if ( ! remainder )
return 1 ;
2018-02-01 03:15:43 +03:00
pad = end - remainder ;
2015-02-14 01:39:38 +03:00
metadata_access_enable ( ) ;
2018-02-01 03:15:43 +03:00
fault = memchr_inv ( pad , POISON_INUSE , remainder ) ;
2015-02-14 01:39:38 +03:00
metadata_access_disable ( ) ;
2007-07-17 15:03:18 +04:00
if ( ! fault )
return 1 ;
while ( end > fault & & end [ - 1 ] = = POISON_INUSE )
end - - ;
2019-12-01 04:49:31 +03:00
slab_err ( s , page , " Padding overwritten. 0x%p-0x%p @offset=%tu " ,
fault , end - 1 , fault - start ) ;
2018-02-01 03:15:43 +03:00
print_section ( KERN_ERR , " Padding " , pad , remainder ) ;
2007-07-17 15:03:18 +04:00
2018-02-01 03:15:43 +03:00
restore_bytes ( s , " slab padding " , POISON_INUSE , fault , end ) ;
2007-07-17 15:03:18 +04:00
return 0 ;
2007-05-07 01:49:36 +04:00
}
static int check_object ( struct kmem_cache * s , struct page * page ,
2010-09-29 16:15:01 +04:00
void * object , u8 val )
2007-05-07 01:49:36 +04:00
{
u8 * p = object ;
2012-06-13 19:24:57 +04:00
u8 * endobject = object + s - > object_size ;
2007-05-07 01:49:36 +04:00
if ( s - > flags & SLAB_RED_ZONE ) {
2016-03-16 00:55:12 +03:00
if ( ! check_bytes_and_report ( s , page , object , " Redzone " ,
object - s - > red_left_pad , val , s - > red_left_pad ) )
return 0 ;
2007-07-17 15:03:18 +04:00
if ( ! check_bytes_and_report ( s , page , object , " Redzone " ,
2012-06-13 19:24:57 +04:00
endobject , val , s - > inuse - s - > object_size ) )
2007-05-07 01:49:36 +04:00
return 0 ;
} else {
2012-06-13 19:24:57 +04:00
if ( ( s - > flags & SLAB_POISON ) & & s - > object_size < s - > inuse ) {
2008-02-06 04:57:39 +03:00
check_bytes_and_report ( s , page , p , " Alignment padding " ,
2013-07-15 05:05:29 +04:00
endobject , POISON_INUSE ,
s - > inuse - s - > object_size ) ;
2008-02-06 04:57:39 +03:00
}
2007-05-07 01:49:36 +04:00
}
if ( s - > flags & SLAB_POISON ) {
2010-09-29 16:15:01 +04:00
if ( val ! = SLUB_RED_ACTIVE & & ( s - > flags & __OBJECT_POISON ) & &
2007-07-17 15:03:18 +04:00
( ! check_bytes_and_report ( s , page , p , " Poison " , p ,
2012-06-13 19:24:57 +04:00
POISON_FREE , s - > object_size - 1 ) | |
2007-07-17 15:03:18 +04:00
! check_bytes_and_report ( s , page , p , " Poison " ,
2012-06-13 19:24:57 +04:00
p + s - > object_size - 1 , POISON_END , 1 ) ) )
2007-05-07 01:49:36 +04:00
return 0 ;
/*
* check_pad_bytes cleans up on its own .
*/
check_pad_bytes ( s , page , p ) ;
}
2020-05-08 04:36:06 +03:00
if ( ! freeptr_outside_object ( s ) & & val = = SLUB_RED_ACTIVE )
2007-05-07 01:49:36 +04:00
/*
* Object and freepointer overlap . Cannot check
* freepointer while object is allocated .
*/
return 1 ;
/* Check free pointer validity */
if ( ! check_valid_pointer ( s , page , get_freepointer ( s , p ) ) ) {
object_err ( s , page , p , " Freepointer corrupt " ) ;
/*
2008-12-05 06:08:08 +03:00
* No choice but to zap it and thus lose the remainder
2007-05-07 01:49:36 +04:00
* of the free objects in this slab . May cause
2007-05-09 13:32:39 +04:00
* another error because the object count is now wrong .
2007-05-07 01:49:36 +04:00
*/
2008-03-02 00:40:44 +03:00
set_freepointer ( s , p , NULL ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
return 1 ;
}
static int check_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 20:11:30 +04:00
int maxobj ;
2007-05-07 01:49:36 +04:00
VM_BUG_ON ( ! irqs_disabled ( ) ) ;
if ( ! PageSlab ( page ) ) {
2007-07-17 15:03:18 +04:00
slab_err ( s , page , " Not a valid slab page " ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
2008-04-14 20:11:30 +04:00
2018-06-08 03:09:10 +03:00
maxobj = order_objects ( compound_order ( page ) , s - > size ) ;
2008-04-14 20:11:30 +04:00
if ( page - > objects > maxobj ) {
slab_err ( s , page , " objects %u > max %u " ,
mm: slub: fix format mismatches in slab_err() callers
Adding __printf(3, 4) to slab_err exposed following:
mm/slub.c: In function `check_slab':
mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->objects, maxobj);
^
mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->inuse, page->objects);
^
mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c: In function `on_freelist':
mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
"should be %d", page->objects, max_objects);
Fix first two warnings by removing redundant s->name.
Fix the last by changing type of max_object from unsigned long to int.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:22 +03:00
page - > objects , maxobj ) ;
2008-04-14 20:11:30 +04:00
return 0 ;
}
if ( page - > inuse > page - > objects ) {
2007-07-17 15:03:18 +04:00
slab_err ( s , page , " inuse %u > max %u " ,
mm: slub: fix format mismatches in slab_err() callers
Adding __printf(3, 4) to slab_err exposed following:
mm/slub.c: In function `check_slab':
mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->objects, maxobj);
^
mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->inuse, page->objects);
^
mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c: In function `on_freelist':
mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
"should be %d", page->objects, max_objects);
Fix first two warnings by removing redundant s->name.
Fix the last by changing type of max_object from unsigned long to int.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:22 +03:00
page - > inuse , page - > objects ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
/* Slab_pad_check fixes things up after itself */
slab_pad_check ( s , page ) ;
return 1 ;
}
/*
2007-05-09 13:32:39 +04:00
* Determine if a certain object on a page is on the freelist . Must hold the
* slab lock to guarantee that the chains are in a consistent state .
2007-05-07 01:49:36 +04:00
*/
static int on_freelist ( struct kmem_cache * s , struct page * page , void * search )
{
int nr = 0 ;
2011-06-01 21:25:53 +04:00
void * fp ;
2007-05-07 01:49:36 +04:00
void * object = NULL ;
mm: slub: fix format mismatches in slab_err() callers
Adding __printf(3, 4) to slab_err exposed following:
mm/slub.c: In function `check_slab':
mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->objects, maxobj);
^
mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->inuse, page->objects);
^
mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c: In function `on_freelist':
mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
"should be %d", page->objects, max_objects);
Fix first two warnings by removing redundant s->name.
Fix the last by changing type of max_object from unsigned long to int.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:22 +03:00
int max_objects ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:53 +04:00
fp = page - > freelist ;
2008-04-14 20:11:30 +04:00
while ( fp & & nr < = page - > objects ) {
2007-05-07 01:49:36 +04:00
if ( fp = = search )
return 1 ;
if ( ! check_valid_pointer ( s , page , fp ) ) {
if ( object ) {
object_err ( s , page , object ,
" Freechain corrupt " ) ;
2008-03-02 00:40:44 +03:00
set_freepointer ( s , object , NULL ) ;
2007-05-07 01:49:36 +04:00
} else {
2007-07-17 15:03:18 +04:00
slab_err ( s , page , " Freepointer corrupt " ) ;
2008-03-02 00:40:44 +03:00
page - > freelist = NULL ;
2008-04-14 20:11:30 +04:00
page - > inuse = page - > objects ;
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Freelist cleared " ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
break ;
}
object = fp ;
fp = get_freepointer ( s , object ) ;
nr + + ;
}
2018-06-08 03:09:10 +03:00
max_objects = order_objects ( compound_order ( page ) , s - > size ) ;
2008-10-22 23:00:38 +04:00
if ( max_objects > MAX_OBJS_PER_PAGE )
max_objects = MAX_OBJS_PER_PAGE ;
2008-04-14 20:11:31 +04:00
if ( page - > objects ! = max_objects ) {
2016-03-18 00:19:47 +03:00
slab_err ( s , page , " Wrong number of objects. Found %d but should be %d " ,
page - > objects , max_objects ) ;
2008-04-14 20:11:31 +04:00
page - > objects = max_objects ;
slab_fix ( s , " Number of objects adjusted. " ) ;
}
2008-04-14 20:11:30 +04:00
if ( page - > inuse ! = page - > objects - nr ) {
2016-03-18 00:19:47 +03:00
slab_err ( s , page , " Wrong object count. Counter is %d but counted were %d " ,
page - > inuse , page - > objects - nr ) ;
2008-04-14 20:11:30 +04:00
page - > inuse = page - > objects - nr ;
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Object count adjusted. " ) ;
2007-05-07 01:49:36 +04:00
}
return search = = NULL ;
}
2008-04-30 03:11:12 +04:00
static void trace ( struct kmem_cache * s , struct page * page , void * object ,
int alloc )
2007-05-17 09:11:00 +04:00
{
if ( s - > flags & SLAB_TRACE ) {
2014-06-05 03:06:34 +04:00
pr_info ( " TRACE %s %s 0x%p inuse=%d fp=0x%p \n " ,
2007-05-17 09:11:00 +04:00
s - > name ,
alloc ? " alloc " : " free " ,
object , page - > inuse ,
page - > freelist ) ;
if ( ! alloc )
2017-01-25 02:18:02 +03:00
print_section ( KERN_INFO , " Object " , ( void * ) object ,
2013-07-15 05:05:29 +04:00
s - > object_size ) ;
2007-05-17 09:11:00 +04:00
dump_stack ( ) ;
}
}
2007-05-07 01:49:42 +04:00
/*
2007-05-09 13:32:39 +04:00
* Tracking of fully allocated slabs for debugging purposes .
2007-05-07 01:49:42 +04:00
*/
2011-06-01 21:25:50 +04:00
static void add_full ( struct kmem_cache * s ,
struct kmem_cache_node * n , struct page * page )
2007-05-07 01:49:42 +04:00
{
2011-06-01 21:25:50 +04:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2014-02-11 02:25:39 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2019-05-14 03:16:12 +03:00
list_add ( & page - > slab_list , & n - > full ) ;
2007-05-07 01:49:42 +04:00
}
2014-01-10 16:23:49 +04:00
static void remove_full ( struct kmem_cache * s , struct kmem_cache_node * n , struct page * page )
2007-05-07 01:49:42 +04:00
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2014-02-11 02:25:39 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2019-05-14 03:16:12 +03:00
list_del ( & page - > slab_list ) ;
2007-05-07 01:49:42 +04:00
}
2008-04-14 19:53:02 +04:00
/* Tracking of the number of slabs for debugging purposes */
static inline unsigned long slabs_node ( struct kmem_cache * s , int node )
{
struct kmem_cache_node * n = get_node ( s , node ) ;
return atomic_long_read ( & n - > nr_slabs ) ;
}
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_slabs ( struct kmem_cache_node * n )
{
return atomic_long_read ( & n - > nr_slabs ) ;
}
2008-04-14 20:11:40 +04:00
static inline void inc_slabs_node ( struct kmem_cache * s , int node , int objects )
2008-04-14 19:53:02 +04:00
{
struct kmem_cache_node * n = get_node ( s , node ) ;
/*
* May be called early in order to allocate a slab for the
* kmem_cache_node structure . Solve the chicken - egg
* dilemma by deferring the increment of the count during
* bootstrap ( see early_kmem_cache_node_alloc ) .
*/
2013-01-21 12:01:27 +04:00
if ( likely ( n ) ) {
2008-04-14 19:53:02 +04:00
atomic_long_inc ( & n - > nr_slabs ) ;
2008-04-14 20:11:40 +04:00
atomic_long_add ( objects , & n - > total_objects ) ;
}
2008-04-14 19:53:02 +04:00
}
2008-04-14 20:11:40 +04:00
static inline void dec_slabs_node ( struct kmem_cache * s , int node , int objects )
2008-04-14 19:53:02 +04:00
{
struct kmem_cache_node * n = get_node ( s , node ) ;
atomic_long_dec ( & n - > nr_slabs ) ;
2008-04-14 20:11:40 +04:00
atomic_long_sub ( objects , & n - > total_objects ) ;
2008-04-14 19:53:02 +04:00
}
/* Object debug checks for alloc/free paths */
2007-05-17 09:11:00 +04:00
static void setup_object_debug ( struct kmem_cache * s , struct page * page ,
void * object )
{
2020-08-07 09:18:58 +03:00
if ( ! kmem_cache_debug_flags ( s , SLAB_STORE_USER | SLAB_RED_ZONE | __OBJECT_POISON ) )
2007-05-17 09:11:00 +04:00
return ;
2010-09-29 16:15:01 +04:00
init_object ( s , object , SLUB_RED_INACTIVE ) ;
2007-05-17 09:11:00 +04:00
init_tracking ( s , object ) ;
}
2019-09-24 01:34:25 +03:00
static
void setup_page_debug ( struct kmem_cache * s , struct page * page , void * addr )
2019-02-21 09:19:23 +03:00
{
2020-08-07 09:18:58 +03:00
if ( ! kmem_cache_debug_flags ( s , SLAB_POISON ) )
2019-02-21 09:19:23 +03:00
return ;
metadata_access_enable ( ) ;
2019-09-24 01:34:25 +03:00
memset ( addr , POISON_INUSE , page_size ( page ) ) ;
2019-02-21 09:19:23 +03:00
metadata_access_disable ( ) ;
}
2016-03-16 00:55:06 +03:00
static inline int alloc_consistency_checks ( struct kmem_cache * s ,
2019-03-06 02:42:10 +03:00
struct page * page , void * object )
2007-05-07 01:49:36 +04:00
{
if ( ! check_slab ( s , page ) )
2016-03-16 00:55:06 +03:00
return 0 ;
2007-05-07 01:49:36 +04:00
if ( ! check_valid_pointer ( s , page , object ) ) {
object_err ( s , page , object , " Freelist Pointer check fails " ) ;
2016-03-16 00:55:06 +03:00
return 0 ;
2007-05-07 01:49:36 +04:00
}
2010-09-29 16:15:01 +04:00
if ( ! check_object ( s , page , object , SLUB_RED_INACTIVE ) )
2016-03-16 00:55:06 +03:00
return 0 ;
return 1 ;
}
static noinline int alloc_debug_processing ( struct kmem_cache * s ,
struct page * page ,
void * object , unsigned long addr )
{
if ( s - > flags & SLAB_CONSISTENCY_CHECKS ) {
2019-03-06 02:42:10 +03:00
if ( ! alloc_consistency_checks ( s , page , object ) )
2016-03-16 00:55:06 +03:00
goto bad ;
}
2007-05-07 01:49:36 +04:00
2007-05-17 09:11:00 +04:00
/* Success perform special debug activities for allocs */
if ( s - > flags & SLAB_STORE_USER )
set_track ( s , object , TRACK_ALLOC , addr ) ;
trace ( s , page , object , 1 ) ;
2010-09-29 16:15:01 +04:00
init_object ( s , object , SLUB_RED_ACTIVE ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
2007-05-17 09:11:00 +04:00
2007-05-07 01:49:36 +04:00
bad :
if ( PageSlab ( page ) ) {
/*
* If this is a slab page then lets do the best we can
* to avoid issues in the future . Marking all objects
2007-05-09 13:32:39 +04:00
* as used avoids touching the remaining objects .
2007-05-07 01:49:36 +04:00
*/
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Marking all objects used " ) ;
2008-04-14 20:11:30 +04:00
page - > inuse = page - > objects ;
2008-03-02 00:40:44 +03:00
page - > freelist = NULL ;
2007-05-07 01:49:36 +04:00
}
return 0 ;
}
2016-03-16 00:55:06 +03:00
static inline int free_consistency_checks ( struct kmem_cache * s ,
struct page * page , void * object , unsigned long addr )
2007-05-07 01:49:36 +04:00
{
if ( ! check_valid_pointer ( s , page , object ) ) {
2007-05-07 01:49:47 +04:00
slab_err ( s , page , " Invalid object pointer 0x%p " , object ) ;
2016-03-16 00:55:06 +03:00
return 0 ;
2007-05-07 01:49:36 +04:00
}
if ( on_freelist ( s , page , object ) ) {
2007-07-17 15:03:18 +04:00
object_err ( s , page , object , " Object already free " ) ;
2016-03-16 00:55:06 +03:00
return 0 ;
2007-05-07 01:49:36 +04:00
}
2010-09-29 16:15:01 +04:00
if ( ! check_object ( s , page , object , SLUB_RED_ACTIVE ) )
2016-03-16 00:55:06 +03:00
return 0 ;
2007-05-07 01:49:36 +04:00
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
if ( unlikely ( s ! = page - > slab_cache ) ) {
2008-02-06 04:57:39 +03:00
if ( ! PageSlab ( page ) ) {
2016-03-18 00:19:47 +03:00
slab_err ( s , page , " Attempt to free object(0x%p) outside of slab " ,
object ) ;
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
} else if ( ! page - > slab_cache ) {
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB <none>: no slab for object 0x%p. \n " ,
object ) ;
2007-05-07 01:49:47 +04:00
dump_stack ( ) ;
2008-01-08 10:20:27 +03:00
} else
2007-07-17 15:03:18 +04:00
object_err ( s , page , object ,
" page slab pointer corrupt. " ) ;
2016-03-16 00:55:06 +03:00
return 0 ;
}
return 1 ;
}
/* Supports checking bulk free of a constructed freelist */
static noinline int free_debug_processing (
struct kmem_cache * s , struct page * page ,
void * head , void * tail , int bulk_cnt ,
unsigned long addr )
{
struct kmem_cache_node * n = get_node ( s , page_to_nid ( page ) ) ;
void * object = head ;
int cnt = 0 ;
treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-03 23:09:38 +03:00
unsigned long flags ;
2016-03-16 00:55:06 +03:00
int ret = 0 ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
slab_lock ( page ) ;
if ( s - > flags & SLAB_CONSISTENCY_CHECKS ) {
if ( ! check_slab ( s , page ) )
goto out ;
}
next_object :
cnt + + ;
if ( s - > flags & SLAB_CONSISTENCY_CHECKS ) {
if ( ! free_consistency_checks ( s , page , object , addr ) )
goto out ;
2007-05-07 01:49:36 +04:00
}
2007-05-17 09:11:00 +04:00
if ( s - > flags & SLAB_STORE_USER )
set_track ( s , object , TRACK_FREE , addr ) ;
trace ( s , page , object , 0 ) ;
2015-11-21 02:57:46 +03:00
/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
2010-09-29 16:15:01 +04:00
init_object ( s , object , SLUB_RED_INACTIVE ) ;
2015-11-21 02:57:46 +03:00
/* Reached end of constructed freelist yet? */
if ( object ! = tail ) {
object = get_freepointer ( s , object ) ;
goto next_object ;
}
2016-03-16 00:55:02 +03:00
ret = 1 ;
2011-06-01 21:25:54 +04:00
out :
2015-11-21 02:57:46 +03:00
if ( cnt ! = bulk_cnt )
slab_err ( s , page , " Bulk freelist count(%d) invalid(%d) \n " ,
bulk_cnt , cnt ) ;
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2016-03-16 00:54:59 +03:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2016-03-16 00:55:02 +03:00
if ( ! ret )
slab_fix ( s , " Object at 0x%p not freed " , object ) ;
return ret ;
2007-05-07 01:49:36 +04:00
}
2020-08-07 09:18:35 +03:00
/*
* Parse a block of slub_debug options . Blocks are delimited by ' ; '
*
* @ str : start of block
* @ flags : returns parsed flags , or DEBUG_DEFAULT_FLAGS if none specified
* @ slabs : return start of list of slabs , or NULL when there ' s no list
* @ init : assume this is initial parsing and not per - kmem - create parsing
*
* returns the start of next block if there ' s any , or NULL
*/
static char *
parse_slub_debug_flags ( char * str , slab_flags_t * flags , char * * slabs , bool init )
2007-05-09 13:32:44 +04:00
{
2020-08-07 09:18:35 +03:00
bool higher_order_disable = false ;
2007-07-16 10:38:14 +04:00
2020-08-07 09:18:35 +03:00
/* Skip any completely empty blocks */
while ( * str & & * str = = ' ; ' )
str + + ;
if ( * str = = ' , ' ) {
2007-07-16 10:38:14 +04:00
/*
* No options but restriction on slabs . This means full
* debugging for slabs matching a pattern .
*/
2020-08-07 09:18:35 +03:00
* flags = DEBUG_DEFAULT_FLAGS ;
2007-07-16 10:38:14 +04:00
goto check_slabs ;
2020-08-07 09:18:35 +03:00
}
* flags = 0 ;
2007-07-16 10:38:14 +04:00
2020-08-07 09:18:35 +03:00
/* Determine which debug features should be switched on */
for ( ; * str & & * str ! = ' , ' & & * str ! = ' ; ' ; str + + ) {
2007-07-16 10:38:14 +04:00
switch ( tolower ( * str ) ) {
2020-08-07 09:18:35 +03:00
case ' - ' :
* flags = 0 ;
break ;
2007-07-16 10:38:14 +04:00
case ' f ' :
2020-08-07 09:18:35 +03:00
* flags | = SLAB_CONSISTENCY_CHECKS ;
2007-07-16 10:38:14 +04:00
break ;
case ' z ' :
2020-08-07 09:18:35 +03:00
* flags | = SLAB_RED_ZONE ;
2007-07-16 10:38:14 +04:00
break ;
case ' p ' :
2020-08-07 09:18:35 +03:00
* flags | = SLAB_POISON ;
2007-07-16 10:38:14 +04:00
break ;
case ' u ' :
2020-08-07 09:18:35 +03:00
* flags | = SLAB_STORE_USER ;
2007-07-16 10:38:14 +04:00
break ;
case ' t ' :
2020-08-07 09:18:35 +03:00
* flags | = SLAB_TRACE ;
2007-07-16 10:38:14 +04:00
break ;
2010-02-26 09:36:12 +03:00
case ' a ' :
2020-08-07 09:18:35 +03:00
* flags | = SLAB_FAILSLAB ;
2010-02-26 09:36:12 +03:00
break ;
2015-04-15 01:44:25 +03:00
case ' o ' :
/*
* Avoid enabling debugging on caches if its minimum
* order would increase as a result .
*/
2020-08-07 09:18:35 +03:00
higher_order_disable = true ;
2015-04-15 01:44:25 +03:00
break ;
2007-07-16 10:38:14 +04:00
default :
2020-08-07 09:18:35 +03:00
if ( init )
pr_err ( " slub_debug option '%c' unknown. skipped \n " , * str ) ;
2007-07-16 10:38:14 +04:00
}
2007-05-09 13:32:44 +04:00
}
2007-07-16 10:38:14 +04:00
check_slabs :
2007-05-09 13:32:44 +04:00
if ( * str = = ' , ' )
2020-08-07 09:18:35 +03:00
* slabs = + + str ;
else
* slabs = NULL ;
/* Skip over the slab list */
while ( * str & & * str ! = ' ; ' )
str + + ;
/* Skip any completely empty blocks */
while ( * str & & * str = = ' ; ' )
str + + ;
if ( init & & higher_order_disable )
disable_higher_order_debug = 1 ;
if ( * str )
return str ;
else
return NULL ;
}
static int __init setup_slub_debug ( char * str )
{
slab_flags_t flags ;
char * saved_str ;
char * slab_list ;
bool global_slub_debug_changed = false ;
bool slab_list_specified = false ;
slub_debug = DEBUG_DEFAULT_FLAGS ;
if ( * str + + ! = ' = ' | | ! * str )
/*
* No options specified . Switch on full debugging .
*/
goto out ;
saved_str = str ;
while ( str ) {
str = parse_slub_debug_flags ( str , & flags , & slab_list , true ) ;
if ( ! slab_list ) {
slub_debug = flags ;
global_slub_debug_changed = true ;
} else {
slab_list_specified = true ;
}
}
/*
* For backwards compatibility , a single list of flags with list of
* slabs means debugging is only enabled for those slabs , so the global
* slub_debug should be 0. We can extended that to multiple lists as
* long as there is no option specifying flags without a slab list .
*/
if ( slab_list_specified ) {
if ( ! global_slub_debug_changed )
slub_debug = 0 ;
slub_debug_string = saved_str ;
}
2007-07-16 10:38:14 +04:00
out :
2020-08-07 09:18:51 +03:00
if ( slub_debug ! = 0 | | slub_debug_string )
static_branch_enable ( & slub_debug_enabled ) ;
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
Patch series "add init_on_alloc/init_on_free boot options", v10.
Provide init_on_alloc and init_on_free boot options.
These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.
Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.
Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.
As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations. There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.
This patch (of 2):
The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.
This is expected to be on-by-default on Android and Chrome OS. And it
gives the opportunity for anyone else to use it under distros too via the
boot args. (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)
init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.
init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.
Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
zero-initialized to preserve their semantics.
Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.
If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.
Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.
The new features are also going to pave the way for hardware memory
tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects. With MTE, tagging will have the
same cost as memory initialization.
Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized. There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.
[glider@google.com: v8]
Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 06:59:19 +03:00
if ( ( static_branch_unlikely ( & init_on_alloc ) | |
static_branch_unlikely ( & init_on_free ) ) & &
( slub_debug & SLAB_POISON ) )
pr_info ( " mem auto-init: SLAB_POISON will take precedence over init_on_alloc/init_on_free \n " ) ;
2007-05-09 13:32:44 +04:00
return 1 ;
}
__setup ( " slub_debug " , setup_slub_debug ) ;
2018-10-27 01:03:15 +03:00
/*
* kmem_cache_flags - apply debugging options to the cache
* @ object_size : the size of an object without meta data
* @ flags : flags to set
* @ name : name of the cache
* @ ctor : constructor function
*
* Debug option ( s ) are applied to @ flags . In addition to the debug
* option ( s ) , if a slab name ( or multiple ) is specified i . e .
* slub_debug = < Debug - Options > , < slab name1 > , < slab name2 > . . .
* then only the select slabs will receive the debug option ( s ) .
*/
2018-04-06 02:21:24 +03:00
slab_flags_t kmem_cache_flags ( unsigned int object_size ,
2017-11-16 04:32:18 +03:00
slab_flags_t flags , const char * name ,
2008-07-26 06:45:34 +04:00
void ( * ctor ) ( void * ) )
2007-05-09 13:32:44 +04:00
{
2018-10-27 01:03:15 +03:00
char * iter ;
size_t len ;
2020-08-07 09:18:35 +03:00
char * next_block ;
slab_flags_t block_flags ;
2018-10-27 01:03:15 +03:00
/* If slub_debug = 0, it folds into the if conditional. */
2020-08-07 09:18:35 +03:00
if ( ! slub_debug_string )
2018-10-27 01:03:15 +03:00
return flags | slub_debug ;
len = strlen ( name ) ;
2020-08-07 09:18:35 +03:00
next_block = slub_debug_string ;
/* Go through all blocks of debug options, see if any matches our slab's name */
while ( next_block ) {
next_block = parse_slub_debug_flags ( next_block , & block_flags , & iter , false ) ;
if ( ! iter )
continue ;
/* Found a block that has a slab list, search it */
while ( * iter ) {
char * end , * glob ;
size_t cmplen ;
end = strchrnul ( iter , ' , ' ) ;
if ( next_block & & next_block < end )
end = next_block - 1 ;
glob = strnchr ( iter , end - iter , ' * ' ) ;
if ( glob )
cmplen = glob - iter ;
else
cmplen = max_t ( size_t , len , ( end - iter ) ) ;
2018-10-27 01:03:15 +03:00
2020-08-07 09:18:35 +03:00
if ( ! strncmp ( name , iter , cmplen ) ) {
flags | = block_flags ;
return flags ;
}
2018-10-27 01:03:15 +03:00
2020-08-07 09:18:35 +03:00
if ( ! * end | | * end = = ' ; ' )
break ;
iter = end + 1 ;
2018-10-27 01:03:15 +03:00
}
}
2007-09-12 02:24:11 +04:00
2020-08-07 09:18:35 +03:00
return slub_debug ;
2007-05-09 13:32:44 +04:00
}
2015-11-21 02:57:41 +03:00
# else /* !CONFIG_SLUB_DEBUG */
2007-05-17 09:11:00 +04:00
static inline void setup_object_debug ( struct kmem_cache * s ,
struct page * page , void * object ) { }
2019-09-24 01:34:25 +03:00
static inline
void setup_page_debug ( struct kmem_cache * s , struct page * page , void * addr ) { }
2007-05-09 13:32:44 +04:00
2007-05-17 09:11:00 +04:00
static inline int alloc_debug_processing ( struct kmem_cache * s ,
2008-08-19 21:43:25 +04:00
struct page * page , void * object , unsigned long addr ) { return 0 ; }
2007-05-09 13:32:44 +04:00
2016-03-16 00:54:59 +03:00
static inline int free_debug_processing (
2015-11-21 02:57:46 +03:00
struct kmem_cache * s , struct page * page ,
void * head , void * tail , int bulk_cnt ,
2016-03-16 00:54:59 +03:00
unsigned long addr ) { return 0 ; }
2007-05-09 13:32:44 +04:00
static inline int slab_pad_check ( struct kmem_cache * s , struct page * page )
{ return 1 ; }
static inline int check_object ( struct kmem_cache * s , struct page * page ,
2010-09-29 16:15:01 +04:00
void * object , u8 val ) { return 1 ; }
2011-06-01 21:25:50 +04:00
static inline void add_full ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct page * page ) { }
2014-01-10 16:23:49 +04:00
static inline void remove_full ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct page * page ) { }
2018-04-06 02:21:24 +03:00
slab_flags_t kmem_cache_flags ( unsigned int object_size ,
2017-11-16 04:32:18 +03:00
slab_flags_t flags , const char * name ,
2008-07-26 06:45:34 +04:00
void ( * ctor ) ( void * ) )
2007-09-12 02:24:11 +04:00
{
return flags ;
}
2007-05-09 13:32:44 +04:00
# define slub_debug 0
2008-04-14 19:53:02 +04:00
2009-09-15 13:00:26 +04:00
# define disable_higher_order_debug 0
2008-04-14 19:53:02 +04:00
static inline unsigned long slabs_node ( struct kmem_cache * s , int node )
{ return 0 ; }
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_slabs ( struct kmem_cache_node * n )
{ return 0 ; }
2008-04-14 20:11:40 +04:00
static inline void inc_slabs_node ( struct kmem_cache * s , int node ,
int objects ) { }
static inline void dec_slabs_node ( struct kmem_cache * s , int node ,
int objects ) { }
2010-08-25 23:07:16 +04:00
2020-06-02 07:45:47 +03:00
static bool freelist_corrupted ( struct kmem_cache * s , struct page * page ,
void * freelist , void * nextfree )
{
return false ;
}
2014-08-07 03:04:18 +04:00
# endif /* CONFIG_SLUB_DEBUG */
/*
* Hooks for other subsystems that check memory allocations . In a typical
* production configuration these hooks all should produce no code at all .
*/
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13.
This patchset adds a new software tag-based mode to KASAN [1]. (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
====== Rationale
On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life
miserable.
[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
====== Technical details
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:29:37 +03:00
static inline void * kmalloc_large_node_hook ( void * ptr , size_t size , gfp_t flags )
2013-10-09 02:58:57 +04:00
{
2019-02-21 09:19:11 +03:00
ptr = kasan_kmalloc_large ( ptr , size , flags ) ;
2019-02-21 09:19:16 +03:00
/* As ptr might get tagged, call kmemleak hook after KASAN. */
2013-10-09 02:58:57 +04:00
kmemleak_alloc ( ptr , size , 1 , flags ) ;
2019-02-21 09:19:11 +03:00
return ptr ;
2013-10-09 02:58:57 +04:00
}
2018-02-07 02:36:27 +03:00
static __always_inline void kfree_hook ( void * x )
2013-10-09 02:58:57 +04:00
{
kmemleak_free ( x ) ;
2018-02-07 02:36:27 +03:00
kasan_kfree_large ( x , _RET_IP_ ) ;
2013-10-09 02:58:57 +04:00
}
2018-04-11 02:30:31 +03:00
static __always_inline bool slab_free_hook ( struct kmem_cache * s , void * x )
2013-10-09 02:58:57 +04:00
{
kmemleak_free_recursive ( x , s - > flags ) ;
2010-08-25 23:07:16 +04:00
2014-08-07 03:04:18 +04:00
/*
* Trouble is that we may no longer disable interrupts in the fast path
* So in order to make the debug calls that expect irqs to be
* disabled we need to disable interrupts temporarily .
*/
2017-11-16 04:36:02 +03:00
# ifdef CONFIG_LOCKDEP
2014-08-07 03:04:18 +04:00
{
unsigned long flags ;
local_irq_save ( flags ) ;
debug_check_no_locks_freed ( x , s - > object_size ) ;
local_irq_restore ( flags ) ;
}
# endif
if ( ! ( s - > flags & SLAB_DEBUG_OBJECTS ) )
debug_check_no_obj_freed ( x , s - > object_size ) ;
2015-02-14 01:39:42 +03:00
2020-08-07 09:19:12 +03:00
/* Use KCSAN to help debug racy use-after-free. */
if ( ! ( s - > flags & SLAB_TYPESAFE_BY_RCU ) )
__kcsan_check_access ( x , s - > object_size ,
KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT ) ;
2018-04-11 02:30:31 +03:00
/* KASAN might put x into memory quarantine, delaying its reuse */
return kasan_slab_free ( s , x , _RET_IP_ ) ;
2014-08-07 03:04:18 +04:00
}
2008-04-14 20:11:40 +04:00
2018-04-11 02:30:31 +03:00
static inline bool slab_free_freelist_hook ( struct kmem_cache * s ,
void * * head , void * * tail )
2015-11-21 02:57:46 +03:00
{
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
Patch series "add init_on_alloc/init_on_free boot options", v10.
Provide init_on_alloc and init_on_free boot options.
These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.
Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.
Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.
As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations. There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.
This patch (of 2):
The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.
This is expected to be on-by-default on Android and Chrome OS. And it
gives the opportunity for anyone else to use it under distros too via the
boot args. (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)
init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.
init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.
Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
zero-initialized to preserve their semantics.
Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.
If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.
Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.
The new features are also going to pave the way for hardware memory
tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects. With MTE, tagging will have the
same cost as memory initialization.
Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized. There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.
[glider@google.com: v8]
Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 06:59:19 +03:00
void * object ;
void * next = * head ;
void * old_tail = * tail ? * tail : * head ;
int rsize ;
2019-11-16 04:34:50 +03:00
/* Head and tail of the reconstructed freelist */
* head = NULL ;
* tail = NULL ;
2019-07-31 22:32:40 +03:00
2019-11-16 04:34:50 +03:00
do {
object = next ;
next = get_freepointer ( s , object ) ;
if ( slab_want_init_on_free ( s ) ) {
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
Patch series "add init_on_alloc/init_on_free boot options", v10.
Provide init_on_alloc and init_on_free boot options.
These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.
Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.
Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.
As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations. There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.
This patch (of 2):
The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.
This is expected to be on-by-default on Android and Chrome OS. And it
gives the opportunity for anyone else to use it under distros too via the
boot args. (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)
init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.
init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.
Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
zero-initialized to preserve their semantics.
Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.
If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.
Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.
The new features are also going to pave the way for hardware memory
tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects. With MTE, tagging will have the
same cost as memory initialization.
Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized. There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.
[glider@google.com: v8]
Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 06:59:19 +03:00
/*
* Clear the object and the metadata , but don ' t touch
* the redzone .
*/
memset ( object , 0 , s - > object_size ) ;
rsize = ( s - > flags & SLAB_RED_ZONE ) ? s - > red_left_pad
: 0 ;
memset ( ( char * ) object + s - > inuse , 0 ,
s - > size - s - > inuse - rsize ) ;
2015-11-21 02:57:46 +03:00
2019-11-16 04:34:50 +03:00
}
2018-04-11 02:30:31 +03:00
/* If object's reuse doesn't have to be delayed */
if ( ! slab_free_hook ( s , object ) ) {
/* Move object to the new freelist */
set_freepointer ( s , object , * head ) ;
* head = object ;
if ( ! * tail )
* tail = object ;
}
} while ( object ! = old_tail ) ;
if ( * head = = * tail )
* tail = NULL ;
return * head ! = NULL ;
2015-11-21 02:57:46 +03:00
}
2018-12-28 11:30:23 +03:00
static void * setup_object ( struct kmem_cache * s , struct page * page ,
2015-09-05 01:45:48 +03:00
void * object )
{
setup_object_debug ( s , page , object ) ;
2018-12-28 11:30:23 +03:00
object = kasan_init_slab_obj ( s , object ) ;
2015-09-05 01:45:48 +03:00
if ( unlikely ( s - > ctor ) ) {
kasan_unpoison_object_data ( s , object ) ;
s - > ctor ( object ) ;
kasan_poison_object_data ( s , object ) ;
}
2018-12-28 11:30:23 +03:00
return object ;
2015-09-05 01:45:48 +03:00
}
2007-05-07 01:49:36 +04:00
/*
* Slab allocation and freeing
*/
2014-06-05 03:06:38 +04:00
static inline struct page * alloc_slab_page ( struct kmem_cache * s ,
gfp_t flags , int node , struct kmem_cache_order_objects oo )
2008-04-14 20:11:40 +04:00
{
2014-06-05 03:06:38 +04:00
struct page * page ;
2018-04-06 02:21:39 +03:00
unsigned int order = oo_order ( oo ) ;
2008-04-14 20:11:40 +04:00
2010-07-09 23:07:10 +04:00
if ( node = = NUMA_NO_NODE )
2014-06-05 03:06:38 +04:00
page = alloc_pages ( flags , order ) ;
2008-04-14 20:11:40 +04:00
else
mm: rename alloc_pages_exact_node() to __alloc_pages_node()
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").
Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.
To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.
It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.
Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.
Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 01:03:50 +03:00
page = __alloc_pages_node ( node , flags , order ) ;
2014-06-05 03:06:38 +04:00
2020-08-07 09:21:27 +03:00
if ( page )
2020-08-07 09:21:44 +03:00
account_slab_page ( page , order , s ) ;
2014-06-05 03:06:38 +04:00
return page ;
2008-04-14 20:11:40 +04:00
}
2016-07-27 01:21:59 +03:00
# ifdef CONFIG_SLAB_FREELIST_RANDOM
/* Pre-initialize the random sequence cache */
static int init_cache_random_seq ( struct kmem_cache * s )
{
2018-04-06 02:21:39 +03:00
unsigned int count = oo_objects ( s - > oo ) ;
2016-07-27 01:21:59 +03:00
int err ;
2017-02-09 01:30:59 +03:00
/* Bailout if already initialised */
if ( s - > random_seq )
return 0 ;
2016-07-27 01:21:59 +03:00
err = cache_random_seq_create ( s , count , GFP_KERNEL ) ;
if ( err ) {
pr_err ( " SLUB: Unable to initialize free list for %s \n " ,
s - > name ) ;
return err ;
}
/* Transform to an offset on the set of pages */
if ( s - > random_seq ) {
2018-04-06 02:21:39 +03:00
unsigned int i ;
2016-07-27 01:21:59 +03:00
for ( i = 0 ; i < count ; i + + )
s - > random_seq [ i ] * = s - > size ;
}
return 0 ;
}
/* Initialize each random sequence freelist per cache */
static void __init init_freelist_randomization ( void )
{
struct kmem_cache * s ;
mutex_lock ( & slab_mutex ) ;
list_for_each_entry ( s , & slab_caches , list )
init_cache_random_seq ( s ) ;
mutex_unlock ( & slab_mutex ) ;
}
/* Get the next entry on the pre-computed freelist randomized */
static void * next_freelist_entry ( struct kmem_cache * s , struct page * page ,
unsigned long * pos , void * start ,
unsigned long page_limit ,
unsigned long freelist_count )
{
unsigned int idx ;
/*
* If the target page allocation failed , the number of objects on the
* page might be smaller than the usual size defined by the cache .
*/
do {
idx = s - > random_seq [ * pos ] ;
* pos + = 1 ;
if ( * pos > = freelist_count )
* pos = 0 ;
} while ( unlikely ( idx > = page_limit ) ) ;
return ( char * ) start + idx ;
}
/* Shuffle the single linked freelist based on a random pre-computed sequence */
static bool shuffle_freelist ( struct kmem_cache * s , struct page * page )
{
void * start ;
void * cur ;
void * next ;
unsigned long idx , pos , page_limit , freelist_count ;
if ( page - > objects < 2 | | ! s - > random_seq )
return false ;
freelist_count = oo_objects ( s - > oo ) ;
pos = get_random_int ( ) % freelist_count ;
page_limit = page - > objects * s - > size ;
start = fixup_red_left ( s , page_address ( page ) ) ;
/* First entry is used as the base of the freelist */
cur = next_freelist_entry ( s , page , & pos , start , page_limit ,
freelist_count ) ;
2018-12-28 11:30:23 +03:00
cur = setup_object ( s , page , cur ) ;
2016-07-27 01:21:59 +03:00
page - > freelist = cur ;
for ( idx = 1 ; idx < page - > objects ; idx + + ) {
next = next_freelist_entry ( s , page , & pos , start , page_limit ,
freelist_count ) ;
2018-12-28 11:30:23 +03:00
next = setup_object ( s , page , next ) ;
2016-07-27 01:21:59 +03:00
set_freepointer ( s , cur , next ) ;
cur = next ;
}
set_freepointer ( s , cur , NULL ) ;
return true ;
}
# else
static inline int init_cache_random_seq ( struct kmem_cache * s )
{
return 0 ;
}
static inline void init_freelist_randomization ( void ) { }
static inline bool shuffle_freelist ( struct kmem_cache * s , struct page * page )
{
return false ;
}
# endif /* CONFIG_SLAB_FREELIST_RANDOM */
2007-05-07 01:49:36 +04:00
static struct page * allocate_slab ( struct kmem_cache * s , gfp_t flags , int node )
{
2008-01-08 10:20:27 +03:00
struct page * page ;
2008-04-14 20:11:31 +04:00
struct kmem_cache_order_objects oo = s - > oo ;
2009-06-24 22:59:51 +04:00
gfp_t alloc_gfp ;
2018-12-28 11:30:23 +03:00
void * start , * p , * next ;
2019-09-24 01:34:25 +03:00
int idx ;
2016-07-27 01:21:59 +03:00
bool shuffle ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:44 +04:00
flags & = gfp_allowed_mask ;
2015-11-07 03:28:21 +03:00
if ( gfpflags_allow_blocking ( flags ) )
2011-06-01 21:25:44 +04:00
local_irq_enable ( ) ;
2008-02-15 01:21:32 +03:00
flags | = s - > allocflags ;
2007-10-16 12:25:52 +04:00
2009-06-24 22:59:51 +04:00
/*
* Let the initial higher - order allocation fail under memory pressure
* so we fall - back to the minimum order allocation .
*/
alloc_gfp = ( flags | __GFP_NOWARN | __GFP_NORETRY ) & ~ __GFP_NOFAIL ;
2015-11-07 03:28:21 +03:00
if ( ( alloc_gfp & __GFP_DIRECT_RECLAIM ) & & oo_order ( oo ) > oo_order ( s - > min ) )
mm: thp: set THP defrag by default to madvise and add a stall-free defrag option
THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that
THP allocation requests potentially enter reclaim/compaction. This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses. While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so. Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM. It's been years and
it's time to throw in the towel.
First, this patch defines THP defrag as follows;
madvise: A failed allocation will direct reclaim/compact if the application requests it
never: Neither reclaim/compact nor wake kswapd
defer: A failed allocation will wake kswapd/kcompactd
always: A failed allocation will direct reclaim/compact (historical behaviour)
khugepaged defrag will enter direct/reclaim but not wake kswapd.
Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.
Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work. The callers that
really cares are slub/slab and they are updated accordingly. The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.
This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available. There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary. THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.
After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future. In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.
The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times. The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete. It uses multiple threads to see
if that is a factor. On UMA, the performance is almost identical so is
not reported but on NUMA, we see this
usemem
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)
For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases. Similar,
notice the large reduction in most cases in system CPU usage. The
overall CPU time is
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
User 10357.65 10438.33
System 3988.88 3543.94
Elapsed 2203.01 1634.41
Which is substantial. Now, the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 128458477 278352931
Major Faults 2174976 225
Swap Ins 16904701 0
Swap Outs 17359627 0
Allocation stalls 43611 0
DMA allocs 0 0
DMA32 allocs 19832646 19448017
Normal allocs 614488453 580941839
Movable allocs 0 0
Direct pages scanned 24163800 0
Kswapd pages scanned 0 0
Kswapd pages reclaimed 0 0
Direct pages reclaimed 20691346 0
Compaction stalls 42263 0
Compaction success 938 0
Compaction failures 41325 0
This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.
I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used
thpscale Fault Latencies
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)
The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 37429143 47564000
Major Faults 1916 1558
Swap Ins 1466 1079
Swap Outs 2936863 149626
Allocation stalls 62510 3
DMA allocs 0 0
DMA32 allocs 6566458 6401314
Normal allocs 216361697 216538171
Movable allocs 0 0
Direct pages scanned 25977580 17998
Kswapd pages scanned 0 3638931
Kswapd pages reclaimed 0 207236
Direct pages reclaimed 8833714 88
Compaction stalls 103349 5
Compaction success 270 4
Compaction failures 103079 1
Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.
I also tried the stutter benchmark. For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available
stutter
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)
This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently. This shows a mix because the ideal case of mapping with THP
is not hit as often. However, note that 99% of the mappings complete
13.79% faster. The CPU usage here is particularly interesting
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
User 67.50 0.99
System 1327.88 91.30
Elapsed 2079.00 2128.98
And once again we look at the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 335241922 1314582827
Major Faults 715 819
Swap Ins 0 0
Swap Outs 0 0
Allocation stalls 532723 0
DMA allocs 0 0
DMA32 allocs 1822364341 1177950222
Normal allocs 1815640808 1517844854
Movable allocs 0 0
Direct pages scanned 21892772 0
Kswapd pages scanned 20015890 41879484
Kswapd pages reclaimed 19961986 41822072
Direct pages reclaimed 21892741 0
Compaction stalls 1065755 0
Compaction success 514 0
Compaction failures 1065241 0
Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.
THP gives impressive gains in some cases but only if they are quickly
available. We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 00:19:23 +03:00
alloc_gfp = ( alloc_gfp | __GFP_NOMEMALLOC ) & ~ ( __GFP_RECLAIM | __GFP_NOFAIL ) ;
2009-06-24 22:59:51 +04:00
2014-06-05 03:06:38 +04:00
page = alloc_slab_page ( s , alloc_gfp , node , oo ) ;
2008-04-14 20:11:40 +04:00
if ( unlikely ( ! page ) ) {
oo = s - > min ;
2014-03-12 12:26:20 +04:00
alloc_gfp = flags ;
2008-04-14 20:11:40 +04:00
/*
* Allocation may have failed due to fragmentation .
* Try a lower order alloc if possible
*/
2014-06-05 03:06:38 +04:00
page = alloc_slab_page ( s , alloc_gfp , node , oo ) ;
2015-09-05 01:45:48 +03:00
if ( unlikely ( ! page ) )
goto out ;
stat ( s , ORDER_FALLBACK ) ;
2008-04-14 20:11:40 +04:00
}
2008-04-04 02:54:48 +04:00
2008-04-14 20:11:31 +04:00
page - > objects = oo_objects ( oo ) ;
2007-05-07 01:49:36 +04:00
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
page - > slab_cache = s ;
2012-05-17 19:47:47 +04:00
__SetPageSlab ( page ) ;
2015-08-22 00:11:51 +03:00
if ( page_is_pfmemalloc ( page ) )
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
SetPageSlabPfmemalloc ( page ) ;
2007-05-07 01:49:36 +04:00
2019-02-21 09:19:23 +03:00
kasan_poison_slab ( page ) ;
2007-05-07 01:49:36 +04:00
2019-02-21 09:19:23 +03:00
start = page_address ( page ) ;
2007-05-07 01:49:36 +04:00
2019-09-24 01:34:25 +03:00
setup_page_debug ( s , page , start ) ;
2015-02-14 01:39:42 +03:00
2016-07-27 01:21:59 +03:00
shuffle = shuffle_freelist ( s , page ) ;
if ( ! shuffle ) {
2018-12-28 11:30:23 +03:00
start = fixup_red_left ( s , start ) ;
start = setup_object ( s , page , start ) ;
page - > freelist = start ;
2019-02-21 09:19:28 +03:00
for ( idx = 0 , p = start ; idx < page - > objects - 1 ; idx + + ) {
next = p + s - > size ;
next = setup_object ( s , page , next ) ;
set_freepointer ( s , p , next ) ;
p = next ;
}
set_freepointer ( s , p , NULL ) ;
2007-05-07 01:49:36 +04:00
}
2011-08-10 01:12:24 +04:00
page - > inuse = page - > objects ;
2011-06-01 21:25:46 +04:00
page - > frozen = 1 ;
2015-09-05 01:45:48 +03:00
2007-05-07 01:49:36 +04:00
out :
2015-11-07 03:28:21 +03:00
if ( gfpflags_allow_blocking ( flags ) )
2015-09-05 01:45:48 +03:00
local_irq_disable ( ) ;
if ( ! page )
return NULL ;
inc_slabs_node ( s , page_to_nid ( page ) , page - > objects ) ;
2007-05-07 01:49:36 +04:00
return page ;
}
2015-09-05 01:45:48 +03:00
static struct page * new_slab ( struct kmem_cache * s , gfp_t flags , int node )
{
2020-08-07 09:18:28 +03:00
if ( unlikely ( flags & GFP_SLAB_BUG_MASK ) )
flags = kmalloc_fix_flags ( flags ) ;
2015-09-05 01:45:48 +03:00
return allocate_slab ( s ,
flags & ( GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK ) , node ) ;
}
2007-05-07 01:49:36 +04:00
static void __free_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 20:11:31 +04:00
int order = compound_order ( page ) ;
int pages = 1 < < order ;
2007-05-07 01:49:36 +04:00
2020-08-07 09:18:58 +03:00
if ( kmem_cache_debug_flags ( s , SLAB_CONSISTENCY_CHECKS ) ) {
2007-05-07 01:49:36 +04:00
void * p ;
slab_pad_check ( s , page ) ;
2008-04-14 20:11:31 +04:00
for_each_object ( p , s , page_address ( page ) ,
page - > objects )
2010-09-29 16:15:01 +04:00
check_object ( s , page , p , SLUB_RED_INACTIVE ) ;
2007-05-07 01:49:36 +04:00
}
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
__ClearPageSlabPfmemalloc ( page ) ;
2008-04-14 19:52:18 +04:00
__ClearPageSlab ( page ) ;
2012-12-19 02:22:50 +04:00
2018-06-08 03:08:26 +03:00
page - > mapping = NULL ;
2009-05-05 13:13:44 +04:00
if ( current - > reclaim_state )
current - > reclaim_state - > reclaimed_slab + = pages ;
2020-08-07 09:21:44 +03:00
unaccount_slab_page ( page , order , s ) ;
2016-03-18 00:17:35 +03:00
__free_pages ( page , order ) ;
2007-05-07 01:49:36 +04:00
}
static void rcu_free_slab ( struct rcu_head * h )
{
2018-06-08 03:09:05 +03:00
struct page * page = container_of ( h , struct page , rcu_head ) ;
2011-03-10 10:22:00 +03:00
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
__free_slab ( page - > slab_cache , page ) ;
2007-05-07 01:49:36 +04:00
}
static void free_slab ( struct kmem_cache * s , struct page * page )
{
2017-01-18 13:53:44 +03:00
if ( unlikely ( s - > flags & SLAB_TYPESAFE_BY_RCU ) ) {
2018-06-08 03:09:05 +03:00
call_rcu ( & page - > rcu_head , rcu_free_slab ) ;
2007-05-07 01:49:36 +04:00
} else
__free_slab ( s , page ) ;
}
static void discard_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 20:11:40 +04:00
dec_slabs_node ( s , page_to_nid ( page ) , page - > objects ) ;
2007-05-07 01:49:36 +04:00
free_slab ( s , page ) ;
}
/*
2011-06-01 21:25:50 +04:00
* Management of partially allocated slabs .
2007-05-07 01:49:36 +04:00
*/
2014-02-11 02:25:46 +04:00
static inline void
__add_partial ( struct kmem_cache_node * n , struct page * page , int tail )
2007-05-07 01:49:36 +04:00
{
2007-05-07 01:49:44 +04:00
n - > nr_partial + + ;
2011-08-24 04:57:52 +04:00
if ( tail = = DEACTIVATE_TO_TAIL )
2019-05-14 03:16:12 +03:00
list_add_tail ( & page - > slab_list , & n - > partial ) ;
2008-01-08 10:20:27 +03:00
else
2019-05-14 03:16:12 +03:00
list_add ( & page - > slab_list , & n - > partial ) ;
2007-05-07 01:49:36 +04:00
}
2014-02-11 02:25:46 +04:00
static inline void add_partial ( struct kmem_cache_node * n ,
struct page * page , int tail )
2010-09-28 17:10:28 +04:00
{
2014-01-10 16:23:49 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2014-02-11 02:25:46 +04:00
__add_partial ( n , page , tail ) ;
}
2014-01-10 16:23:49 +04:00
2014-02-11 02:25:46 +04:00
static inline void remove_partial ( struct kmem_cache_node * n ,
struct page * page )
{
lockdep_assert_held ( & n - > list_lock ) ;
2019-05-14 03:16:12 +03:00
list_del ( & page - > slab_list ) ;
2016-02-18 00:11:37 +03:00
n - > nr_partial - - ;
2014-02-11 02:25:46 +04:00
}
2007-05-07 01:49:36 +04:00
/*
2012-05-09 19:09:53 +04:00
* Remove slab from the partial list , freeze it and
* return the pointer to the freelist .
2007-05-07 01:49:36 +04:00
*
2011-08-10 01:12:26 +04:00
* Returns a list of objects or NULL if it fails .
2007-05-07 01:49:36 +04:00
*/
2011-08-10 01:12:26 +04:00
static inline void * acquire_slab ( struct kmem_cache * s ,
2011-08-10 01:12:25 +04:00
struct kmem_cache_node * n , struct page * page ,
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
int mode , int * objects )
2007-05-07 01:49:36 +04:00
{
2011-06-01 21:25:52 +04:00
void * freelist ;
unsigned long counters ;
struct page new ;
2014-01-10 16:23:49 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2011-06-01 21:25:52 +04:00
/*
* Zap the freelist and set the frozen bit .
* The old freelist is the list of objects for the
* per cpu allocation list .
*/
2012-05-09 19:09:53 +04:00
freelist = page - > freelist ;
counters = page - > counters ;
new . counters = counters ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
* objects = new . objects - new . inuse ;
2012-06-04 11:14:58 +04:00
if ( mode ) {
2012-05-09 19:09:53 +04:00
new . inuse = page - > objects ;
2012-06-04 11:14:58 +04:00
new . freelist = NULL ;
} else {
new . freelist = freelist ;
}
2011-06-01 21:25:52 +04:00
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( new . frozen ) ;
2012-05-09 19:09:53 +04:00
new . frozen = 1 ;
2011-06-01 21:25:52 +04:00
2012-05-09 19:09:53 +04:00
if ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 21:25:52 +04:00
freelist , counters ,
2012-05-16 19:13:02 +04:00
new . freelist , new . counters ,
2012-05-09 19:09:53 +04:00
" acquire_slab " ) )
return NULL ;
2011-06-01 21:25:52 +04:00
remove_partial ( n , page ) ;
2012-05-09 19:09:53 +04:00
WARN_ON ( ! freelist ) ;
2011-08-10 01:12:27 +04:00
return freelist ;
2007-05-07 01:49:36 +04:00
}
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
static void put_cpu_partial ( struct kmem_cache * s , struct page * page , int drain ) ;
2012-09-18 01:09:09 +04:00
static inline bool pfmemalloc_match ( struct page * page , gfp_t gfpflags ) ;
2011-08-10 01:12:27 +04:00
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* Try to allocate a partial slab from a specific node .
2007-05-07 01:49:36 +04:00
*/
2012-09-18 01:09:09 +04:00
static void * get_partial_node ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct kmem_cache_cpu * c , gfp_t flags )
2007-05-07 01:49:36 +04:00
{
2011-08-10 01:12:27 +04:00
struct page * page , * page2 ;
void * object = NULL ;
2018-04-06 02:21:10 +03:00
unsigned int available = 0 ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
int objects ;
2007-05-07 01:49:36 +04:00
/*
* Racy check . If we mistakenly see no partial slabs then we
* just allocate an empty slab . If we mistakenly try to get a
2007-05-09 13:32:39 +04:00
* partial slab and there is none available then get_partials ( )
* will return NULL .
2007-05-07 01:49:36 +04:00
*/
if ( ! n | | ! n - > nr_partial )
return NULL ;
spin_lock ( & n - > list_lock ) ;
2019-05-14 03:16:12 +03:00
list_for_each_entry_safe ( page , page2 , & n - > partial , slab_list ) {
2012-09-18 01:09:09 +04:00
void * t ;
2011-08-10 01:12:27 +04:00
2012-09-18 01:09:09 +04:00
if ( ! pfmemalloc_match ( page , flags ) )
continue ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
t = acquire_slab ( s , n , page , object = = NULL , & objects ) ;
2011-08-10 01:12:27 +04:00
if ( ! t )
break ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
available + = objects ;
2011-09-07 06:26:36 +04:00
if ( ! object ) {
2011-08-10 01:12:27 +04:00
c - > page = page ;
stat ( s , ALLOC_FROM_PARTIAL ) ;
object = t ;
} else {
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
put_cpu_partial ( s , page , 0 ) ;
2012-02-03 19:34:56 +04:00
stat ( s , CPU_PARTIAL_NODE ) ;
2011-08-10 01:12:27 +04:00
}
2013-06-19 09:05:52 +04:00
if ( ! kmem_cache_has_cpu_partial ( s )
2017-07-07 01:36:34 +03:00
| | available > slub_cpu_partial ( s ) / 2 )
2011-08-10 01:12:27 +04:00
break ;
2011-08-10 01:12:26 +04:00
}
2007-05-07 01:49:36 +04:00
spin_unlock ( & n - > list_lock ) ;
2011-08-10 01:12:26 +04:00
return object ;
2007-05-07 01:49:36 +04:00
}
/*
2007-05-09 13:32:39 +04:00
* Get a page from somewhere . Search in increasing NUMA distances .
2007-05-07 01:49:36 +04:00
*/
2012-01-27 12:12:23 +04:00
static void * get_any_partial ( struct kmem_cache * s , gfp_t flags ,
2011-08-10 01:12:25 +04:00
struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
# ifdef CONFIG_NUMA
struct zonelist * zonelist ;
2008-04-28 13:12:17 +04:00
struct zoneref * z ;
2008-04-28 13:12:16 +04:00
struct zone * zone ;
2020-06-04 01:59:01 +03:00
enum zone_type highest_zoneidx = gfp_zone ( flags ) ;
2011-08-10 01:12:26 +04:00
void * object ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
unsigned int cpuset_mems_cookie ;
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* The defrag ratio allows a configuration of the tradeoffs between
* inter node defragmentation and node local allocations . A lower
* defrag_ratio increases the tendency to do local allocations
* instead of attempting to obtain partial slabs from other nodes .
2007-05-07 01:49:36 +04:00
*
2007-05-09 13:32:39 +04:00
* If the defrag_ratio is set to 0 then kmalloc ( ) always
* returns node local objects . If the ratio is higher then kmalloc ( )
* may return off node objects because partial slabs are obtained
* from other nodes and filled up .
2007-05-07 01:49:36 +04:00
*
2016-05-20 03:10:43 +03:00
* If / sys / kernel / slab / xx / remote_node_defrag_ratio is set to 100
* ( which makes defrag_ratio = 1000 ) then every ( well almost )
* allocation will first attempt to defrag slab caches on other nodes .
* This means scanning over all nodes to look for partial slabs which
* may be expensive if we do it every time we are trying to find a slab
2007-05-09 13:32:39 +04:00
* with available objects .
2007-05-07 01:49:36 +04:00
*/
2008-01-08 10:20:26 +03:00
if ( ! s - > remote_node_defrag_ratio | |
get_cycles ( ) % 1024 > s - > remote_node_defrag_ratio )
2007-05-07 01:49:36 +04:00
return NULL ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
do {
2014-04-04 01:47:24 +04:00
cpuset_mems_cookie = read_mems_allowed_begin ( ) ;
2014-04-08 02:37:29 +04:00
zonelist = node_zonelist ( mempolicy_slab_node ( ) , flags ) ;
2020-06-04 01:59:01 +03:00
for_each_zone_zonelist ( zone , z , zonelist , highest_zoneidx ) {
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
struct kmem_cache_node * n ;
n = get_node ( s , zone_to_nid ( zone ) ) ;
2014-12-13 03:58:28 +03:00
if ( n & & cpuset_zone_allowed ( zone , flags ) & &
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
n - > nr_partial > s - > min_partial ) {
2012-09-18 01:09:09 +04:00
object = get_partial_node ( s , n , c , flags ) ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
if ( object ) {
/*
2014-04-04 01:47:24 +04:00
* Don ' t check read_mems_allowed_retry ( )
* here - if mems_allowed was updated in
* parallel , that was a harmless race
* between allocation and the cpuset
* update
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
*/
return object ;
}
2010-05-25 01:32:08 +04:00
}
2007-05-07 01:49:36 +04:00
}
2014-04-04 01:47:24 +04:00
} while ( read_mems_allowed_retry ( cpuset_mems_cookie ) ) ;
2019-05-14 03:16:09 +03:00
# endif /* CONFIG_NUMA */
2007-05-07 01:49:36 +04:00
return NULL ;
}
/*
* Get a partial page , lock it and return it .
*/
2011-08-10 01:12:26 +04:00
static void * get_partial ( struct kmem_cache * s , gfp_t flags , int node ,
2011-08-10 01:12:25 +04:00
struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2011-08-10 01:12:26 +04:00
void * object ;
2014-10-10 02:26:15 +04:00
int searchnode = node ;
if ( node = = NUMA_NO_NODE )
searchnode = numa_mem_id ( ) ;
2007-05-07 01:49:36 +04:00
2012-09-18 01:09:09 +04:00
object = get_partial_node ( s , get_node ( s , searchnode ) , c , flags ) ;
2011-08-10 01:12:26 +04:00
if ( object | | node ! = NUMA_NO_NODE )
return object ;
2007-05-07 01:49:36 +04:00
2011-08-10 01:12:25 +04:00
return get_any_partial ( s , flags , c ) ;
2007-05-07 01:49:36 +04:00
}
2019-10-15 22:18:12 +03:00
# ifdef CONFIG_PREEMPTION
2011-02-25 20:38:54 +03:00
/*
2020-06-05 02:49:34 +03:00
* Calculate the next globally unique transaction for disambiguation
2011-02-25 20:38:54 +03:00
* during cmpxchg . The transactions start with the cpu number and are then
* incremented by CONFIG_NR_CPUS .
*/
# define TID_STEP roundup_pow_of_two(CONFIG_NR_CPUS)
# else
/*
* No preemption supported therefore also no need to check for
* different cpus .
*/
# define TID_STEP 1
# endif
static inline unsigned long next_tid ( unsigned long tid )
{
return tid + TID_STEP ;
}
2019-09-24 01:33:52 +03:00
# ifdef SLUB_DEBUG_CMPXCHG
2011-02-25 20:38:54 +03:00
static inline unsigned int tid_to_cpu ( unsigned long tid )
{
return tid % TID_STEP ;
}
static inline unsigned long tid_to_event ( unsigned long tid )
{
return tid / TID_STEP ;
}
2019-09-24 01:33:52 +03:00
# endif
2011-02-25 20:38:54 +03:00
static inline unsigned int init_tid ( int cpu )
{
return cpu ;
}
static inline void note_cmpxchg_failure ( const char * n ,
const struct kmem_cache * s , unsigned long tid )
{
# ifdef SLUB_DEBUG_CMPXCHG
unsigned long actual_tid = __this_cpu_read ( s - > cpu_slab - > tid ) ;
2014-06-05 03:06:34 +04:00
pr_info ( " %s %s: cmpxchg redo " , n , s - > name ) ;
2011-02-25 20:38:54 +03:00
2019-10-15 22:18:12 +03:00
# ifdef CONFIG_PREEMPTION
2011-02-25 20:38:54 +03:00
if ( tid_to_cpu ( tid ) ! = tid_to_cpu ( actual_tid ) )
2014-06-05 03:06:34 +04:00
pr_warn ( " due to cpu change %d -> %d \n " ,
2011-02-25 20:38:54 +03:00
tid_to_cpu ( tid ) , tid_to_cpu ( actual_tid ) ) ;
else
# endif
if ( tid_to_event ( tid ) ! = tid_to_event ( actual_tid ) )
2014-06-05 03:06:34 +04:00
pr_warn ( " due to cpu running other code. Event %ld->%ld \n " ,
2011-02-25 20:38:54 +03:00
tid_to_event ( tid ) , tid_to_event ( actual_tid ) ) ;
else
2014-06-05 03:06:34 +04:00
pr_warn ( " for unknown reason: actual=%lx was=%lx target=%lx \n " ,
2011-02-25 20:38:54 +03:00
actual_tid , tid , next_tid ( tid ) ) ;
# endif
2011-03-22 21:35:00 +03:00
stat ( s , CMPXCHG_DOUBLE_CPU_FAIL ) ;
2011-02-25 20:38:54 +03:00
}
2012-09-28 12:34:05 +04:00
static void init_kmem_cache_cpus ( struct kmem_cache * s )
2011-02-25 20:38:54 +03:00
{
int cpu ;
for_each_possible_cpu ( cpu )
per_cpu_ptr ( s - > cpu_slab , cpu ) - > tid = init_tid ( cpu ) ;
}
2011-06-01 21:25:52 +04:00
2007-05-07 01:49:36 +04:00
/*
* Remove the cpu slab
*/
2013-07-15 05:05:29 +04:00
static void deactivate_slab ( struct kmem_cache * s , struct page * page ,
2017-07-07 01:36:25 +03:00
void * freelist , struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2011-06-01 21:25:52 +04:00
enum slab_modes { M_NONE , M_PARTIAL , M_FULL , M_FREE } ;
struct kmem_cache_node * n = get_node ( s , page_to_nid ( page ) ) ;
int lock = 0 ;
enum slab_modes l = M_NONE , m = M_NONE ;
void * nextfree ;
2011-08-24 04:57:52 +04:00
int tail = DEACTIVATE_TO_HEAD ;
2011-06-01 21:25:52 +04:00
struct page new ;
struct page old ;
if ( page - > freelist ) {
2009-12-19 01:26:23 +03:00
stat ( s , DEACTIVATE_REMOTE_FREES ) ;
2011-08-24 04:57:52 +04:00
tail = DEACTIVATE_TO_TAIL ;
2011-06-01 21:25:52 +04:00
}
2007-05-10 14:15:16 +04:00
/*
2011-06-01 21:25:52 +04:00
* Stage one : Free all available per cpu objects back
* to the page freelist while it is still frozen . Leave the
* last one .
*
* There is no need to take the list - > lock because the page
* is still frozen .
*/
while ( freelist & & ( nextfree = get_freepointer ( s , freelist ) ) ) {
void * prior ;
unsigned long counters ;
2020-06-02 07:45:47 +03:00
/*
* If ' nextfree ' is invalid , it is possible that the object at
* ' freelist ' is already corrupted . So isolate all objects
* starting at ' freelist ' .
*/
if ( freelist_corrupted ( s , page , freelist , nextfree ) )
break ;
2011-06-01 21:25:52 +04:00
do {
prior = page - > freelist ;
counters = page - > counters ;
set_freepointer ( s , freelist , prior ) ;
new . counters = counters ;
new . inuse - - ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! new . frozen ) ;
2011-06-01 21:25:52 +04:00
2011-07-14 21:49:12 +04:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 21:25:52 +04:00
prior , counters ,
freelist , new . counters ,
" drain percpu freelist " ) ) ;
freelist = nextfree ;
}
2007-05-10 14:15:16 +04:00
/*
2011-06-01 21:25:52 +04:00
* Stage two : Ensure that the page is unfrozen while the
* list presence reflects the actual number of objects
* during unfreeze .
*
* We setup the list membership and then perform a cmpxchg
* with the count . If there is a mismatch then the page
* is not unfrozen but the page is on the wrong list .
*
* Then we restart the process which may have to remove
* the page from the list that we just put it on again
* because the number of objects in the slab may have
* changed .
2007-05-10 14:15:16 +04:00
*/
2011-06-01 21:25:52 +04:00
redo :
2007-05-10 14:15:16 +04:00
2011-06-01 21:25:52 +04:00
old . freelist = page - > freelist ;
old . counters = page - > counters ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! old . frozen ) ;
2008-01-08 10:20:27 +03:00
2011-06-01 21:25:52 +04:00
/* Determine target state of the slab */
new . counters = old . counters ;
if ( freelist ) {
new . inuse - - ;
set_freepointer ( s , freelist , old . freelist ) ;
new . freelist = freelist ;
} else
new . freelist = old . freelist ;
new . frozen = 0 ;
2014-07-03 02:22:35 +04:00
if ( ! new . inuse & & n - > nr_partial > = s - > min_partial )
2011-06-01 21:25:52 +04:00
m = M_FREE ;
else if ( new . freelist ) {
m = M_PARTIAL ;
if ( ! lock ) {
lock = 1 ;
/*
2019-03-06 02:46:22 +03:00
* Taking the spinlock removes the possibility
2011-06-01 21:25:52 +04:00
* that acquire_slab ( ) will see a slab page that
* is frozen
*/
spin_lock ( & n - > list_lock ) ;
}
} else {
m = M_FULL ;
if ( kmem_cache_debug ( s ) & & ! lock ) {
lock = 1 ;
/*
* This also ensures that the scanning of full
* slabs from diagnostic functions will not see
* any frozen slabs .
*/
spin_lock ( & n - > list_lock ) ;
}
}
if ( l ! = m ) {
if ( l = = M_PARTIAL )
remove_partial ( n , page ) ;
else if ( l = = M_FULL )
2014-01-10 16:23:49 +04:00
remove_full ( s , n , page ) ;
2011-06-01 21:25:52 +04:00
2018-12-28 11:33:13 +03:00
if ( m = = M_PARTIAL )
2011-06-01 21:25:52 +04:00
add_partial ( n , page , tail ) ;
2018-12-28 11:33:13 +03:00
else if ( m = = M_FULL )
2011-06-01 21:25:52 +04:00
add_full ( s , n , page ) ;
}
l = m ;
2011-07-14 21:49:12 +04:00
if ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 21:25:52 +04:00
old . freelist , old . counters ,
new . freelist , new . counters ,
" unfreezing slab " ) )
goto redo ;
if ( lock )
spin_unlock ( & n - > list_lock ) ;
2018-12-28 11:33:13 +03:00
if ( m = = M_PARTIAL )
stat ( s , tail ) ;
else if ( m = = M_FULL )
stat ( s , DEACTIVATE_FULL ) ;
else if ( m = = M_FREE ) {
2011-06-01 21:25:52 +04:00
stat ( s , DEACTIVATE_EMPTY ) ;
discard_slab ( s , page ) ;
stat ( s , FREE_SLAB ) ;
2007-05-10 14:15:16 +04:00
}
2017-07-07 01:36:25 +03:00
c - > page = NULL ;
c - > freelist = NULL ;
2007-05-07 01:49:36 +04:00
}
2012-05-18 17:01:17 +04:00
/*
* Unfreeze all the cpu partial slabs .
*
2012-11-28 20:23:00 +04:00
* This function must be called with interrupts disabled
* for the cpu using c ( or some other guarantee must be there
* to guarantee no concurrent accesses ) .
2012-05-18 17:01:17 +04:00
*/
2012-11-28 20:23:00 +04:00
static void unfreeze_partials ( struct kmem_cache * s ,
struct kmem_cache_cpu * c )
2011-08-10 01:12:27 +04:00
{
2013-06-19 09:05:52 +04:00
# ifdef CONFIG_SLUB_CPU_PARTIAL
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
struct kmem_cache_node * n = NULL , * n2 = NULL ;
2011-11-14 09:34:13 +04:00
struct page * page , * discard_page = NULL ;
2011-08-10 01:12:27 +04:00
2020-04-02 07:04:16 +03:00
while ( ( page = slub_percpu_partial ( c ) ) ) {
2011-08-10 01:12:27 +04:00
struct page new ;
struct page old ;
2020-04-02 07:04:16 +03:00
slub_set_percpu_partial ( c , page ) ;
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
n2 = get_node ( s , page_to_nid ( page ) ) ;
if ( n ! = n2 ) {
if ( n )
spin_unlock ( & n - > list_lock ) ;
n = n2 ;
spin_lock ( & n - > list_lock ) ;
}
2011-08-10 01:12:27 +04:00
do {
old . freelist = page - > freelist ;
old . counters = page - > counters ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! old . frozen ) ;
2011-08-10 01:12:27 +04:00
new . counters = old . counters ;
new . freelist = old . freelist ;
new . frozen = 0 ;
2012-05-18 17:01:17 +04:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-08-10 01:12:27 +04:00
old . freelist , old . counters ,
new . freelist , new . counters ,
" unfreezing slab " ) ) ;
2014-07-03 02:22:35 +04:00
if ( unlikely ( ! new . inuse & & n - > nr_partial > = s - > min_partial ) ) {
2011-11-14 09:34:13 +04:00
page - > next = discard_page ;
discard_page = page ;
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
} else {
add_partial ( n , page , DEACTIVATE_TO_TAIL ) ;
stat ( s , FREE_ADD_PARTIAL ) ;
2011-08-10 01:12:27 +04:00
}
}
if ( n )
spin_unlock ( & n - > list_lock ) ;
2011-11-14 09:34:13 +04:00
while ( discard_page ) {
page = discard_page ;
discard_page = discard_page - > next ;
stat ( s , DEACTIVATE_EMPTY ) ;
discard_slab ( s , page ) ;
stat ( s , FREE_SLAB ) ;
}
2019-05-14 03:16:09 +03:00
# endif /* CONFIG_SLUB_CPU_PARTIAL */
2011-08-10 01:12:27 +04:00
}
/*
2019-03-06 02:43:10 +03:00
* Put a page that was just frozen ( in __slab_free | get_partial_node ) into a
* partial page slot if available .
2011-08-10 01:12:27 +04:00
*
* If we did not find a slot then simply move all the partials to the
* per node partial list .
*/
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
static void put_cpu_partial ( struct kmem_cache * s , struct page * page , int drain )
2011-08-10 01:12:27 +04:00
{
2013-06-19 09:05:52 +04:00
# ifdef CONFIG_SLUB_CPU_PARTIAL
2011-08-10 01:12:27 +04:00
struct page * oldpage ;
int pages ;
int pobjects ;
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 01:59:47 +03:00
preempt_disable ( ) ;
2011-08-10 01:12:27 +04:00
do {
pages = 0 ;
pobjects = 0 ;
oldpage = this_cpu_read ( s - > cpu_slab - > partial ) ;
if ( oldpage ) {
pobjects = oldpage - > pobjects ;
pages = oldpage - > pages ;
2020-04-02 07:04:19 +03:00
if ( drain & & pobjects > slub_cpu_partial ( s ) ) {
2011-08-10 01:12:27 +04:00
unsigned long flags ;
/*
* partial array is full . Move the existing
* set to the per node partial list .
*/
local_irq_save ( flags ) ;
2012-11-28 20:23:00 +04:00
unfreeze_partials ( s , this_cpu_ptr ( s - > cpu_slab ) ) ;
2011-08-10 01:12:27 +04:00
local_irq_restore ( flags ) ;
2012-06-22 22:22:38 +04:00
oldpage = NULL ;
2011-08-10 01:12:27 +04:00
pobjects = 0 ;
pages = 0 ;
2012-02-03 19:34:56 +04:00
stat ( s , CPU_PARTIAL_DRAIN ) ;
2011-08-10 01:12:27 +04:00
}
}
pages + + ;
pobjects + = page - > objects - page - > inuse ;
page - > pages = pages ;
page - > pobjects = pobjects ;
page - > next = oldpage ;
2013-07-15 05:05:29 +04:00
} while ( this_cpu_cmpxchg ( s - > cpu_slab - > partial , oldpage , page )
! = oldpage ) ;
2020-04-02 07:04:19 +03:00
if ( unlikely ( ! slub_cpu_partial ( s ) ) ) {
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 01:59:47 +03:00
unsigned long flags ;
local_irq_save ( flags ) ;
unfreeze_partials ( s , this_cpu_ptr ( s - > cpu_slab ) ) ;
local_irq_restore ( flags ) ;
}
preempt_enable ( ) ;
2019-05-14 03:16:09 +03:00
# endif /* CONFIG_SLUB_CPU_PARTIAL */
2011-08-10 01:12:27 +04:00
}
2007-10-16 12:26:05 +04:00
static inline void flush_slab ( struct kmem_cache * s , struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2009-12-19 01:26:23 +03:00
stat ( s , CPUSLAB_FLUSH ) ;
2017-07-07 01:36:25 +03:00
deactivate_slab ( s , c - > page , c - > freelist , c ) ;
2012-05-09 19:09:57 +04:00
c - > tid = next_tid ( c - > tid ) ;
2007-05-07 01:49:36 +04:00
}
/*
* Flush cpu slab .
2008-02-16 10:45:26 +03:00
*
2007-05-07 01:49:36 +04:00
* Called from IPI handler with interrupts disabled .
*/
2007-07-17 15:03:24 +04:00
static inline void __flush_cpu_slab ( struct kmem_cache * s , int cpu )
2007-05-07 01:49:36 +04:00
{
2009-12-19 01:26:20 +03:00
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab , cpu ) ;
2007-05-07 01:49:36 +04:00
2018-12-28 11:33:06 +03:00
if ( c - > page )
flush_slab ( s , c ) ;
2011-08-10 01:12:27 +04:00
2018-12-28 11:33:06 +03:00
unfreeze_partials ( s , c ) ;
2007-05-07 01:49:36 +04:00
}
static void flush_cpu_slab ( void * d )
{
struct kmem_cache * s = d ;
2007-10-16 12:26:05 +04:00
__flush_cpu_slab ( s , smp_processor_id ( ) ) ;
2007-05-07 01:49:36 +04:00
}
2012-03-29 01:42:44 +04:00
static bool has_cpu_slab ( int cpu , void * info )
{
struct kmem_cache * s = info ;
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab , cpu ) ;
2017-07-07 01:36:31 +03:00
return c - > page | | slub_percpu_partial ( c ) ;
2012-03-29 01:42:44 +04:00
}
2007-05-07 01:49:36 +04:00
static void flush_all ( struct kmem_cache * s )
{
2020-01-17 12:01:37 +03:00
on_each_cpu_cond ( has_cpu_slab , flush_cpu_slab , s , 1 ) ;
2007-05-07 01:49:36 +04:00
}
2016-08-18 15:57:19 +03:00
/*
* Use the cpu notifier to insure that the cpu slabs are flushed when
* necessary .
*/
static int slub_cpu_dead ( unsigned int cpu )
{
struct kmem_cache * s ;
unsigned long flags ;
mutex_lock ( & slab_mutex ) ;
list_for_each_entry ( s , & slab_caches , list ) {
local_irq_save ( flags ) ;
__flush_cpu_slab ( s , cpu ) ;
local_irq_restore ( flags ) ;
}
mutex_unlock ( & slab_mutex ) ;
return 0 ;
}
2007-10-16 12:26:05 +04:00
/*
* Check if the objects in a per cpu structure fit numa
* locality expectations .
*/
2012-05-09 19:09:59 +04:00
static inline int node_match ( struct page * page , int node )
2007-10-16 12:26:05 +04:00
{
# ifdef CONFIG_NUMA
2018-12-28 11:33:09 +03:00
if ( node ! = NUMA_NO_NODE & & page_to_nid ( page ) ! = node )
2007-10-16 12:26:05 +04:00
return 0 ;
# endif
return 1 ;
}
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
# ifdef CONFIG_SLUB_DEBUG
2009-06-10 19:50:32 +04:00
static int count_free ( struct page * page )
{
return page - > objects - page - > inuse ;
}
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
static inline unsigned long node_nr_objs ( struct kmem_cache_node * n )
{
return atomic_long_read ( & n - > total_objects ) ;
}
# endif /* CONFIG_SLUB_DEBUG */
# if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
2009-06-10 19:50:32 +04:00
static unsigned long count_partial ( struct kmem_cache_node * n ,
int ( * get_count ) ( struct page * ) )
{
unsigned long flags ;
unsigned long x = 0 ;
struct page * page ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
2019-05-14 03:16:12 +03:00
list_for_each_entry ( page , & n - > partial , slab_list )
2009-06-10 19:50:32 +04:00
x + = get_count ( page ) ;
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
return x ;
}
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
# endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
2009-06-11 14:08:48 +04:00
2009-06-10 19:50:32 +04:00
static noinline void
slab_out_of_memory ( struct kmem_cache * s , gfp_t gfpflags , int nid )
{
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
# ifdef CONFIG_SLUB_DEBUG
static DEFINE_RATELIMIT_STATE ( slub_oom_rs , DEFAULT_RATELIMIT_INTERVAL ,
DEFAULT_RATELIMIT_BURST ) ;
2009-06-10 19:50:32 +04:00
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2009-06-10 19:50:32 +04:00
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
if ( ( gfpflags & __GFP_NOWARN ) | | ! __ratelimit ( & slub_oom_rs ) )
return ;
2016-03-16 00:56:33 +03:00
pr_warn ( " SLUB: Unable to allocate memory on node %d, gfp=%#x(%pGg) \n " ,
nid , gfpflags , & gfpflags ) ;
2018-04-06 02:21:39 +03:00
pr_warn ( " cache: %s, object size: %u, buffer size: %u, default order: %u, min order: %u \n " ,
2014-06-05 03:06:34 +04:00
s - > name , s - > object_size , s - > size , oo_order ( s - > oo ) ,
oo_order ( s - > min ) ) ;
2009-06-10 19:50:32 +04:00
2012-06-13 19:24:57 +04:00
if ( oo_order ( s - > min ) > get_order ( s - > object_size ) )
2014-06-05 03:06:34 +04:00
pr_warn ( " %s debugging increased min order, use slub_debug=O to disable. \n " ,
s - > name ) ;
2009-07-07 11:14:14 +04:00
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2009-06-10 19:50:32 +04:00
unsigned long nr_slabs ;
unsigned long nr_objs ;
unsigned long nr_free ;
2009-06-11 14:08:48 +04:00
nr_free = count_partial ( n , count_free ) ;
nr_slabs = node_nr_slabs ( n ) ;
nr_objs = node_nr_objs ( n ) ;
2009-06-10 19:50:32 +04:00
2014-06-05 03:06:34 +04:00
pr_warn ( " node %d: slabs: %ld, objs: %ld, free: %ld \n " ,
2009-06-10 19:50:32 +04:00
node , nr_slabs , nr_objs , nr_free ) ;
}
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
# endif
2009-06-10 19:50:32 +04:00
}
2011-08-10 01:12:26 +04:00
static inline void * new_slab_objects ( struct kmem_cache * s , gfp_t flags ,
int node , struct kmem_cache_cpu * * pc )
{
2012-05-09 19:09:51 +04:00
void * freelist ;
2012-05-09 19:09:55 +04:00
struct kmem_cache_cpu * c = * pc ;
struct page * page ;
2011-08-10 01:12:26 +04:00
2018-06-08 03:05:13 +03:00
WARN_ON_ONCE ( s - > ctor & & ( flags & __GFP_ZERO ) ) ;
2012-05-09 19:09:55 +04:00
freelist = get_partial ( s , flags , node , c ) ;
2011-08-10 01:12:26 +04:00
2012-05-09 19:09:55 +04:00
if ( freelist )
return freelist ;
page = new_slab ( s , flags , node ) ;
2011-08-10 01:12:26 +04:00
if ( page ) {
2014-06-05 03:07:56 +04:00
c = raw_cpu_ptr ( s - > cpu_slab ) ;
2011-08-10 01:12:26 +04:00
if ( c - > page )
flush_slab ( s , c ) ;
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
*/
2012-05-09 19:09:51 +04:00
freelist = page - > freelist ;
2011-08-10 01:12:26 +04:00
page - > freelist = NULL ;
stat ( s , ALLOC_SLAB ) ;
c - > page = page ;
* pc = c ;
2019-03-06 02:42:00 +03:00
}
2011-08-10 01:12:26 +04:00
2012-05-09 19:09:51 +04:00
return freelist ;
2011-08-10 01:12:26 +04:00
}
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
static inline bool pfmemalloc_match ( struct page * page , gfp_t gfpflags )
{
if ( unlikely ( PageSlabPfmemalloc ( page ) ) )
return gfp_pfmemalloc_allowed ( gfpflags ) ;
return true ;
}
2011-11-12 00:07:14 +04:00
/*
2013-07-15 05:05:29 +04:00
* Check the page - > freelist of a page and either transfer the freelist to the
* per cpu freelist or deactivate the page .
2011-11-12 00:07:14 +04:00
*
* The page is still frozen if the return value is not NULL .
*
* If this function returns NULL then the page has been unfrozen .
2012-05-18 17:01:17 +04:00
*
* This function must be called with interrupt disabled .
2011-11-12 00:07:14 +04:00
*/
static inline void * get_freelist ( struct kmem_cache * s , struct page * page )
{
struct page new ;
unsigned long counters ;
void * freelist ;
do {
freelist = page - > freelist ;
counters = page - > counters ;
2012-05-09 19:09:51 +04:00
2011-11-12 00:07:14 +04:00
new . counters = counters ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! new . frozen ) ;
2011-11-12 00:07:14 +04:00
new . inuse = page - > objects ;
new . frozen = freelist ! = NULL ;
2012-05-18 17:01:17 +04:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-11-12 00:07:14 +04:00
freelist , counters ,
NULL , new . counters ,
" get_freelist " ) ) ;
return freelist ;
}
2007-05-07 01:49:36 +04:00
/*
2007-05-10 14:15:16 +04:00
* Slow path . The lockless freelist is empty or we need to perform
* debugging duties .
*
* Processing is still very fast if new objects have been freed to the
* regular freelist . In that case we simply take over the regular freelist
* as the lockless freelist and zap the regular freelist .
2007-05-07 01:49:36 +04:00
*
2007-05-10 14:15:16 +04:00
* If that is not working then we fall back to the partial lists . We take the
* first element of the freelist as the object to allocate now and move the
* rest of the freelist to the lockless freelist .
2007-05-07 01:49:36 +04:00
*
2007-05-10 14:15:16 +04:00
* And if we were unable to get a new slab from the partial slab lists then
2008-02-16 10:45:26 +03:00
* we need to allocate a new slab . This is the slowest path since it involves
* a call to the page allocator and the setup of a new slab .
2015-11-21 02:57:35 +03:00
*
* Version of __slab_alloc to use when we know that interrupts are
* already disabled ( which is the case for bulk allocation ) .
2007-05-07 01:49:36 +04:00
*/
2015-11-21 02:57:35 +03:00
static void * ___slab_alloc ( struct kmem_cache * s , gfp_t gfpflags , int node ,
2008-08-19 21:43:25 +04:00
unsigned long addr , struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2012-05-09 19:09:51 +04:00
void * freelist ;
2012-05-09 19:09:58 +04:00
struct page * page ;
2007-05-07 01:49:36 +04:00
2012-05-09 19:09:58 +04:00
page = c - > page ;
mm, slub: prevent kmalloc_node crashes and memory leaks
Sachin reports [1] a crash in SLUB __slab_alloc():
BUG: Kernel NULL pointer dereference on read at 0x000073b0
Faulting instruction address: 0xc0000000003d55f4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000
CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
NIP ___slab_alloc+0x1f4/0x760
LR __slab_alloc+0x34/0x60
Call Trace:
___slab_alloc+0x334/0x760 (unreliable)
__slab_alloc+0x34/0x60
__kmalloc_node+0x110/0x490
kvmalloc_node+0x58/0x110
mem_cgroup_css_online+0x108/0x270
online_css+0x48/0xd0
cgroup_apply_control_enable+0x2ec/0x4d0
cgroup_mkdir+0x228/0x5f0
kernfs_iop_mkdir+0x90/0xf0
vfs_mkdir+0x110/0x230
do_mkdirat+0xb0/0x1a0
system_call+0x5c/0x68
This is a PowerPC platform with following NUMA topology:
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node 0 1
0: 10 40
1: 40 10
possible numa nodes: 0-31
This only happens with a mmotm patch "mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node" [2] which effectively calls
kmalloc_node for each possible node. SLUB however only allocates
kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
node_to_mem_node to return such valid node for other nodes since commit
a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating
on memoryless node"). This is however not true in this configuration
where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
thus it contains zeroes and get_partial() ends up accessing
non-allocated kmem_cache_node.
A related issue was reported by Bharata (originally by Ramachandran) [3]
where a similar PowerPC configuration, but with mainline kernel without
patch [2] ends up allocating large amounts of pages by kmalloc-1k
kmalloc-512. This seems to have the same underlying issue with
node_to_mem_node() not behaving as expected, and might probably also
lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].
This patch should fix both issues by not relying on node_to_mem_node()
anymore and instead simply falling back to NUMA_NO_NODE, when
kmalloc_node(node) is attempted for a node that's not online, or has no
usable memory. The "usable memory" condition is also changed from
node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
the condition that SLUB uses to allocate kmem_cache_node structures.
The check in get_partial() is removed completely, as the checks in
___slab_alloc() are now sufficient to prevent get_partial() being
reached with an invalid node.
[1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
[2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
[3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
[4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/
Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz
Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-22 04:22:37 +03:00
if ( ! page ) {
/*
* if the node is not online or has no normal memory , just
* ignore the node constraint
*/
if ( unlikely ( node ! = NUMA_NO_NODE & &
! node_state ( node , N_NORMAL_MEMORY ) ) )
node = NUMA_NO_NODE ;
2007-05-07 01:49:36 +04:00
goto new_slab ;
mm, slub: prevent kmalloc_node crashes and memory leaks
Sachin reports [1] a crash in SLUB __slab_alloc():
BUG: Kernel NULL pointer dereference on read at 0x000073b0
Faulting instruction address: 0xc0000000003d55f4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000
CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
NIP ___slab_alloc+0x1f4/0x760
LR __slab_alloc+0x34/0x60
Call Trace:
___slab_alloc+0x334/0x760 (unreliable)
__slab_alloc+0x34/0x60
__kmalloc_node+0x110/0x490
kvmalloc_node+0x58/0x110
mem_cgroup_css_online+0x108/0x270
online_css+0x48/0xd0
cgroup_apply_control_enable+0x2ec/0x4d0
cgroup_mkdir+0x228/0x5f0
kernfs_iop_mkdir+0x90/0xf0
vfs_mkdir+0x110/0x230
do_mkdirat+0xb0/0x1a0
system_call+0x5c/0x68
This is a PowerPC platform with following NUMA topology:
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node 0 1
0: 10 40
1: 40 10
possible numa nodes: 0-31
This only happens with a mmotm patch "mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node" [2] which effectively calls
kmalloc_node for each possible node. SLUB however only allocates
kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
node_to_mem_node to return such valid node for other nodes since commit
a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating
on memoryless node"). This is however not true in this configuration
where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
thus it contains zeroes and get_partial() ends up accessing
non-allocated kmem_cache_node.
A related issue was reported by Bharata (originally by Ramachandran) [3]
where a similar PowerPC configuration, but with mainline kernel without
patch [2] ends up allocating large amounts of pages by kmalloc-1k
kmalloc-512. This seems to have the same underlying issue with
node_to_mem_node() not behaving as expected, and might probably also
lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].
This patch should fix both issues by not relying on node_to_mem_node()
anymore and instead simply falling back to NUMA_NO_NODE, when
kmalloc_node(node) is attempted for a node that's not online, or has no
usable memory. The "usable memory" condition is also changed from
node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
the condition that SLUB uses to allocate kmem_cache_node structures.
The check in get_partial() is removed completely, as the checks in
___slab_alloc() are now sufficient to prevent get_partial() being
reached with an invalid node.
[1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
[2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
[3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
[4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/
Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz
Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-22 04:22:37 +03:00
}
2011-08-10 01:12:27 +04:00
redo :
2012-05-09 19:09:51 +04:00
2012-05-09 19:09:59 +04:00
if ( unlikely ( ! node_match ( page , node ) ) ) {
mm, slub: prevent kmalloc_node crashes and memory leaks
Sachin reports [1] a crash in SLUB __slab_alloc():
BUG: Kernel NULL pointer dereference on read at 0x000073b0
Faulting instruction address: 0xc0000000003d55f4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000
CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
NIP ___slab_alloc+0x1f4/0x760
LR __slab_alloc+0x34/0x60
Call Trace:
___slab_alloc+0x334/0x760 (unreliable)
__slab_alloc+0x34/0x60
__kmalloc_node+0x110/0x490
kvmalloc_node+0x58/0x110
mem_cgroup_css_online+0x108/0x270
online_css+0x48/0xd0
cgroup_apply_control_enable+0x2ec/0x4d0
cgroup_mkdir+0x228/0x5f0
kernfs_iop_mkdir+0x90/0xf0
vfs_mkdir+0x110/0x230
do_mkdirat+0xb0/0x1a0
system_call+0x5c/0x68
This is a PowerPC platform with following NUMA topology:
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node 0 1
0: 10 40
1: 40 10
possible numa nodes: 0-31
This only happens with a mmotm patch "mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node" [2] which effectively calls
kmalloc_node for each possible node. SLUB however only allocates
kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
node_to_mem_node to return such valid node for other nodes since commit
a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating
on memoryless node"). This is however not true in this configuration
where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
thus it contains zeroes and get_partial() ends up accessing
non-allocated kmem_cache_node.
A related issue was reported by Bharata (originally by Ramachandran) [3]
where a similar PowerPC configuration, but with mainline kernel without
patch [2] ends up allocating large amounts of pages by kmalloc-1k
kmalloc-512. This seems to have the same underlying issue with
node_to_mem_node() not behaving as expected, and might probably also
lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].
This patch should fix both issues by not relying on node_to_mem_node()
anymore and instead simply falling back to NUMA_NO_NODE, when
kmalloc_node(node) is attempted for a node that's not online, or has no
usable memory. The "usable memory" condition is also changed from
node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
the condition that SLUB uses to allocate kmem_cache_node structures.
The check in get_partial() is removed completely, as the checks in
___slab_alloc() are now sufficient to prevent get_partial() being
reached with an invalid node.
[1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
[2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
[3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
[4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/
Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz
Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-22 04:22:37 +03:00
/*
* same as above but node_match ( ) being false already
* implies node ! = NUMA_NO_NODE
*/
if ( ! node_state ( node , N_NORMAL_MEMORY ) ) {
node = NUMA_NO_NODE ;
goto redo ;
} else {
2014-10-10 02:26:15 +04:00
stat ( s , ALLOC_NODE_MISMATCH ) ;
2017-07-07 01:36:25 +03:00
deactivate_slab ( s , page , c - > freelist , c ) ;
2014-10-10 02:26:15 +04:00
goto new_slab ;
}
2011-06-01 21:25:56 +04:00
}
2008-02-16 10:45:26 +03:00
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
/*
* By rights , we should be searching for a slab page that was
* PFMEMALLOC but right now , we are losing the pfmemalloc
* information when the page leaves the per - cpu allocator
*/
if ( unlikely ( ! pfmemalloc_match ( page , gfpflags ) ) ) {
2017-07-07 01:36:25 +03:00
deactivate_slab ( s , page , c - > freelist , c ) ;
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
goto new_slab ;
}
2011-12-13 07:57:06 +04:00
/* must check again c->freelist in case of cpu migration or IRQ */
2012-05-09 19:09:51 +04:00
freelist = c - > freelist ;
if ( freelist )
2011-12-13 07:57:06 +04:00
goto load_freelist ;
2011-06-01 21:25:58 +04:00
2012-05-09 19:09:58 +04:00
freelist = get_freelist ( s , page ) ;
2008-02-16 10:45:26 +03:00
2012-05-09 19:09:51 +04:00
if ( ! freelist ) {
2011-06-01 21:25:58 +04:00
c - > page = NULL ;
stat ( s , DEACTIVATE_BYPASS ) ;
2011-06-01 21:25:56 +04:00
goto new_slab ;
2011-06-01 21:25:58 +04:00
}
2008-02-16 10:45:26 +03:00
2009-12-19 01:26:23 +03:00
stat ( s , ALLOC_REFILL ) ;
2008-02-16 10:45:26 +03:00
2007-05-10 14:15:16 +04:00
load_freelist :
2012-05-09 19:09:52 +04:00
/*
* freelist is pointing to the list of objects to be used .
* page is pointing to the page from which the objects are obtained .
* That page must be frozen for per cpu allocations to work .
*/
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! c - > page - > frozen ) ;
2012-05-09 19:09:51 +04:00
c - > freelist = get_freepointer ( s , freelist ) ;
2011-02-25 20:38:54 +03:00
c - > tid = next_tid ( c - > tid ) ;
2012-05-09 19:09:51 +04:00
return freelist ;
2007-05-07 01:49:36 +04:00
new_slab :
2011-06-01 21:25:52 +04:00
2017-07-07 01:36:31 +03:00
if ( slub_percpu_partial ( c ) ) {
page = c - > page = slub_percpu_partial ( c ) ;
slub_set_percpu_partial ( c , page ) ;
2011-08-10 01:12:27 +04:00
stat ( s , CPU_PARTIAL_ALLOC ) ;
goto redo ;
2007-05-07 01:49:36 +04:00
}
2012-05-09 19:09:55 +04:00
freelist = new_slab_objects ( s , gfpflags , node , & c ) ;
2011-04-15 23:48:14 +04:00
2012-05-09 19:09:54 +04:00
if ( unlikely ( ! freelist ) ) {
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
slab_out_of_memory ( s , gfpflags , node ) ;
2012-05-09 19:09:54 +04:00
return NULL ;
2007-05-07 01:49:36 +04:00
}
2011-06-01 21:25:52 +04:00
2012-05-09 19:09:58 +04:00
page = c - > page ;
2012-08-01 03:44:00 +04:00
if ( likely ( ! kmem_cache_debug ( s ) & & pfmemalloc_match ( page , gfpflags ) ) )
2007-05-17 09:10:53 +04:00
goto load_freelist ;
2011-06-01 21:25:52 +04:00
2011-08-10 01:12:26 +04:00
/* Only entered in the debug case */
2013-07-15 05:05:29 +04:00
if ( kmem_cache_debug ( s ) & &
! alloc_debug_processing ( s , page , freelist , addr ) )
2011-08-10 01:12:26 +04:00
goto new_slab ; /* Slab failed checks. Next slab needed */
2007-05-10 14:15:16 +04:00
2017-07-07 01:36:25 +03:00
deactivate_slab ( s , page , get_freepointer ( s , freelist ) , c ) ;
2012-05-09 19:09:51 +04:00
return freelist ;
2007-05-10 14:15:16 +04:00
}
2015-11-21 02:57:35 +03:00
/*
* Another one that disabled interrupt and compensates for possible
* cpu changes by refetching the per cpu area pointer .
*/
static void * __slab_alloc ( struct kmem_cache * s , gfp_t gfpflags , int node ,
unsigned long addr , struct kmem_cache_cpu * c )
{
void * p ;
unsigned long flags ;
local_irq_save ( flags ) ;
2019-10-15 22:18:12 +03:00
# ifdef CONFIG_PREEMPTION
2015-11-21 02:57:35 +03:00
/*
* We may have been preempted and rescheduled on a different
* cpu before disabling interrupts . Need to reload cpu area
* pointer .
*/
c = this_cpu_ptr ( s - > cpu_slab ) ;
# endif
p = ___slab_alloc ( s , gfpflags , node , addr , c ) ;
local_irq_restore ( flags ) ;
return p ;
}
2019-10-15 00:11:57 +03:00
/*
* If the object has been wiped upon free , make sure it ' s fully initialized by
* zeroing out freelist pointer .
*/
static __always_inline void maybe_wipe_obj_freeptr ( struct kmem_cache * s ,
void * obj )
{
if ( unlikely ( slab_want_init_on_free ( s ) ) & & obj )
memset ( ( void * ) ( ( char * ) obj + s - > offset ) , 0 , sizeof ( void * ) ) ;
}
2007-05-10 14:15:16 +04:00
/*
* Inlined fastpath so that allocation functions ( kmalloc , kmem_cache_alloc )
* have the fastpath folded into their functions . So no function call
* overhead for requests that can be satisfied on the fastpath .
*
* The fastpath works by first checking if the lockless freelist can be used .
* If not then __slab_alloc is called for slow processing .
*
* Otherwise we can simply pick the next object from the lockless free list .
*/
2012-09-09 00:47:58 +04:00
static __always_inline void * slab_alloc_node ( struct kmem_cache * s ,
2008-08-19 21:43:25 +04:00
gfp_t gfpflags , int node , unsigned long addr )
2007-05-10 14:15:16 +04:00
{
2015-11-21 02:57:52 +03:00
void * object ;
2007-10-16 12:26:05 +04:00
struct kmem_cache_cpu * c ;
2012-05-09 19:09:59 +04:00
struct page * page ;
2011-02-25 20:38:54 +03:00
unsigned long tid ;
2020-08-07 09:20:56 +03:00
struct obj_cgroup * objcg = NULL ;
2008-01-08 10:20:30 +03:00
2020-08-07 09:20:56 +03:00
s = slab_pre_alloc_hook ( s , & objcg , 1 , gfpflags ) ;
memcg: fix possible use-after-free in memcg_kmem_get_cache()
Suppose task @t that belongs to a memory cgroup @memcg is going to
allocate an object from a kmem cache @c. The copy of @c corresponding to
@memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
cgroup destruction we can access the memory cgroup's copy of the cache
after it was destroyed:
CPU0 CPU1
---- ----
[ current=@t
@mc->memcg_params->nr_pages=0 ]
kmem_cache_alloc(@c):
call memcg_kmem_get_cache(@c);
proceed to allocation from @mc:
alloc a page for @mc:
...
move @t from @memcg
destroy @memcg:
mem_cgroup_css_offline(@memcg):
memcg_unregister_all_caches(@memcg):
kmem_cache_destroy(@mc)
add page to @mc
We could fix this issue by taking a reference to a per-memcg cache, but
that would require adding a per-cpu reference counter to per-memcg caches,
which would look cumbersome.
Instead, let's take a reference to a memory cgroup, which already has a
per-cpu reference counter, in the beginning of kmem_cache_alloc to be
dropped in the end, and move per memcg caches destruction from css offline
to css free. As a side effect, per-memcg caches will be destroyed not one
by one, but all at once when the last page accounted to the memory cgroup
is freed. This doesn't sound as a high price for code readability though.
Note, this patch does add some overhead to the kmem_cache_alloc hot path,
but it is pretty negligible - it's just a function call plus a per cpu
counter decrement, which is comparable to what we already have in
memcg_kmem_get_cache. Besides, it's only relevant if there are memory
cgroups with kmem accounting enabled. I don't think we can find a way to
handle this race w/o it, because alloc_page called from kmem_cache_alloc
may sleep so we can't flush all pending kmallocs w/o reference counting.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 03:56:38 +03:00
if ( ! s )
2008-12-23 13:37:01 +03:00
return NULL ;
2011-02-25 20:38:54 +03:00
redo :
/*
* Must read kmem_cache cpu data via this cpu ptr . Preemption is
* enabled . We may switch back and forth between cpus while
* reading from one cpu area . That does not matter as long
* as we end up on the original cpu again when doing the cmpxchg .
2013-01-24 01:45:48 +04:00
*
2015-02-11 01:09:32 +03:00
* We should guarantee that tid and kmem_cache are retrieved on
2019-10-15 22:18:12 +03:00
* the same cpu . It could be different if CONFIG_PREEMPTION so we need
2015-02-11 01:09:32 +03:00
* to check if it is matched or not .
2011-02-25 20:38:54 +03:00
*/
2015-02-11 01:09:32 +03:00
do {
tid = this_cpu_read ( s - > cpu_slab - > tid ) ;
c = raw_cpu_ptr ( s - > cpu_slab ) ;
2019-10-15 22:18:12 +03:00
} while ( IS_ENABLED ( CONFIG_PREEMPTION ) & &
2015-03-26 01:55:23 +03:00
unlikely ( tid ! = READ_ONCE ( c - > tid ) ) ) ;
2015-02-11 01:09:32 +03:00
/*
* Irqless object alloc / free algorithm used here depends on sequence
* of fetching cpu_slab ' s data . tid should be fetched before anything
* on c to guarantee that object and page associated with previous tid
* won ' t be used with current tid . If we fetch tid first , object and
* page could be one associated with next tid and our alloc / free
* request will be failed . In this case , we will retry . So , no problem .
*/
barrier ( ) ;
2011-02-25 20:38:54 +03:00
/*
* The transaction ids are globally unique per cpu and per operation on
* a per cpu queue . Thus they can be guarantee that the cmpxchg_double
* occurs on the right processor and that there was no operation on the
* linked list in between .
*/
2009-12-19 01:26:20 +03:00
object = c - > freelist ;
2012-05-09 19:09:59 +04:00
page = c - > page ;
mm: slub: fix ALLOC_SLOWPATH stat
There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
got bumped in that exit path. Now there are two, and a bunch of gotos.
ALLOC_SLOWPATH can now get set more than once during a single call to
__slab_alloc() which is pretty bogus. Here's the sequence:
1. Enter __slab_alloc(), fall through all the way to the
stat(s, ALLOC_SLOWPATH);
2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
new_slab (goto #1)
3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
(goto #2)
4. Fall through in the same path we did before all the way to
stat(s, ALLOC_SLOWPATH)
5. bump ALLOC_REFILL stat, then return
Doing this is obviously bogus. It keeps us from being able to
accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
that the total number of allocs always exceeds the total number of
frees.
This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
place that __slab_alloc() is. This makes it much less likely that
ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
__slab_alloc().
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:37 +04:00
if ( unlikely ( ! object | | ! node_match ( page , node ) ) ) {
2007-10-16 12:26:05 +04:00
object = __slab_alloc ( s , gfpflags , node , addr , c ) ;
mm: slub: fix ALLOC_SLOWPATH stat
There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
got bumped in that exit path. Now there are two, and a bunch of gotos.
ALLOC_SLOWPATH can now get set more than once during a single call to
__slab_alloc() which is pretty bogus. Here's the sequence:
1. Enter __slab_alloc(), fall through all the way to the
stat(s, ALLOC_SLOWPATH);
2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
new_slab (goto #1)
3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
(goto #2)
4. Fall through in the same path we did before all the way to
stat(s, ALLOC_SLOWPATH)
5. bump ALLOC_REFILL stat, then return
Doing this is obviously bogus. It keeps us from being able to
accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
that the total number of allocs always exceeds the total number of
frees.
This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
place that __slab_alloc() is. This makes it much less likely that
ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
__slab_alloc().
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:37 +04:00
stat ( s , ALLOC_SLOWPATH ) ;
} else {
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
void * next_object = get_freepointer_safe ( s , object ) ;
2011-02-25 20:38:54 +03:00
/*
2011-03-31 05:57:33 +04:00
* The cmpxchg will only match if there was no additional
2011-02-25 20:38:54 +03:00
* operation and if we are on the right processor .
*
2013-07-15 05:05:29 +04:00
* The cmpxchg does the following atomically ( without lock
* semantics ! )
2011-02-25 20:38:54 +03:00
* 1. Relocate first pointer to the current per cpu area .
* 2. Verify that tid and freelist have not been changed
* 3. If they were not changed replace tid and freelist
*
2013-07-15 05:05:29 +04:00
* Since this is without lock semantics the protection is only
* against code executing on this cpu * not * from access by
* other cpus .
2011-02-25 20:38:54 +03:00
*/
2011-12-22 21:58:51 +04:00
if ( unlikely ( ! this_cpu_cmpxchg_double (
2011-02-25 20:38:54 +03:00
s - > cpu_slab - > freelist , s - > cpu_slab - > tid ,
object , tid ,
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
next_object , next_tid ( tid ) ) ) ) {
2011-02-25 20:38:54 +03:00
note_cmpxchg_failure ( " slab_alloc " , s , tid ) ;
goto redo ;
}
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
prefetch_freepointer ( s , next_object ) ;
2009-12-19 01:26:23 +03:00
stat ( s , ALLOC_FASTPATH ) ;
2007-05-10 14:15:16 +04:00
}
2019-10-15 00:11:57 +03:00
maybe_wipe_obj_freeptr ( s , object ) ;
2011-02-25 20:38:54 +03:00
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
Patch series "add init_on_alloc/init_on_free boot options", v10.
Provide init_on_alloc and init_on_free boot options.
These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.
Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.
Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.
As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations. There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.
This patch (of 2):
The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.
This is expected to be on-by-default on Android and Chrome OS. And it
gives the opportunity for anyone else to use it under distros too via the
boot args. (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)
init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.
init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.
Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
zero-initialized to preserve their semantics.
Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.
If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.
Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.
The new features are also going to pave the way for hardware memory
tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects. With MTE, tagging will have the
same cost as memory initialization.
Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized. There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.
[glider@google.com: v8]
Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 06:59:19 +03:00
if ( unlikely ( slab_want_init_on_alloc ( gfpflags , s ) ) & & object )
2012-06-13 19:24:57 +04:00
memset ( object , 0 , s - > object_size ) ;
2007-07-17 15:03:23 +04:00
2020-08-07 09:20:56 +03:00
slab_post_alloc_hook ( s , objcg , gfpflags , 1 , & object ) ;
2008-04-04 02:54:48 +04:00
2007-05-10 14:15:16 +04:00
return object ;
2007-05-07 01:49:36 +04:00
}
2012-09-09 00:47:58 +04:00
static __always_inline void * slab_alloc ( struct kmem_cache * s ,
gfp_t gfpflags , unsigned long addr )
{
return slab_alloc_node ( s , gfpflags , NUMA_NO_NODE , addr ) ;
}
2007-05-07 01:49:36 +04:00
void * kmem_cache_alloc ( struct kmem_cache * s , gfp_t gfpflags )
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc ( s , gfpflags , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2013-07-15 05:05:29 +04:00
trace_kmem_cache_alloc ( _RET_IP_ , ret , s - > object_size ,
s - > size , gfpflags ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kmem_cache_alloc ) ;
2009-12-11 10:45:30 +03:00
# ifdef CONFIG_TRACING
2010-10-21 13:29:19 +04:00
void * kmem_cache_alloc_trace ( struct kmem_cache * s , gfp_t gfpflags , size_t size )
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc ( s , gfpflags , _RET_IP_ ) ;
2010-10-21 13:29:19 +04:00
trace_kmalloc ( _RET_IP_ , ret , size , s - > size , gfpflags ) ;
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13.
This patchset adds a new software tag-based mode to KASAN [1]. (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
====== Rationale
On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life
miserable.
[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
====== Technical details
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:29:37 +03:00
ret = kasan_kmalloc ( s , ret , size , gfpflags ) ;
2010-10-21 13:29:19 +04:00
return ret ;
}
EXPORT_SYMBOL ( kmem_cache_alloc_trace ) ;
2008-08-19 21:43:26 +04:00
# endif
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_NUMA
void * kmem_cache_alloc_node ( struct kmem_cache * s , gfp_t gfpflags , int node )
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc_node ( s , gfpflags , node , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2009-03-23 16:12:24 +03:00
trace_kmem_cache_alloc_node ( _RET_IP_ , ret ,
2012-06-13 19:24:57 +04:00
s - > object_size , s - > size , gfpflags , node ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kmem_cache_alloc_node ) ;
2009-12-11 10:45:30 +03:00
# ifdef CONFIG_TRACING
2010-10-21 13:29:19 +04:00
void * kmem_cache_alloc_node_trace ( struct kmem_cache * s ,
2008-08-19 21:43:26 +04:00
gfp_t gfpflags ,
2010-10-21 13:29:19 +04:00
int node , size_t size )
2008-08-19 21:43:26 +04:00
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc_node ( s , gfpflags , node , _RET_IP_ ) ;
2010-10-21 13:29:19 +04:00
trace_kmalloc_node ( _RET_IP_ , ret ,
size , s - > size , gfpflags , node ) ;
2015-02-14 01:39:42 +03:00
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13.
This patchset adds a new software tag-based mode to KASAN [1]. (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
====== Rationale
On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life
miserable.
[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
====== Technical details
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:29:37 +03:00
ret = kasan_kmalloc ( s , ret , size , gfpflags ) ;
2010-10-21 13:29:19 +04:00
return ret ;
2008-08-19 21:43:26 +04:00
}
2010-10-21 13:29:19 +04:00
EXPORT_SYMBOL ( kmem_cache_alloc_node_trace ) ;
2008-08-19 21:43:26 +04:00
# endif
2019-05-14 03:16:09 +03:00
# endif /* CONFIG_NUMA */
2008-08-19 21:43:26 +04:00
2007-05-07 01:49:36 +04:00
/*
2015-02-11 01:09:37 +03:00
* Slow path handling . This may still be called frequently since objects
2007-05-10 14:15:16 +04:00
* have a longer lifetime than the cpu slabs in most processing loads .
2007-05-07 01:49:36 +04:00
*
2007-05-10 14:15:16 +04:00
* So we still attempt to reduce cache line usage . Just take the slab
* lock and free the item . If there is no additional partial page
* handling required then we can return immediately .
2007-05-07 01:49:36 +04:00
*/
2007-05-10 14:15:16 +04:00
static void __slab_free ( struct kmem_cache * s , struct page * page ,
2015-11-21 02:57:46 +03:00
void * head , void * tail , int cnt ,
unsigned long addr )
2007-05-07 01:49:36 +04:00
{
void * prior ;
2011-06-01 21:25:52 +04:00
int was_frozen ;
struct page new ;
unsigned long counters ;
struct kmem_cache_node * n = NULL ;
treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-03 23:09:38 +03:00
unsigned long flags ;
2007-05-07 01:49:36 +04:00
2011-02-25 20:38:54 +03:00
stat ( s , FREE_SLOWPATH ) ;
2007-05-07 01:49:36 +04:00
2012-05-30 21:54:46 +04:00
if ( kmem_cache_debug ( s ) & &
2016-03-16 00:54:59 +03:00
! free_debug_processing ( s , page , head , tail , cnt , addr ) )
2011-06-01 21:25:55 +04:00
return ;
2008-02-16 10:45:26 +03:00
2011-06-01 21:25:52 +04:00
do {
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
if ( unlikely ( n ) ) {
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
n = NULL ;
}
2011-06-01 21:25:52 +04:00
prior = page - > freelist ;
counters = page - > counters ;
2015-11-21 02:57:46 +03:00
set_freepointer ( s , tail , prior ) ;
2011-06-01 21:25:52 +04:00
new . counters = counters ;
was_frozen = new . frozen ;
2015-11-21 02:57:46 +03:00
new . inuse - = cnt ;
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
if ( ( ! new . inuse | | ! prior ) & & ! was_frozen ) {
2011-08-10 01:12:27 +04:00
2014-01-10 16:23:49 +04:00
if ( kmem_cache_has_cpu_partial ( s ) & & ! prior ) {
2011-08-10 01:12:27 +04:00
/*
2013-07-15 05:05:29 +04:00
* Slab was on no list before and will be
* partially empty
* We can defer the list move and instead
* freeze it .
2011-08-10 01:12:27 +04:00
*/
new . frozen = 1 ;
2014-01-10 16:23:49 +04:00
} else { /* Needs to be taken off a list */
2011-08-10 01:12:27 +04:00
2014-12-11 02:42:13 +03:00
n = get_node ( s , page_to_nid ( page ) ) ;
2011-08-10 01:12:27 +04:00
/*
* Speculatively acquire the list_lock .
* If the cmpxchg does not succeed then we may
* drop the list_lock without any processing .
*
* Otherwise the list_lock will synchronize with
* other processors updating the list of slabs .
*/
spin_lock_irqsave ( & n - > list_lock , flags ) ;
}
2011-06-01 21:25:52 +04:00
}
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:52 +04:00
} while ( ! cmpxchg_double_slab ( s , page ,
prior , counters ,
2015-11-21 02:57:46 +03:00
head , new . counters ,
2011-06-01 21:25:52 +04:00
" __slab_free " ) ) ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:52 +04:00
if ( likely ( ! n ) ) {
2011-08-10 01:12:27 +04:00
/*
* If we just froze the page then put it onto the
* per cpu partial list .
*/
2012-02-03 19:34:56 +04:00
if ( new . frozen & & ! was_frozen ) {
2011-08-10 01:12:27 +04:00
put_cpu_partial ( s , page , 1 ) ;
2012-02-03 19:34:56 +04:00
stat ( s , CPU_PARTIAL_FREE ) ;
}
2011-08-10 01:12:27 +04:00
/*
2011-06-01 21:25:52 +04:00
* The list lock was not taken therefore no list
* activity can be necessary .
*/
2014-12-11 02:42:13 +03:00
if ( was_frozen )
stat ( s , FREE_FROZEN ) ;
return ;
}
2007-05-07 01:49:36 +04:00
2014-07-03 02:22:35 +04:00
if ( unlikely ( ! new . inuse & & n - > nr_partial > = s - > min_partial ) )
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
goto slab_empty ;
2007-05-07 01:49:36 +04:00
/*
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
* Objects left in the slab . If it was not on the partial list before
* then add it .
2007-05-07 01:49:36 +04:00
*/
2013-06-19 09:05:52 +04:00
if ( ! kmem_cache_has_cpu_partial ( s ) & & unlikely ( ! prior ) ) {
2019-05-14 03:16:22 +03:00
remove_full ( s , n , page ) ;
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
add_partial ( n , page , DEACTIVATE_TO_TAIL ) ;
stat ( s , FREE_ADD_PARTIAL ) ;
2008-02-08 04:47:41 +03:00
}
2011-06-01 21:25:55 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2007-05-07 01:49:36 +04:00
return ;
slab_empty :
2008-03-02 00:40:44 +03:00
if ( prior ) {
2007-05-07 01:49:36 +04:00
/*
2011-08-08 20:16:56 +04:00
* Slab on the partial list .
2007-05-07 01:49:36 +04:00
*/
2011-06-01 21:25:50 +04:00
remove_partial ( n , page ) ;
2009-12-19 01:26:23 +03:00
stat ( s , FREE_REMOVE_PARTIAL ) ;
2014-01-10 16:23:49 +04:00
} else {
2011-08-08 20:16:56 +04:00
/* Slab must be on the full list */
2014-01-10 16:23:49 +04:00
remove_full ( s , n , page ) ;
}
2011-06-01 21:25:52 +04:00
2011-06-01 21:25:55 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2009-12-19 01:26:23 +03:00
stat ( s , FREE_SLAB ) ;
2007-05-07 01:49:36 +04:00
discard_slab ( s , page ) ;
}
2007-05-10 14:15:16 +04:00
/*
* Fastpath with forced inlining to produce a kfree and kmem_cache_free that
* can perform fastpath freeing without additional function calls .
*
* The fastpath is only possible if we are freeing to the current cpu slab
* of this processor . This typically the case if we have just allocated
* the item before .
*
* If fastpath is not possible then fall back to __slab_free where we deal
* with all sorts of special processing .
2015-11-21 02:57:46 +03:00
*
* Bulk free of a freelist with several objects ( all pointing to the
* same page ) possible by specifying head and tail ptr , plus objects
* count ( cnt ) . Bulk free indicated by tail pointer being set .
2007-05-10 14:15:16 +04:00
*/
2016-07-29 01:49:07 +03:00
static __always_inline void do_slab_free ( struct kmem_cache * s ,
struct page * page , void * head , void * tail ,
int cnt , unsigned long addr )
2007-05-10 14:15:16 +04:00
{
2015-11-21 02:57:46 +03:00
void * tail_obj = tail ? : head ;
2007-10-16 12:26:05 +04:00
struct kmem_cache_cpu * c ;
2011-02-25 20:38:54 +03:00
unsigned long tid ;
2020-08-07 09:20:56 +03:00
memcg_slab_free_hook ( s , page , head ) ;
2011-02-25 20:38:54 +03:00
redo :
/*
* Determine the currently cpus per cpu slab .
* The cpu may change afterward . However that does not matter since
* data is retrieved via this pointer . If we are on the same cpu
2015-09-05 01:45:31 +03:00
* during the cmpxchg then the free will succeed .
2011-02-25 20:38:54 +03:00
*/
2015-02-11 01:09:32 +03:00
do {
tid = this_cpu_read ( s - > cpu_slab - > tid ) ;
c = raw_cpu_ptr ( s - > cpu_slab ) ;
2019-10-15 22:18:12 +03:00
} while ( IS_ENABLED ( CONFIG_PREEMPTION ) & &
2015-03-26 01:55:23 +03:00
unlikely ( tid ! = READ_ONCE ( c - > tid ) ) ) ;
2010-08-20 21:37:16 +04:00
2015-02-11 01:09:32 +03:00
/* Same with comment on barrier() in slab_alloc_node() */
barrier ( ) ;
2010-08-20 21:37:16 +04:00
2011-05-18 01:29:31 +04:00
if ( likely ( page = = c - > page ) ) {
2020-03-17 21:04:09 +03:00
void * * freelist = READ_ONCE ( c - > freelist ) ;
set_freepointer ( s , tail_obj , freelist ) ;
2011-02-25 20:38:54 +03:00
2011-12-22 21:58:51 +04:00
if ( unlikely ( ! this_cpu_cmpxchg_double (
2011-02-25 20:38:54 +03:00
s - > cpu_slab - > freelist , s - > cpu_slab - > tid ,
2020-03-17 21:04:09 +03:00
freelist , tid ,
2015-11-21 02:57:46 +03:00
head , next_tid ( tid ) ) ) ) {
2011-02-25 20:38:54 +03:00
note_cmpxchg_failure ( " slab_free " , s , tid ) ;
goto redo ;
}
2009-12-19 01:26:23 +03:00
stat ( s , FREE_FASTPATH ) ;
2007-05-10 14:15:16 +04:00
} else
2015-11-21 02:57:46 +03:00
__slab_free ( s , page , head , tail_obj , cnt , addr ) ;
2007-05-10 14:15:16 +04:00
}
2016-07-29 01:49:07 +03:00
static __always_inline void slab_free ( struct kmem_cache * s , struct page * page ,
void * head , void * tail , int cnt ,
unsigned long addr )
{
/*
2018-04-11 02:30:31 +03:00
* With KASAN enabled slab_free_freelist_hook modifies the freelist
* to remove objects , whose reuse must be delayed .
2016-07-29 01:49:07 +03:00
*/
2018-04-11 02:30:31 +03:00
if ( slab_free_freelist_hook ( s , & head , & tail ) )
do_slab_free ( s , page , head , tail , cnt , addr ) ;
2016-07-29 01:49:07 +03:00
}
2018-12-28 11:29:53 +03:00
# ifdef CONFIG_KASAN_GENERIC
2016-07-29 01:49:07 +03:00
void ___cache_free ( struct kmem_cache * cache , void * x , unsigned long addr )
{
do_slab_free ( cache , virt_to_head_page ( x ) , x , NULL , 1 , addr ) ;
}
# endif
2007-05-07 01:49:36 +04:00
void kmem_cache_free ( struct kmem_cache * s , void * x )
{
2012-12-19 02:22:46 +04:00
s = cache_from_obj ( s , x ) ;
if ( ! s )
2012-09-05 03:06:14 +04:00
return ;
2015-11-21 02:57:46 +03:00
slab_free ( s , virt_to_head_page ( x ) , x , NULL , 1 , _RET_IP_ ) ;
2009-03-23 16:12:24 +03:00
trace_kmem_cache_free ( _RET_IP_ , x ) ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kmem_cache_free ) ;
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
struct detached_freelist {
2015-09-05 01:45:43 +03:00
struct page * page ;
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
void * tail ;
void * freelist ;
int cnt ;
2016-03-16 00:53:32 +03:00
struct kmem_cache * s ;
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
} ;
2015-09-05 01:45:43 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
/*
* This function progressively scans the array with free objects ( with
* a limited look ahead ) and extract objects belonging to the same
* page . It builds a detached freelist directly within the given
* page / objects . This can happen without any need for
* synchronization , because the objects are owned by running process .
* The freelist is build up as a single linked list in the objects .
* The idea is , that this detached freelist can then be bulk
* transferred to the real freelist ( s ) , but only requiring a single
* synchronization primitive . Look ahead in the array is limited due
* to performance reasons .
*/
2016-03-16 00:53:32 +03:00
static inline
int build_detached_freelist ( struct kmem_cache * s , size_t size ,
void * * p , struct detached_freelist * df )
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
{
size_t first_skipped_index = 0 ;
int lookahead = 3 ;
void * object ;
2016-03-16 00:54:00 +03:00
struct page * page ;
2015-09-05 01:45:43 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
/* Always re-init detached_freelist */
df - > page = NULL ;
2015-09-05 01:45:43 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
do {
object = p [ - - size ] ;
2016-03-16 00:54:00 +03:00
/* Do we need !ZERO_OR_NULL_PTR(object) here? (for kfree) */
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
} while ( ! object & & size ) ;
2015-09-05 01:45:45 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
if ( ! object )
return 0 ;
2015-09-05 01:45:43 +03:00
2016-03-16 00:54:00 +03:00
page = virt_to_head_page ( object ) ;
if ( ! s ) {
/* Handle kalloc'ed objects */
if ( unlikely ( ! PageSlab ( page ) ) ) {
BUG_ON ( ! PageCompound ( page ) ) ;
kfree_hook ( object ) ;
mm: charge/uncharge kmemcg from generic page allocator paths
Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.
This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e. when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.
To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e. to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends. A charged page will be
automatically uncharged on free.
To make it possible, we need to mark pages charged to kmemcg somehow.
To avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages. Since pages charged to kmemcg are not supposed to
be mapped to userspace, it should work just fine. There are other
(ab)users of page->_mapcount - buddy and balloon pages - but we don't
conflict with them.
In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths. If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.
Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-27 01:24:24 +03:00
__free_pages ( page , compound_order ( page ) ) ;
2016-03-16 00:54:00 +03:00
p [ size ] = NULL ; /* mark object processed */
return size ;
}
/* Derive kmem_cache from object */
df - > s = page - > slab_cache ;
} else {
df - > s = cache_from_obj ( s , object ) ; /* Support for memcg */
}
2016-03-16 00:53:32 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
/* Start new detached freelist */
2016-03-16 00:54:00 +03:00
df - > page = page ;
2016-03-16 00:53:32 +03:00
set_freepointer ( df - > s , object , NULL ) ;
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
df - > tail = object ;
df - > freelist = object ;
p [ size ] = NULL ; /* mark object processed */
df - > cnt = 1 ;
while ( size ) {
object = p [ - - size ] ;
if ( ! object )
continue ; /* Skip processed objects */
/* df->page is always set at this point */
if ( df - > page = = virt_to_head_page ( object ) ) {
/* Opportunity build freelist */
2016-03-16 00:53:32 +03:00
set_freepointer ( df - > s , object , df - > freelist ) ;
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
df - > freelist = object ;
df - > cnt + + ;
p [ size ] = NULL ; /* mark object processed */
continue ;
2015-09-05 01:45:43 +03:00
}
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
/* Limit look ahead search */
if ( ! - - lookahead )
break ;
if ( ! first_skipped_index )
first_skipped_index = size + 1 ;
2015-09-05 01:45:43 +03:00
}
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
return first_skipped_index ;
}
/* Note that interrupts must be enabled when calling this function. */
2016-03-16 00:53:32 +03:00
void kmem_cache_free_bulk ( struct kmem_cache * s , size_t size , void * * p )
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
{
if ( WARN_ON ( ! size ) )
return ;
do {
struct detached_freelist df ;
size = build_detached_freelist ( s , size , p , & df ) ;
2016-12-13 03:41:35 +03:00
if ( ! df . page )
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
continue ;
2016-03-16 00:53:32 +03:00
slab_free ( df . s , df . page , df . freelist , df . tail , df . cnt , _RET_IP_ ) ;
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
} while ( likely ( size ) ) ;
2015-09-05 01:45:34 +03:00
}
EXPORT_SYMBOL ( kmem_cache_free_bulk ) ;
2015-09-05 01:45:37 +03:00
/* Note that interrupts must be enabled when calling this function. */
2015-11-21 02:57:58 +03:00
int kmem_cache_alloc_bulk ( struct kmem_cache * s , gfp_t flags , size_t size ,
void * * p )
2015-09-05 01:45:34 +03:00
{
2015-09-05 01:45:37 +03:00
struct kmem_cache_cpu * c ;
int i ;
2020-08-07 09:20:56 +03:00
struct obj_cgroup * objcg = NULL ;
2015-09-05 01:45:37 +03:00
2015-11-21 02:57:52 +03:00
/* memcg and kmem_cache debug support */
2020-08-07 09:20:56 +03:00
s = slab_pre_alloc_hook ( s , & objcg , size , flags ) ;
2015-11-21 02:57:52 +03:00
if ( unlikely ( ! s ) )
return false ;
2015-09-05 01:45:37 +03:00
/*
* Drain objects in the per cpu slab , while disabling local
* IRQs , which protects against PREEMPT and interrupts
* handlers invoking normal fastpath .
*/
local_irq_disable ( ) ;
c = this_cpu_ptr ( s - > cpu_slab ) ;
for ( i = 0 ; i < size ; i + + ) {
void * object = c - > freelist ;
2015-09-05 01:45:40 +03:00
if ( unlikely ( ! object ) ) {
2020-03-17 03:28:45 +03:00
/*
* We may have removed an object from c - > freelist using
* the fastpath in the previous iteration ; in that case ,
* c - > tid has not been bumped yet .
* Since ___slab_alloc ( ) may reenable interrupts while
* allocating memory , we should bump c - > tid now .
*/
c - > tid = next_tid ( c - > tid ) ;
2015-09-05 01:45:40 +03:00
/*
* Invoking slow path likely have side - effect
* of re - populating per CPU c - > freelist
*/
2015-11-21 02:57:38 +03:00
p [ i ] = ___slab_alloc ( s , flags , NUMA_NO_NODE ,
2015-09-05 01:45:40 +03:00
_RET_IP_ , c ) ;
2015-11-21 02:57:38 +03:00
if ( unlikely ( ! p [ i ] ) )
goto error ;
2015-09-05 01:45:40 +03:00
c = this_cpu_ptr ( s - > cpu_slab ) ;
2019-10-15 00:11:57 +03:00
maybe_wipe_obj_freeptr ( s , p [ i ] ) ;
2015-09-05 01:45:40 +03:00
continue ; /* goto for-loop */
}
2015-09-05 01:45:37 +03:00
c - > freelist = get_freepointer ( s , object ) ;
p [ i ] = object ;
2019-10-15 00:11:57 +03:00
maybe_wipe_obj_freeptr ( s , p [ i ] ) ;
2015-09-05 01:45:37 +03:00
}
c - > tid = next_tid ( c - > tid ) ;
local_irq_enable ( ) ;
/* Clear memory outside IRQ disabled fastpath loop */
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
Patch series "add init_on_alloc/init_on_free boot options", v10.
Provide init_on_alloc and init_on_free boot options.
These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.
Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.
Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.
As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations. There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.
This patch (of 2):
The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.
This is expected to be on-by-default on Android and Chrome OS. And it
gives the opportunity for anyone else to use it under distros too via the
boot args. (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)
init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.
init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.
Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
zero-initialized to preserve their semantics.
Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.
If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.
Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.
The new features are also going to pave the way for hardware memory
tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects. With MTE, tagging will have the
same cost as memory initialization.
Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized. There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.
[glider@google.com: v8]
Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 06:59:19 +03:00
if ( unlikely ( slab_want_init_on_alloc ( flags , s ) ) ) {
2015-09-05 01:45:37 +03:00
int j ;
for ( j = 0 ; j < i ; j + + )
memset ( p [ j ] , 0 , s - > object_size ) ;
}
2015-11-21 02:57:52 +03:00
/* memcg and kmem_cache debug support */
2020-08-07 09:20:56 +03:00
slab_post_alloc_hook ( s , objcg , flags , size , p ) ;
2015-11-21 02:57:58 +03:00
return i ;
2015-11-21 02:57:38 +03:00
error :
local_irq_enable ( ) ;
2020-08-07 09:20:56 +03:00
slab_post_alloc_hook ( s , objcg , flags , i , p ) ;
2015-11-21 02:57:52 +03:00
__kmem_cache_free_bulk ( s , i , p ) ;
2015-11-21 02:57:58 +03:00
return 0 ;
2015-09-05 01:45:34 +03:00
}
EXPORT_SYMBOL ( kmem_cache_alloc_bulk ) ;
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* Object placement in a slab is made very easy because we always start at
* offset 0. If we tune the size of the object to the alignment then we can
* get the required alignment by putting one properly sized object after
* another .
2007-05-07 01:49:36 +04:00
*
* Notice that the allocation order determines the sizes of the per cpu
* caches . Each processor has always one slab available for allocations .
* Increasing the allocation order reduces the number of times that slabs
2007-05-09 13:32:39 +04:00
* must be moved on and off the partial lists and is therefore a factor in
2007-05-07 01:49:36 +04:00
* locking overhead .
*/
/*
* Mininum / Maximum order of slab pages . This influences locking overhead
* and slab fragmentation . A higher order reduces the number of partial slabs
* and increases the number of allocations possible without having to
* take the list_lock .
*/
2018-04-06 02:21:39 +03:00
static unsigned int slub_min_order ;
static unsigned int slub_max_order = PAGE_ALLOC_COSTLY_ORDER ;
static unsigned int slub_min_objects ;
2007-05-07 01:49:36 +04:00
/*
* Calculate the order of allocation given an slab object size .
*
2007-05-09 13:32:39 +04:00
* The order of allocation has significant impact on performance and other
* system components . Generally order 0 allocations should be preferred since
* order 0 does not cause fragmentation in the page allocator . Larger objects
* be problematic to put into order 0 slabs because there may be too much
2008-04-14 20:13:29 +04:00
* unused space left . We go to a higher order if more than 1 / 16 th of the slab
2007-05-09 13:32:39 +04:00
* would be wasted .
*
* In order to reach satisfactory performance we must ensure that a minimum
* number of objects is in one slab . Otherwise we may generate too much
* activity on the partial lists which requires taking the list_lock . This is
* less a concern for large slabs though which are rarely used .
2007-05-07 01:49:36 +04:00
*
2007-05-09 13:32:39 +04:00
* slub_max_order specifies the order where we begin to stop considering the
* number of objects in a slab as critical . If we reach slub_max_order then
* we try to keep the page order as low as possible . So we accept more waste
* of space in favor of a small page order .
2007-05-07 01:49:36 +04:00
*
2007-05-09 13:32:39 +04:00
* Higher order allocations also allow the placement of more objects in a
* slab and thereby reduce object handling overhead . If the user has
* requested a higher mininum order then we start with that one instead of
* the smallest order which will fit the object .
2007-05-07 01:49:36 +04:00
*/
2018-04-06 02:21:39 +03:00
static inline unsigned int slab_order ( unsigned int size ,
unsigned int min_objects , unsigned int max_order ,
2018-06-08 03:09:10 +03:00
unsigned int fract_leftover )
2007-05-07 01:49:36 +04:00
{
2018-04-06 02:21:39 +03:00
unsigned int min_order = slub_min_order ;
unsigned int order ;
2007-05-07 01:49:36 +04:00
2018-06-08 03:09:10 +03:00
if ( order_objects ( min_order , size ) > MAX_OBJS_PER_PAGE )
2008-10-22 23:00:38 +04:00
return get_order ( size * MAX_OBJS_PER_PAGE ) - 1 ;
2008-04-14 20:11:30 +04:00
2018-06-08 03:09:10 +03:00
for ( order = max ( min_order , ( unsigned int ) get_order ( min_objects * size ) ) ;
2007-05-09 13:32:46 +04:00
order < = max_order ; order + + ) {
2007-05-07 01:49:36 +04:00
2018-04-06 02:21:39 +03:00
unsigned int slab_size = ( unsigned int ) PAGE_SIZE < < order ;
unsigned int rem ;
2007-05-07 01:49:36 +04:00
2018-06-08 03:09:10 +03:00
rem = slab_size % size ;
2007-05-07 01:49:36 +04:00
2007-05-09 13:32:46 +04:00
if ( rem < = slab_size / fract_leftover )
2007-05-07 01:49:36 +04:00
break ;
}
2007-05-09 13:32:39 +04:00
2007-05-07 01:49:36 +04:00
return order ;
}
2018-06-08 03:09:10 +03:00
static inline int calculate_order ( unsigned int size )
2007-05-09 13:32:46 +04:00
{
2018-04-06 02:21:39 +03:00
unsigned int order ;
unsigned int min_objects ;
unsigned int max_objects ;
2007-05-09 13:32:46 +04:00
/*
* Attempt to find best configuration for a slab . This
* works by first attempting to generate a layout with
* the best configuration and backing off gradually .
*
2015-11-06 05:45:46 +03:00
* First we increase the acceptable waste in a slab . Then
2007-05-09 13:32:46 +04:00
* we reduce the minimum objects required in a slab .
*/
min_objects = slub_min_objects ;
2008-04-14 20:11:41 +04:00
if ( ! min_objects )
min_objects = 4 * ( fls ( nr_cpu_ids ) + 1 ) ;
2018-06-08 03:09:10 +03:00
max_objects = order_objects ( slub_max_order , size ) ;
2009-02-12 19:00:17 +03:00
min_objects = min ( min_objects , max_objects ) ;
2007-05-09 13:32:46 +04:00
while ( min_objects > 1 ) {
2018-04-06 02:21:39 +03:00
unsigned int fraction ;
2008-04-14 20:13:29 +04:00
fraction = 16 ;
2007-05-09 13:32:46 +04:00
while ( fraction > = 4 ) {
order = slab_order ( size , min_objects ,
2018-06-08 03:09:10 +03:00
slub_max_order , fraction ) ;
2007-05-09 13:32:46 +04:00
if ( order < = slub_max_order )
return order ;
fraction / = 2 ;
}
2009-08-19 22:44:13 +04:00
min_objects - - ;
2007-05-09 13:32:46 +04:00
}
/*
* We were unable to place multiple objects in a slab . Now
* lets see if we can place a single object there .
*/
2018-06-08 03:09:10 +03:00
order = slab_order ( size , 1 , slub_max_order , 1 ) ;
2007-05-09 13:32:46 +04:00
if ( order < = slub_max_order )
return order ;
/*
* Doh this slab cannot be placed using slub_max_order .
*/
2018-06-08 03:09:10 +03:00
order = slab_order ( size , 1 , MAX_ORDER , 1 ) ;
2009-04-23 10:58:22 +04:00
if ( order < MAX_ORDER )
2007-05-09 13:32:46 +04:00
return order ;
return - ENOSYS ;
}
2008-08-05 10:28:47 +04:00
static void
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( struct kmem_cache_node * n )
2007-05-07 01:49:36 +04:00
{
n - > nr_partial = 0 ;
spin_lock_init ( & n - > list_lock ) ;
INIT_LIST_HEAD ( & n - > partial ) ;
2007-07-17 15:03:32 +04:00
# ifdef CONFIG_SLUB_DEBUG
2008-04-14 19:53:02 +04:00
atomic_long_set ( & n - > nr_slabs , 0 ) ;
2008-09-11 23:25:41 +04:00
atomic_long_set ( & n - > total_objects , 0 ) ;
2007-05-07 01:49:42 +04:00
INIT_LIST_HEAD ( & n - > full ) ;
2007-07-17 15:03:32 +04:00
# endif
2007-05-07 01:49:36 +04:00
}
2010-08-20 21:37:13 +04:00
static inline int alloc_kmem_cache_cpus ( struct kmem_cache * s )
2007-10-16 12:26:08 +04:00
{
2010-08-20 21:37:14 +04:00
BUILD_BUG_ON ( PERCPU_DYNAMIC_EARLY_SIZE <
2013-01-10 23:14:19 +04:00
KMALLOC_SHIFT_HIGH * sizeof ( struct kmem_cache_cpu ) ) ;
2007-10-16 12:26:08 +04:00
2011-02-25 20:38:54 +03:00
/*
2011-06-02 18:19:41 +04:00
* Must align to double word boundary for the double cmpxchg
* instructions to work ; see __pcpu_double_call_return_bool ( ) .
2011-02-25 20:38:54 +03:00
*/
2011-06-02 18:19:41 +04:00
s - > cpu_slab = __alloc_percpu ( sizeof ( struct kmem_cache_cpu ) ,
2 * sizeof ( void * ) ) ;
2011-02-25 20:38:54 +03:00
if ( ! s - > cpu_slab )
return 0 ;
init_kmem_cache_cpus ( s ) ;
2007-10-16 12:26:08 +04:00
2011-02-25 20:38:54 +03:00
return 1 ;
2007-10-16 12:26:08 +04:00
}
2010-08-20 21:37:15 +04:00
static struct kmem_cache * kmem_cache_node ;
2007-05-07 01:49:36 +04:00
/*
* No kmalloc_node yet so do it by hand . We know that this is the first
* slab on the node for this slabcache . There are no concurrent accesses
* possible .
*
2013-11-08 16:47:37 +04:00
* Note that this function only works on the kmem_cache_node
* when allocating for the kmem_cache_node . This is used for bootstrapping
2007-10-16 12:26:08 +04:00
* memory on a fresh node that has no slab structures yet .
2007-05-07 01:49:36 +04:00
*/
2010-08-20 21:37:13 +04:00
static void early_kmem_cache_node_alloc ( int node )
2007-05-07 01:49:36 +04:00
{
struct page * page ;
struct kmem_cache_node * n ;
2010-08-20 21:37:15 +04:00
BUG_ON ( kmem_cache_node - > size < sizeof ( struct kmem_cache_node ) ) ;
2007-05-07 01:49:36 +04:00
2010-08-20 21:37:15 +04:00
page = new_slab ( kmem_cache_node , GFP_NOWAIT , node ) ;
2007-05-07 01:49:36 +04:00
BUG_ON ( ! page ) ;
2007-08-23 01:01:57 +04:00
if ( page_to_nid ( page ) ! = node ) {
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB: Unable to allocate memory from node %d \n " , node ) ;
pr_err ( " SLUB: Allocating a useless per node structure in order to be able to continue \n " ) ;
2007-08-23 01:01:57 +04:00
}
2007-05-07 01:49:36 +04:00
n = page - > freelist ;
BUG_ON ( ! n ) ;
2007-07-17 15:03:32 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-09-29 16:15:01 +04:00
init_object ( kmem_cache_node , n , SLUB_RED_ACTIVE ) ;
2010-08-20 21:37:15 +04:00
init_tracking ( kmem_cache_node , n ) ;
2007-07-17 15:03:32 +04:00
# endif
2018-12-28 11:29:41 +03:00
n = kasan_kmalloc ( kmem_cache_node , n , sizeof ( struct kmem_cache_node ) ,
2016-03-26 00:22:02 +03:00
GFP_KERNEL ) ;
2018-12-28 11:29:41 +03:00
page - > freelist = get_freepointer ( kmem_cache_node , n ) ;
page - > inuse = 1 ;
page - > frozen = 0 ;
kmem_cache_node - > node [ node ] = n ;
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( n ) ;
2010-08-20 21:37:15 +04:00
inc_slabs_node ( kmem_cache_node , node , page - > objects ) ;
2008-02-16 10:45:26 +03:00
2014-01-24 19:20:23 +04:00
/*
2014-02-11 02:25:46 +04:00
* No locks need to be taken here as it has just been
* initialized and there is no concurrent access .
2014-01-24 19:20:23 +04:00
*/
2014-02-11 02:25:46 +04:00
__add_partial ( n , page , DEACTIVATE_TO_HEAD ) ;
2007-05-07 01:49:36 +04:00
}
static void free_kmem_cache_nodes ( struct kmem_cache * s )
{
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:36 +04:00
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2007-05-07 01:49:36 +04:00
s - > node [ node ] = NULL ;
2017-09-07 02:19:15 +03:00
kmem_cache_free ( kmem_cache_node , n ) ;
2007-05-07 01:49:36 +04:00
}
}
2016-02-18 00:11:37 +03:00
void __kmem_cache_release ( struct kmem_cache * s )
{
2016-07-27 01:21:59 +03:00
cache_random_seq_destroy ( s ) ;
2016-02-18 00:11:37 +03:00
free_percpu ( s - > cpu_slab ) ;
free_kmem_cache_nodes ( s ) ;
}
2010-08-20 21:37:13 +04:00
static int init_kmem_cache_nodes ( struct kmem_cache * s )
2007-05-07 01:49:36 +04:00
{
int node ;
2007-10-16 12:25:33 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-07 01:49:36 +04:00
struct kmem_cache_node * n ;
2010-05-22 01:41:35 +04:00
if ( slab_state = = DOWN ) {
2010-08-20 21:37:13 +04:00
early_kmem_cache_node_alloc ( node ) ;
2010-05-22 01:41:35 +04:00
continue ;
}
2010-08-20 21:37:15 +04:00
n = kmem_cache_alloc_node ( kmem_cache_node ,
2010-08-20 21:37:13 +04:00
GFP_KERNEL , node ) ;
2007-05-07 01:49:36 +04:00
2010-05-22 01:41:35 +04:00
if ( ! n ) {
free_kmem_cache_nodes ( s ) ;
return 0 ;
2007-05-07 01:49:36 +04:00
}
2010-05-22 01:41:35 +04:00
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( n ) ;
2017-09-07 02:19:15 +03:00
s - > node [ node ] = n ;
2007-05-07 01:49:36 +04:00
}
return 1 ;
}
2009-02-25 10:16:35 +03:00
static void set_min_partial ( struct kmem_cache * s , unsigned long min )
2009-02-23 04:40:07 +03:00
{
if ( min < MIN_PARTIAL )
min = MIN_PARTIAL ;
else if ( min > MAX_PARTIAL )
min = MAX_PARTIAL ;
s - > min_partial = min ;
}
2017-07-07 01:36:34 +03:00
static void set_cpu_partial ( struct kmem_cache * s )
{
# ifdef CONFIG_SLUB_CPU_PARTIAL
/*
* cpu_partial determined the maximum number of objects kept in the
* per cpu partial lists of a processor .
*
* Per cpu partial lists mainly contain slabs that just have one
* object freed . If they are used for allocation then they can be
* filled up again with minimal effort . The slab will never hit the
* per node partial lists and therefore no locking will be required .
*
* This setting also determines
*
* A ) The number of objects from per cpu partial slabs dumped to the
* per node list when we reach the limit .
* B ) The number of objects in cpu partial slabs to extract from the
* per node list when we run out of per cpu objects . We only fetch
* 50 % to keep some capacity around for frees .
*/
if ( ! kmem_cache_has_cpu_partial ( s ) )
2020-04-02 07:04:19 +03:00
slub_set_cpu_partial ( s , 0 ) ;
2017-07-07 01:36:34 +03:00
else if ( s - > size > = PAGE_SIZE )
2020-04-02 07:04:19 +03:00
slub_set_cpu_partial ( s , 2 ) ;
2017-07-07 01:36:34 +03:00
else if ( s - > size > = 1024 )
2020-04-02 07:04:19 +03:00
slub_set_cpu_partial ( s , 6 ) ;
2017-07-07 01:36:34 +03:00
else if ( s - > size > = 256 )
2020-04-02 07:04:19 +03:00
slub_set_cpu_partial ( s , 13 ) ;
2017-07-07 01:36:34 +03:00
else
2020-04-02 07:04:19 +03:00
slub_set_cpu_partial ( s , 30 ) ;
2017-07-07 01:36:34 +03:00
# endif
}
2007-05-07 01:49:36 +04:00
/*
* calculate_sizes ( ) determines the order and the distribution of data within
* a slab object .
*/
2008-04-14 20:11:41 +04:00
static int calculate_sizes ( struct kmem_cache * s , int forced_order )
2007-05-07 01:49:36 +04:00
{
2017-11-16 04:32:18 +03:00
slab_flags_t flags = s - > flags ;
2018-04-06 02:21:28 +03:00
unsigned int size = s - > object_size ;
2020-04-21 04:13:42 +03:00
unsigned int freepointer_area ;
2018-04-06 02:21:39 +03:00
unsigned int order ;
2007-05-07 01:49:36 +04:00
2008-02-16 10:45:25 +03:00
/*
* Round up object size to the next word boundary . We can only
* place the free pointer at word boundaries and this determines
* the possible location of the free pointer .
*/
size = ALIGN ( size , sizeof ( void * ) ) ;
2020-04-21 04:13:42 +03:00
/*
* This is the area of the object where a freepointer can be
* safely written . If redzoning adds more to the inuse size , we
* can ' t use that portion for writing the freepointer , so
* s - > offset must be limited within this for the general case .
*/
freepointer_area = size ;
2008-02-16 10:45:25 +03:00
# ifdef CONFIG_SLUB_DEBUG
2007-05-07 01:49:36 +04:00
/*
* Determine if we can poison the object itself . If the user of
* the slab may touch the object after free or before allocation
* then we should never poison the object itself .
*/
2017-01-18 13:53:44 +03:00
if ( ( flags & SLAB_POISON ) & & ! ( flags & SLAB_TYPESAFE_BY_RCU ) & &
2007-05-17 09:10:50 +04:00
! s - > ctor )
2007-05-07 01:49:36 +04:00
s - > flags | = __OBJECT_POISON ;
else
s - > flags & = ~ __OBJECT_POISON ;
/*
2007-05-09 13:32:39 +04:00
* If we are Redzoning then check if there is some space between the
2007-05-07 01:49:36 +04:00
* end of the object and the free pointer . If not then add an
2007-05-09 13:32:39 +04:00
* additional word to have some bytes to store Redzone information .
2007-05-07 01:49:36 +04:00
*/
2012-06-13 19:24:57 +04:00
if ( ( flags & SLAB_RED_ZONE ) & & size = = s - > object_size )
2007-05-07 01:49:36 +04:00
size + = sizeof ( void * ) ;
2007-05-09 13:32:44 +04:00
# endif
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* With that we have determined the number of bytes in actual use
* by the object . This is the potential offset to the free pointer .
2007-05-07 01:49:36 +04:00
*/
s - > inuse = size ;
2017-01-18 13:53:44 +03:00
if ( ( ( flags & ( SLAB_TYPESAFE_BY_RCU | SLAB_POISON ) ) | |
2007-05-17 09:10:50 +04:00
s - > ctor ) ) {
2007-05-07 01:49:36 +04:00
/*
* Relocate free pointer after the object if it is not
* permitted to overwrite the first word of the object on
* kmem_cache_free .
*
* This is the case if we do RCU , have a constructor or
* destructor or are poisoning the objects .
2020-05-08 04:36:06 +03:00
*
* The assumption that s - > offset > = s - > inuse means free
* pointer is outside of the object is used in the
* freeptr_outside_object ( ) function . If that is no
* longer true , the function needs to be modified .
2007-05-07 01:49:36 +04:00
*/
s - > offset = size ;
size + = sizeof ( void * ) ;
2020-04-21 04:13:42 +03:00
} else if ( freepointer_area > sizeof ( void * ) ) {
slub: relocate freelist pointer to middle of object
In a recent discussion[1] with Vitaly Nikolenko and Silvio Cesare, it
became clear that moving the freelist pointer away from the edge of
allocations would likely improve the overall defensive posture of the
inline freelist pointer. My benchmarks show no meaningful change to
performance (they seem to show it being faster), so this looks like a
reasonable change to make.
Instead of having the freelist pointer at the very beginning of an
allocation (offset 0) or at the very end of an allocation (effectively
offset -sizeof(void *) from the next allocation), move it away from the
edges of the allocation and into the middle. This provides some
protection against small-sized neighboring overflows (or underflows), for
which the freelist pointer is commonly the target. (Large or well
controlled overwrites are much more likely to attack live object contents,
instead of attempting freelist corruption.)
The vaunted kernel build benchmark, across 5 runs. Before:
Mean: 250.05
Std Dev: 1.85
and after, which appears mysteriously faster:
Mean: 247.13
Std Dev: 0.76
Attempts at running "sysbench --test=memory" show the change to be well in
the noise (sysbench seems to be pretty unstable here -- it's not really
measuring allocation).
Hackbench is more allocation-heavy, and while the std dev is above the
difference, it looks like may manifest as an improvement as well:
20 runs of "hackbench -g 20 -l 1000", before:
Mean: 36.322
Std Dev: 0.577
and after:
Mean: 36.056
Std Dev: 0.598
[1] https://twitter.com/vnik5287/status/1235113523098685440
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Vitaly Nikolenko <vnik@duasynt.com>
Cc: Silvio Cesare <silvio.cesare@gmail.com>
Cc: Christoph Lameter <cl@linux.com>Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/202003051624.AAAC9AECC@keescook
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 07:04:27 +03:00
/*
* Store freelist pointer near middle of object to keep
* it away from the edges of the object to avoid small
* sized over / underflows from neighboring allocations .
*/
2020-04-21 04:13:42 +03:00
s - > offset = ALIGN ( freepointer_area / 2 , sizeof ( void * ) ) ;
2007-05-07 01:49:36 +04:00
}
2007-05-24 00:57:31 +04:00
# ifdef CONFIG_SLUB_DEBUG
2007-05-07 01:49:36 +04:00
if ( flags & SLAB_STORE_USER )
/*
* Need to store information about allocs and frees after
* the object .
*/
size + = 2 * sizeof ( struct track ) ;
2016-07-29 01:49:07 +03:00
# endif
2007-05-07 01:49:36 +04:00
2016-07-29 01:49:07 +03:00
kasan_cache_create ( s , & size , & s - > flags ) ;
# ifdef CONFIG_SLUB_DEBUG
2016-03-16 00:55:12 +03:00
if ( flags & SLAB_RED_ZONE ) {
2007-05-07 01:49:36 +04:00
/*
* Add some empty padding so that we can catch
* overwrites from earlier objects rather than let
* tracking information or the free pointer be
2008-12-30 00:14:56 +03:00
* corrupted if a user writes before the start
2007-05-07 01:49:36 +04:00
* of the object .
*/
size + = sizeof ( void * ) ;
2016-03-16 00:55:12 +03:00
s - > red_left_pad = sizeof ( void * ) ;
s - > red_left_pad = ALIGN ( s - > red_left_pad , s - > align ) ;
size + = s - > red_left_pad ;
}
2007-05-09 13:32:44 +04:00
# endif
2007-05-09 13:32:39 +04:00
2007-05-07 01:49:36 +04:00
/*
* SLUB stores one object immediately after another beginning from
* offset 0. In order to align the objects we have to simply size
* each object to conform to the alignment .
*/
2012-11-28 20:23:16 +04:00
size = ALIGN ( size , s - > align ) ;
2007-05-07 01:49:36 +04:00
s - > size = size ;
2020-08-07 09:20:42 +03:00
s - > reciprocal_size = reciprocal_value ( size ) ;
2008-04-14 20:11:41 +04:00
if ( forced_order > = 0 )
order = forced_order ;
else
2018-06-08 03:09:10 +03:00
order = calculate_order ( size ) ;
2007-05-07 01:49:36 +04:00
2018-04-06 02:21:39 +03:00
if ( ( int ) order < 0 )
2007-05-07 01:49:36 +04:00
return 0 ;
2008-02-15 01:21:32 +03:00
s - > allocflags = 0 ;
2008-04-14 20:11:31 +04:00
if ( order )
2008-02-15 01:21:32 +03:00
s - > allocflags | = __GFP_COMP ;
if ( s - > flags & SLAB_CACHE_DMA )
2013-01-10 23:14:19 +04:00
s - > allocflags | = GFP_DMA ;
2008-02-15 01:21:32 +03:00
mm: add support for kmem caches in DMA32 zone
Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
v6.
This is a followup to the discussion in [1], [2].
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For L1 tables that are bigger than a page, we can just use
__get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
use GFP_DMA).
For L2 tables that only take 1KB, it would be a waste to allocate a full
page, so we considered 3 approaches:
1. This series, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2 page
tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable to reuse
freed fragments until the whole page is freed. [3]
This series is the most memory-efficient approach.
stable@ note:
We confirmed that this is a regression, and IOMMU errors happen on 4.19
and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
platforms (and maybe others?).
[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
[2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
[3] https://patchwork.codeaurora.org/patch/671639/
This patch (of 3):
IOMMUs using ARMv7 short-descriptor format require page tables to be
allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
this is done by passing GFP_DMA32 flag to memory allocation functions.
For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
a full page using get_free_pages, so we considered 3 approaches:
1. This patch, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2
page tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable
to reuse freed fragments until the whole page is freed.
This change makes it possible to create a custom cache in DMA32 zone using
kmem_cache_create, then allocate memory using kmem_cache_alloc.
We do not create a DMA32 kmalloc cache array, as there are currently no
users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
unnecessary).
Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Sasha Levin <Alexander.Levin@microsoft.com>
Cc: Huaisheng Ye <yehs1@lenovo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yong Wu <yong.wu@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Tomasz Figa <tfiga@google.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 06:43:42 +03:00
if ( s - > flags & SLAB_CACHE_DMA32 )
s - > allocflags | = GFP_DMA32 ;
2008-02-15 01:21:32 +03:00
if ( s - > flags & SLAB_RECLAIM_ACCOUNT )
s - > allocflags | = __GFP_RECLAIMABLE ;
2007-05-07 01:49:36 +04:00
/*
* Determine the number of objects per slab
*/
2018-06-08 03:09:10 +03:00
s - > oo = oo_make ( order , size ) ;
s - > min = oo_make ( get_order ( size ) , size ) ;
2008-04-14 20:11:40 +04:00
if ( oo_objects ( s - > oo ) > oo_objects ( s - > max ) )
s - > max = s - > oo ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:31 +04:00
return ! ! oo_objects ( s - > oo ) ;
2007-05-07 01:49:36 +04:00
}
2017-11-16 04:32:18 +03:00
static int kmem_cache_open ( struct kmem_cache * s , slab_flags_t flags )
2007-05-07 01:49:36 +04:00
{
2012-09-05 03:18:33 +04:00
s - > flags = kmem_cache_flags ( s - > size , flags , s - > name , s - > ctor ) ;
2017-09-07 02:19:18 +03:00
# ifdef CONFIG_SLAB_FREELIST_HARDENED
s - > random = get_random_long ( ) ;
# endif
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:41 +04:00
if ( ! calculate_sizes ( s , - 1 ) )
2007-05-07 01:49:36 +04:00
goto error ;
2009-07-28 05:30:35 +04:00
if ( disable_higher_order_debug ) {
/*
* Disable debugging flags that store metadata if the min slab
* order increased .
*/
2012-06-13 19:24:57 +04:00
if ( get_order ( s - > size ) > get_order ( s - > object_size ) ) {
2009-07-28 05:30:35 +04:00
s - > flags & = ~ DEBUG_METADATA_FLAGS ;
s - > offset = 0 ;
if ( ! calculate_sizes ( s , - 1 ) )
goto error ;
}
}
2007-05-07 01:49:36 +04:00
2012-01-13 05:17:33 +04:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2016-03-16 00:55:09 +03:00
if ( system_has_cmpxchg_double ( ) & & ( s - > flags & SLAB_NO_CMPXCHG ) = = 0 )
2011-06-01 21:25:49 +04:00
/* Enable fast mode */
s - > flags | = __CMPXCHG_DOUBLE ;
# endif
2009-02-23 04:40:07 +03:00
/*
* The larger the object size is , the more pages we want on the partial
* list to avoid pounding the page allocator excessively .
*/
2011-08-10 01:12:27 +04:00
set_min_partial ( s , ilog2 ( s - > size ) / 2 ) ;
2017-07-07 01:36:34 +03:00
set_cpu_partial ( s ) ;
2011-08-10 01:12:27 +04:00
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_NUMA
2008-08-19 17:51:22 +04:00
s - > remote_node_defrag_ratio = 1000 ;
2007-05-07 01:49:36 +04:00
# endif
2016-07-27 01:21:59 +03:00
/* Initialize the pre-computed randomized freelist if slab is up */
if ( slab_state > = UP ) {
if ( init_cache_random_seq ( s ) )
goto error ;
}
2010-08-20 21:37:13 +04:00
if ( ! init_kmem_cache_nodes ( s ) )
2007-10-16 12:26:05 +04:00
goto error ;
2007-05-07 01:49:36 +04:00
2010-08-20 21:37:13 +04:00
if ( alloc_kmem_cache_cpus ( s ) )
2012-09-05 04:20:34 +04:00
return 0 ;
2009-12-19 01:26:22 +03:00
2007-10-16 12:26:08 +04:00
free_kmem_cache_nodes ( s ) ;
2007-05-07 01:49:36 +04:00
error :
2012-09-05 04:20:34 +04:00
return - EINVAL ;
2007-05-07 01:49:36 +04:00
}
2008-04-25 23:22:43 +04:00
static void list_slab_objects ( struct kmem_cache * s , struct page * page ,
2020-06-26 06:29:55 +03:00
const char * text )
2008-04-25 23:22:43 +04:00
{
# ifdef CONFIG_SLUB_DEBUG
void * addr = page_address ( page ) ;
2020-06-26 06:29:55 +03:00
unsigned long * map ;
2008-04-25 23:22:43 +04:00
void * p ;
2020-06-02 07:45:53 +03:00
2012-09-05 03:18:33 +04:00
slab_err ( s , page , text , s - > name ) ;
2008-04-25 23:22:43 +04:00
slab_lock ( page ) ;
2020-01-31 09:11:57 +03:00
map = get_map ( s , page ) ;
2008-04-25 23:22:43 +04:00
for_each_object ( p , s , addr , page - > objects ) {
2020-08-07 09:20:42 +03:00
if ( ! test_bit ( __obj_to_index ( s , addr , p ) , map ) ) {
2014-06-05 03:06:34 +04:00
pr_err ( " INFO: Object 0x%p @offset=%tu \n " , p , p - addr ) ;
2008-04-25 23:22:43 +04:00
print_tracking ( s , p ) ;
}
}
2020-06-26 06:29:55 +03:00
put_map ( map ) ;
2008-04-25 23:22:43 +04:00
slab_unlock ( page ) ;
# endif
}
2007-05-07 01:49:36 +04:00
/*
2008-04-23 23:36:52 +04:00
* Attempt to free all partial slabs on a node .
2016-02-18 00:11:37 +03:00
* This is called from __kmem_cache_shutdown ( ) . We must take list_lock
* because sysfs file might still access partial list after the shutdowning .
2007-05-07 01:49:36 +04:00
*/
2008-04-23 23:36:52 +04:00
static void free_partial ( struct kmem_cache * s , struct kmem_cache_node * n )
2007-05-07 01:49:36 +04:00
{
mm/slub.c: run free_partial() outside of the kmem_cache_node->list_lock
With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
kmem_cache_node is destroyed the call_rcu() may trigger a slab
allocation to fill the debug object pool (__debug_object_init:fill_pool).
Everywhere but during kmem_cache_destroy(), discard_slab() is performed
outside of the kmem_cache_node->list_lock and avoids a lockdep warning
about potential recursion:
=============================================
[ INFO: possible recursive locking detected ]
4.8.0-rc1-gfxbench+ #1 Tainted: G U
---------------------------------------------
rmmod/8895 is trying to acquire lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811c80d7>] get_partial_node.isra.63+0x47/0x430
but task is already holding lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811cbda4>] __kmem_cache_shutdown+0x54/0x320
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&n->list_lock)->rlock);
lock(&(&n->list_lock)->rlock);
*** DEADLOCK ***
May be due to missing lock nesting notation
5 locks held by rmmod/8895:
#0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0
#1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0
#2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
#3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
#4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320
stack backtrace:
CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1
Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
Call Trace:
__lock_acquire+0x1646/0x1ad0
lock_acquire+0xb2/0x200
_raw_spin_lock+0x36/0x50
get_partial_node.isra.63+0x47/0x430
___slab_alloc.constprop.67+0x1a7/0x3b0
__slab_alloc.isra.64.constprop.66+0x43/0x80
kmem_cache_alloc+0x236/0x2d0
__debug_object_init+0x2de/0x400
debug_object_activate+0x109/0x1e0
__call_rcu.constprop.63+0x32/0x2f0
call_rcu+0x12/0x20
discard_slab+0x3d/0x40
__kmem_cache_shutdown+0xdb/0x320
shutdown_cache+0x19/0x60
kmem_cache_destroy+0x1ae/0x220
i915_gem_load_cleanup+0x14/0x40 [i915]
i915_driver_unload+0x151/0x180 [i915]
i915_pci_remove+0x14/0x20 [i915]
pci_device_remove+0x34/0xb0
__device_release_driver+0x95/0x140
driver_detach+0xb6/0xc0
bus_remove_driver+0x53/0xd0
driver_unregister+0x27/0x50
pci_unregister_driver+0x25/0x70
i915_exit+0x1a/0x1e2 [i915]
SyS_delete_module+0x193/0x1f0
entry_SYSCALL_64_fastpath+0x1c/0xac
Fixes: 52b4b950b507 ("mm: slab: free kmem_cache_node after destroy sysfs file")
Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
Reported-by: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Gordon <david.s.gordon@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 02:27:58 +03:00
LIST_HEAD ( discard ) ;
2007-05-07 01:49:36 +04:00
struct page * page , * h ;
2016-02-18 00:11:37 +03:00
BUG_ON ( irqs_disabled ( ) ) ;
spin_lock_irq ( & n - > list_lock ) ;
2019-05-14 03:16:12 +03:00
list_for_each_entry_safe ( page , h , & n - > partial , slab_list ) {
2007-05-07 01:49:36 +04:00
if ( ! page - > inuse ) {
2016-02-18 00:11:37 +03:00
remove_partial ( n , page ) ;
2019-05-14 03:16:12 +03:00
list_add ( & page - > slab_list , & discard ) ;
2008-04-25 23:22:43 +04:00
} else {
list_slab_objects ( s , page ,
2020-06-26 06:29:55 +03:00
" Objects remaining in %s on __kmem_cache_shutdown() " ) ;
2008-04-23 23:36:52 +04:00
}
2008-04-25 23:22:43 +04:00
}
2016-02-18 00:11:37 +03:00
spin_unlock_irq ( & n - > list_lock ) ;
mm/slub.c: run free_partial() outside of the kmem_cache_node->list_lock
With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
kmem_cache_node is destroyed the call_rcu() may trigger a slab
allocation to fill the debug object pool (__debug_object_init:fill_pool).
Everywhere but during kmem_cache_destroy(), discard_slab() is performed
outside of the kmem_cache_node->list_lock and avoids a lockdep warning
about potential recursion:
=============================================
[ INFO: possible recursive locking detected ]
4.8.0-rc1-gfxbench+ #1 Tainted: G U
---------------------------------------------
rmmod/8895 is trying to acquire lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811c80d7>] get_partial_node.isra.63+0x47/0x430
but task is already holding lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811cbda4>] __kmem_cache_shutdown+0x54/0x320
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&n->list_lock)->rlock);
lock(&(&n->list_lock)->rlock);
*** DEADLOCK ***
May be due to missing lock nesting notation
5 locks held by rmmod/8895:
#0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0
#1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0
#2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
#3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
#4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320
stack backtrace:
CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1
Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
Call Trace:
__lock_acquire+0x1646/0x1ad0
lock_acquire+0xb2/0x200
_raw_spin_lock+0x36/0x50
get_partial_node.isra.63+0x47/0x430
___slab_alloc.constprop.67+0x1a7/0x3b0
__slab_alloc.isra.64.constprop.66+0x43/0x80
kmem_cache_alloc+0x236/0x2d0
__debug_object_init+0x2de/0x400
debug_object_activate+0x109/0x1e0
__call_rcu.constprop.63+0x32/0x2f0
call_rcu+0x12/0x20
discard_slab+0x3d/0x40
__kmem_cache_shutdown+0xdb/0x320
shutdown_cache+0x19/0x60
kmem_cache_destroy+0x1ae/0x220
i915_gem_load_cleanup+0x14/0x40 [i915]
i915_driver_unload+0x151/0x180 [i915]
i915_pci_remove+0x14/0x20 [i915]
pci_device_remove+0x34/0xb0
__device_release_driver+0x95/0x140
driver_detach+0xb6/0xc0
bus_remove_driver+0x53/0xd0
driver_unregister+0x27/0x50
pci_unregister_driver+0x25/0x70
i915_exit+0x1a/0x1e2 [i915]
SyS_delete_module+0x193/0x1f0
entry_SYSCALL_64_fastpath+0x1c/0xac
Fixes: 52b4b950b507 ("mm: slab: free kmem_cache_node after destroy sysfs file")
Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
Reported-by: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Gordon <david.s.gordon@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 02:27:58 +03:00
2019-05-14 03:16:12 +03:00
list_for_each_entry_safe ( page , h , & discard , slab_list )
mm/slub.c: run free_partial() outside of the kmem_cache_node->list_lock
With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
kmem_cache_node is destroyed the call_rcu() may trigger a slab
allocation to fill the debug object pool (__debug_object_init:fill_pool).
Everywhere but during kmem_cache_destroy(), discard_slab() is performed
outside of the kmem_cache_node->list_lock and avoids a lockdep warning
about potential recursion:
=============================================
[ INFO: possible recursive locking detected ]
4.8.0-rc1-gfxbench+ #1 Tainted: G U
---------------------------------------------
rmmod/8895 is trying to acquire lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811c80d7>] get_partial_node.isra.63+0x47/0x430
but task is already holding lock:
(&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811cbda4>] __kmem_cache_shutdown+0x54/0x320
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&n->list_lock)->rlock);
lock(&(&n->list_lock)->rlock);
*** DEADLOCK ***
May be due to missing lock nesting notation
5 locks held by rmmod/8895:
#0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0
#1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0
#2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
#3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
#4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320
stack backtrace:
CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1
Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
Call Trace:
__lock_acquire+0x1646/0x1ad0
lock_acquire+0xb2/0x200
_raw_spin_lock+0x36/0x50
get_partial_node.isra.63+0x47/0x430
___slab_alloc.constprop.67+0x1a7/0x3b0
__slab_alloc.isra.64.constprop.66+0x43/0x80
kmem_cache_alloc+0x236/0x2d0
__debug_object_init+0x2de/0x400
debug_object_activate+0x109/0x1e0
__call_rcu.constprop.63+0x32/0x2f0
call_rcu+0x12/0x20
discard_slab+0x3d/0x40
__kmem_cache_shutdown+0xdb/0x320
shutdown_cache+0x19/0x60
kmem_cache_destroy+0x1ae/0x220
i915_gem_load_cleanup+0x14/0x40 [i915]
i915_driver_unload+0x151/0x180 [i915]
i915_pci_remove+0x14/0x20 [i915]
pci_device_remove+0x34/0xb0
__device_release_driver+0x95/0x140
driver_detach+0xb6/0xc0
bus_remove_driver+0x53/0xd0
driver_unregister+0x27/0x50
pci_unregister_driver+0x25/0x70
i915_exit+0x1a/0x1e2 [i915]
SyS_delete_module+0x193/0x1f0
entry_SYSCALL_64_fastpath+0x1c/0xac
Fixes: 52b4b950b507 ("mm: slab: free kmem_cache_node after destroy sysfs file")
Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
Reported-by: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Gordon <david.s.gordon@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 02:27:58 +03:00
discard_slab ( s , page ) ;
2007-05-07 01:49:36 +04:00
}
2018-04-06 02:21:57 +03:00
bool __kmem_cache_empty ( struct kmem_cache * s )
{
int node ;
struct kmem_cache_node * n ;
for_each_kmem_cache_node ( s , node , n )
if ( n - > nr_partial | | slabs_node ( s , node ) )
return false ;
return true ;
}
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* Release all resources used by a slab cache .
2007-05-07 01:49:36 +04:00
*/
2016-02-18 00:11:37 +03:00
int __kmem_cache_shutdown ( struct kmem_cache * s )
2007-05-07 01:49:36 +04:00
{
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:36 +04:00
flush_all ( s ) ;
/* Attempt to free all objects */
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2008-04-23 23:36:52 +04:00
free_partial ( s , n ) ;
if ( n - > nr_partial | | slabs_node ( s , node ) )
2007-05-07 01:49:36 +04:00
return 1 ;
}
return 0 ;
}
/********************************************************************
* Kmalloc subsystem
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
static int __init setup_slub_min_order ( char * str )
{
2018-04-06 02:21:39 +03:00
get_option ( & str , ( int * ) & slub_min_order ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
}
__setup ( " slub_min_order= " , setup_slub_min_order ) ;
static int __init setup_slub_max_order ( char * str )
{
2018-04-06 02:21:39 +03:00
get_option ( & str , ( int * ) & slub_max_order ) ;
slub_max_order = min ( slub_max_order , ( unsigned int ) MAX_ORDER - 1 ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
}
__setup ( " slub_max_order= " , setup_slub_max_order ) ;
static int __init setup_slub_min_objects ( char * str )
{
2018-04-06 02:21:39 +03:00
get_option ( & str , ( int * ) & slub_min_objects ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
}
__setup ( " slub_min_objects= " , setup_slub_min_objects ) ;
void * __kmalloc ( size_t size , gfp_t flags )
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-19 21:43:26 +04:00
void * ret ;
2007-05-07 01:49:36 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) )
2008-02-11 23:47:46 +03:00
return kmalloc_large ( size , flags ) ;
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , flags ) ;
2007-10-16 12:24:38 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2012-09-09 00:47:58 +04:00
ret = slab_alloc ( s , flags , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2009-03-23 16:12:24 +03:00
trace_kmalloc ( _RET_IP_ , ret , size , s - > size , flags ) ;
2008-08-19 21:43:26 +04:00
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13.
This patchset adds a new software tag-based mode to KASAN [1]. (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
====== Rationale
On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life
miserable.
[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
====== Technical details
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:29:37 +03:00
ret = kasan_kmalloc ( s , ret , size , flags ) ;
2015-02-14 01:39:42 +03:00
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( __kmalloc ) ;
2010-09-29 16:02:15 +04:00
# ifdef CONFIG_NUMA
2008-03-02 00:56:40 +03:00
static void * kmalloc_large_node ( size_t size , gfp_t flags , int node )
{
2008-11-25 18:55:53 +03:00
struct page * page ;
2009-07-07 13:32:59 +04:00
void * ptr = NULL ;
mm, sl[ou]b: improve memory accounting
Patch series "guarantee natural alignment for kmalloc()", v2.
This patch (of 2):
SLOB currently doesn't account its pages at all, so in /proc/meminfo the
Slab field shows zero. Modifying a counter on page allocation and
freeing should be acceptable even for the small system scenarios SLOB is
intended for. Since reclaimable caches are not separated in SLOB,
account everything as unreclaimable.
SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
larger than order-1 page, that are passed directly to the page
allocator. As they also don't appear in /proc/slabinfo, it might look
like a memory leak. For consistency, account them as well. (SLAB
doesn't actually use page allocator directly, so no change there).
Ideally SLOB and SLUB would be handled in separate patches, but due to
the shared kmalloc_order() function and different kfree()
implementations, it's easier to patch both at once to prevent
inconsistencies.
Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 03:58:42 +03:00
unsigned int order = get_order ( size ) ;
2008-03-02 00:56:40 +03:00
2017-11-16 04:35:54 +03:00
flags | = __GFP_COMP ;
mm, sl[ou]b: improve memory accounting
Patch series "guarantee natural alignment for kmalloc()", v2.
This patch (of 2):
SLOB currently doesn't account its pages at all, so in /proc/meminfo the
Slab field shows zero. Modifying a counter on page allocation and
freeing should be acceptable even for the small system scenarios SLOB is
intended for. Since reclaimable caches are not separated in SLOB,
account everything as unreclaimable.
SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
larger than order-1 page, that are passed directly to the page
allocator. As they also don't appear in /proc/slabinfo, it might look
like a memory leak. For consistency, account them as well. (SLAB
doesn't actually use page allocator directly, so no change there).
Ideally SLOB and SLUB would be handled in separate patches, but due to
the shared kmalloc_order() function and different kfree()
implementations, it's easier to patch both at once to prevent
inconsistencies.
Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 03:58:42 +03:00
page = alloc_pages_node ( node , flags , order ) ;
if ( page ) {
2009-07-07 13:32:59 +04:00
ptr = page_address ( page ) ;
2020-08-07 09:20:39 +03:00
mod_node_page_state ( page_pgdat ( page ) , NR_SLAB_UNRECLAIMABLE_B ,
PAGE_SIZE < < order ) ;
mm, sl[ou]b: improve memory accounting
Patch series "guarantee natural alignment for kmalloc()", v2.
This patch (of 2):
SLOB currently doesn't account its pages at all, so in /proc/meminfo the
Slab field shows zero. Modifying a counter on page allocation and
freeing should be acceptable even for the small system scenarios SLOB is
intended for. Since reclaimable caches are not separated in SLOB,
account everything as unreclaimable.
SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
larger than order-1 page, that are passed directly to the page
allocator. As they also don't appear in /proc/slabinfo, it might look
like a memory leak. For consistency, account them as well. (SLAB
doesn't actually use page allocator directly, so no change there).
Ideally SLOB and SLUB would be handled in separate patches, but due to
the shared kmalloc_order() function and different kfree()
implementations, it's easier to patch both at once to prevent
inconsistencies.
Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 03:58:42 +03:00
}
2009-07-07 13:32:59 +04:00
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13.
This patchset adds a new software tag-based mode to KASAN [1]. (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
====== Rationale
On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life
miserable.
[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
====== Technical details
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:29:37 +03:00
return kmalloc_large_node_hook ( ptr , size , flags ) ;
2008-03-02 00:56:40 +03:00
}
2007-05-07 01:49:36 +04:00
void * __kmalloc_node ( size_t size , gfp_t flags , int node )
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-19 21:43:26 +04:00
void * ret ;
2007-05-07 01:49:36 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) ) {
2008-08-19 21:43:26 +04:00
ret = kmalloc_large_node ( size , flags , node ) ;
2009-03-23 16:12:24 +03:00
trace_kmalloc_node ( _RET_IP_ , ret ,
size , PAGE_SIZE < < get_order ( size ) ,
flags , node ) ;
2008-08-19 21:43:26 +04:00
return ret ;
}
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , flags ) ;
2007-10-16 12:24:38 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2012-09-09 00:47:58 +04:00
ret = slab_alloc_node ( s , flags , node , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2009-03-23 16:12:24 +03:00
trace_kmalloc_node ( _RET_IP_ , ret , size , s - > size , flags , node ) ;
2008-08-19 21:43:26 +04:00
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v13.
This patchset adds a new software tag-based mode to KASAN [1]. (Initially
this mode was called KHWASAN, but it got renamed, see the naming rationale
at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise bug
detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which
would use hardware memory tagging support (announced by Arm [3]) instead
of compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a
====== Rationale
On mobile devices generic KASAN's memory usage is significant problem.
One of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life
miserable.
[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/
====== Technical details
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether we
operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 11:29:37 +03:00
ret = kasan_kmalloc ( s , ret , size , flags ) ;
2015-02-14 01:39:42 +03:00
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( __kmalloc_node ) ;
2019-05-14 03:16:09 +03:00
# endif /* CONFIG_NUMA */
2007-05-07 01:49:36 +04:00
2016-06-24 01:24:05 +03:00
# ifdef CONFIG_HARDENED_USERCOPY
/*
2018-01-11 02:17:01 +03:00
* Rejects incorrectly sized objects and objects that are to be copied
* to / from userspace but do not fall entirely within the containing slab
* cache ' s usercopy region .
2016-06-24 01:24:05 +03:00
*
* Returns NULL if check passes , otherwise const char * to name of cache
* to indicate an error .
*/
2018-01-11 01:48:22 +03:00
void __check_heap_object ( const void * ptr , unsigned long n , struct page * page ,
bool to_user )
2016-06-24 01:24:05 +03:00
{
struct kmem_cache * s ;
2018-04-06 02:21:20 +03:00
unsigned int offset ;
2016-06-24 01:24:05 +03:00
size_t object_size ;
2019-01-09 02:23:15 +03:00
ptr = kasan_reset_tag ( ptr ) ;
2016-06-24 01:24:05 +03:00
/* Find object and usable object size. */
s = page - > slab_cache ;
/* Reject impossible pointers. */
if ( ptr < page_address ( page ) )
2018-01-11 01:48:22 +03:00
usercopy_abort ( " SLUB object not in SLUB page?! " , NULL ,
to_user , 0 , n ) ;
2016-06-24 01:24:05 +03:00
/* Find offset within object. */
offset = ( ptr - page_address ( page ) ) % s - > size ;
/* Adjust for redzone and reject if within the redzone. */
2020-08-07 09:18:55 +03:00
if ( kmem_cache_debug_flags ( s , SLAB_RED_ZONE ) ) {
2016-06-24 01:24:05 +03:00
if ( offset < s - > red_left_pad )
2018-01-11 01:48:22 +03:00
usercopy_abort ( " SLUB object in left red zone " ,
s - > name , to_user , offset , n ) ;
2016-06-24 01:24:05 +03:00
offset - = s - > red_left_pad ;
}
2018-01-11 02:17:01 +03:00
/* Allow address range falling entirely within usercopy region. */
if ( offset > = s - > useroffset & &
offset - s - > useroffset < = s - > usersize & &
n < = s - > useroffset - offset + s - > usersize )
2018-01-11 01:48:22 +03:00
return ;
2016-06-24 01:24:05 +03:00
2018-01-11 02:17:01 +03:00
/*
* If the copy is still within the allocated object , produce
* a warning instead of rejecting the copy . This is intended
* to be a temporary method to find any missing usercopy
* whitelists .
*/
object_size = slab_ksize ( s ) ;
2017-12-01 00:04:32 +03:00
if ( usercopy_fallback & &
offset < = object_size & & n < = object_size - offset ) {
2018-01-11 02:17:01 +03:00
usercopy_warn ( " SLUB object " , s - > name , to_user , offset , n ) ;
return ;
}
2016-06-24 01:24:05 +03:00
2018-01-11 01:48:22 +03:00
usercopy_abort ( " SLUB object " , s - > name , to_user , offset , n ) ;
2016-06-24 01:24:05 +03:00
}
# endif /* CONFIG_HARDENED_USERCOPY */
2019-07-12 06:54:14 +03:00
size_t __ksize ( const void * object )
2007-05-07 01:49:36 +04:00
{
2007-06-09 00:46:49 +04:00
struct page * page ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:24:46 +04:00
if ( unlikely ( object = = ZERO_SIZE_PTR ) )
2007-06-09 00:46:49 +04:00
return 0 ;
2007-12-05 10:45:30 +03:00
page = virt_to_head_page ( object ) ;
2008-05-22 20:22:25 +04:00
if ( unlikely ( ! PageSlab ( page ) ) ) {
WARN_ON ( ! PageCompound ( page ) ) ;
2019-09-24 01:34:25 +03:00
return page_size ( page ) ;
2008-05-22 20:22:25 +04:00
}
2007-05-07 01:49:36 +04:00
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
return slab_ksize ( page - > slab_cache ) ;
2007-05-07 01:49:36 +04:00
}
2019-07-12 06:54:14 +03:00
EXPORT_SYMBOL ( __ksize ) ;
2007-05-07 01:49:36 +04:00
void kfree ( const void * x )
{
struct page * page ;
2008-02-08 04:47:41 +03:00
void * object = ( void * ) x ;
2007-05-07 01:49:36 +04:00
2009-03-25 12:05:57 +03:00
trace_kfree ( _RET_IP_ , x ) ;
2007-10-16 12:24:44 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( x ) ) )
2007-05-07 01:49:36 +04:00
return ;
2007-05-07 01:49:41 +04:00
page = virt_to_head_page ( x ) ;
2007-10-16 12:24:38 +04:00
if ( unlikely ( ! PageSlab ( page ) ) ) {
mm, sl[ou]b: improve memory accounting
Patch series "guarantee natural alignment for kmalloc()", v2.
This patch (of 2):
SLOB currently doesn't account its pages at all, so in /proc/meminfo the
Slab field shows zero. Modifying a counter on page allocation and
freeing should be acceptable even for the small system scenarios SLOB is
intended for. Since reclaimable caches are not separated in SLOB,
account everything as unreclaimable.
SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
larger than order-1 page, that are passed directly to the page
allocator. As they also don't appear in /proc/slabinfo, it might look
like a memory leak. For consistency, account them as well. (SLAB
doesn't actually use page allocator directly, so no change there).
Ideally SLOB and SLUB would be handled in separate patches, but due to
the shared kmalloc_order() function and different kfree()
implementations, it's easier to patch both at once to prevent
inconsistencies.
Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 03:58:42 +03:00
unsigned int order = compound_order ( page ) ;
2008-05-28 21:32:22 +04:00
BUG_ON ( ! PageCompound ( page ) ) ;
2018-02-07 02:36:23 +03:00
kfree_hook ( object ) ;
2020-08-07 09:20:39 +03:00
mod_node_page_state ( page_pgdat ( page ) , NR_SLAB_UNRECLAIMABLE_B ,
- ( PAGE_SIZE < < order ) ) ;
mm, sl[ou]b: improve memory accounting
Patch series "guarantee natural alignment for kmalloc()", v2.
This patch (of 2):
SLOB currently doesn't account its pages at all, so in /proc/meminfo the
Slab field shows zero. Modifying a counter on page allocation and
freeing should be acceptable even for the small system scenarios SLOB is
intended for. Since reclaimable caches are not separated in SLOB,
account everything as unreclaimable.
SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
larger than order-1 page, that are passed directly to the page
allocator. As they also don't appear in /proc/slabinfo, it might look
like a memory leak. For consistency, account them as well. (SLAB
doesn't actually use page allocator directly, so no change there).
Ideally SLOB and SLUB would be handled in separate patches, but due to
the shared kmalloc_order() function and different kfree()
implementations, it's easier to patch both at once to prevent
inconsistencies.
Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 03:58:42 +03:00
__free_pages ( page , order ) ;
2007-10-16 12:24:38 +04:00
return ;
}
2015-11-21 02:57:46 +03:00
slab_free ( page - > slab_cache , page , object , NULL , 1 , _RET_IP_ ) ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kfree ) ;
2015-02-13 01:59:41 +03:00
# define SHRINK_PROMOTE_MAX 32
2007-05-07 01:49:46 +04:00
/*
2015-02-13 01:59:41 +03:00
* kmem_cache_shrink discards empty slabs and promotes the slabs filled
* up most to the head of the partial lists . New allocations will then
* fill those up and thus they can be removed from the partial lists .
2007-05-09 13:32:39 +04:00
*
* The slabs with the least items are placed last . This results in them
* being allocated from last increasing the chance that the last objects
* are freed in them .
2007-05-07 01:49:46 +04:00
*/
2017-02-23 02:41:27 +03:00
int __kmem_cache_shrink ( struct kmem_cache * s )
2007-05-07 01:49:46 +04:00
{
int node ;
int i ;
struct kmem_cache_node * n ;
struct page * page ;
struct page * t ;
2015-02-13 01:59:41 +03:00
struct list_head discard ;
struct list_head promote [ SHRINK_PROMOTE_MAX ] ;
2007-05-07 01:49:46 +04:00
unsigned long flags ;
2015-02-13 01:59:44 +03:00
int ret = 0 ;
2007-05-07 01:49:46 +04:00
flush_all ( s ) ;
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2015-02-13 01:59:41 +03:00
INIT_LIST_HEAD ( & discard ) ;
for ( i = 0 ; i < SHRINK_PROMOTE_MAX ; i + + )
INIT_LIST_HEAD ( promote + i ) ;
2007-05-07 01:49:46 +04:00
spin_lock_irqsave ( & n - > list_lock , flags ) ;
/*
2015-02-13 01:59:41 +03:00
* Build lists of slabs to discard or promote .
2007-05-07 01:49:46 +04:00
*
2007-05-09 13:32:39 +04:00
* Note that concurrent frees may occur while we hold the
* list_lock . page - > inuse here is the upper limit .
2007-05-07 01:49:46 +04:00
*/
2019-05-14 03:16:12 +03:00
list_for_each_entry_safe ( page , t , & n - > partial , slab_list ) {
2015-02-13 01:59:41 +03:00
int free = page - > objects - page - > inuse ;
/* Do not reread page->inuse */
barrier ( ) ;
/* We do not keep full slabs on the list */
BUG_ON ( free < = 0 ) ;
if ( free = = page - > objects ) {
2019-05-14 03:16:12 +03:00
list_move ( & page - > slab_list , & discard ) ;
2011-08-10 01:12:22 +04:00
n - > nr_partial - - ;
2015-02-13 01:59:41 +03:00
} else if ( free < = SHRINK_PROMOTE_MAX )
2019-05-14 03:16:12 +03:00
list_move ( & page - > slab_list , promote + free - 1 ) ;
2007-05-07 01:49:46 +04:00
}
/*
2015-02-13 01:59:41 +03:00
* Promote the slabs filled up most to the head of the
* partial list .
2007-05-07 01:49:46 +04:00
*/
2015-02-13 01:59:41 +03:00
for ( i = SHRINK_PROMOTE_MAX - 1 ; i > = 0 ; i - - )
list_splice ( promote + i , & n - > partial ) ;
2007-05-07 01:49:46 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2011-08-10 01:12:22 +04:00
/* Release empty slabs */
2019-05-14 03:16:12 +03:00
list_for_each_entry_safe ( page , t , & discard , slab_list )
2011-08-10 01:12:22 +04:00
discard_slab ( s , page ) ;
2015-02-13 01:59:44 +03:00
if ( slabs_node ( s , node ) )
ret = 1 ;
2007-05-07 01:49:46 +04:00
}
2015-02-13 01:59:44 +03:00
return ret ;
2007-05-07 01:49:46 +04:00
}
2007-10-22 03:41:37 +04:00
static int slab_mem_going_offline_callback ( void * arg )
{
struct kmem_cache * s ;
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
list_for_each_entry ( s , & slab_caches , list )
2017-02-23 02:41:27 +03:00
__kmem_cache_shrink ( s ) ;
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
return 0 ;
}
static void slab_mem_offline_callback ( void * arg )
{
struct kmem_cache_node * n ;
struct kmem_cache * s ;
struct memory_notify * marg = arg ;
int offline_node ;
2012-12-12 04:01:05 +04:00
offline_node = marg - > status_change_nid_normal ;
2007-10-22 03:41:37 +04:00
/*
* If the node still has available memory . we need kmem_cache_node
* for it yet .
*/
if ( offline_node < 0 )
return ;
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
n = get_node ( s , offline_node ) ;
if ( n ) {
/*
* if n - > nr_slabs > 0 , slabs still exist on the node
* that is going down . We were unable to free them ,
2009-12-18 23:40:42 +03:00
* and offline_pages ( ) function shouldn ' t call this
2007-10-22 03:41:37 +04:00
* callback . So , we must fail .
*/
2008-04-14 19:53:02 +04:00
BUG_ON ( slabs_node ( s , offline_node ) ) ;
2007-10-22 03:41:37 +04:00
s - > node [ offline_node ] = NULL ;
2010-08-25 23:51:14 +04:00
kmem_cache_free ( kmem_cache_node , n ) ;
2007-10-22 03:41:37 +04:00
}
}
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
}
static int slab_mem_going_online_callback ( void * arg )
{
struct kmem_cache_node * n ;
struct kmem_cache * s ;
struct memory_notify * marg = arg ;
2012-12-12 04:01:05 +04:00
int nid = marg - > status_change_nid_normal ;
2007-10-22 03:41:37 +04:00
int ret = 0 ;
/*
* If the node ' s memory is already available , then kmem_cache_node is
* already created . Nothing to do .
*/
if ( nid < 0 )
return 0 ;
/*
2008-04-30 03:11:12 +04:00
* We are bringing a node online . No memory is available yet . We must
2007-10-22 03:41:37 +04:00
* allocate a kmem_cache_node structure in order to bring the node
* online .
*/
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
/*
* XXX : kmem_cache_alloc_node will fallback to other nodes
* since memory is not yet available from the node that
* is brought up .
*/
2010-08-25 23:51:14 +04:00
n = kmem_cache_alloc ( kmem_cache_node , GFP_KERNEL ) ;
2007-10-22 03:41:37 +04:00
if ( ! n ) {
ret = - ENOMEM ;
goto out ;
}
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( n ) ;
2007-10-22 03:41:37 +04:00
s - > node [ nid ] = n ;
}
out :
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
return ret ;
}
static int slab_memory_callback ( struct notifier_block * self ,
unsigned long action , void * arg )
{
int ret = 0 ;
switch ( action ) {
case MEM_GOING_ONLINE :
ret = slab_mem_going_online_callback ( arg ) ;
break ;
case MEM_GOING_OFFLINE :
ret = slab_mem_going_offline_callback ( arg ) ;
break ;
case MEM_OFFLINE :
case MEM_CANCEL_ONLINE :
slab_mem_offline_callback ( arg ) ;
break ;
case MEM_ONLINE :
case MEM_CANCEL_OFFLINE :
break ;
}
2008-12-02 00:13:48 +03:00
if ( ret )
ret = notifier_from_errno ( ret ) ;
else
ret = NOTIFY_OK ;
2007-10-22 03:41:37 +04:00
return ret ;
}
2013-04-30 02:08:06 +04:00
static struct notifier_block slab_memory_callback_nb = {
. notifier_call = slab_memory_callback ,
. priority = SLAB_CALLBACK_PRI ,
} ;
2007-10-22 03:41:37 +04:00
2007-05-07 01:49:36 +04:00
/********************************************************************
* Basic setup of slabs
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
2010-08-20 21:37:15 +04:00
/*
* Used for early kmem_cache structures that were allocated using
2012-11-28 20:23:07 +04:00
* the page allocator . Allocate them properly then fix up the pointers
* that may be pointing to the wrong kmem_cache structure .
2010-08-20 21:37:15 +04:00
*/
2012-11-28 20:23:07 +04:00
static struct kmem_cache * __init bootstrap ( struct kmem_cache * static_cache )
2010-08-20 21:37:15 +04:00
{
int node ;
2012-11-28 20:23:07 +04:00
struct kmem_cache * s = kmem_cache_zalloc ( kmem_cache , GFP_NOWAIT ) ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2010-08-20 21:37:15 +04:00
2012-11-28 20:23:07 +04:00
memcpy ( s , static_cache , kmem_cache - > object_size ) ;
2010-08-20 21:37:15 +04:00
2013-02-22 20:20:00 +04:00
/*
* This runs very early , and only the boot processor is supposed to be
* up . Even if it weren ' t true , IRQs are not up so we couldn ' t fire
* IPIs around .
*/
__flush_cpu_slab ( s , smp_processor_id ( ) ) ;
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2010-08-20 21:37:15 +04:00
struct page * p ;
2019-05-14 03:16:12 +03:00
list_for_each_entry ( p , & n - > partial , slab_list )
2014-08-07 03:04:09 +04:00
p - > slab_cache = s ;
2010-08-20 21:37:15 +04:00
2011-04-12 11:22:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2019-05-14 03:16:12 +03:00
list_for_each_entry ( p , & n - > full , slab_list )
2014-08-07 03:04:09 +04:00
p - > slab_cache = s ;
2010-08-20 21:37:15 +04:00
# endif
}
2012-11-28 20:23:07 +04:00
list_add ( & s - > list , & slab_caches ) ;
return s ;
2010-08-20 21:37:15 +04:00
}
2007-05-07 01:49:36 +04:00
void __init kmem_cache_init ( void )
{
2012-11-28 20:23:07 +04:00
static __initdata struct kmem_cache boot_kmem_cache ,
boot_kmem_cache_node ;
2010-08-20 21:37:15 +04:00
2012-01-11 03:07:32 +04:00
if ( debug_guardpage_minorder ( ) )
slub_max_order = 0 ;
2012-11-28 20:23:07 +04:00
kmem_cache_node = & boot_kmem_cache_node ;
kmem_cache = & boot_kmem_cache ;
2010-08-20 21:37:15 +04:00
2012-11-28 20:23:07 +04:00
create_boot_cache ( kmem_cache_node , " kmem_cache_node " ,
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 05:50:28 +03:00
sizeof ( struct kmem_cache_node ) , SLAB_HWCACHE_ALIGN , 0 , 0 ) ;
2007-10-22 03:41:37 +04:00
2013-04-30 02:08:06 +04:00
register_hotmemory_notifier ( & slab_memory_callback_nb ) ;
2007-05-07 01:49:36 +04:00
/* Able to allocate the per node structures */
slab_state = PARTIAL ;
2012-11-28 20:23:07 +04:00
create_boot_cache ( kmem_cache , " kmem_cache " ,
offsetof ( struct kmem_cache , node ) +
nr_node_ids * sizeof ( struct kmem_cache_node * ) ,
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 05:50:28 +03:00
SLAB_HWCACHE_ALIGN , 0 , 0 ) ;
2012-09-05 03:18:33 +04:00
2012-11-28 20:23:07 +04:00
kmem_cache = bootstrap ( & boot_kmem_cache ) ;
kmem_cache_node = bootstrap ( & boot_kmem_cache_node ) ;
2010-08-20 21:37:15 +04:00
/* Now we can use the kmem_cache to allocate kmalloc slabs */
2015-06-25 02:55:57 +03:00
setup_kmalloc_cache_index_table ( ) ;
2013-01-10 23:12:17 +04:00
create_kmalloc_caches ( 0 ) ;
2007-05-07 01:49:36 +04:00
2016-07-27 01:21:59 +03:00
/* Setup random freelists for each cache */
init_freelist_randomization ( ) ;
2016-08-18 15:57:19 +03:00
cpuhp_setup_state_nocalls ( CPUHP_SLUB_DEAD , " slub:dead " , NULL ,
slub_cpu_dead ) ;
2007-05-07 01:49:36 +04:00
2019-03-06 02:48:26 +03:00
pr_info ( " SLUB: HWalign=%d, Order=%u-%u, MinObjects=%u, CPUs=%u, Nodes=%u \n " ,
2013-01-10 23:12:17 +04:00
cache_line_size ( ) ,
2007-05-07 01:49:36 +04:00
slub_min_order , slub_max_order , slub_min_objects ,
nr_cpu_ids , nr_node_ids ) ;
}
2009-06-12 15:03:06 +04:00
void __init kmem_cache_init_late ( void )
{
}
2012-12-19 02:22:34 +04:00
struct kmem_cache *
2018-04-06 02:20:37 +03:00
__kmem_cache_alias ( const char * name , unsigned int size , unsigned int align ,
2017-11-16 04:32:18 +03:00
slab_flags_t flags , void ( * ctor ) ( void * ) )
2007-05-07 01:49:36 +04:00
{
2020-08-07 09:21:27 +03:00
struct kmem_cache * s ;
2007-05-07 01:49:36 +04:00
memcg, slab: never try to merge memcg caches
When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.
Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.
The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.
So let's remove the useless code responsible for merging memcg caches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:39:23 +04:00
s = find_mergeable ( size , align , flags , name , ctor ) ;
2007-05-07 01:49:36 +04:00
if ( s ) {
s - > refcount + + ;
2014-04-08 02:39:29 +04:00
2007-05-07 01:49:36 +04:00
/*
* Adjust the object sizes so that we clear
* the complete object on kzalloc .
*/
2018-04-06 02:21:17 +03:00
s - > object_size = max ( s - > object_size , size ) ;
2018-04-06 02:21:06 +03:00
s - > inuse = max ( s - > inuse , ALIGN ( size , sizeof ( void * ) ) ) ;
2008-02-16 10:45:26 +03:00
2008-12-18 09:09:46 +03:00
if ( sysfs_slab_alias ( s , name ) ) {
s - > refcount - - ;
2012-09-05 04:18:32 +04:00
s = NULL ;
2008-12-18 09:09:46 +03:00
}
2007-07-17 15:03:31 +04:00
}
2008-02-16 10:45:26 +03:00
2012-09-05 04:18:32 +04:00
return s ;
}
2010-09-15 00:21:12 +04:00
2017-11-16 04:32:18 +03:00
int __kmem_cache_create ( struct kmem_cache * s , slab_flags_t flags )
2012-09-05 04:18:32 +04:00
{
2012-09-05 13:07:44 +04:00
int err ;
err = kmem_cache_open ( s , flags ) ;
if ( err )
return err ;
2012-07-07 00:25:13 +04:00
2012-11-28 20:23:07 +04:00
/* Mutex is not taken during early boot */
if ( slab_state < = UP )
return 0 ;
2012-09-05 13:07:44 +04:00
err = sysfs_slab_add ( s ) ;
if ( err )
2016-02-18 00:11:37 +03:00
__kmem_cache_release ( s ) ;
2012-07-07 00:25:13 +04:00
2012-09-05 13:07:44 +04:00
return err ;
2007-05-07 01:49:36 +04:00
}
2008-08-19 21:43:25 +04:00
void * __kmalloc_track_caller ( size_t size , gfp_t gfpflags , unsigned long caller )
2007-05-07 01:49:36 +04:00
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-24 21:49:35 +04:00
void * ret ;
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) )
2008-02-11 23:47:46 +03:00
return kmalloc_large ( size , gfpflags ) ;
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , gfpflags ) ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:24:44 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2007-05-07 01:49:36 +04:00
2012-09-09 00:47:58 +04:00
ret = slab_alloc ( s , gfpflags , caller ) ;
2008-08-24 21:49:35 +04:00
2011-03-31 05:57:33 +04:00
/* Honor the call site pointer we received. */
2009-03-23 16:12:24 +03:00
trace_kmalloc ( caller , ret , size , s - > size , gfpflags ) ;
2008-08-24 21:49:35 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
2020-03-23 17:49:00 +03:00
EXPORT_SYMBOL ( __kmalloc_track_caller ) ;
2007-05-07 01:49:36 +04:00
2010-09-29 16:02:15 +04:00
# ifdef CONFIG_NUMA
2007-05-07 01:49:36 +04:00
void * __kmalloc_node_track_caller ( size_t size , gfp_t gfpflags ,
2008-08-19 21:43:25 +04:00
int node , unsigned long caller )
2007-05-07 01:49:36 +04:00
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-24 21:49:35 +04:00
void * ret ;
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) ) {
2010-04-08 13:26:44 +04:00
ret = kmalloc_large_node ( size , gfpflags , node ) ;
trace_kmalloc_node ( caller , ret ,
size , PAGE_SIZE < < get_order ( size ) ,
gfpflags , node ) ;
return ret ;
}
2008-02-11 23:47:46 +03:00
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , gfpflags ) ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:24:44 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2007-05-07 01:49:36 +04:00
2012-09-09 00:47:58 +04:00
ret = slab_alloc_node ( s , gfpflags , node , caller ) ;
2008-08-24 21:49:35 +04:00
2011-03-31 05:57:33 +04:00
/* Honor the call site pointer we received. */
2009-03-23 16:12:24 +03:00
trace_kmalloc_node ( caller , ret , size , s - > size , gfpflags , node ) ;
2008-08-24 21:49:35 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
2020-03-23 17:49:00 +03:00
EXPORT_SYMBOL ( __kmalloc_node_track_caller ) ;
2010-09-29 16:02:15 +04:00
# endif
2007-05-07 01:49:36 +04:00
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SYSFS
2008-04-14 20:11:40 +04:00
static int count_inuse ( struct page * page )
{
return page - > inuse ;
}
static int count_total ( struct page * page )
{
return page - > objects ;
}
2010-10-05 22:57:26 +04:00
# endif
2008-04-14 20:11:40 +04:00
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2020-01-31 09:11:57 +03:00
static void validate_slab ( struct kmem_cache * s , struct page * page )
2007-05-07 01:49:43 +04:00
{
void * p ;
2008-03-02 00:40:44 +03:00
void * addr = page_address ( page ) ;
2020-01-31 09:11:57 +03:00
unsigned long * map ;
slab_lock ( page ) ;
2007-05-07 01:49:43 +04:00
2019-12-01 04:49:37 +03:00
if ( ! check_slab ( s , page ) | | ! on_freelist ( s , page , NULL ) )
2020-01-31 09:11:57 +03:00
goto unlock ;
2007-05-07 01:49:43 +04:00
/* Now we know that a valid freelist exists */
2020-01-31 09:11:57 +03:00
map = get_map ( s , page ) ;
2011-04-15 23:48:13 +04:00
for_each_object ( p , s , addr , page - > objects ) {
2020-08-07 09:20:42 +03:00
u8 val = test_bit ( __obj_to_index ( s , addr , p ) , map ) ?
2019-12-01 04:49:37 +03:00
SLUB_RED_INACTIVE : SLUB_RED_ACTIVE ;
2007-05-07 01:49:43 +04:00
2019-12-01 04:49:37 +03:00
if ( ! check_object ( s , page , p , val ) )
break ;
}
2020-01-31 09:11:57 +03:00
put_map ( map ) ;
unlock :
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2007-05-07 01:49:43 +04:00
}
2007-07-17 15:03:30 +04:00
static int validate_slab_node ( struct kmem_cache * s ,
2020-01-31 09:11:57 +03:00
struct kmem_cache_node * n )
2007-05-07 01:49:43 +04:00
{
unsigned long count = 0 ;
struct page * page ;
unsigned long flags ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
2019-05-14 03:16:12 +03:00
list_for_each_entry ( page , & n - > partial , slab_list ) {
2020-01-31 09:11:57 +03:00
validate_slab ( s , page ) ;
2007-05-07 01:49:43 +04:00
count + + ;
}
if ( count ! = n - > nr_partial )
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB %s: %ld partial slabs counted but counter=%ld \n " ,
s - > name , count , n - > nr_partial ) ;
2007-05-07 01:49:43 +04:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
goto out ;
2019-05-14 03:16:12 +03:00
list_for_each_entry ( page , & n - > full , slab_list ) {
2020-01-31 09:11:57 +03:00
validate_slab ( s , page ) ;
2007-05-07 01:49:43 +04:00
count + + ;
}
if ( count ! = atomic_long_read ( & n - > nr_slabs ) )
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB: %s %ld slabs counted but counter=%ld \n " ,
s - > name , count , atomic_long_read ( & n - > nr_slabs ) ) ;
2007-05-07 01:49:43 +04:00
out :
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
return count ;
}
2007-07-17 15:03:30 +04:00
static long validate_slab_cache ( struct kmem_cache * s )
2007-05-07 01:49:43 +04:00
{
int node ;
unsigned long count = 0 ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:43 +04:00
flush_all ( s ) ;
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n )
2020-01-31 09:11:57 +03:00
count + = validate_slab_node ( s , n ) ;
2007-05-07 01:49:43 +04:00
return count ;
}
2007-05-07 01:49:45 +04:00
/*
2007-05-09 13:32:39 +04:00
* Generate lists of code addresses where slabcache objects are allocated
2007-05-07 01:49:45 +04:00
* and freed .
*/
struct location {
unsigned long count ;
2008-08-19 21:43:25 +04:00
unsigned long addr ;
2007-05-09 13:32:45 +04:00
long long sum_time ;
long min_time ;
long max_time ;
long min_pid ;
long max_pid ;
2009-01-01 02:42:29 +03:00
DECLARE_BITMAP ( cpus , NR_CPUS ) ;
2007-05-09 13:32:45 +04:00
nodemask_t nodes ;
2007-05-07 01:49:45 +04:00
} ;
struct loc_track {
unsigned long max ;
unsigned long count ;
struct location * loc ;
} ;
static void free_loc_track ( struct loc_track * t )
{
if ( t - > max )
free_pages ( ( unsigned long ) t - > loc ,
get_order ( sizeof ( struct location ) * t - > max ) ) ;
}
2007-07-17 15:03:20 +04:00
static int alloc_loc_track ( struct loc_track * t , unsigned long max , gfp_t flags )
2007-05-07 01:49:45 +04:00
{
struct location * l ;
int order ;
order = get_order ( sizeof ( struct location ) * max ) ;
2007-07-17 15:03:20 +04:00
l = ( void * ) __get_free_pages ( flags , order ) ;
2007-05-07 01:49:45 +04:00
if ( ! l )
return 0 ;
if ( t - > count ) {
memcpy ( l , t - > loc , sizeof ( struct location ) * t - > count ) ;
free_loc_track ( t ) ;
}
t - > max = max ;
t - > loc = l ;
return 1 ;
}
static int add_location ( struct loc_track * t , struct kmem_cache * s ,
2007-05-09 13:32:45 +04:00
const struct track * track )
2007-05-07 01:49:45 +04:00
{
long start , end , pos ;
struct location * l ;
2008-08-19 21:43:25 +04:00
unsigned long caddr ;
2007-05-09 13:32:45 +04:00
unsigned long age = jiffies - track - > when ;
2007-05-07 01:49:45 +04:00
start = - 1 ;
end = t - > count ;
for ( ; ; ) {
pos = start + ( end - start + 1 ) / 2 ;
/*
* There is nothing at " end " . If we end up there
* we need to add something to before end .
*/
if ( pos = = end )
break ;
caddr = t - > loc [ pos ] . addr ;
2007-05-09 13:32:45 +04:00
if ( track - > addr = = caddr ) {
l = & t - > loc [ pos ] ;
l - > count + + ;
if ( track - > when ) {
l - > sum_time + = age ;
if ( age < l - > min_time )
l - > min_time = age ;
if ( age > l - > max_time )
l - > max_time = age ;
if ( track - > pid < l - > min_pid )
l - > min_pid = track - > pid ;
if ( track - > pid > l - > max_pid )
l - > max_pid = track - > pid ;
2009-01-01 02:42:29 +03:00
cpumask_set_cpu ( track - > cpu ,
to_cpumask ( l - > cpus ) ) ;
2007-05-09 13:32:45 +04:00
}
node_set ( page_to_nid ( virt_to_page ( track ) ) , l - > nodes ) ;
2007-05-07 01:49:45 +04:00
return 1 ;
}
2007-05-09 13:32:45 +04:00
if ( track - > addr < caddr )
2007-05-07 01:49:45 +04:00
end = pos ;
else
start = pos ;
}
/*
2007-05-09 13:32:39 +04:00
* Not found . Insert new tracking element .
2007-05-07 01:49:45 +04:00
*/
2007-07-17 15:03:20 +04:00
if ( t - > count > = t - > max & & ! alloc_loc_track ( t , 2 * t - > max , GFP_ATOMIC ) )
2007-05-07 01:49:45 +04:00
return 0 ;
l = t - > loc + pos ;
if ( pos < t - > count )
memmove ( l + 1 , l ,
( t - > count - pos ) * sizeof ( struct location ) ) ;
t - > count + + ;
l - > count = 1 ;
2007-05-09 13:32:45 +04:00
l - > addr = track - > addr ;
l - > sum_time = age ;
l - > min_time = age ;
l - > max_time = age ;
l - > min_pid = track - > pid ;
l - > max_pid = track - > pid ;
2009-01-01 02:42:29 +03:00
cpumask_clear ( to_cpumask ( l - > cpus ) ) ;
cpumask_set_cpu ( track - > cpu , to_cpumask ( l - > cpus ) ) ;
2007-05-09 13:32:45 +04:00
nodes_clear ( l - > nodes ) ;
node_set ( page_to_nid ( virt_to_page ( track ) ) , l - > nodes ) ;
2007-05-07 01:49:45 +04:00
return 1 ;
}
static void process_slab ( struct loc_track * t , struct kmem_cache * s ,
2020-01-31 09:11:57 +03:00
struct page * page , enum track_item alloc )
2007-05-07 01:49:45 +04:00
{
2008-03-02 00:40:44 +03:00
void * addr = page_address ( page ) ;
2007-05-07 01:49:45 +04:00
void * p ;
2020-01-31 09:11:57 +03:00
unsigned long * map ;
2007-05-07 01:49:45 +04:00
2020-01-31 09:11:57 +03:00
map = get_map ( s , page ) ;
2008-04-14 20:11:31 +04:00
for_each_object ( p , s , addr , page - > objects )
2020-08-07 09:20:42 +03:00
if ( ! test_bit ( __obj_to_index ( s , addr , p ) , map ) )
2007-05-09 13:32:45 +04:00
add_location ( t , s , get_track ( s , p , alloc ) ) ;
2020-01-31 09:11:57 +03:00
put_map ( map ) ;
2007-05-07 01:49:45 +04:00
}
static int list_locations ( struct kmem_cache * s , char * buf ,
enum track_item alloc )
{
2008-02-01 02:20:50 +03:00
int len = 0 ;
2007-05-07 01:49:45 +04:00
unsigned long i ;
2007-07-17 15:03:20 +04:00
struct loc_track t = { 0 , 0 , NULL } ;
2007-05-07 01:49:45 +04:00
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:45 +04:00
2020-01-31 09:11:57 +03:00
if ( ! alloc_loc_track ( & t , PAGE_SIZE / sizeof ( struct location ) ,
GFP_KERNEL ) ) {
2007-07-17 15:03:20 +04:00
return sprintf ( buf , " Out of memory \n " ) ;
2010-03-25 00:25:47 +03:00
}
2007-05-07 01:49:45 +04:00
/* Push back cpu slabs */
flush_all ( s ) ;
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2007-05-07 01:49:45 +04:00
unsigned long flags ;
struct page * page ;
2007-08-23 01:01:56 +04:00
if ( ! atomic_long_read ( & n - > nr_slabs ) )
2007-05-07 01:49:45 +04:00
continue ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
2019-05-14 03:16:12 +03:00
list_for_each_entry ( page , & n - > partial , slab_list )
2020-01-31 09:11:57 +03:00
process_slab ( & t , s , page , alloc ) ;
2019-05-14 03:16:12 +03:00
list_for_each_entry ( page , & n - > full , slab_list )
2020-01-31 09:11:57 +03:00
process_slab ( & t , s , page , alloc ) ;
2007-05-07 01:49:45 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
}
for ( i = 0 ; i < t . count ; i + + ) {
2007-05-09 13:32:45 +04:00
struct location * l = & t . loc [ i ] ;
2007-05-07 01:49:45 +04:00
2008-12-10 00:14:27 +03:00
if ( len > PAGE_SIZE - KSYM_SYMBOL_LEN - 100 )
2007-05-07 01:49:45 +04:00
break ;
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " %7ld " , l - > count ) ;
2007-05-09 13:32:45 +04:00
if ( l - > addr )
2011-01-14 02:45:52 +03:00
len + = sprintf ( buf + len , " %pS " , ( void * ) l - > addr ) ;
2007-05-07 01:49:45 +04:00
else
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " <not-available> " ) ;
2007-05-09 13:32:45 +04:00
if ( l - > sum_time ! = l - > min_time ) {
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " age=%ld/%ld/%ld " ,
2008-05-01 15:34:31 +04:00
l - > min_time ,
( long ) div_u64 ( l - > sum_time , l - > count ) ,
l - > max_time ) ;
2007-05-09 13:32:45 +04:00
} else
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " age=%ld " ,
2007-05-09 13:32:45 +04:00
l - > min_time ) ;
if ( l - > min_pid ! = l - > max_pid )
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " pid=%ld-%ld " ,
2007-05-09 13:32:45 +04:00
l - > min_pid , l - > max_pid ) ;
else
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " pid=%ld " ,
2007-05-09 13:32:45 +04:00
l - > min_pid ) ;
2009-01-01 02:42:29 +03:00
if ( num_online_cpus ( ) > 1 & &
! cpumask_empty ( to_cpumask ( l - > cpus ) ) & &
2015-02-14 01:37:59 +03:00
len < PAGE_SIZE - 60 )
len + = scnprintf ( buf + len , PAGE_SIZE - len - 50 ,
" cpus=%*pbl " ,
cpumask_pr_args ( to_cpumask ( l - > cpus ) ) ) ;
2007-05-09 13:32:45 +04:00
2009-06-17 02:32:15 +04:00
if ( nr_online_nodes > 1 & & ! nodes_empty ( l - > nodes ) & &
2015-02-14 01:37:59 +03:00
len < PAGE_SIZE - 60 )
len + = scnprintf ( buf + len , PAGE_SIZE - len - 50 ,
" nodes=%*pbl " ,
nodemask_pr_args ( & l - > nodes ) ) ;
2007-05-09 13:32:45 +04:00
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " \n " ) ;
2007-05-07 01:49:45 +04:00
}
free_loc_track ( & t ) ;
if ( ! t . count )
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf , " No data \n " ) ;
return len ;
2007-05-07 01:49:45 +04:00
}
2019-05-14 03:16:09 +03:00
# endif /* CONFIG_SLUB_DEBUG */
2007-05-07 01:49:45 +04:00
2010-10-05 22:57:27 +04:00
# ifdef SLUB_RESILIENCY_TEST
2014-08-07 03:04:16 +04:00
static void __init resiliency_test ( void )
2010-10-05 22:57:27 +04:00
{
u8 * p ;
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
int type = KMALLOC_NORMAL ;
2010-10-05 22:57:27 +04:00
2013-01-10 23:14:19 +04:00
BUILD_BUG_ON ( KMALLOC_MIN_SIZE > 16 | | KMALLOC_SHIFT_HIGH < 10 ) ;
2010-10-05 22:57:27 +04:00
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB resiliency testing \n " ) ;
pr_err ( " ----------------------- \n " ) ;
pr_err ( " A. Corruption after allocation \n " ) ;
2010-10-05 22:57:27 +04:00
p = kzalloc ( 16 , GFP_KERNEL ) ;
p [ 16 ] = 0x12 ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 1. kmalloc-16: Clobber Redzone/next pointer 0x12->0x%p \n \n " ,
p + 16 ) ;
2010-10-05 22:57:27 +04:00
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
validate_slab_cache ( kmalloc_caches [ type ] [ 4 ] ) ;
2010-10-05 22:57:27 +04:00
/* Hmmm... The next two are dangerous */
p = kzalloc ( 32 , GFP_KERNEL ) ;
p [ 32 + sizeof ( void * ) ] = 0x34 ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 2. kmalloc-32: Clobber next pointer/next slab 0x34 -> -0x%p \n " ,
p ) ;
pr_err ( " If allocated object is overwritten then not detectable \n \n " ) ;
2010-10-05 22:57:27 +04:00
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
validate_slab_cache ( kmalloc_caches [ type ] [ 5 ] ) ;
2010-10-05 22:57:27 +04:00
p = kzalloc ( 64 , GFP_KERNEL ) ;
p + = 64 + ( get_cycles ( ) & 0xff ) * sizeof ( void * ) ;
* p = 0x56 ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 3. kmalloc-64: corrupting random byte 0x56->0x%p \n " ,
p ) ;
pr_err ( " If allocated object is overwritten then not detectable \n \n " ) ;
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
validate_slab_cache ( kmalloc_caches [ type ] [ 6 ] ) ;
2010-10-05 22:57:27 +04:00
2014-06-05 03:06:34 +04:00
pr_err ( " \n B. Corruption after free \n " ) ;
2010-10-05 22:57:27 +04:00
p = kzalloc ( 128 , GFP_KERNEL ) ;
kfree ( p ) ;
* p = 0x78 ;
2014-06-05 03:06:34 +04:00
pr_err ( " 1. kmalloc-128: Clobber first word 0x78->0x%p \n \n " , p ) ;
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
validate_slab_cache ( kmalloc_caches [ type ] [ 7 ] ) ;
2010-10-05 22:57:27 +04:00
p = kzalloc ( 256 , GFP_KERNEL ) ;
kfree ( p ) ;
p [ 50 ] = 0x9a ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 2. kmalloc-256: Clobber 50th byte 0x9a->0x%p \n \n " , p ) ;
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
validate_slab_cache ( kmalloc_caches [ type ] [ 8 ] ) ;
2010-10-05 22:57:27 +04:00
p = kzalloc ( 512 , GFP_KERNEL ) ;
kfree ( p ) ;
p [ 512 ] = 0xab ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 3. kmalloc-512: Clobber redzone 0xab->0x%p \n \n " , p ) ;
mm, slab: combine kmalloc_caches and kmalloc_dma_caches
Patch series "kmalloc-reclaimable caches", v4.
As discussed at LSF/MM [1] here's a patchset that introduces
kmalloc-reclaimable caches (more details in the second patch) and uses
them for dcache external names. That allows us to repurpose the
NR_INDIRECTLY_RECLAIMABLE_BYTES counter later in the series.
With patch 3/6, dcache external names are allocated from kmalloc-rcl-*
caches, eliminating the need for manual accounting. More importantly, it
also ensures the reclaimable kmalloc allocations are grouped in pages
separate from the regular kmalloc allocations. The need for proper
accounting of dcache external names has shown it's easy for misbehaving
process to allocate lots of them, causing premature OOMs. Without the
added grouping, it's likely that a similar workload can interleave the
dcache external names allocations with regular kmalloc allocations (note:
I haven't searched myself for an example of such regular kmalloc
allocation, but I would be very surprised if there wasn't some). A
pathological case would be e.g. one 64byte regular allocations with 63
external dcache names in a page (64x64=4096), which means the page is not
freed even after reclaiming after all dcache names, and the process can
thus "steal" the whole page with single 64byte allocation.
If other kmalloc users similar to dcache external names become identified,
they can also benefit from the new functionality simply by adding
__GFP_RECLAIMABLE to the kmalloc calls.
Side benefits of the patchset (that could be also merged separately)
include removed branch for detecting __GFP_DMA kmalloc(), and shortening
kmalloc cache names in /proc/slabinfo output. The latter is potentially
an ABI break in case there are tools parsing the names and expecting the
values to be in bytes.
This is how /proc/slabinfo looks like after booting in virtme:
...
kmalloc-rcl-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
...
kmalloc-rcl-96 7 32 128 32 1 : tunables 120 60 8 : slabdata 1 1 0
kmalloc-rcl-64 25 128 64 64 1 : tunables 120 60 8 : slabdata 2 2 0
kmalloc-rcl-32 0 0 32 124 1 : tunables 120 60 8 : slabdata 0 0 0
kmalloc-4M 0 0 4194304 1 1024 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-2M 0 0 2097152 1 512 : tunables 1 1 0 : slabdata 0 0 0
kmalloc-1M 0 0 1048576 1 256 : tunables 1 1 0 : slabdata 0 0 0
...
/proc/vmstat with renamed nr_indirectly_reclaimable_bytes counter:
...
nr_slab_reclaimable 2817
nr_slab_unreclaimable 1781
...
nr_kernel_misc_reclaimable 0
...
/proc/meminfo with new KReclaimable counter:
...
Shmem: 564 kB
KReclaimable: 11260 kB
Slab: 18368 kB
SReclaimable: 11260 kB
SUnreclaim: 7108 kB
KernelStack: 1248 kB
...
This patch (of 6):
The kmalloc caches currently mainain separate (optional) array
kmalloc_dma_caches for __GFP_DMA allocations. There are tests for
__GFP_DMA in the allocation hotpaths. We can avoid the branches by
combining kmalloc_caches and kmalloc_dma_caches into a single
two-dimensional array where the outer dimension is cache "type". This
will also allow to add kmalloc-reclaimable caches as a third type.
Link: http://lkml.kernel.org/r/20180731090649.16028-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-27 01:05:34 +03:00
validate_slab_cache ( kmalloc_caches [ type ] [ 9 ] ) ;
2010-10-05 22:57:27 +04:00
}
# else
# ifdef CONFIG_SYSFS
static void resiliency_test ( void ) { } ;
# endif
2019-05-14 03:16:09 +03:00
# endif /* SLUB_RESILIENCY_TEST */
2010-10-05 22:57:27 +04:00
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SYSFS
2007-05-07 01:49:36 +04:00
enum slab_stat_type {
2008-04-14 20:11:40 +04:00
SL_ALL , /* All slabs */
SL_PARTIAL , /* Only partially allocated slabs */
SL_CPU , /* Only slabs used for cpu caches */
SL_OBJECTS , /* Determine allocated objects not slabs */
SL_TOTAL /* Determine object capacity not slabs */
2007-05-07 01:49:36 +04:00
} ;
2008-04-14 20:11:40 +04:00
# define SO_ALL (1 << SL_ALL)
2007-05-07 01:49:36 +04:00
# define SO_PARTIAL (1 << SL_PARTIAL)
# define SO_CPU (1 << SL_CPU)
# define SO_OBJECTS (1 << SL_OBJECTS)
2008-04-14 20:11:40 +04:00
# define SO_TOTAL (1 << SL_TOTAL)
2007-05-07 01:49:36 +04:00
2017-02-23 02:41:39 +03:00
# ifdef CONFIG_MEMCG
static bool memcg_sysfs_enabled = IS_ENABLED ( CONFIG_SLUB_MEMCG_SYSFS_ON ) ;
static int __init setup_slub_memcg_sysfs ( char * str )
{
int v ;
if ( get_option ( & str , & v ) > 0 )
memcg_sysfs_enabled = v ;
return 1 ;
}
__setup ( " slub_memcg_sysfs= " , setup_slub_memcg_sysfs ) ;
# endif
2008-03-02 23:28:24 +03:00
static ssize_t show_slab_objects ( struct kmem_cache * s ,
char * buf , unsigned long flags )
2007-05-07 01:49:36 +04:00
{
unsigned long total = 0 ;
int node ;
int x ;
unsigned long * nodes ;
treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a * b, gfp)
as well as handling cases of:
kzalloc(a * b * c, gfp)
with:
kzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kzalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kzalloc
+ kcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kzalloc(sizeof(THING) * C2, ...)
|
kzalloc(sizeof(TYPE) * C2, ...)
|
kzalloc(C1 * C2 * C3, ...)
|
kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc
+ kcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 00:03:40 +03:00
nodes = kcalloc ( nr_node_ids , sizeof ( unsigned long ) , GFP_KERNEL ) ;
2008-03-02 23:28:24 +03:00
if ( ! nodes )
return - ENOMEM ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:40 +04:00
if ( flags & SO_CPU ) {
int cpu ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:40 +04:00
for_each_possible_cpu ( cpu ) {
2013-07-15 05:05:29 +04:00
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab ,
cpu ) ;
2012-05-09 19:09:56 +04:00
int node ;
2011-08-10 01:12:27 +04:00
struct page * page ;
2007-10-16 12:26:05 +04:00
2015-04-16 02:14:08 +03:00
page = READ_ONCE ( c - > page ) ;
2012-05-09 19:09:56 +04:00
if ( ! page )
continue ;
2008-04-14 20:11:40 +04:00
2012-05-09 19:09:56 +04:00
node = page_to_nid ( page ) ;
if ( flags & SO_TOTAL )
x = page - > objects ;
else if ( flags & SO_OBJECTS )
x = page - > inuse ;
else
x = 1 ;
2011-08-10 01:12:27 +04:00
2012-05-09 19:09:56 +04:00
total + = x ;
nodes [ node ] + = x ;
2017-07-07 01:36:31 +03:00
page = slub_percpu_partial_read_once ( c ) ;
2011-08-10 01:12:27 +04:00
if ( page ) {
2013-09-10 07:43:37 +04:00
node = page_to_nid ( page ) ;
if ( flags & SO_TOTAL )
WARN_ON_ONCE ( 1 ) ;
else if ( flags & SO_OBJECTS )
WARN_ON_ONCE ( 1 ) ;
else
x = page - > pages ;
2011-11-22 19:02:02 +04:00
total + = x ;
nodes [ node ] + = x ;
2011-08-10 01:12:27 +04:00
}
2007-05-07 01:49:36 +04:00
}
}
mm/slub: fix a deadlock in show_slab_objects()
A long time ago we fixed a similar deadlock in show_slab_objects() [1].
However, it is apparently due to the commits like 01fb58bcba63 ("slab:
remove synchronous synchronize_sched() from memcg cache deactivation
path") and 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
just reading files in /sys/kernel/slab which will generate a lockdep
splat below.
Since the "mem_hotplug_lock" here is only to obtain a stable online node
mask while racing with NUMA node hotplug, in the worst case, the results
may me miscalculated while doing NUMA node hotplug, but they shall be
corrected by later reads of the same files.
WARNING: possible circular locking dependency detected
------------------------------------------------------
cat/5224 is trying to acquire lock:
ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
show_slab_objects+0x94/0x3a8
but task is already holding lock:
b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (kn->count#45){++++}:
lock_acquire+0x31c/0x360
__kernfs_remove+0x290/0x490
kernfs_remove+0x30/0x44
sysfs_remove_dir+0x70/0x88
kobject_del+0x50/0xb0
sysfs_slab_unlink+0x2c/0x38
shutdown_cache+0xa0/0xf0
kmemcg_cache_shutdown_fn+0x1c/0x34
kmemcg_workfn+0x44/0x64
process_one_work+0x4f4/0x950
worker_thread+0x390/0x4bc
kthread+0x1cc/0x1e8
ret_from_fork+0x10/0x18
-> #1 (slab_mutex){+.+.}:
lock_acquire+0x31c/0x360
__mutex_lock_common+0x16c/0xf78
mutex_lock_nested+0x40/0x50
memcg_create_kmem_cache+0x38/0x16c
memcg_kmem_cache_create_func+0x3c/0x70
process_one_work+0x4f4/0x950
worker_thread+0x390/0x4bc
kthread+0x1cc/0x1e8
ret_from_fork+0x10/0x18
-> #0 (mem_hotplug_lock.rw_sem){++++}:
validate_chain+0xd10/0x2bcc
__lock_acquire+0x7f4/0xb8c
lock_acquire+0x31c/0x360
get_online_mems+0x54/0x150
show_slab_objects+0x94/0x3a8
total_objects_show+0x28/0x34
slab_attr_show+0x38/0x54
sysfs_kf_seq_show+0x198/0x2d4
kernfs_seq_show+0xa4/0xcc
seq_read+0x30c/0x8a8
kernfs_fop_read+0xa8/0x314
__vfs_read+0x88/0x20c
vfs_read+0xd8/0x10c
ksys_read+0xb0/0x120
__arm64_sys_read+0x54/0x88
el0_svc_handler+0x170/0x240
el0_svc+0x8/0xc
other info that might help us debug this:
Chain exists of:
mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(kn->count#45);
lock(slab_mutex);
lock(kn->count#45);
lock(mem_hotplug_lock.rw_sem);
*** DEADLOCK ***
3 locks held by cat/5224:
#0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8
#1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0
#2: b8ff009693eee398 (kn->count#45){++++}, at:
kernfs_seq_start+0x44/0xf0
stack backtrace:
Call trace:
dump_backtrace+0x0/0x248
show_stack+0x20/0x2c
dump_stack+0xd0/0x140
print_circular_bug+0x368/0x380
check_noncircular+0x248/0x250
validate_chain+0xd10/0x2bcc
__lock_acquire+0x7f4/0xb8c
lock_acquire+0x31c/0x360
get_online_mems+0x54/0x150
show_slab_objects+0x94/0x3a8
total_objects_show+0x28/0x34
slab_attr_show+0x38/0x54
sysfs_kf_seq_show+0x198/0x2d4
kernfs_seq_show+0xa4/0xcc
seq_read+0x30c/0x8a8
kernfs_fop_read+0xa8/0x314
__vfs_read+0x88/0x20c
vfs_read+0xd8/0x10c
ksys_read+0xb0/0x120
__arm64_sys_read+0x54/0x88
el0_svc_handler+0x170/0x240
el0_svc+0x8/0xc
I think it is important to mention that this doesn't expose the
show_slab_objects to use-after-free. There is only a single path that
might really race here and that is the slab hotplug notifier callback
__kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
doesn't really destroy kmem_cache_node data structures.
[1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html
[akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock]
Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw
Fixes: 01fb58bcba63 ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path")
Fixes: 03afc0e25f7f ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-15 00:11:51 +03:00
/*
* It is impossible to take " mem_hotplug_lock " here with " kernfs_mutex "
* already held which will conflict with an existing lock order :
*
* mem_hotplug_lock - > slab_mutex - > kernfs_mutex
*
* We don ' t really need mem_hotplug_lock ( to hold off
* slab_mem_going_offline_callback ) here because slab ' s memory hot
* unplug code doesn ' t destroy the kmem_cache - > node [ ] data .
*/
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2008-04-14 20:11:40 +04:00
if ( flags & SO_ALL ) {
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
for_each_kmem_cache_node ( s , node , n ) {
2008-04-14 20:11:40 +04:00
2013-07-15 05:05:29 +04:00
if ( flags & SO_TOTAL )
x = atomic_long_read ( & n - > total_objects ) ;
else if ( flags & SO_OBJECTS )
x = atomic_long_read ( & n - > total_objects ) -
count_partial ( n , count_free ) ;
2007-05-07 01:49:36 +04:00
else
2008-04-14 20:11:40 +04:00
x = atomic_long_read ( & n - > nr_slabs ) ;
2007-05-07 01:49:36 +04:00
total + = x ;
nodes [ node ] + = x ;
}
2010-10-05 22:57:26 +04:00
} else
# endif
if ( flags & SO_PARTIAL ) {
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:36 +04:00
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2008-04-14 20:11:40 +04:00
if ( flags & SO_TOTAL )
x = count_partial ( n , count_total ) ;
else if ( flags & SO_OBJECTS )
x = count_partial ( n , count_inuse ) ;
2007-05-07 01:49:36 +04:00
else
2008-04-14 20:11:40 +04:00
x = n - > nr_partial ;
2007-05-07 01:49:36 +04:00
total + = x ;
nodes [ node ] + = x ;
}
}
x = sprintf ( buf , " %lu " , total ) ;
# ifdef CONFIG_NUMA
2014-08-07 03:04:09 +04:00
for ( node = 0 ; node < nr_node_ids ; node + + )
2007-05-07 01:49:36 +04:00
if ( nodes [ node ] )
x + = sprintf ( buf + x , " N%d=%lu " ,
node , nodes [ node ] ) ;
# endif
kfree ( nodes ) ;
return x + sprintf ( buf + x , " \n " ) ;
}
# define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
2011-07-14 16:07:13 +04:00
# define to_slab(n) container_of(n, struct kmem_cache, kobj)
2007-05-07 01:49:36 +04:00
struct slab_attribute {
struct attribute attr ;
ssize_t ( * show ) ( struct kmem_cache * s , char * buf ) ;
ssize_t ( * store ) ( struct kmem_cache * s , const char * x , size_t count ) ;
} ;
# define SLAB_ATTR_RO(_name) \
mm: restrict access to slab files under procfs and sysfs
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:
1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.
2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.
3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).
4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.
5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.
6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.
Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.
Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:
http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:
groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER
And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):
chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <kees@ubuntu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27 21:54:53 +04:00
static struct slab_attribute _name # # _attr = \
__ATTR ( _name , 0400 , _name # # _show , NULL )
2007-05-07 01:49:36 +04:00
# define SLAB_ATTR(_name) \
static struct slab_attribute _name # # _attr = \
mm: restrict access to slab files under procfs and sysfs
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:
1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.
2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.
3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).
4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.
5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.
6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.
Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.
Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:
http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:
groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER
And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):
chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <kees@ubuntu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27 21:54:53 +04:00
__ATTR ( _name , 0600 , _name # # _show , _name # # _store )
2007-05-07 01:49:36 +04:00
static ssize_t slab_size_show ( struct kmem_cache * s , char * buf )
{
2018-04-06 02:21:20 +03:00
return sprintf ( buf , " %u \n " , s - > size ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( slab_size ) ;
static ssize_t align_show ( struct kmem_cache * s , char * buf )
{
2018-04-06 02:21:02 +03:00
return sprintf ( buf , " %u \n " , s - > align ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( align ) ;
static ssize_t object_size_show ( struct kmem_cache * s , char * buf )
{
2018-04-06 02:21:17 +03:00
return sprintf ( buf , " %u \n " , s - > object_size ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( object_size ) ;
static ssize_t objs_per_slab_show ( struct kmem_cache * s , char * buf )
{
2018-04-06 02:21:39 +03:00
return sprintf ( buf , " %u \n " , oo_objects ( s - > oo ) ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( objs_per_slab ) ;
static ssize_t order_show ( struct kmem_cache * s , char * buf )
{
2018-04-06 02:21:39 +03:00
return sprintf ( buf , " %u \n " , oo_order ( s - > oo ) ) ;
2007-05-07 01:49:36 +04:00
}
2020-08-07 09:18:41 +03:00
SLAB_ATTR_RO ( order ) ;
2007-05-07 01:49:36 +04:00
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
static ssize_t min_partial_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %lu \n " , s - > min_partial ) ;
}
static ssize_t min_partial_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
unsigned long min ;
int err ;
2013-09-12 01:20:25 +04:00
err = kstrtoul ( buf , 10 , & min ) ;
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
if ( err )
return err ;
2009-02-25 10:16:35 +03:00
set_min_partial ( s , min ) ;
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
return length ;
}
SLAB_ATTR ( min_partial ) ;
2011-08-10 01:12:27 +04:00
static ssize_t cpu_partial_show ( struct kmem_cache * s , char * buf )
{
2017-07-07 01:36:34 +03:00
return sprintf ( buf , " %u \n " , slub_cpu_partial ( s ) ) ;
2011-08-10 01:12:27 +04:00
}
static ssize_t cpu_partial_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
2018-04-06 02:21:10 +03:00
unsigned int objects ;
2011-08-10 01:12:27 +04:00
int err ;
2018-04-06 02:21:10 +03:00
err = kstrtouint ( buf , 10 , & objects ) ;
2011-08-10 01:12:27 +04:00
if ( err )
return err ;
2013-06-19 09:05:52 +04:00
if ( objects & & ! kmem_cache_has_cpu_partial ( s ) )
2012-01-10 01:19:45 +04:00
return - EINVAL ;
2011-08-10 01:12:27 +04:00
2017-07-07 01:36:34 +03:00
slub_set_cpu_partial ( s , objects ) ;
2011-08-10 01:12:27 +04:00
flush_all ( s ) ;
return length ;
}
SLAB_ATTR ( cpu_partial ) ;
2007-05-07 01:49:36 +04:00
static ssize_t ctor_show ( struct kmem_cache * s , char * buf )
{
2011-01-14 02:45:52 +03:00
if ( ! s - > ctor )
return 0 ;
return sprintf ( buf , " %pS \n " , s - > ctor ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( ctor ) ;
static ssize_t aliases_show ( struct kmem_cache * s , char * buf )
{
2014-08-07 03:04:51 +04:00
return sprintf ( buf , " %d \n " , s - > refcount < 0 ? 0 : s - > refcount - 1 ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( aliases ) ;
static ssize_t partial_show ( struct kmem_cache * s , char * buf )
{
2008-02-16 02:22:21 +03:00
return show_slab_objects ( s , buf , SO_PARTIAL ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( partial ) ;
static ssize_t cpu_slabs_show ( struct kmem_cache * s , char * buf )
{
2008-02-16 02:22:21 +03:00
return show_slab_objects ( s , buf , SO_CPU ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( cpu_slabs ) ;
static ssize_t objects_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 20:11:40 +04:00
return show_slab_objects ( s , buf , SO_ALL | SO_OBJECTS ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( objects ) ;
2008-04-14 20:11:40 +04:00
static ssize_t objects_partial_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_PARTIAL | SO_OBJECTS ) ;
}
SLAB_ATTR_RO ( objects_partial ) ;
2011-08-10 01:12:27 +04:00
static ssize_t slabs_cpu_partial_show ( struct kmem_cache * s , char * buf )
{
int objects = 0 ;
int pages = 0 ;
int cpu ;
int len ;
for_each_online_cpu ( cpu ) {
2017-07-07 01:36:31 +03:00
struct page * page ;
page = slub_percpu_partial ( per_cpu_ptr ( s - > cpu_slab , cpu ) ) ;
2011-08-10 01:12:27 +04:00
if ( page ) {
pages + = page - > pages ;
objects + = page - > pobjects ;
}
}
len = sprintf ( buf , " %d(%d) " , objects , pages ) ;
# ifdef CONFIG_SMP
for_each_online_cpu ( cpu ) {
2017-07-07 01:36:31 +03:00
struct page * page ;
page = slub_percpu_partial ( per_cpu_ptr ( s - > cpu_slab , cpu ) ) ;
2011-08-10 01:12:27 +04:00
if ( page & & len < PAGE_SIZE - 20 )
len + = sprintf ( buf + len , " C%d=%d(%d) " , cpu ,
page - > pobjects , page - > pages ) ;
}
# endif
return len + sprintf ( buf + len , " \n " ) ;
}
SLAB_ATTR_RO ( slabs_cpu_partial ) ;
2010-10-05 22:57:27 +04:00
static ssize_t reclaim_account_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_RECLAIM_ACCOUNT ) ) ;
}
2020-08-07 09:18:48 +03:00
SLAB_ATTR_RO ( reclaim_account ) ;
2010-10-05 22:57:27 +04:00
static ssize_t hwcache_align_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_HWCACHE_ALIGN ) ) ;
}
SLAB_ATTR_RO ( hwcache_align ) ;
# ifdef CONFIG_ZONE_DMA
static ssize_t cache_dma_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_CACHE_DMA ) ) ;
}
SLAB_ATTR_RO ( cache_dma ) ;
# endif
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 05:50:28 +03:00
static ssize_t usersize_show ( struct kmem_cache * s , char * buf )
{
2018-04-06 02:21:31 +03:00
return sprintf ( buf , " %u \n " , s - > usersize ) ;
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 05:50:28 +03:00
}
SLAB_ATTR_RO ( usersize ) ;
2010-10-05 22:57:27 +04:00
static ssize_t destroy_by_rcu_show ( struct kmem_cache * s , char * buf )
{
2017-01-18 13:53:44 +03:00
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_TYPESAFE_BY_RCU ) ) ;
2010-10-05 22:57:27 +04:00
}
SLAB_ATTR_RO ( destroy_by_rcu ) ;
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-10-05 22:57:27 +04:00
static ssize_t slabs_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_ALL ) ;
}
SLAB_ATTR_RO ( slabs ) ;
2008-04-14 20:11:40 +04:00
static ssize_t total_objects_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_ALL | SO_TOTAL ) ;
}
SLAB_ATTR_RO ( total_objects ) ;
2007-05-07 01:49:36 +04:00
static ssize_t sanity_checks_show ( struct kmem_cache * s , char * buf )
{
2016-03-16 00:55:06 +03:00
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_CONSISTENCY_CHECKS ) ) ;
2007-05-07 01:49:36 +04:00
}
2020-08-07 09:18:45 +03:00
SLAB_ATTR_RO ( sanity_checks ) ;
2007-05-07 01:49:36 +04:00
static ssize_t trace_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_TRACE ) ) ;
}
2020-08-07 09:18:45 +03:00
SLAB_ATTR_RO ( trace ) ;
2007-05-07 01:49:36 +04:00
static ssize_t red_zone_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_RED_ZONE ) ) ;
}
mm, slub: make some slub_debug related attributes read-only
SLUB_DEBUG creates several files under /sys/kernel/slab/<cache>/ that can
be read to check if the respective debugging options are enabled for given
cache. The options can be also toggled at runtime by writing into the
files. Some of those, namely red_zone, poison, and store_user can be
toggled only when no objects yet exist in the cache.
Vijayanand reports [1] that there is a problem with freelist randomization
if changing the debugging option's state results in different number of
objects per page, and the random sequence cache needs thus needs to be
recomputed.
However, another problem is that the check for "no objects yet exist in
the cache" is racy, as noted by Jann [2] and fixing that would add
overhead or otherwise complicate the allocation/freeing paths. Thus it
would be much simpler just to remove the runtime toggling support. The
documentation describes it's "In case you forgot to enable debugging on
the kernel command line", but the neccessity of having no objects limits
its usefulness anyway for many caches.
Vijayanand describes an use case [3] where debugging is enabled for all
but zram caches for memory overhead reasons, and using the runtime toggles
was the only way to achieve such configuration. After the previous patch
it's now possible to do that directly from the kernel boot option, so we
can remove the dangerous runtime toggles by making the /sys attribute
files read-only.
While updating it, also improve the documentation of the debugging /sys files.
[1] https://lkml.kernel.org/r/1580379523-32272-1-git-send-email-vjitta@codeaurora.org
[2] https://lore.kernel.org/r/CAG48ez31PP--h6_FzVyfJ4H86QYczAFPdxtJHUEEan+7VJETAQ@mail.gmail.com
[3] https://lore.kernel.org/r/1383cd32-1ddc-4dac-b5f8-9c42282fa81c@codeaurora.org
Reported-by: Vijayanand Jitta <vjitta@codeaurora.org>
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/20200610163135.17364-3-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 09:18:38 +03:00
SLAB_ATTR_RO ( red_zone ) ;
2007-05-07 01:49:36 +04:00
static ssize_t poison_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_POISON ) ) ;
}
mm, slub: make some slub_debug related attributes read-only
SLUB_DEBUG creates several files under /sys/kernel/slab/<cache>/ that can
be read to check if the respective debugging options are enabled for given
cache. The options can be also toggled at runtime by writing into the
files. Some of those, namely red_zone, poison, and store_user can be
toggled only when no objects yet exist in the cache.
Vijayanand reports [1] that there is a problem with freelist randomization
if changing the debugging option's state results in different number of
objects per page, and the random sequence cache needs thus needs to be
recomputed.
However, another problem is that the check for "no objects yet exist in
the cache" is racy, as noted by Jann [2] and fixing that would add
overhead or otherwise complicate the allocation/freeing paths. Thus it
would be much simpler just to remove the runtime toggling support. The
documentation describes it's "In case you forgot to enable debugging on
the kernel command line", but the neccessity of having no objects limits
its usefulness anyway for many caches.
Vijayanand describes an use case [3] where debugging is enabled for all
but zram caches for memory overhead reasons, and using the runtime toggles
was the only way to achieve such configuration. After the previous patch
it's now possible to do that directly from the kernel boot option, so we
can remove the dangerous runtime toggles by making the /sys attribute
files read-only.
While updating it, also improve the documentation of the debugging /sys files.
[1] https://lkml.kernel.org/r/1580379523-32272-1-git-send-email-vjitta@codeaurora.org
[2] https://lore.kernel.org/r/CAG48ez31PP--h6_FzVyfJ4H86QYczAFPdxtJHUEEan+7VJETAQ@mail.gmail.com
[3] https://lore.kernel.org/r/1383cd32-1ddc-4dac-b5f8-9c42282fa81c@codeaurora.org
Reported-by: Vijayanand Jitta <vjitta@codeaurora.org>
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/20200610163135.17364-3-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 09:18:38 +03:00
SLAB_ATTR_RO ( poison ) ;
2007-05-07 01:49:36 +04:00
static ssize_t store_user_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_STORE_USER ) ) ;
}
mm, slub: make some slub_debug related attributes read-only
SLUB_DEBUG creates several files under /sys/kernel/slab/<cache>/ that can
be read to check if the respective debugging options are enabled for given
cache. The options can be also toggled at runtime by writing into the
files. Some of those, namely red_zone, poison, and store_user can be
toggled only when no objects yet exist in the cache.
Vijayanand reports [1] that there is a problem with freelist randomization
if changing the debugging option's state results in different number of
objects per page, and the random sequence cache needs thus needs to be
recomputed.
However, another problem is that the check for "no objects yet exist in
the cache" is racy, as noted by Jann [2] and fixing that would add
overhead or otherwise complicate the allocation/freeing paths. Thus it
would be much simpler just to remove the runtime toggling support. The
documentation describes it's "In case you forgot to enable debugging on
the kernel command line", but the neccessity of having no objects limits
its usefulness anyway for many caches.
Vijayanand describes an use case [3] where debugging is enabled for all
but zram caches for memory overhead reasons, and using the runtime toggles
was the only way to achieve such configuration. After the previous patch
it's now possible to do that directly from the kernel boot option, so we
can remove the dangerous runtime toggles by making the /sys attribute
files read-only.
While updating it, also improve the documentation of the debugging /sys files.
[1] https://lkml.kernel.org/r/1580379523-32272-1-git-send-email-vjitta@codeaurora.org
[2] https://lore.kernel.org/r/CAG48ez31PP--h6_FzVyfJ4H86QYczAFPdxtJHUEEan+7VJETAQ@mail.gmail.com
[3] https://lore.kernel.org/r/1383cd32-1ddc-4dac-b5f8-9c42282fa81c@codeaurora.org
Reported-by: Vijayanand Jitta <vjitta@codeaurora.org>
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/20200610163135.17364-3-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 09:18:38 +03:00
SLAB_ATTR_RO ( store_user ) ;
2007-05-07 01:49:36 +04:00
2007-05-07 01:49:43 +04:00
static ssize_t validate_show ( struct kmem_cache * s , char * buf )
{
return 0 ;
}
static ssize_t validate_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
2007-07-17 15:03:30 +04:00
int ret = - EINVAL ;
if ( buf [ 0 ] = = ' 1 ' ) {
ret = validate_slab_cache ( s ) ;
if ( ret > = 0 )
ret = length ;
}
return ret ;
2007-05-07 01:49:43 +04:00
}
SLAB_ATTR ( validate ) ;
2010-10-05 22:57:27 +04:00
static ssize_t alloc_calls_show ( struct kmem_cache * s , char * buf )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return - ENOSYS ;
return list_locations ( s , buf , TRACK_ALLOC ) ;
}
SLAB_ATTR_RO ( alloc_calls ) ;
static ssize_t free_calls_show ( struct kmem_cache * s , char * buf )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return - ENOSYS ;
return list_locations ( s , buf , TRACK_FREE ) ;
}
SLAB_ATTR_RO ( free_calls ) ;
# endif /* CONFIG_SLUB_DEBUG */
# ifdef CONFIG_FAILSLAB
static ssize_t failslab_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_FAILSLAB ) ) ;
}
2020-08-07 09:18:45 +03:00
SLAB_ATTR_RO ( failslab ) ;
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:43 +04:00
2007-05-07 01:49:46 +04:00
static ssize_t shrink_show ( struct kmem_cache * s , char * buf )
{
return 0 ;
}
static ssize_t shrink_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
2015-02-13 01:59:41 +03:00
if ( buf [ 0 ] = = ' 1 ' )
2020-08-07 09:21:27 +03:00
kmem_cache_shrink ( s ) ;
2015-02-13 01:59:41 +03:00
else
2007-05-07 01:49:46 +04:00
return - EINVAL ;
return length ;
}
SLAB_ATTR ( shrink ) ;
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_NUMA
2008-01-08 10:20:26 +03:00
static ssize_t remote_node_defrag_ratio_show ( struct kmem_cache * s , char * buf )
2007-05-07 01:49:36 +04:00
{
2018-04-06 02:20:48 +03:00
return sprintf ( buf , " %u \n " , s - > remote_node_defrag_ratio / 10 ) ;
2007-05-07 01:49:36 +04:00
}
2008-01-08 10:20:26 +03:00
static ssize_t remote_node_defrag_ratio_store ( struct kmem_cache * s ,
2007-05-07 01:49:36 +04:00
const char * buf , size_t length )
{
2018-04-06 02:20:48 +03:00
unsigned int ratio ;
2008-04-30 03:11:12 +04:00
int err ;
2018-04-06 02:20:48 +03:00
err = kstrtouint ( buf , 10 , & ratio ) ;
2008-04-30 03:11:12 +04:00
if ( err )
return err ;
2018-04-06 02:20:48 +03:00
if ( ratio > 100 )
return - ERANGE ;
2008-04-30 03:11:12 +04:00
2018-04-06 02:20:48 +03:00
s - > remote_node_defrag_ratio = ratio * 10 ;
2007-05-07 01:49:36 +04:00
return length ;
}
2008-01-08 10:20:26 +03:00
SLAB_ATTR ( remote_node_defrag_ratio ) ;
2007-05-07 01:49:36 +04:00
# endif
2008-02-08 04:47:41 +03:00
# ifdef CONFIG_SLUB_STATS
static int show_stat ( struct kmem_cache * s , char * buf , enum stat_item si )
{
unsigned long sum = 0 ;
int cpu ;
int len ;
treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:
kmalloc(a * b, gfp)
with:
kmalloc_array(a * b, gfp)
as well as handling cases of:
kmalloc(a * b * c, gfp)
with:
kmalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kmalloc_array(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kmalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kmalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kmalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kmalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kmalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kmalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kmalloc
+ kmalloc_array
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kmalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kmalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kmalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kmalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kmalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kmalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kmalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kmalloc(sizeof(THING) * C2, ...)
|
kmalloc(sizeof(TYPE) * C2, ...)
|
kmalloc(C1 * C2 * C3, ...)
|
kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * E2
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kmalloc
+ kmalloc_array
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 23:55:00 +03:00
int * data = kmalloc_array ( nr_cpu_ids , sizeof ( int ) , GFP_KERNEL ) ;
2008-02-08 04:47:41 +03:00
if ( ! data )
return - ENOMEM ;
for_each_online_cpu ( cpu ) {
2009-12-19 01:26:20 +03:00
unsigned x = per_cpu_ptr ( s - > cpu_slab , cpu ) - > stat [ si ] ;
2008-02-08 04:47:41 +03:00
data [ cpu ] = x ;
sum + = x ;
}
len = sprintf ( buf , " %lu " , sum ) ;
2008-04-14 19:52:05 +04:00
# ifdef CONFIG_SMP
2008-02-08 04:47:41 +03:00
for_each_online_cpu ( cpu ) {
if ( data [ cpu ] & & len < PAGE_SIZE - 20 )
2008-04-14 19:52:05 +04:00
len + = sprintf ( buf + len , " C%d=%u " , cpu , data [ cpu ] ) ;
2008-02-08 04:47:41 +03:00
}
2008-04-14 19:52:05 +04:00
# endif
2008-02-08 04:47:41 +03:00
kfree ( data ) ;
return len + sprintf ( buf + len , " \n " ) ;
}
2009-10-15 13:20:22 +04:00
static void clear_stat ( struct kmem_cache * s , enum stat_item si )
{
int cpu ;
for_each_online_cpu ( cpu )
2009-12-19 01:26:20 +03:00
per_cpu_ptr ( s - > cpu_slab , cpu ) - > stat [ si ] = 0 ;
2009-10-15 13:20:22 +04:00
}
2008-02-08 04:47:41 +03:00
# define STAT_ATTR(si, text) \
static ssize_t text # # _show ( struct kmem_cache * s , char * buf ) \
{ \
return show_stat ( s , buf , si ) ; \
} \
2009-10-15 13:20:22 +04:00
static ssize_t text # # _store ( struct kmem_cache * s , \
const char * buf , size_t length ) \
{ \
if ( buf [ 0 ] ! = ' 0 ' ) \
return - EINVAL ; \
clear_stat ( s , si ) ; \
return length ; \
} \
SLAB_ATTR ( text ) ; \
2008-02-08 04:47:41 +03:00
STAT_ATTR ( ALLOC_FASTPATH , alloc_fastpath ) ;
STAT_ATTR ( ALLOC_SLOWPATH , alloc_slowpath ) ;
STAT_ATTR ( FREE_FASTPATH , free_fastpath ) ;
STAT_ATTR ( FREE_SLOWPATH , free_slowpath ) ;
STAT_ATTR ( FREE_FROZEN , free_frozen ) ;
STAT_ATTR ( FREE_ADD_PARTIAL , free_add_partial ) ;
STAT_ATTR ( FREE_REMOVE_PARTIAL , free_remove_partial ) ;
STAT_ATTR ( ALLOC_FROM_PARTIAL , alloc_from_partial ) ;
STAT_ATTR ( ALLOC_SLAB , alloc_slab ) ;
STAT_ATTR ( ALLOC_REFILL , alloc_refill ) ;
2011-06-01 21:25:57 +04:00
STAT_ATTR ( ALLOC_NODE_MISMATCH , alloc_node_mismatch ) ;
2008-02-08 04:47:41 +03:00
STAT_ATTR ( FREE_SLAB , free_slab ) ;
STAT_ATTR ( CPUSLAB_FLUSH , cpuslab_flush ) ;
STAT_ATTR ( DEACTIVATE_FULL , deactivate_full ) ;
STAT_ATTR ( DEACTIVATE_EMPTY , deactivate_empty ) ;
STAT_ATTR ( DEACTIVATE_TO_HEAD , deactivate_to_head ) ;
STAT_ATTR ( DEACTIVATE_TO_TAIL , deactivate_to_tail ) ;
STAT_ATTR ( DEACTIVATE_REMOTE_FREES , deactivate_remote_frees ) ;
2011-06-01 21:25:58 +04:00
STAT_ATTR ( DEACTIVATE_BYPASS , deactivate_bypass ) ;
2008-04-14 20:11:40 +04:00
STAT_ATTR ( ORDER_FALLBACK , order_fallback ) ;
2011-06-01 21:25:49 +04:00
STAT_ATTR ( CMPXCHG_DOUBLE_CPU_FAIL , cmpxchg_double_cpu_fail ) ;
STAT_ATTR ( CMPXCHG_DOUBLE_FAIL , cmpxchg_double_fail ) ;
2011-08-10 01:12:27 +04:00
STAT_ATTR ( CPU_PARTIAL_ALLOC , cpu_partial_alloc ) ;
STAT_ATTR ( CPU_PARTIAL_FREE , cpu_partial_free ) ;
2012-02-03 19:34:56 +04:00
STAT_ATTR ( CPU_PARTIAL_NODE , cpu_partial_node ) ;
STAT_ATTR ( CPU_PARTIAL_DRAIN , cpu_partial_drain ) ;
2019-05-14 03:16:09 +03:00
# endif /* CONFIG_SLUB_STATS */
2008-02-08 04:47:41 +03:00
2008-01-08 10:20:27 +03:00
static struct attribute * slab_attrs [ ] = {
2007-05-07 01:49:36 +04:00
& slab_size_attr . attr ,
& object_size_attr . attr ,
& objs_per_slab_attr . attr ,
& order_attr . attr ,
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
& min_partial_attr . attr ,
2011-08-10 01:12:27 +04:00
& cpu_partial_attr . attr ,
2007-05-07 01:49:36 +04:00
& objects_attr . attr ,
2008-04-14 20:11:40 +04:00
& objects_partial_attr . attr ,
2007-05-07 01:49:36 +04:00
& partial_attr . attr ,
& cpu_slabs_attr . attr ,
& ctor_attr . attr ,
& aliases_attr . attr ,
& align_attr . attr ,
& hwcache_align_attr . attr ,
& reclaim_account_attr . attr ,
& destroy_by_rcu_attr . attr ,
2010-10-05 22:57:27 +04:00
& shrink_attr . attr ,
2011-08-10 01:12:27 +04:00
& slabs_cpu_partial_attr . attr ,
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-10-05 22:57:27 +04:00
& total_objects_attr . attr ,
& slabs_attr . attr ,
& sanity_checks_attr . attr ,
& trace_attr . attr ,
2007-05-07 01:49:36 +04:00
& red_zone_attr . attr ,
& poison_attr . attr ,
& store_user_attr . attr ,
2007-05-07 01:49:43 +04:00
& validate_attr . attr ,
2007-05-07 01:49:45 +04:00
& alloc_calls_attr . attr ,
& free_calls_attr . attr ,
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_ZONE_DMA
& cache_dma_attr . attr ,
# endif
# ifdef CONFIG_NUMA
2008-01-08 10:20:26 +03:00
& remote_node_defrag_ratio_attr . attr ,
2008-02-08 04:47:41 +03:00
# endif
# ifdef CONFIG_SLUB_STATS
& alloc_fastpath_attr . attr ,
& alloc_slowpath_attr . attr ,
& free_fastpath_attr . attr ,
& free_slowpath_attr . attr ,
& free_frozen_attr . attr ,
& free_add_partial_attr . attr ,
& free_remove_partial_attr . attr ,
& alloc_from_partial_attr . attr ,
& alloc_slab_attr . attr ,
& alloc_refill_attr . attr ,
2011-06-01 21:25:57 +04:00
& alloc_node_mismatch_attr . attr ,
2008-02-08 04:47:41 +03:00
& free_slab_attr . attr ,
& cpuslab_flush_attr . attr ,
& deactivate_full_attr . attr ,
& deactivate_empty_attr . attr ,
& deactivate_to_head_attr . attr ,
& deactivate_to_tail_attr . attr ,
& deactivate_remote_frees_attr . attr ,
2011-06-01 21:25:58 +04:00
& deactivate_bypass_attr . attr ,
2008-04-14 20:11:40 +04:00
& order_fallback_attr . attr ,
2011-06-01 21:25:49 +04:00
& cmpxchg_double_fail_attr . attr ,
& cmpxchg_double_cpu_fail_attr . attr ,
2011-08-10 01:12:27 +04:00
& cpu_partial_alloc_attr . attr ,
& cpu_partial_free_attr . attr ,
2012-02-03 19:34:56 +04:00
& cpu_partial_node_attr . attr ,
& cpu_partial_drain_attr . attr ,
2007-05-07 01:49:36 +04:00
# endif
2010-02-26 09:36:12 +03:00
# ifdef CONFIG_FAILSLAB
& failslab_attr . attr ,
# endif
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-11 05:50:28 +03:00
& usersize_attr . attr ,
2010-02-26 09:36:12 +03:00
2007-05-07 01:49:36 +04:00
NULL
} ;
2017-09-07 02:21:56 +03:00
static const struct attribute_group slab_attr_group = {
2007-05-07 01:49:36 +04:00
. attrs = slab_attrs ,
} ;
static ssize_t slab_attr_show ( struct kobject * kobj ,
struct attribute * attr ,
char * buf )
{
struct slab_attribute * attribute ;
struct kmem_cache * s ;
int err ;
attribute = to_slab_attr ( attr ) ;
s = to_slab ( kobj ) ;
if ( ! attribute - > show )
return - EIO ;
err = attribute - > show ( s , buf ) ;
return err ;
}
static ssize_t slab_attr_store ( struct kobject * kobj ,
struct attribute * attr ,
const char * buf , size_t len )
{
struct slab_attribute * attribute ;
struct kmem_cache * s ;
int err ;
attribute = to_slab_attr ( attr ) ;
s = to_slab ( kobj ) ;
if ( ! attribute - > store )
return - EIO ;
err = attribute - > store ( s , buf , len ) ;
return err ;
}
2014-05-06 23:50:08 +04:00
static void kmem_cache_release ( struct kobject * k )
{
slab_kmem_cache_release ( to_slab ( k ) ) ;
}
2010-01-19 04:58:23 +03:00
static const struct sysfs_ops slab_sysfs_ops = {
2007-05-07 01:49:36 +04:00
. show = slab_attr_show ,
. store = slab_attr_store ,
} ;
static struct kobj_type slab_ktype = {
. sysfs_ops = & slab_sysfs_ops ,
2014-05-06 23:50:08 +04:00
. release = kmem_cache_release ,
2007-05-07 01:49:36 +04:00
} ;
2007-11-01 18:29:06 +03:00
static struct kset * slab_kset ;
2007-05-07 01:49:36 +04:00
2014-04-08 02:39:31 +04:00
static inline struct kset * cache_kset ( struct kmem_cache * s )
{
return slab_kset ;
}
2007-05-07 01:49:36 +04:00
# define ID_STR_LENGTH 64
/* Create a unique string id for a slab cache:
2008-02-16 10:45:26 +03:00
*
* Format : [ flags - ] size
2007-05-07 01:49:36 +04:00
*/
static char * create_unique_id ( struct kmem_cache * s )
{
char * name = kmalloc ( ID_STR_LENGTH , GFP_KERNEL ) ;
char * p = name ;
BUG_ON ( ! name ) ;
* p + + = ' : ' ;
/*
* First flags affecting slabcache operations . We will only
* get here for aliasable slabs so we do not need to support
* too many flags . The flags here must cover all flags that
* are matched during merging to guarantee that the id is
* unique .
*/
if ( s - > flags & SLAB_CACHE_DMA )
* p + + = ' d ' ;
mm: add support for kmem caches in DMA32 zone
Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
v6.
This is a followup to the discussion in [1], [2].
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For L1 tables that are bigger than a page, we can just use
__get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
use GFP_DMA).
For L2 tables that only take 1KB, it would be a waste to allocate a full
page, so we considered 3 approaches:
1. This series, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2 page
tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable to reuse
freed fragments until the whole page is freed. [3]
This series is the most memory-efficient approach.
stable@ note:
We confirmed that this is a regression, and IOMMU errors happen on 4.19
and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
platforms (and maybe others?).
[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
[2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
[3] https://patchwork.codeaurora.org/patch/671639/
This patch (of 3):
IOMMUs using ARMv7 short-descriptor format require page tables to be
allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
this is done by passing GFP_DMA32 flag to memory allocation functions.
For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
a full page using get_free_pages, so we considered 3 approaches:
1. This patch, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2
page tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable
to reuse freed fragments until the whole page is freed.
This change makes it possible to create a custom cache in DMA32 zone using
kmem_cache_create, then allocate memory using kmem_cache_alloc.
We do not create a DMA32 kmalloc cache array, as there are currently no
users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
unnecessary).
Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Sasha Levin <Alexander.Levin@microsoft.com>
Cc: Huaisheng Ye <yehs1@lenovo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yong Wu <yong.wu@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Tomasz Figa <tfiga@google.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 06:43:42 +03:00
if ( s - > flags & SLAB_CACHE_DMA32 )
* p + + = ' D ' ;
2007-05-07 01:49:36 +04:00
if ( s - > flags & SLAB_RECLAIM_ACCOUNT )
* p + + = ' a ' ;
2016-03-16 00:55:06 +03:00
if ( s - > flags & SLAB_CONSISTENCY_CHECKS )
2007-05-07 01:49:36 +04:00
* p + + = ' F ' ;
2016-01-15 02:18:15 +03:00
if ( s - > flags & SLAB_ACCOUNT )
* p + + = ' A ' ;
2007-05-07 01:49:36 +04:00
if ( p ! = name + 1 )
* p + + = ' - ' ;
2018-04-06 02:21:20 +03:00
p + = sprintf ( p , " %07u " , s - > size ) ;
2012-12-19 02:22:34 +04:00
2007-05-07 01:49:36 +04:00
BUG_ON ( p > name + ID_STR_LENGTH - 1 ) ;
return name ;
}
static int sysfs_slab_add ( struct kmem_cache * s )
{
int err ;
const char * name ;
2017-02-23 02:41:39 +03:00
struct kset * kset = cache_kset ( s ) ;
2012-11-28 20:23:07 +04:00
int unmergeable = slab_unmergeable ( s ) ;
2007-05-07 01:49:36 +04:00
2017-02-23 02:41:39 +03:00
if ( ! kset ) {
kobject_init ( & s - > kobj , & slab_ktype ) ;
return 0 ;
}
2017-11-16 04:32:25 +03:00
if ( ! unmergeable & & disable_higher_order_debug & &
( slub_debug & DEBUG_METADATA_FLAGS ) )
unmergeable = 1 ;
2007-05-07 01:49:36 +04:00
if ( unmergeable ) {
/*
* Slabcache can never be merged so we can use the name proper .
* This is typically the case for debug situations . In that
* case we can catch duplicate names easily .
*/
2007-11-01 18:29:06 +03:00
sysfs_remove_link ( & slab_kset - > kobj , s - > name ) ;
2007-05-07 01:49:36 +04:00
name = s - > name ;
} else {
/*
* Create a unique name for the slab as a target
* for the symlinks .
*/
name = create_unique_id ( s ) ;
}
2017-02-23 02:41:39 +03:00
s - > kobj . kset = kset ;
2014-01-04 11:32:31 +04:00
err = kobject_init_and_add ( & s - > kobj , & slab_ktype , NULL , " %s " , name ) ;
2020-06-04 01:56:21 +03:00
if ( err ) {
kobject_put ( & s - > kobj ) ;
2015-09-05 01:45:51 +03:00
goto out ;
2020-06-04 01:56:21 +03:00
}
2007-05-07 01:49:36 +04:00
err = sysfs_create_group ( & s - > kobj , & slab_attr_group ) ;
2014-04-08 02:39:32 +04:00
if ( err )
goto out_del_kobj ;
2014-04-08 02:39:31 +04:00
2007-05-07 01:49:36 +04:00
if ( ! unmergeable ) {
/* Setup first alias */
sysfs_slab_alias ( s , s - > name ) ;
}
2014-04-08 02:39:32 +04:00
out :
if ( ! unmergeable )
kfree ( name ) ;
return err ;
out_del_kobj :
kobject_del ( & s - > kobj ) ;
goto out ;
2007-05-07 01:49:36 +04:00
}
2018-06-28 09:26:09 +03:00
void sysfs_slab_unlink ( struct kmem_cache * s )
{
if ( slab_state > = FULL )
kobject_del ( & s - > kobj ) ;
}
2017-02-23 02:41:11 +03:00
void sysfs_slab_release ( struct kmem_cache * s )
{
if ( slab_state > = FULL )
kobject_put ( & s - > kobj ) ;
2007-05-07 01:49:36 +04:00
}
/*
* Need to buffer aliases during bootup until sysfs becomes
2008-12-05 06:08:08 +03:00
* available lest we lose that information .
2007-05-07 01:49:36 +04:00
*/
struct saved_alias {
struct kmem_cache * s ;
const char * name ;
struct saved_alias * next ;
} ;
2007-07-17 15:03:27 +04:00
static struct saved_alias * alias_list ;
2007-05-07 01:49:36 +04:00
static int sysfs_slab_alias ( struct kmem_cache * s , const char * name )
{
struct saved_alias * al ;
2012-07-07 00:25:11 +04:00
if ( slab_state = = FULL ) {
2007-05-07 01:49:36 +04:00
/*
* If we have a leftover link then remove it .
*/
2007-11-01 18:29:06 +03:00
sysfs_remove_link ( & slab_kset - > kobj , name ) ;
return sysfs_create_link ( & slab_kset - > kobj , & s - > kobj , name ) ;
2007-05-07 01:49:36 +04:00
}
al = kmalloc ( sizeof ( struct saved_alias ) , GFP_KERNEL ) ;
if ( ! al )
return - ENOMEM ;
al - > s = s ;
al - > name = name ;
al - > next = alias_list ;
alias_list = al ;
return 0 ;
}
static int __init slab_sysfs_init ( void )
{
2007-07-17 15:03:19 +04:00
struct kmem_cache * s ;
2007-05-07 01:49:36 +04:00
int err ;
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2010-07-19 20:39:11 +04:00
2020-06-02 07:45:50 +03:00
slab_kset = kset_create_and_add ( " slab " , NULL , kernel_kobj ) ;
2007-11-01 18:29:06 +03:00
if ( ! slab_kset ) {
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2014-06-05 03:06:34 +04:00
pr_err ( " Cannot register slab subsystem. \n " ) ;
2007-05-07 01:49:36 +04:00
return - ENOSYS ;
}
2012-07-07 00:25:11 +04:00
slab_state = FULL ;
2007-05-09 13:32:39 +04:00
2007-07-17 15:03:19 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
2007-05-09 13:32:39 +04:00
err = sysfs_slab_add ( s ) ;
2007-08-31 10:56:26 +04:00
if ( err )
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB: Unable to add boot slab %s to sysfs \n " ,
s - > name ) ;
2007-05-09 13:32:39 +04:00
}
2007-05-07 01:49:36 +04:00
while ( alias_list ) {
struct saved_alias * al = alias_list ;
alias_list = alias_list - > next ;
err = sysfs_slab_alias ( al - > s , al - > name ) ;
2007-08-31 10:56:26 +04:00
if ( err )
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB: Unable to add boot slab alias %s to sysfs \n " ,
al - > name ) ;
2007-05-07 01:49:36 +04:00
kfree ( al ) ;
}
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-05-07 01:49:36 +04:00
resiliency_test ( ) ;
return 0 ;
}
__initcall ( slab_sysfs_init ) ;
2010-10-05 22:57:26 +04:00
# endif /* CONFIG_SYSFS */
2008-01-01 19:23:28 +03:00
/*
* The / proc / slabinfo ABI
*/
2017-11-16 04:32:03 +03:00
# ifdef CONFIG_SLUB_DEBUG
2012-10-19 18:20:27 +04:00
void get_slabinfo ( struct kmem_cache * s , struct slabinfo * sinfo )
2008-01-01 19:23:28 +03:00
{
unsigned long nr_slabs = 0 ;
2008-04-14 20:11:40 +04:00
unsigned long nr_objs = 0 ;
unsigned long nr_free = 0 ;
2008-01-01 19:23:28 +03:00
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2008-01-01 19:23:28 +03:00
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2013-07-04 04:33:26 +04:00
nr_slabs + = node_nr_slabs ( n ) ;
nr_objs + = node_nr_objs ( n ) ;
2008-04-14 20:11:40 +04:00
nr_free + = count_partial ( n , count_free ) ;
2008-01-01 19:23:28 +03:00
}
2012-10-19 18:20:27 +04:00
sinfo - > active_objs = nr_objs - nr_free ;
sinfo - > num_objs = nr_objs ;
sinfo - > active_slabs = nr_slabs ;
sinfo - > num_slabs = nr_slabs ;
sinfo - > objects_per_slab = oo_objects ( s - > oo ) ;
sinfo - > cache_order = oo_order ( s - > oo ) ;
2008-01-01 19:23:28 +03:00
}
2012-10-19 18:20:27 +04:00
void slabinfo_show_stats ( struct seq_file * m , struct kmem_cache * s )
2008-10-06 02:42:17 +04:00
{
}
2012-10-19 18:20:25 +04:00
ssize_t slabinfo_write ( struct file * file , const char __user * buffer ,
size_t count , loff_t * ppos )
2008-10-06 02:42:17 +04:00
{
2012-10-19 18:20:25 +04:00
return - EIO ;
2008-10-06 02:42:17 +04:00
}
2017-11-16 04:32:03 +03:00
# endif /* CONFIG_SLUB_DEBUG */