2007-05-07 01:49:36 +04:00
/*
* SLUB : A slab allocator that limits cache line use instead of queuing
* objects in per cpu and per node lists .
*
2011-06-01 21:25:53 +04:00
* The allocator synchronizes using per slab locks or atomic operatios
* and only uses a centralized lock to manage a pool of partial slabs .
2007-05-07 01:49:36 +04:00
*
2008-07-04 20:59:22 +04:00
* ( C ) 2007 SGI , Christoph Lameter
2011-06-01 21:25:53 +04:00
* ( C ) 2011 Linux Foundation , Christoph Lameter
2007-05-07 01:49:36 +04:00
*/
# include <linux/mm.h>
2009-05-05 13:13:44 +04:00
# include <linux/swap.h> /* struct reclaim_state */
2007-05-07 01:49:36 +04:00
# include <linux/module.h>
# include <linux/bit_spinlock.h>
# include <linux/interrupt.h>
# include <linux/bitops.h>
# include <linux/slab.h>
2012-07-07 00:25:11 +04:00
# include "slab.h"
2008-10-06 02:42:17 +04:00
# include <linux/proc_fs.h>
2013-04-30 02:08:06 +04:00
# include <linux/notifier.h>
2007-05-07 01:49:36 +04:00
# include <linux/seq_file.h>
2015-02-14 01:39:38 +03:00
# include <linux/kasan.h>
2008-04-04 02:54:48 +04:00
# include <linux/kmemcheck.h>
2007-05-07 01:49:36 +04:00
# include <linux/cpu.h>
# include <linux/cpuset.h>
# include <linux/mempolicy.h>
# include <linux/ctype.h>
2008-04-30 11:55:01 +04:00
# include <linux/debugobjects.h>
2007-05-07 01:49:36 +04:00
# include <linux/kallsyms.h>
2007-10-22 03:41:37 +04:00
# include <linux/memory.h>
2008-05-01 15:34:31 +04:00
# include <linux/math64.h>
2008-12-23 13:37:01 +03:00
# include <linux/fault-inject.h>
2011-07-07 23:47:01 +04:00
# include <linux/stacktrace.h>
2012-01-31 01:53:51 +04:00
# include <linux/prefetch.h>
2012-12-19 02:22:34 +04:00
# include <linux/memcontrol.h>
2007-05-07 01:49:36 +04:00
2010-10-21 13:29:19 +04:00
# include <trace/events/kmem.h>
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
# include "internal.h"
2007-05-07 01:49:36 +04:00
/*
* Lock order :
2012-07-07 00:25:12 +04:00
* 1. slab_mutex ( Global Mutex )
2011-06-01 21:25:53 +04:00
* 2. node - > list_lock
* 3. slab_lock ( page ) ( Only on some arches and for debugging )
2007-05-07 01:49:36 +04:00
*
2012-07-07 00:25:12 +04:00
* slab_mutex
2011-06-01 21:25:53 +04:00
*
2012-07-07 00:25:12 +04:00
* The role of the slab_mutex is to protect the list of all the slabs
2011-06-01 21:25:53 +04:00
* and to synchronize major metadata changes to slab cache structures .
*
* The slab_lock is only used for debugging and on arches that do not
* have the ability to do a cmpxchg_double . It only protects the second
* double word in the page struct . Meaning
* A . page - > freelist - > List of object free in a page
* B . page - > counters - > Counters of objects
* C . page - > frozen - > frozen state
*
* If a slab is frozen then it is exempt from list management . It is not
* on any list . The processor that froze the slab is the one who can
* perform list operations on the page . Other processors may put objects
* onto the freelist but the processor that froze the slab is the only
* one that can retrieve the objects from the page ' s freelist .
2007-05-07 01:49:36 +04:00
*
* The list_lock protects the partial and full list on each node and
* the partial slab counter . If taken then no new slabs may be added or
* removed from the lists nor make the number of partial slabs be modified .
* ( Note that the total number of slabs is an atomic value that may be
* modified without taking the list lock ) .
*
* The list_lock is a centralized lock and thus we avoid taking it as
* much as possible . As long as SLUB does not have to handle partial
* slabs , operations can continue without any centralized lock . F . e .
* allocating a long series of objects that fill up slabs does not require
* the list lock .
* Interrupts are disabled during allocation and deallocation in order to
* make the slab allocator safe to use in the context of an irq . In addition
* interrupts are disabled to ensure that the processor does not change
* while handling per_cpu slabs , due to kernel preemption .
*
* SLUB assigns one slab for allocation to each processor .
* Allocations only occur from these slabs called cpu slabs .
*
2007-05-09 13:32:39 +04:00
* Slabs with free elements are kept on a partial list and during regular
* operations no list for full slabs is used . If an object in a full slab is
2007-05-07 01:49:36 +04:00
* freed then the slab will show up again on the partial lists .
2007-05-09 13:32:39 +04:00
* We track full slabs for debugging purposes though because otherwise we
* cannot scan all objects .
2007-05-07 01:49:36 +04:00
*
* Slabs are freed when they become empty . Teardown and setup is
* minimal so we rely on the page allocators per cpu caches for
* fast frees and allocs .
*
* Overloading of page flags that are otherwise used for LRU management .
*
2007-05-17 09:10:53 +04:00
* PageActive The slab is frozen and exempt from list processing .
* This means that the slab is dedicated to a purpose
* such as satisfying allocations for a specific
* processor . Objects may be freed in the slab while
* it is frozen but slab_free will then skip the usual
* list operations . It is up to the processor holding
* the slab to integrate the slab into the slab lists
* when the slab is no longer needed .
*
* One use of this flag is to mark slabs that are
* used for allocations . Then such a slab becomes a cpu
* slab . The cpu slab may be equipped with an additional
2007-10-16 12:26:05 +04:00
* freelist that allows lockless access to
2007-05-10 14:15:16 +04:00
* free objects in addition to the regular freelist
* that requires the slab lock .
2007-05-07 01:49:36 +04:00
*
* PageError Slab requires special handling due to debug
* options set . This moves slab handling out of
2007-05-10 14:15:16 +04:00
* the fast path and disables lockless freelists .
2007-05-07 01:49:36 +04:00
*/
2010-07-09 23:07:14 +04:00
static inline int kmem_cache_debug ( struct kmem_cache * s )
{
2007-05-17 09:10:56 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-07-09 23:07:14 +04:00
return unlikely ( s - > flags & SLAB_DEBUG_FLAGS ) ;
2007-05-17 09:10:56 +04:00
# else
2010-07-09 23:07:14 +04:00
return 0 ;
2007-05-17 09:10:56 +04:00
# endif
2010-07-09 23:07:14 +04:00
}
2007-05-17 09:10:56 +04:00
2013-06-19 09:05:52 +04:00
static inline bool kmem_cache_has_cpu_partial ( struct kmem_cache * s )
{
# ifdef CONFIG_SLUB_CPU_PARTIAL
return ! kmem_cache_debug ( s ) ;
# else
return false ;
# endif
}
2007-05-07 01:49:36 +04:00
/*
* Issues still to be resolved :
*
* - Support PAGE_ALLOC_DEBUG . Should be easy to do .
*
* - Variable sizing of the per node arrays
*/
/* Enable to test recovery from slab corruption on boot */
# undef SLUB_RESILIENCY_TEST
2011-06-01 21:25:49 +04:00
/* Enable to log cmpxchg failures */
# undef SLUB_DEBUG_CMPXCHG
2007-05-07 01:49:46 +04:00
/*
* Mininum number of partial slabs . These will be left on the partial
* lists even if they are empty . kmem_cache_shrink may reclaim them .
*/
2007-12-22 01:37:37 +03:00
# define MIN_PARTIAL 5
2007-05-07 01:49:44 +04:00
2007-05-07 01:49:46 +04:00
/*
* Maximum number of desirable partial slabs .
* The existence of more partial slabs makes kmem_cache_shrink
2013-11-08 16:47:37 +04:00
* sort the partial list by the number of objects in use .
2007-05-07 01:49:46 +04:00
*/
# define MAX_PARTIAL 10
2007-05-07 01:49:36 +04:00
# define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER )
2007-05-09 13:32:39 +04:00
2009-07-07 11:14:14 +04:00
/*
2009-07-28 05:30:35 +04:00
* Debugging flags that require metadata to be stored in the slab . These get
* disabled when slub_debug = O is used and a cache ' s min order increases with
* metadata .
2009-07-07 11:14:14 +04:00
*/
2009-07-28 05:30:35 +04:00
# define DEBUG_METADATA_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
2009-07-07 11:14:14 +04:00
2008-10-22 23:00:38 +04:00
# define OO_SHIFT 16
# define OO_MASK ((1 << OO_SHIFT) - 1)
2011-06-01 21:25:45 +04:00
# define MAX_OBJS_PER_PAGE 32767 /* since page.objects is u15 */
2008-10-22 23:00:38 +04:00
2007-05-07 01:49:36 +04:00
/* Internal SLUB flags */
2010-07-09 23:07:11 +04:00
# define __OBJECT_POISON 0x80000000UL /* Poison object */
2011-06-01 21:25:49 +04:00
# define __CMPXCHG_DOUBLE 0x40000000UL /* Use cmpxchg_double */
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_SMP
static struct notifier_block slab_notifier ;
# endif
2007-05-09 13:32:43 +04:00
/*
* Tracking user of a slab .
*/
2011-07-07 22:36:36 +04:00
# define TRACK_ADDRS_COUNT 16
2007-05-09 13:32:43 +04:00
struct track {
2008-08-19 21:43:25 +04:00
unsigned long addr ; /* Called from address */
2011-07-07 22:36:36 +04:00
# ifdef CONFIG_STACKTRACE
unsigned long addrs [ TRACK_ADDRS_COUNT ] ; /* Called from address */
# endif
2007-05-09 13:32:43 +04:00
int cpu ; /* Was running on cpu */
int pid ; /* Pid context */
unsigned long when ; /* When did the operation occur */
} ;
enum track_item { TRACK_ALLOC , TRACK_FREE } ;
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SYSFS
2007-05-07 01:49:36 +04:00
static int sysfs_slab_add ( struct kmem_cache * ) ;
static int sysfs_slab_alias ( struct kmem_cache * , const char * ) ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
static void memcg_propagate_slab_attrs ( struct kmem_cache * s ) ;
2007-05-07 01:49:36 +04:00
# else
2007-07-17 15:03:24 +04:00
static inline int sysfs_slab_add ( struct kmem_cache * s ) { return 0 ; }
static inline int sysfs_slab_alias ( struct kmem_cache * s , const char * p )
{ return 0 ; }
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
static inline void memcg_propagate_slab_attrs ( struct kmem_cache * s ) { }
2007-05-07 01:49:36 +04:00
# endif
2011-03-22 21:35:00 +03:00
static inline void stat ( const struct kmem_cache * s , enum stat_item si )
2008-02-08 04:47:41 +03:00
{
# ifdef CONFIG_SLUB_STATS
2014-04-08 02:39:42 +04:00
/*
* The rmw is racy on a preemptible kernel but this is acceptable , so
* avoid this_cpu_add ( ) ' s irq - disable overhead .
*/
raw_cpu_inc ( s - > cpu_slab - > stat [ si ] ) ;
2008-02-08 04:47:41 +03:00
# endif
}
2007-05-07 01:49:36 +04:00
/********************************************************************
* Core slab cache functions
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
2008-02-16 10:45:26 +03:00
/* Verify that a pointer has an address that is valid within a slab page */
2007-05-09 13:32:43 +04:00
static inline int check_valid_pointer ( struct kmem_cache * s ,
struct page * page , const void * object )
{
void * base ;
2008-03-02 00:40:44 +03:00
if ( ! object )
2007-05-09 13:32:43 +04:00
return 1 ;
2008-03-02 00:40:44 +03:00
base = page_address ( page ) ;
2008-04-14 20:11:30 +04:00
if ( object < base | | object > = base + page - > objects * s - > size | |
2007-05-09 13:32:43 +04:00
( object - base ) % s - > size ) {
return 0 ;
}
return 1 ;
}
2007-05-09 13:32:40 +04:00
static inline void * get_freepointer ( struct kmem_cache * s , void * object )
{
return * ( void * * ) ( object + s - > offset ) ;
}
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
static void prefetch_freepointer ( const struct kmem_cache * s , void * object )
{
prefetch ( object + s - > offset ) ;
}
2011-05-17 00:26:08 +04:00
static inline void * get_freepointer_safe ( struct kmem_cache * s , void * object )
{
void * p ;
# ifdef CONFIG_DEBUG_PAGEALLOC
probe_kernel_read ( & p , ( void * * ) ( object + s - > offset ) , sizeof ( p ) ) ;
# else
p = get_freepointer ( s , object ) ;
# endif
return p ;
}
2007-05-09 13:32:40 +04:00
static inline void set_freepointer ( struct kmem_cache * s , void * object , void * fp )
{
* ( void * * ) ( object + s - > offset ) = fp ;
}
/* Loop over all objects in a slab */
2008-04-14 20:11:31 +04:00
# define for_each_object(__p, __s, __addr, __objects) \
for ( __p = ( __addr ) ; __p < ( __addr ) + ( __objects ) * ( __s ) - > size ; \
2007-05-09 13:32:40 +04:00
__p + = ( __s ) - > size )
2014-08-07 03:04:42 +04:00
# define for_each_object_idx(__p, __idx, __s, __addr, __objects) \
for ( __p = ( __addr ) , __idx = 1 ; __idx < = __objects ; \
__p + = ( __s ) - > size , __idx + + )
2007-05-09 13:32:40 +04:00
/* Determine object index from a given position */
static inline int slab_index ( void * p , struct kmem_cache * s , void * addr )
{
return ( p - addr ) / s - > size ;
}
2011-02-26 22:10:26 +03:00
static inline size_t slab_ksize ( const struct kmem_cache * s )
{
# ifdef CONFIG_SLUB_DEBUG
/*
* Debugging requires use of the padding between object
* and whatever may come after it .
*/
if ( s - > flags & ( SLAB_RED_ZONE | SLAB_POISON ) )
2012-06-13 19:24:57 +04:00
return s - > object_size ;
2011-02-26 22:10:26 +03:00
# endif
/*
* If we have the need to store the freelist pointer
* back there or track user information then we can
* only use the space before that information .
*/
if ( s - > flags & ( SLAB_DESTROY_BY_RCU | SLAB_STORE_USER ) )
return s - > inuse ;
/*
* Else we can use all the padding etc for the allocation
*/
return s - > size ;
}
2011-03-10 10:21:48 +03:00
static inline int order_objects ( int order , unsigned long size , int reserved )
{
return ( ( PAGE_SIZE < < order ) - reserved ) / size ;
}
2008-04-14 20:11:31 +04:00
static inline struct kmem_cache_order_objects oo_make ( int order ,
2011-03-10 10:21:48 +03:00
unsigned long size , int reserved )
2008-04-14 20:11:31 +04:00
{
struct kmem_cache_order_objects x = {
2011-03-10 10:21:48 +03:00
( order < < OO_SHIFT ) + order_objects ( order , size , reserved )
2008-04-14 20:11:31 +04:00
} ;
return x ;
}
static inline int oo_order ( struct kmem_cache_order_objects x )
{
2008-10-22 23:00:38 +04:00
return x . x > > OO_SHIFT ;
2008-04-14 20:11:31 +04:00
}
static inline int oo_objects ( struct kmem_cache_order_objects x )
{
2008-10-22 23:00:38 +04:00
return x . x & OO_MASK ;
2008-04-14 20:11:31 +04:00
}
2011-06-01 21:25:53 +04:00
/*
* Per slab locking using the pagelock
*/
static __always_inline void slab_lock ( struct page * page )
{
bit_spin_lock ( PG_locked , & page - > flags ) ;
}
static __always_inline void slab_unlock ( struct page * page )
{
__bit_spin_unlock ( PG_locked , & page - > flags ) ;
}
mm/slub.c: fix page->_count corruption (again)
Commit abca7c496584 ("mm: fix slab->page _count corruption when using
slub") notes that we can not _set_ a page->counters directly, except
when using a real double-cmpxchg. Doing so can lose updates to
->_count.
That is an absolute rule:
You may not *set* page->counters except via a cmpxchg.
Commit abca7c496584 fixed this for the folks who have the slub
cmpxchg_double code turned off at compile time, but it left the bad case
alone. It can still be reached, and the same bug triggered in two
cases:
1. Turning on slub debugging at runtime, which is available on
the distro kernels that I looked at.
2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
cpus, evidently)
There are at least 3 ways we could fix this:
1. Take all of the exising calls to cmpxchg_double_slab() and
__cmpxchg_double_slab() and convert them to take an old, new
and target 'struct page'.
2. Do (1), but with the newly-introduced 'slub_data'.
3. Do some magic inside the two cmpxchg...slab() functions to
pull the counters out of new_counters and only set those
fields in page->{inuse,frozen,objects}.
I've done (2) as well, but it's a bunch more code. This patch is an
attempt at (3). This was the most straightforward and foolproof way
that I could think to do this.
This would also technically allow us to get rid of the ugly
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
in 'struct page', but leaving it alone has the added benefit that
'counters' stays 'unsigned' instead of 'unsigned long', so all the
copies that the slub code does stay a bit smaller.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-31 03:46:09 +04:00
static inline void set_page_slub_counters ( struct page * page , unsigned long counters_new )
{
struct page tmp ;
tmp . counters = counters_new ;
/*
* page - > counters can cover frozen / inuse / objects as well
* as page - > _count . If we assign to - > counters directly
* we run the risk of losing updates to page - > _count , so
* be careful and only assign to the fields we need .
*/
page - > frozen = tmp . frozen ;
page - > inuse = tmp . inuse ;
page - > objects = tmp . objects ;
}
2011-07-14 21:49:12 +04:00
/* Interrupts must be disabled (for the fallback code to work right) */
static inline bool __cmpxchg_double_slab ( struct kmem_cache * s , struct page * page ,
void * freelist_old , unsigned long counters_old ,
void * freelist_new , unsigned long counters_new ,
const char * n )
{
VM_BUG_ON ( ! irqs_disabled ( ) ) ;
2012-01-13 05:17:33 +04:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-07-14 21:49:12 +04:00
if ( s - > flags & __CMPXCHG_DOUBLE ) {
2012-01-02 21:02:18 +04:00
if ( cmpxchg_double ( & page - > freelist , & page - > counters ,
2014-08-07 03:04:48 +04:00
freelist_old , counters_old ,
freelist_new , counters_new ) )
2015-04-15 01:44:31 +03:00
return true ;
2011-07-14 21:49:12 +04:00
} else
# endif
{
slab_lock ( page ) ;
2013-07-15 05:05:29 +04:00
if ( page - > freelist = = freelist_old & &
page - > counters = = counters_old ) {
2011-07-14 21:49:12 +04:00
page - > freelist = freelist_new ;
mm/slub.c: fix page->_count corruption (again)
Commit abca7c496584 ("mm: fix slab->page _count corruption when using
slub") notes that we can not _set_ a page->counters directly, except
when using a real double-cmpxchg. Doing so can lose updates to
->_count.
That is an absolute rule:
You may not *set* page->counters except via a cmpxchg.
Commit abca7c496584 fixed this for the folks who have the slub
cmpxchg_double code turned off at compile time, but it left the bad case
alone. It can still be reached, and the same bug triggered in two
cases:
1. Turning on slub debugging at runtime, which is available on
the distro kernels that I looked at.
2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
cpus, evidently)
There are at least 3 ways we could fix this:
1. Take all of the exising calls to cmpxchg_double_slab() and
__cmpxchg_double_slab() and convert them to take an old, new
and target 'struct page'.
2. Do (1), but with the newly-introduced 'slub_data'.
3. Do some magic inside the two cmpxchg...slab() functions to
pull the counters out of new_counters and only set those
fields in page->{inuse,frozen,objects}.
I've done (2) as well, but it's a bunch more code. This patch is an
attempt at (3). This was the most straightforward and foolproof way
that I could think to do this.
This would also technically allow us to get rid of the ugly
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
in 'struct page', but leaving it alone has the added benefit that
'counters' stays 'unsigned' instead of 'unsigned long', so all the
copies that the slub code does stay a bit smaller.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-31 03:46:09 +04:00
set_page_slub_counters ( page , counters_new ) ;
2011-07-14 21:49:12 +04:00
slab_unlock ( page ) ;
2015-04-15 01:44:31 +03:00
return true ;
2011-07-14 21:49:12 +04:00
}
slab_unlock ( page ) ;
}
cpu_relax ( ) ;
stat ( s , CMPXCHG_DOUBLE_FAIL ) ;
# ifdef SLUB_DEBUG_CMPXCHG
2014-06-05 03:06:34 +04:00
pr_info ( " %s %s: cmpxchg double redo " , n , s - > name ) ;
2011-07-14 21:49:12 +04:00
# endif
2015-04-15 01:44:31 +03:00
return false ;
2011-07-14 21:49:12 +04:00
}
2011-06-01 21:25:49 +04:00
static inline bool cmpxchg_double_slab ( struct kmem_cache * s , struct page * page ,
void * freelist_old , unsigned long counters_old ,
void * freelist_new , unsigned long counters_new ,
const char * n )
{
2012-01-13 05:17:33 +04:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-06-01 21:25:49 +04:00
if ( s - > flags & __CMPXCHG_DOUBLE ) {
2012-01-02 21:02:18 +04:00
if ( cmpxchg_double ( & page - > freelist , & page - > counters ,
2014-08-07 03:04:48 +04:00
freelist_old , counters_old ,
freelist_new , counters_new ) )
2015-04-15 01:44:31 +03:00
return true ;
2011-06-01 21:25:49 +04:00
} else
# endif
{
2011-07-14 21:49:12 +04:00
unsigned long flags ;
local_irq_save ( flags ) ;
2011-06-01 21:25:53 +04:00
slab_lock ( page ) ;
2013-07-15 05:05:29 +04:00
if ( page - > freelist = = freelist_old & &
page - > counters = = counters_old ) {
2011-06-01 21:25:49 +04:00
page - > freelist = freelist_new ;
mm/slub.c: fix page->_count corruption (again)
Commit abca7c496584 ("mm: fix slab->page _count corruption when using
slub") notes that we can not _set_ a page->counters directly, except
when using a real double-cmpxchg. Doing so can lose updates to
->_count.
That is an absolute rule:
You may not *set* page->counters except via a cmpxchg.
Commit abca7c496584 fixed this for the folks who have the slub
cmpxchg_double code turned off at compile time, but it left the bad case
alone. It can still be reached, and the same bug triggered in two
cases:
1. Turning on slub debugging at runtime, which is available on
the distro kernels that I looked at.
2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
cpus, evidently)
There are at least 3 ways we could fix this:
1. Take all of the exising calls to cmpxchg_double_slab() and
__cmpxchg_double_slab() and convert them to take an old, new
and target 'struct page'.
2. Do (1), but with the newly-introduced 'slub_data'.
3. Do some magic inside the two cmpxchg...slab() functions to
pull the counters out of new_counters and only set those
fields in page->{inuse,frozen,objects}.
I've done (2) as well, but it's a bunch more code. This patch is an
attempt at (3). This was the most straightforward and foolproof way
that I could think to do this.
This would also technically allow us to get rid of the ugly
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
in 'struct page', but leaving it alone has the added benefit that
'counters' stays 'unsigned' instead of 'unsigned long', so all the
copies that the slub code does stay a bit smaller.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-31 03:46:09 +04:00
set_page_slub_counters ( page , counters_new ) ;
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2011-07-14 21:49:12 +04:00
local_irq_restore ( flags ) ;
2015-04-15 01:44:31 +03:00
return true ;
2011-06-01 21:25:49 +04:00
}
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2011-07-14 21:49:12 +04:00
local_irq_restore ( flags ) ;
2011-06-01 21:25:49 +04:00
}
cpu_relax ( ) ;
stat ( s , CMPXCHG_DOUBLE_FAIL ) ;
# ifdef SLUB_DEBUG_CMPXCHG
2014-06-05 03:06:34 +04:00
pr_info ( " %s %s: cmpxchg double redo " , n , s - > name ) ;
2011-06-01 21:25:49 +04:00
# endif
2015-04-15 01:44:31 +03:00
return false ;
2011-06-01 21:25:49 +04:00
}
2007-05-09 13:32:44 +04:00
# ifdef CONFIG_SLUB_DEBUG
2011-04-15 23:48:13 +04:00
/*
* Determine a map of object in use on a page .
*
2011-06-01 21:25:53 +04:00
* Node listlock must be held to guarantee that the page does
2011-04-15 23:48:13 +04:00
* not vanish from under us .
*/
static void get_map ( struct kmem_cache * s , struct page * page , unsigned long * map )
{
void * p ;
void * addr = page_address ( page ) ;
for ( p = page - > freelist ; p ; p = get_freepointer ( s , p ) )
set_bit ( slab_index ( p , s , addr ) , map ) ;
}
2007-05-09 13:32:44 +04:00
/*
* Debug settings :
*/
2015-11-06 05:51:23 +03:00
# if defined(CONFIG_SLUB_DEBUG_ON)
2007-07-16 10:38:14 +04:00
static int slub_debug = DEBUG_DEFAULT_FLAGS ;
2015-11-06 05:51:23 +03:00
# elif defined(CONFIG_KASAN)
static int slub_debug = SLAB_STORE_USER ;
2007-07-16 10:38:14 +04:00
# else
2007-05-09 13:32:44 +04:00
static int slub_debug ;
2007-07-16 10:38:14 +04:00
# endif
2007-05-09 13:32:44 +04:00
static char * slub_debug_slabs ;
2009-07-07 11:14:14 +04:00
static int disable_higher_order_debug ;
2007-05-09 13:32:44 +04:00
2015-02-14 01:39:38 +03:00
/*
* slub is about to manipulate internal object metadata . This memory lies
* outside the range of the allocated object , so accessing it would normally
* be reported by kasan as a bounds error . metadata_access_enable ( ) is used
* to tell kasan that these accesses are OK .
*/
static inline void metadata_access_enable ( void )
{
kasan_disable_current ( ) ;
}
static inline void metadata_access_disable ( void )
{
kasan_enable_current ( ) ;
}
2007-05-07 01:49:36 +04:00
/*
* Object debugging
*/
static void print_section ( char * text , u8 * addr , unsigned int length )
{
2015-02-14 01:39:38 +03:00
metadata_access_enable ( ) ;
2011-07-29 16:10:20 +04:00
print_hex_dump ( KERN_ERR , text , DUMP_PREFIX_ADDRESS , 16 , 1 , addr ,
length , 1 ) ;
2015-02-14 01:39:38 +03:00
metadata_access_disable ( ) ;
2007-05-07 01:49:36 +04:00
}
static struct track * get_track ( struct kmem_cache * s , void * object ,
enum track_item alloc )
{
struct track * p ;
if ( s - > offset )
p = object + s - > offset + sizeof ( void * ) ;
else
p = object + s - > inuse ;
return p + alloc ;
}
static void set_track ( struct kmem_cache * s , void * object ,
2008-08-19 21:43:25 +04:00
enum track_item alloc , unsigned long addr )
2007-05-07 01:49:36 +04:00
{
2009-03-06 18:36:21 +03:00
struct track * p = get_track ( s , object , alloc ) ;
2007-05-07 01:49:36 +04:00
if ( addr ) {
2011-07-07 22:36:36 +04:00
# ifdef CONFIG_STACKTRACE
struct stack_trace trace ;
int i ;
trace . nr_entries = 0 ;
trace . max_entries = TRACK_ADDRS_COUNT ;
trace . entries = p - > addrs ;
trace . skip = 3 ;
2015-02-14 01:39:38 +03:00
metadata_access_enable ( ) ;
2011-07-07 22:36:36 +04:00
save_stack_trace ( & trace ) ;
2015-02-14 01:39:38 +03:00
metadata_access_disable ( ) ;
2011-07-07 22:36:36 +04:00
/* See rant in lockdep.c */
if ( trace . nr_entries ! = 0 & &
trace . entries [ trace . nr_entries - 1 ] = = ULONG_MAX )
trace . nr_entries - - ;
for ( i = trace . nr_entries ; i < TRACK_ADDRS_COUNT ; i + + )
p - > addrs [ i ] = 0 ;
# endif
2007-05-07 01:49:36 +04:00
p - > addr = addr ;
p - > cpu = smp_processor_id ( ) ;
2008-06-23 02:58:37 +04:00
p - > pid = current - > pid ;
2007-05-07 01:49:36 +04:00
p - > when = jiffies ;
} else
memset ( p , 0 , sizeof ( struct track ) ) ;
}
static void init_tracking ( struct kmem_cache * s , void * object )
{
2007-07-17 15:03:18 +04:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2008-08-19 21:43:25 +04:00
set_track ( s , object , TRACK_FREE , 0UL ) ;
set_track ( s , object , TRACK_ALLOC , 0UL ) ;
2007-05-07 01:49:36 +04:00
}
static void print_track ( const char * s , struct track * t )
{
if ( ! t - > addr )
return ;
2014-06-05 03:06:34 +04:00
pr_err ( " INFO: %s in %pS age=%lu cpu=%u pid=%d \n " ,
s , ( void * ) t - > addr , jiffies - t - > when , t - > cpu , t - > pid ) ;
2011-07-07 22:36:36 +04:00
# ifdef CONFIG_STACKTRACE
{
int i ;
for ( i = 0 ; i < TRACK_ADDRS_COUNT ; i + + )
if ( t - > addrs [ i ] )
2014-06-05 03:06:34 +04:00
pr_err ( " \t %pS \n " , ( void * ) t - > addrs [ i ] ) ;
2011-07-07 22:36:36 +04:00
else
break ;
}
# endif
2007-07-17 15:03:18 +04:00
}
static void print_tracking ( struct kmem_cache * s , void * object )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
print_track ( " Allocated " , get_track ( s , object , TRACK_ALLOC ) ) ;
print_track ( " Freed " , get_track ( s , object , TRACK_FREE ) ) ;
}
static void print_page_info ( struct page * page )
{
2014-06-05 03:06:34 +04:00
pr_err ( " INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx \n " ,
2013-07-15 05:05:29 +04:00
page , page - > objects , page - > inuse , page - > freelist , page - > flags ) ;
2007-07-17 15:03:18 +04:00
}
static void slab_bug ( struct kmem_cache * s , char * fmt , . . . )
{
2014-06-05 03:06:35 +04:00
struct va_format vaf ;
2007-07-17 15:03:18 +04:00
va_list args ;
va_start ( args , fmt ) ;
2014-06-05 03:06:35 +04:00
vaf . fmt = fmt ;
vaf . va = & args ;
2014-06-05 03:06:34 +04:00
pr_err ( " ============================================================================= \n " ) ;
2014-06-05 03:06:35 +04:00
pr_err ( " BUG %s (%s): %pV \n " , s - > name , print_tainted ( ) , & vaf ) ;
2014-06-05 03:06:34 +04:00
pr_err ( " ----------------------------------------------------------------------------- \n \n " ) ;
2012-09-18 23:54:12 +04:00
2013-01-21 10:47:39 +04:00
add_taint ( TAINT_BAD_PAGE , LOCKDEP_NOW_UNRELIABLE ) ;
2014-06-05 03:06:35 +04:00
va_end ( args ) ;
2007-05-07 01:49:36 +04:00
}
2007-07-17 15:03:18 +04:00
static void slab_fix ( struct kmem_cache * s , char * fmt , . . . )
{
2014-06-05 03:06:35 +04:00
struct va_format vaf ;
2007-07-17 15:03:18 +04:00
va_list args ;
va_start ( args , fmt ) ;
2014-06-05 03:06:35 +04:00
vaf . fmt = fmt ;
vaf . va = & args ;
pr_err ( " FIX %s: %pV \n " , s - > name , & vaf ) ;
2007-07-17 15:03:18 +04:00
va_end ( args ) ;
}
static void print_trailer ( struct kmem_cache * s , struct page * page , u8 * p )
2007-05-07 01:49:36 +04:00
{
unsigned int off ; /* Offset of last byte */
2008-03-02 00:40:44 +03:00
u8 * addr = page_address ( page ) ;
2007-07-17 15:03:18 +04:00
print_tracking ( s , p ) ;
print_page_info ( page ) ;
2014-06-05 03:06:34 +04:00
pr_err ( " INFO: Object 0x%p @offset=%tu fp=0x%p \n \n " ,
p , p - addr , get_freepointer ( s , p ) ) ;
2007-07-17 15:03:18 +04:00
if ( p > addr + 16 )
2011-07-29 16:10:20 +04:00
print_section ( " Bytes b4 " , p - 16 , 16 ) ;
2007-05-07 01:49:36 +04:00
2012-06-13 19:24:57 +04:00
print_section ( " Object " , p , min_t ( unsigned long , s - > object_size ,
2011-07-29 16:10:20 +04:00
PAGE_SIZE ) ) ;
2007-05-07 01:49:36 +04:00
if ( s - > flags & SLAB_RED_ZONE )
2012-06-13 19:24:57 +04:00
print_section ( " Redzone " , p + s - > object_size ,
s - > inuse - s - > object_size ) ;
2007-05-07 01:49:36 +04:00
if ( s - > offset )
off = s - > offset + sizeof ( void * ) ;
else
off = s - > inuse ;
2007-07-17 15:03:18 +04:00
if ( s - > flags & SLAB_STORE_USER )
2007-05-07 01:49:36 +04:00
off + = 2 * sizeof ( struct track ) ;
if ( off ! = s - > size )
/* Beginning of the filler is the free pointer */
2011-07-29 16:10:20 +04:00
print_section ( " Padding " , p + off , s - > size - off ) ;
2007-07-17 15:03:18 +04:00
dump_stack ( ) ;
2007-05-07 01:49:36 +04:00
}
2015-02-14 01:39:35 +03:00
void object_err ( struct kmem_cache * s , struct page * page ,
2007-05-07 01:49:36 +04:00
u8 * object , char * reason )
{
2008-04-23 23:28:01 +04:00
slab_bug ( s , " %s " , reason ) ;
2007-07-17 15:03:18 +04:00
print_trailer ( s , page , object ) ;
2007-05-07 01:49:36 +04:00
}
2013-07-15 05:05:29 +04:00
static void slab_err ( struct kmem_cache * s , struct page * page ,
const char * fmt , . . . )
2007-05-07 01:49:36 +04:00
{
va_list args ;
char buf [ 100 ] ;
2007-07-17 15:03:18 +04:00
va_start ( args , fmt ) ;
vsnprintf ( buf , sizeof ( buf ) , fmt , args ) ;
2007-05-07 01:49:36 +04:00
va_end ( args ) ;
2008-04-23 23:28:01 +04:00
slab_bug ( s , " %s " , buf ) ;
2007-07-17 15:03:18 +04:00
print_page_info ( page ) ;
2007-05-07 01:49:36 +04:00
dump_stack ( ) ;
}
2010-09-29 16:15:01 +04:00
static void init_object ( struct kmem_cache * s , void * object , u8 val )
2007-05-07 01:49:36 +04:00
{
u8 * p = object ;
if ( s - > flags & __OBJECT_POISON ) {
2012-06-13 19:24:57 +04:00
memset ( p , POISON_FREE , s - > object_size - 1 ) ;
p [ s - > object_size - 1 ] = POISON_END ;
2007-05-07 01:49:36 +04:00
}
if ( s - > flags & SLAB_RED_ZONE )
2012-06-13 19:24:57 +04:00
memset ( p + s - > object_size , val , s - > inuse - s - > object_size ) ;
2007-05-07 01:49:36 +04:00
}
2007-07-17 15:03:18 +04:00
static void restore_bytes ( struct kmem_cache * s , char * message , u8 data ,
void * from , void * to )
{
slab_fix ( s , " Restoring 0x%p-0x%p=0x%x \n " , from , to - 1 , data ) ;
memset ( from , data , to - from ) ;
}
static int check_bytes_and_report ( struct kmem_cache * s , struct page * page ,
u8 * object , char * what ,
2008-01-08 10:20:27 +03:00
u8 * start , unsigned int value , unsigned int bytes )
2007-07-17 15:03:18 +04:00
{
u8 * fault ;
u8 * end ;
2015-02-14 01:39:38 +03:00
metadata_access_enable ( ) ;
2011-11-01 04:08:07 +04:00
fault = memchr_inv ( start , value , bytes ) ;
2015-02-14 01:39:38 +03:00
metadata_access_disable ( ) ;
2007-07-17 15:03:18 +04:00
if ( ! fault )
return 1 ;
end = start + bytes ;
while ( end > fault & & end [ - 1 ] = = value )
end - - ;
slab_bug ( s , " %s overwritten " , what ) ;
2014-06-05 03:06:34 +04:00
pr_err ( " INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x \n " ,
2007-07-17 15:03:18 +04:00
fault , end - 1 , fault [ 0 ] , value ) ;
print_trailer ( s , page , object ) ;
restore_bytes ( s , what , value , fault , end ) ;
return 0 ;
2007-05-07 01:49:36 +04:00
}
/*
* Object layout :
*
* object address
* Bytes of the object to be managed .
* If the freepointer may overlay the object then the free
* pointer is the first word of the object .
2007-05-09 13:32:39 +04:00
*
2007-05-07 01:49:36 +04:00
* Poisoning uses 0x6b ( POISON_FREE ) and the last byte is
* 0xa5 ( POISON_END )
*
2012-06-13 19:24:57 +04:00
* object + s - > object_size
2007-05-07 01:49:36 +04:00
* Padding to reach word boundary . This is also used for Redzoning .
2007-05-09 13:32:39 +04:00
* Padding is extended by another word if Redzoning is enabled and
2012-06-13 19:24:57 +04:00
* object_size = = inuse .
2007-05-09 13:32:39 +04:00
*
2007-05-07 01:49:36 +04:00
* We fill with 0xbb ( RED_INACTIVE ) for inactive objects and with
* 0xcc ( RED_ACTIVE ) for objects in use .
*
* object + s - > inuse
2007-05-09 13:32:39 +04:00
* Meta data starts here .
*
2007-05-07 01:49:36 +04:00
* A . Free pointer ( if we cannot overwrite object on free )
* B . Tracking data for SLAB_STORE_USER
2007-05-09 13:32:39 +04:00
* C . Padding to reach required alignment boundary or at mininum
2008-02-16 10:45:26 +03:00
* one word if debugging is on to be able to detect writes
2007-05-09 13:32:39 +04:00
* before the word boundary .
*
* Padding is done using 0x5a ( POISON_INUSE )
2007-05-07 01:49:36 +04:00
*
* object + s - > size
2007-05-09 13:32:39 +04:00
* Nothing is used beyond s - > size .
2007-05-07 01:49:36 +04:00
*
2012-06-13 19:24:57 +04:00
* If slabcaches are merged then the object_size and inuse boundaries are mostly
2007-05-09 13:32:39 +04:00
* ignored . And therefore no slab options that rely on these boundaries
2007-05-07 01:49:36 +04:00
* may be used with merged slabcaches .
*/
static int check_pad_bytes ( struct kmem_cache * s , struct page * page , u8 * p )
{
unsigned long off = s - > inuse ; /* The end of info */
if ( s - > offset )
/* Freepointer is placed after the object. */
off + = sizeof ( void * ) ;
if ( s - > flags & SLAB_STORE_USER )
/* We also have user information there */
off + = 2 * sizeof ( struct track ) ;
if ( s - > size = = off )
return 1 ;
2007-07-17 15:03:18 +04:00
return check_bytes_and_report ( s , page , p , " Object padding " ,
p + off , POISON_INUSE , s - > size - off ) ;
2007-05-07 01:49:36 +04:00
}
2008-04-14 20:11:30 +04:00
/* Check the pad bytes at the end of a slab page */
2007-05-07 01:49:36 +04:00
static int slab_pad_check ( struct kmem_cache * s , struct page * page )
{
2007-07-17 15:03:18 +04:00
u8 * start ;
u8 * fault ;
u8 * end ;
int length ;
int remainder ;
2007-05-07 01:49:36 +04:00
if ( ! ( s - > flags & SLAB_POISON ) )
return 1 ;
2008-03-02 00:40:44 +03:00
start = page_address ( page ) ;
2011-03-10 10:21:48 +03:00
length = ( PAGE_SIZE < < compound_order ( page ) ) - s - > reserved ;
2008-04-14 20:11:30 +04:00
end = start + length ;
remainder = length % s - > size ;
2007-05-07 01:49:36 +04:00
if ( ! remainder )
return 1 ;
2015-02-14 01:39:38 +03:00
metadata_access_enable ( ) ;
2011-11-01 04:08:07 +04:00
fault = memchr_inv ( end - remainder , POISON_INUSE , remainder ) ;
2015-02-14 01:39:38 +03:00
metadata_access_disable ( ) ;
2007-07-17 15:03:18 +04:00
if ( ! fault )
return 1 ;
while ( end > fault & & end [ - 1 ] = = POISON_INUSE )
end - - ;
slab_err ( s , page , " Padding overwritten. 0x%p-0x%p " , fault , end - 1 ) ;
2011-07-29 16:10:20 +04:00
print_section ( " Padding " , end - remainder , remainder ) ;
2007-07-17 15:03:18 +04:00
2009-09-03 18:08:06 +04:00
restore_bytes ( s , " slab padding " , POISON_INUSE , end - remainder , end ) ;
2007-07-17 15:03:18 +04:00
return 0 ;
2007-05-07 01:49:36 +04:00
}
static int check_object ( struct kmem_cache * s , struct page * page ,
2010-09-29 16:15:01 +04:00
void * object , u8 val )
2007-05-07 01:49:36 +04:00
{
u8 * p = object ;
2012-06-13 19:24:57 +04:00
u8 * endobject = object + s - > object_size ;
2007-05-07 01:49:36 +04:00
if ( s - > flags & SLAB_RED_ZONE ) {
2007-07-17 15:03:18 +04:00
if ( ! check_bytes_and_report ( s , page , object , " Redzone " ,
2012-06-13 19:24:57 +04:00
endobject , val , s - > inuse - s - > object_size ) )
2007-05-07 01:49:36 +04:00
return 0 ;
} else {
2012-06-13 19:24:57 +04:00
if ( ( s - > flags & SLAB_POISON ) & & s - > object_size < s - > inuse ) {
2008-02-06 04:57:39 +03:00
check_bytes_and_report ( s , page , p , " Alignment padding " ,
2013-07-15 05:05:29 +04:00
endobject , POISON_INUSE ,
s - > inuse - s - > object_size ) ;
2008-02-06 04:57:39 +03:00
}
2007-05-07 01:49:36 +04:00
}
if ( s - > flags & SLAB_POISON ) {
2010-09-29 16:15:01 +04:00
if ( val ! = SLUB_RED_ACTIVE & & ( s - > flags & __OBJECT_POISON ) & &
2007-07-17 15:03:18 +04:00
( ! check_bytes_and_report ( s , page , p , " Poison " , p ,
2012-06-13 19:24:57 +04:00
POISON_FREE , s - > object_size - 1 ) | |
2007-07-17 15:03:18 +04:00
! check_bytes_and_report ( s , page , p , " Poison " ,
2012-06-13 19:24:57 +04:00
p + s - > object_size - 1 , POISON_END , 1 ) ) )
2007-05-07 01:49:36 +04:00
return 0 ;
/*
* check_pad_bytes cleans up on its own .
*/
check_pad_bytes ( s , page , p ) ;
}
2010-09-29 16:15:01 +04:00
if ( ! s - > offset & & val = = SLUB_RED_ACTIVE )
2007-05-07 01:49:36 +04:00
/*
* Object and freepointer overlap . Cannot check
* freepointer while object is allocated .
*/
return 1 ;
/* Check free pointer validity */
if ( ! check_valid_pointer ( s , page , get_freepointer ( s , p ) ) ) {
object_err ( s , page , p , " Freepointer corrupt " ) ;
/*
2008-12-05 06:08:08 +03:00
* No choice but to zap it and thus lose the remainder
2007-05-07 01:49:36 +04:00
* of the free objects in this slab . May cause
2007-05-09 13:32:39 +04:00
* another error because the object count is now wrong .
2007-05-07 01:49:36 +04:00
*/
2008-03-02 00:40:44 +03:00
set_freepointer ( s , p , NULL ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
return 1 ;
}
static int check_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 20:11:30 +04:00
int maxobj ;
2007-05-07 01:49:36 +04:00
VM_BUG_ON ( ! irqs_disabled ( ) ) ;
if ( ! PageSlab ( page ) ) {
2007-07-17 15:03:18 +04:00
slab_err ( s , page , " Not a valid slab page " ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
2008-04-14 20:11:30 +04:00
2011-03-10 10:21:48 +03:00
maxobj = order_objects ( compound_order ( page ) , s - > size , s - > reserved ) ;
2008-04-14 20:11:30 +04:00
if ( page - > objects > maxobj ) {
slab_err ( s , page , " objects %u > max %u " ,
mm: slub: fix format mismatches in slab_err() callers
Adding __printf(3, 4) to slab_err exposed following:
mm/slub.c: In function `check_slab':
mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->objects, maxobj);
^
mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->inuse, page->objects);
^
mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c: In function `on_freelist':
mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
"should be %d", page->objects, max_objects);
Fix first two warnings by removing redundant s->name.
Fix the last by changing type of max_object from unsigned long to int.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:22 +03:00
page - > objects , maxobj ) ;
2008-04-14 20:11:30 +04:00
return 0 ;
}
if ( page - > inuse > page - > objects ) {
2007-07-17 15:03:18 +04:00
slab_err ( s , page , " inuse %u > max %u " ,
mm: slub: fix format mismatches in slab_err() callers
Adding __printf(3, 4) to slab_err exposed following:
mm/slub.c: In function `check_slab':
mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->objects, maxobj);
^
mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->inuse, page->objects);
^
mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c: In function `on_freelist':
mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
"should be %d", page->objects, max_objects);
Fix first two warnings by removing redundant s->name.
Fix the last by changing type of max_object from unsigned long to int.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:22 +03:00
page - > inuse , page - > objects ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
/* Slab_pad_check fixes things up after itself */
slab_pad_check ( s , page ) ;
return 1 ;
}
/*
2007-05-09 13:32:39 +04:00
* Determine if a certain object on a page is on the freelist . Must hold the
* slab lock to guarantee that the chains are in a consistent state .
2007-05-07 01:49:36 +04:00
*/
static int on_freelist ( struct kmem_cache * s , struct page * page , void * search )
{
int nr = 0 ;
2011-06-01 21:25:53 +04:00
void * fp ;
2007-05-07 01:49:36 +04:00
void * object = NULL ;
mm: slub: fix format mismatches in slab_err() callers
Adding __printf(3, 4) to slab_err exposed following:
mm/slub.c: In function `check_slab':
mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->objects, maxobj);
^
mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
s->name, page->inuse, page->objects);
^
mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]
mm/slub.c: In function `on_freelist':
mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
"should be %d", page->objects, max_objects);
Fix first two warnings by removing redundant s->name.
Fix the last by changing type of max_object from unsigned long to int.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:22 +03:00
int max_objects ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:53 +04:00
fp = page - > freelist ;
2008-04-14 20:11:30 +04:00
while ( fp & & nr < = page - > objects ) {
2007-05-07 01:49:36 +04:00
if ( fp = = search )
return 1 ;
if ( ! check_valid_pointer ( s , page , fp ) ) {
if ( object ) {
object_err ( s , page , object ,
" Freechain corrupt " ) ;
2008-03-02 00:40:44 +03:00
set_freepointer ( s , object , NULL ) ;
2007-05-07 01:49:36 +04:00
} else {
2007-07-17 15:03:18 +04:00
slab_err ( s , page , " Freepointer corrupt " ) ;
2008-03-02 00:40:44 +03:00
page - > freelist = NULL ;
2008-04-14 20:11:30 +04:00
page - > inuse = page - > objects ;
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Freelist cleared " ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
break ;
}
object = fp ;
fp = get_freepointer ( s , object ) ;
nr + + ;
}
2011-03-10 10:21:48 +03:00
max_objects = order_objects ( compound_order ( page ) , s - > size , s - > reserved ) ;
2008-10-22 23:00:38 +04:00
if ( max_objects > MAX_OBJS_PER_PAGE )
max_objects = MAX_OBJS_PER_PAGE ;
2008-04-14 20:11:31 +04:00
if ( page - > objects ! = max_objects ) {
slab_err ( s , page , " Wrong number of objects. Found %d but "
" should be %d " , page - > objects , max_objects ) ;
page - > objects = max_objects ;
slab_fix ( s , " Number of objects adjusted. " ) ;
}
2008-04-14 20:11:30 +04:00
if ( page - > inuse ! = page - > objects - nr ) {
2007-05-07 01:49:47 +04:00
slab_err ( s , page , " Wrong object count. Counter is %d but "
2008-04-14 20:11:30 +04:00
" counted were %d " , page - > inuse , page - > objects - nr ) ;
page - > inuse = page - > objects - nr ;
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Object count adjusted. " ) ;
2007-05-07 01:49:36 +04:00
}
return search = = NULL ;
}
2008-04-30 03:11:12 +04:00
static void trace ( struct kmem_cache * s , struct page * page , void * object ,
int alloc )
2007-05-17 09:11:00 +04:00
{
if ( s - > flags & SLAB_TRACE ) {
2014-06-05 03:06:34 +04:00
pr_info ( " TRACE %s %s 0x%p inuse=%d fp=0x%p \n " ,
2007-05-17 09:11:00 +04:00
s - > name ,
alloc ? " alloc " : " free " ,
object , page - > inuse ,
page - > freelist ) ;
if ( ! alloc )
2013-07-15 05:05:29 +04:00
print_section ( " Object " , ( void * ) object ,
s - > object_size ) ;
2007-05-17 09:11:00 +04:00
dump_stack ( ) ;
}
}
2007-05-07 01:49:42 +04:00
/*
2007-05-09 13:32:39 +04:00
* Tracking of fully allocated slabs for debugging purposes .
2007-05-07 01:49:42 +04:00
*/
2011-06-01 21:25:50 +04:00
static void add_full ( struct kmem_cache * s ,
struct kmem_cache_node * n , struct page * page )
2007-05-07 01:49:42 +04:00
{
2011-06-01 21:25:50 +04:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2014-02-11 02:25:39 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2007-05-07 01:49:42 +04:00
list_add ( & page - > lru , & n - > full ) ;
}
2014-01-10 16:23:49 +04:00
static void remove_full ( struct kmem_cache * s , struct kmem_cache_node * n , struct page * page )
2007-05-07 01:49:42 +04:00
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2014-02-11 02:25:39 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2007-05-07 01:49:42 +04:00
list_del ( & page - > lru ) ;
}
2008-04-14 19:53:02 +04:00
/* Tracking of the number of slabs for debugging purposes */
static inline unsigned long slabs_node ( struct kmem_cache * s , int node )
{
struct kmem_cache_node * n = get_node ( s , node ) ;
return atomic_long_read ( & n - > nr_slabs ) ;
}
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_slabs ( struct kmem_cache_node * n )
{
return atomic_long_read ( & n - > nr_slabs ) ;
}
2008-04-14 20:11:40 +04:00
static inline void inc_slabs_node ( struct kmem_cache * s , int node , int objects )
2008-04-14 19:53:02 +04:00
{
struct kmem_cache_node * n = get_node ( s , node ) ;
/*
* May be called early in order to allocate a slab for the
* kmem_cache_node structure . Solve the chicken - egg
* dilemma by deferring the increment of the count during
* bootstrap ( see early_kmem_cache_node_alloc ) .
*/
2013-01-21 12:01:27 +04:00
if ( likely ( n ) ) {
2008-04-14 19:53:02 +04:00
atomic_long_inc ( & n - > nr_slabs ) ;
2008-04-14 20:11:40 +04:00
atomic_long_add ( objects , & n - > total_objects ) ;
}
2008-04-14 19:53:02 +04:00
}
2008-04-14 20:11:40 +04:00
static inline void dec_slabs_node ( struct kmem_cache * s , int node , int objects )
2008-04-14 19:53:02 +04:00
{
struct kmem_cache_node * n = get_node ( s , node ) ;
atomic_long_dec ( & n - > nr_slabs ) ;
2008-04-14 20:11:40 +04:00
atomic_long_sub ( objects , & n - > total_objects ) ;
2008-04-14 19:53:02 +04:00
}
/* Object debug checks for alloc/free paths */
2007-05-17 09:11:00 +04:00
static void setup_object_debug ( struct kmem_cache * s , struct page * page ,
void * object )
{
if ( ! ( s - > flags & ( SLAB_STORE_USER | SLAB_RED_ZONE | __OBJECT_POISON ) ) )
return ;
2010-09-29 16:15:01 +04:00
init_object ( s , object , SLUB_RED_INACTIVE ) ;
2007-05-17 09:11:00 +04:00
init_tracking ( s , object ) ;
}
2013-07-15 05:05:29 +04:00
static noinline int alloc_debug_processing ( struct kmem_cache * s ,
struct page * page ,
2008-08-19 21:43:25 +04:00
void * object , unsigned long addr )
2007-05-07 01:49:36 +04:00
{
if ( ! check_slab ( s , page ) )
goto bad ;
if ( ! check_valid_pointer ( s , page , object ) ) {
object_err ( s , page , object , " Freelist Pointer check fails " ) ;
2007-05-07 01:49:47 +04:00
goto bad ;
2007-05-07 01:49:36 +04:00
}
2010-09-29 16:15:01 +04:00
if ( ! check_object ( s , page , object , SLUB_RED_INACTIVE ) )
2007-05-07 01:49:36 +04:00
goto bad ;
2007-05-17 09:11:00 +04:00
/* Success perform special debug activities for allocs */
if ( s - > flags & SLAB_STORE_USER )
set_track ( s , object , TRACK_ALLOC , addr ) ;
trace ( s , page , object , 1 ) ;
2010-09-29 16:15:01 +04:00
init_object ( s , object , SLUB_RED_ACTIVE ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
2007-05-17 09:11:00 +04:00
2007-05-07 01:49:36 +04:00
bad :
if ( PageSlab ( page ) ) {
/*
* If this is a slab page then lets do the best we can
* to avoid issues in the future . Marking all objects
2007-05-09 13:32:39 +04:00
* as used avoids touching the remaining objects .
2007-05-07 01:49:36 +04:00
*/
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Marking all objects used " ) ;
2008-04-14 20:11:30 +04:00
page - > inuse = page - > objects ;
2008-03-02 00:40:44 +03:00
page - > freelist = NULL ;
2007-05-07 01:49:36 +04:00
}
return 0 ;
}
2015-11-21 02:57:46 +03:00
/* Supports checking bulk free of a constructed freelist */
2012-05-30 21:54:46 +04:00
static noinline struct kmem_cache_node * free_debug_processing (
2015-11-21 02:57:46 +03:00
struct kmem_cache * s , struct page * page ,
void * head , void * tail , int bulk_cnt ,
2012-05-30 21:54:46 +04:00
unsigned long addr , unsigned long * flags )
2007-05-07 01:49:36 +04:00
{
2012-05-30 21:54:46 +04:00
struct kmem_cache_node * n = get_node ( s , page_to_nid ( page ) ) ;
2015-11-21 02:57:46 +03:00
void * object = head ;
int cnt = 0 ;
2011-06-01 21:25:54 +04:00
2012-05-30 21:54:46 +04:00
spin_lock_irqsave ( & n - > list_lock , * flags ) ;
2011-06-01 21:25:53 +04:00
slab_lock ( page ) ;
2007-05-07 01:49:36 +04:00
if ( ! check_slab ( s , page ) )
goto fail ;
2015-11-21 02:57:46 +03:00
next_object :
cnt + + ;
2007-05-07 01:49:36 +04:00
if ( ! check_valid_pointer ( s , page , object ) ) {
2007-05-07 01:49:47 +04:00
slab_err ( s , page , " Invalid object pointer 0x%p " , object ) ;
2007-05-07 01:49:36 +04:00
goto fail ;
}
if ( on_freelist ( s , page , object ) ) {
2007-07-17 15:03:18 +04:00
object_err ( s , page , object , " Object already free " ) ;
2007-05-07 01:49:36 +04:00
goto fail ;
}
2010-09-29 16:15:01 +04:00
if ( ! check_object ( s , page , object , SLUB_RED_ACTIVE ) )
2011-06-01 21:25:54 +04:00
goto out ;
2007-05-07 01:49:36 +04:00
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
if ( unlikely ( s ! = page - > slab_cache ) ) {
2008-02-06 04:57:39 +03:00
if ( ! PageSlab ( page ) ) {
2007-05-07 01:49:47 +04:00
slab_err ( s , page , " Attempt to free object(0x%p) "
" outside of slab " , object ) ;
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
} else if ( ! page - > slab_cache ) {
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB <none>: no slab for object 0x%p. \n " ,
object ) ;
2007-05-07 01:49:47 +04:00
dump_stack ( ) ;
2008-01-08 10:20:27 +03:00
} else
2007-07-17 15:03:18 +04:00
object_err ( s , page , object ,
" page slab pointer corrupt. " ) ;
2007-05-07 01:49:36 +04:00
goto fail ;
}
2007-05-17 09:11:00 +04:00
if ( s - > flags & SLAB_STORE_USER )
set_track ( s , object , TRACK_FREE , addr ) ;
trace ( s , page , object , 0 ) ;
2015-11-21 02:57:46 +03:00
/* Freepointer not overwritten by init_object(), SLAB_POISON moved it */
2010-09-29 16:15:01 +04:00
init_object ( s , object , SLUB_RED_INACTIVE ) ;
2015-11-21 02:57:46 +03:00
/* Reached end of constructed freelist yet? */
if ( object ! = tail ) {
object = get_freepointer ( s , object ) ;
goto next_object ;
}
2011-06-01 21:25:54 +04:00
out :
2015-11-21 02:57:46 +03:00
if ( cnt ! = bulk_cnt )
slab_err ( s , page , " Bulk freelist count(%d) invalid(%d) \n " ,
bulk_cnt , cnt ) ;
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2012-05-30 21:54:46 +04:00
/*
* Keep node_lock to preserve integrity
* until the object is actually freed
*/
return n ;
2007-05-17 09:11:00 +04:00
2007-05-07 01:49:36 +04:00
fail :
2012-05-30 21:54:46 +04:00
slab_unlock ( page ) ;
spin_unlock_irqrestore ( & n - > list_lock , * flags ) ;
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Object at 0x%p not freed " , object ) ;
2012-05-30 21:54:46 +04:00
return NULL ;
2007-05-07 01:49:36 +04:00
}
2007-05-09 13:32:44 +04:00
static int __init setup_slub_debug ( char * str )
{
2007-07-16 10:38:14 +04:00
slub_debug = DEBUG_DEFAULT_FLAGS ;
if ( * str + + ! = ' = ' | | ! * str )
/*
* No options specified . Switch on full debugging .
*/
goto out ;
if ( * str = = ' , ' )
/*
* No options but restriction on slabs . This means full
* debugging for slabs matching a pattern .
*/
goto check_slabs ;
slub_debug = 0 ;
if ( * str = = ' - ' )
/*
* Switch off all debugging measures .
*/
goto out ;
/*
* Determine which debug features should be switched on
*/
2008-01-08 10:20:27 +03:00
for ( ; * str & & * str ! = ' , ' ; str + + ) {
2007-07-16 10:38:14 +04:00
switch ( tolower ( * str ) ) {
case ' f ' :
slub_debug | = SLAB_DEBUG_FREE ;
break ;
case ' z ' :
slub_debug | = SLAB_RED_ZONE ;
break ;
case ' p ' :
slub_debug | = SLAB_POISON ;
break ;
case ' u ' :
slub_debug | = SLAB_STORE_USER ;
break ;
case ' t ' :
slub_debug | = SLAB_TRACE ;
break ;
2010-02-26 09:36:12 +03:00
case ' a ' :
slub_debug | = SLAB_FAILSLAB ;
break ;
2015-04-15 01:44:25 +03:00
case ' o ' :
/*
* Avoid enabling debugging on caches if its minimum
* order would increase as a result .
*/
disable_higher_order_debug = 1 ;
break ;
2007-07-16 10:38:14 +04:00
default :
2014-06-05 03:06:34 +04:00
pr_err ( " slub_debug option '%c' unknown. skipped \n " ,
* str ) ;
2007-07-16 10:38:14 +04:00
}
2007-05-09 13:32:44 +04:00
}
2007-07-16 10:38:14 +04:00
check_slabs :
2007-05-09 13:32:44 +04:00
if ( * str = = ' , ' )
slub_debug_slabs = str + 1 ;
2007-07-16 10:38:14 +04:00
out :
2007-05-09 13:32:44 +04:00
return 1 ;
}
__setup ( " slub_debug " , setup_slub_debug ) ;
2014-10-10 02:26:22 +04:00
unsigned long kmem_cache_flags ( unsigned long object_size ,
2007-09-12 02:24:11 +04:00
unsigned long flags , const char * name ,
2008-07-26 06:45:34 +04:00
void ( * ctor ) ( void * ) )
2007-05-09 13:32:44 +04:00
{
/*
2008-02-16 10:45:24 +03:00
* Enable debugging if selected on the kernel commandline .
2007-05-09 13:32:44 +04:00
*/
2013-11-07 20:29:15 +04:00
if ( slub_debug & & ( ! slub_debug_slabs | | ( name & &
! strncmp ( slub_debug_slabs , name , strlen ( slub_debug_slabs ) ) ) ) )
2009-07-28 05:30:35 +04:00
flags | = slub_debug ;
2007-09-12 02:24:11 +04:00
return flags ;
2007-05-09 13:32:44 +04:00
}
2015-11-21 02:57:41 +03:00
# else /* !CONFIG_SLUB_DEBUG */
2007-05-17 09:11:00 +04:00
static inline void setup_object_debug ( struct kmem_cache * s ,
struct page * page , void * object ) { }
2007-05-09 13:32:44 +04:00
2007-05-17 09:11:00 +04:00
static inline int alloc_debug_processing ( struct kmem_cache * s ,
2008-08-19 21:43:25 +04:00
struct page * page , void * object , unsigned long addr ) { return 0 ; }
2007-05-09 13:32:44 +04:00
2012-05-30 21:54:46 +04:00
static inline struct kmem_cache_node * free_debug_processing (
2015-11-21 02:57:46 +03:00
struct kmem_cache * s , struct page * page ,
void * head , void * tail , int bulk_cnt ,
2012-05-30 21:54:46 +04:00
unsigned long addr , unsigned long * flags ) { return NULL ; }
2007-05-09 13:32:44 +04:00
static inline int slab_pad_check ( struct kmem_cache * s , struct page * page )
{ return 1 ; }
static inline int check_object ( struct kmem_cache * s , struct page * page ,
2010-09-29 16:15:01 +04:00
void * object , u8 val ) { return 1 ; }
2011-06-01 21:25:50 +04:00
static inline void add_full ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct page * page ) { }
2014-01-10 16:23:49 +04:00
static inline void remove_full ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct page * page ) { }
2014-10-10 02:26:22 +04:00
unsigned long kmem_cache_flags ( unsigned long object_size ,
2007-09-12 02:24:11 +04:00
unsigned long flags , const char * name ,
2008-07-26 06:45:34 +04:00
void ( * ctor ) ( void * ) )
2007-09-12 02:24:11 +04:00
{
return flags ;
}
2007-05-09 13:32:44 +04:00
# define slub_debug 0
2008-04-14 19:53:02 +04:00
2009-09-15 13:00:26 +04:00
# define disable_higher_order_debug 0
2008-04-14 19:53:02 +04:00
static inline unsigned long slabs_node ( struct kmem_cache * s , int node )
{ return 0 ; }
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_slabs ( struct kmem_cache_node * n )
{ return 0 ; }
2008-04-14 20:11:40 +04:00
static inline void inc_slabs_node ( struct kmem_cache * s , int node ,
int objects ) { }
static inline void dec_slabs_node ( struct kmem_cache * s , int node ,
int objects ) { }
2010-08-25 23:07:16 +04:00
2014-08-07 03:04:18 +04:00
# endif /* CONFIG_SLUB_DEBUG */
/*
* Hooks for other subsystems that check memory allocations . In a typical
* production configuration these hooks all should produce no code at all .
*/
2013-10-09 02:58:57 +04:00
static inline void kmalloc_large_node_hook ( void * ptr , size_t size , gfp_t flags )
{
kmemleak_alloc ( ptr , size , 1 , flags ) ;
2015-02-14 01:39:42 +03:00
kasan_kmalloc_large ( ptr , size ) ;
2013-10-09 02:58:57 +04:00
}
static inline void kfree_hook ( const void * x )
{
kmemleak_free ( x ) ;
2015-02-14 01:39:42 +03:00
kasan_kfree_large ( x ) ;
2013-10-09 02:58:57 +04:00
}
memcg: fix possible use-after-free in memcg_kmem_get_cache()
Suppose task @t that belongs to a memory cgroup @memcg is going to
allocate an object from a kmem cache @c. The copy of @c corresponding to
@memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
cgroup destruction we can access the memory cgroup's copy of the cache
after it was destroyed:
CPU0 CPU1
---- ----
[ current=@t
@mc->memcg_params->nr_pages=0 ]
kmem_cache_alloc(@c):
call memcg_kmem_get_cache(@c);
proceed to allocation from @mc:
alloc a page for @mc:
...
move @t from @memcg
destroy @memcg:
mem_cgroup_css_offline(@memcg):
memcg_unregister_all_caches(@memcg):
kmem_cache_destroy(@mc)
add page to @mc
We could fix this issue by taking a reference to a per-memcg cache, but
that would require adding a per-cpu reference counter to per-memcg caches,
which would look cumbersome.
Instead, let's take a reference to a memory cgroup, which already has a
per-cpu reference counter, in the beginning of kmem_cache_alloc to be
dropped in the end, and move per memcg caches destruction from css offline
to css free. As a side effect, per-memcg caches will be destroyed not one
by one, but all at once when the last page accounted to the memory cgroup
is freed. This doesn't sound as a high price for code readability though.
Note, this patch does add some overhead to the kmem_cache_alloc hot path,
but it is pretty negligible - it's just a function call plus a per cpu
counter decrement, which is comparable to what we already have in
memcg_kmem_get_cache. Besides, it's only relevant if there are memory
cgroups with kmem accounting enabled. I don't think we can find a way to
handle this race w/o it, because alloc_page called from kmem_cache_alloc
may sleep so we can't flush all pending kmallocs w/o reference counting.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 03:56:38 +03:00
static inline struct kmem_cache * slab_pre_alloc_hook ( struct kmem_cache * s ,
gfp_t flags )
2014-08-07 03:04:18 +04:00
{
flags & = gfp_allowed_mask ;
lockdep_trace_alloc ( flags ) ;
2015-11-07 03:28:21 +03:00
might_sleep_if ( gfpflags_allow_blocking ( flags ) ) ;
2010-08-25 23:07:16 +04:00
memcg: fix possible use-after-free in memcg_kmem_get_cache()
Suppose task @t that belongs to a memory cgroup @memcg is going to
allocate an object from a kmem cache @c. The copy of @c corresponding to
@memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
cgroup destruction we can access the memory cgroup's copy of the cache
after it was destroyed:
CPU0 CPU1
---- ----
[ current=@t
@mc->memcg_params->nr_pages=0 ]
kmem_cache_alloc(@c):
call memcg_kmem_get_cache(@c);
proceed to allocation from @mc:
alloc a page for @mc:
...
move @t from @memcg
destroy @memcg:
mem_cgroup_css_offline(@memcg):
memcg_unregister_all_caches(@memcg):
kmem_cache_destroy(@mc)
add page to @mc
We could fix this issue by taking a reference to a per-memcg cache, but
that would require adding a per-cpu reference counter to per-memcg caches,
which would look cumbersome.
Instead, let's take a reference to a memory cgroup, which already has a
per-cpu reference counter, in the beginning of kmem_cache_alloc to be
dropped in the end, and move per memcg caches destruction from css offline
to css free. As a side effect, per-memcg caches will be destroyed not one
by one, but all at once when the last page accounted to the memory cgroup
is freed. This doesn't sound as a high price for code readability though.
Note, this patch does add some overhead to the kmem_cache_alloc hot path,
but it is pretty negligible - it's just a function call plus a per cpu
counter decrement, which is comparable to what we already have in
memcg_kmem_get_cache. Besides, it's only relevant if there are memory
cgroups with kmem accounting enabled. I don't think we can find a way to
handle this race w/o it, because alloc_page called from kmem_cache_alloc
may sleep so we can't flush all pending kmallocs w/o reference counting.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 03:56:38 +03:00
if ( should_failslab ( s - > object_size , flags , s - > flags ) )
return NULL ;
return memcg_kmem_get_cache ( s , flags ) ;
2014-08-07 03:04:18 +04:00
}
2015-11-21 02:57:52 +03:00
static inline void slab_post_alloc_hook ( struct kmem_cache * s , gfp_t flags ,
size_t size , void * * p )
2013-10-09 02:58:57 +04:00
{
2015-11-21 02:57:52 +03:00
size_t i ;
2014-08-07 03:04:18 +04:00
flags & = gfp_allowed_mask ;
2015-11-21 02:57:52 +03:00
for ( i = 0 ; i < size ; i + + ) {
void * object = p [ i ] ;
kmemcheck_slab_alloc ( s , flags , object , slab_ksize ( s ) ) ;
kmemleak_alloc_recursive ( object , s - > object_size , 1 ,
s - > flags , flags ) ;
kasan_slab_alloc ( s , object ) ;
}
memcg: fix possible use-after-free in memcg_kmem_get_cache()
Suppose task @t that belongs to a memory cgroup @memcg is going to
allocate an object from a kmem cache @c. The copy of @c corresponding to
@memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
cgroup destruction we can access the memory cgroup's copy of the cache
after it was destroyed:
CPU0 CPU1
---- ----
[ current=@t
@mc->memcg_params->nr_pages=0 ]
kmem_cache_alloc(@c):
call memcg_kmem_get_cache(@c);
proceed to allocation from @mc:
alloc a page for @mc:
...
move @t from @memcg
destroy @memcg:
mem_cgroup_css_offline(@memcg):
memcg_unregister_all_caches(@memcg):
kmem_cache_destroy(@mc)
add page to @mc
We could fix this issue by taking a reference to a per-memcg cache, but
that would require adding a per-cpu reference counter to per-memcg caches,
which would look cumbersome.
Instead, let's take a reference to a memory cgroup, which already has a
per-cpu reference counter, in the beginning of kmem_cache_alloc to be
dropped in the end, and move per memcg caches destruction from css offline
to css free. As a side effect, per-memcg caches will be destroyed not one
by one, but all at once when the last page accounted to the memory cgroup
is freed. This doesn't sound as a high price for code readability though.
Note, this patch does add some overhead to the kmem_cache_alloc hot path,
but it is pretty negligible - it's just a function call plus a per cpu
counter decrement, which is comparable to what we already have in
memcg_kmem_get_cache. Besides, it's only relevant if there are memory
cgroups with kmem accounting enabled. I don't think we can find a way to
handle this race w/o it, because alloc_page called from kmem_cache_alloc
may sleep so we can't flush all pending kmallocs w/o reference counting.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 03:56:38 +03:00
memcg_kmem_put_cache ( s ) ;
2013-10-09 02:58:57 +04:00
}
2010-08-25 23:07:16 +04:00
2013-10-09 02:58:57 +04:00
static inline void slab_free_hook ( struct kmem_cache * s , void * x )
{
kmemleak_free_recursive ( x , s - > flags ) ;
2010-08-25 23:07:16 +04:00
2014-08-07 03:04:18 +04:00
/*
* Trouble is that we may no longer disable interrupts in the fast path
* So in order to make the debug calls that expect irqs to be
* disabled we need to disable interrupts temporarily .
*/
# if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP)
{
unsigned long flags ;
local_irq_save ( flags ) ;
kmemcheck_slab_free ( s , x , s - > object_size ) ;
debug_check_no_locks_freed ( x , s - > object_size ) ;
local_irq_restore ( flags ) ;
}
# endif
if ( ! ( s - > flags & SLAB_DEBUG_OBJECTS ) )
debug_check_no_obj_freed ( x , s - > object_size ) ;
2015-02-14 01:39:42 +03:00
kasan_slab_free ( s , x ) ;
2014-08-07 03:04:18 +04:00
}
2008-04-14 20:11:40 +04:00
2015-11-21 02:57:46 +03:00
static inline void slab_free_freelist_hook ( struct kmem_cache * s ,
void * head , void * tail )
{
/*
* Compiler cannot detect this function can be removed if slab_free_hook ( )
* evaluates to nothing . Thus , catch all relevant config debug options here .
*/
# if defined(CONFIG_KMEMCHECK) || \
defined ( CONFIG_LOCKDEP ) | | \
defined ( CONFIG_DEBUG_KMEMLEAK ) | | \
defined ( CONFIG_DEBUG_OBJECTS_FREE ) | | \
defined ( CONFIG_KASAN )
void * object = head ;
void * tail_obj = tail ? : head ;
do {
slab_free_hook ( s , object ) ;
} while ( ( object ! = tail_obj ) & &
( object = get_freepointer ( s , object ) ) ) ;
# endif
}
2015-09-05 01:45:48 +03:00
static void setup_object ( struct kmem_cache * s , struct page * page ,
void * object )
{
setup_object_debug ( s , page , object ) ;
if ( unlikely ( s - > ctor ) ) {
kasan_unpoison_object_data ( s , object ) ;
s - > ctor ( object ) ;
kasan_poison_object_data ( s , object ) ;
}
}
2007-05-07 01:49:36 +04:00
/*
* Slab allocation and freeing
*/
2014-06-05 03:06:38 +04:00
static inline struct page * alloc_slab_page ( struct kmem_cache * s ,
gfp_t flags , int node , struct kmem_cache_order_objects oo )
2008-04-14 20:11:40 +04:00
{
2014-06-05 03:06:38 +04:00
struct page * page ;
2008-04-14 20:11:40 +04:00
int order = oo_order ( oo ) ;
2008-11-25 18:55:53 +03:00
flags | = __GFP_NOTRACK ;
2010-07-09 23:07:10 +04:00
if ( node = = NUMA_NO_NODE )
2014-06-05 03:06:38 +04:00
page = alloc_pages ( flags , order ) ;
2008-04-14 20:11:40 +04:00
else
mm: rename alloc_pages_exact_node() to __alloc_pages_node()
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.
The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").
Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.
To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.
It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.
Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.
Both differences will be rectified by the next patch.
To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Robin Holt <robinmholt@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cliff Whickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09 01:03:50 +03:00
page = __alloc_pages_node ( node , flags , order ) ;
2014-06-05 03:06:38 +04:00
memcg: unify slab and other kmem pages charging
We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
uncharging kmem pages to memcg, but currently they are not used for
charging slab pages (i.e. they are only used for charging pages allocated
with alloc_kmem_pages). The only reason why the slab subsystem uses
special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
to the memcg that the current task belongs to.
To remove this diversity, this patch adds an extra argument to
__memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
not NULL, the function tries to charge to the memcg it points to,
otherwise it charge to the current context. Next, it makes the slab
subsystem use this function to charge slab pages.
Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
__memcg_kmem_charge stores a pointer to the memcg in the page struct, we
don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
Besides, one can now detect which memcg a slab page belongs to by reading
/proc/kpagecgroup.
Note, this patch switches slab to charge-after-alloc design. Since this
design is already used for all other memcg charges, it should not make any
difference.
[hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 05:49:01 +03:00
if ( page & & memcg_charge_slab ( page , flags , order , s ) ) {
__free_pages ( page , order ) ;
page = NULL ;
}
2014-06-05 03:06:38 +04:00
return page ;
2008-04-14 20:11:40 +04:00
}
2007-05-07 01:49:36 +04:00
static struct page * allocate_slab ( struct kmem_cache * s , gfp_t flags , int node )
{
2008-01-08 10:20:27 +03:00
struct page * page ;
2008-04-14 20:11:31 +04:00
struct kmem_cache_order_objects oo = s - > oo ;
2009-06-24 22:59:51 +04:00
gfp_t alloc_gfp ;
2015-09-05 01:45:48 +03:00
void * start , * p ;
int idx , order ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:44 +04:00
flags & = gfp_allowed_mask ;
2015-11-07 03:28:21 +03:00
if ( gfpflags_allow_blocking ( flags ) )
2011-06-01 21:25:44 +04:00
local_irq_enable ( ) ;
2008-02-15 01:21:32 +03:00
flags | = s - > allocflags ;
2007-10-16 12:25:52 +04:00
2009-06-24 22:59:51 +04:00
/*
* Let the initial higher - order allocation fail under memory pressure
* so we fall - back to the minimum order allocation .
*/
alloc_gfp = ( flags | __GFP_NOWARN | __GFP_NORETRY ) & ~ __GFP_NOFAIL ;
2015-11-07 03:28:21 +03:00
if ( ( alloc_gfp & __GFP_DIRECT_RECLAIM ) & & oo_order ( oo ) > oo_order ( s - > min ) )
alloc_gfp = ( alloc_gfp | __GFP_NOMEMALLOC ) & ~ __GFP_DIRECT_RECLAIM ;
2009-06-24 22:59:51 +04:00
2014-06-05 03:06:38 +04:00
page = alloc_slab_page ( s , alloc_gfp , node , oo ) ;
2008-04-14 20:11:40 +04:00
if ( unlikely ( ! page ) ) {
oo = s - > min ;
2014-03-12 12:26:20 +04:00
alloc_gfp = flags ;
2008-04-14 20:11:40 +04:00
/*
* Allocation may have failed due to fragmentation .
* Try a lower order alloc if possible
*/
2014-06-05 03:06:38 +04:00
page = alloc_slab_page ( s , alloc_gfp , node , oo ) ;
2015-09-05 01:45:48 +03:00
if ( unlikely ( ! page ) )
goto out ;
stat ( s , ORDER_FALLBACK ) ;
2008-04-14 20:11:40 +04:00
}
2008-04-04 02:54:48 +04:00
2015-09-05 01:45:48 +03:00
if ( kmemcheck_enabled & &
! ( s - > flags & ( SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS ) ) ) {
2008-11-25 18:55:53 +03:00
int pages = 1 < < oo_order ( oo ) ;
2014-03-12 12:26:20 +04:00
kmemcheck_alloc_shadow ( page , oo_order ( oo ) , alloc_gfp , node ) ;
2008-11-25 18:55:53 +03:00
/*
* Objects from caches that have a constructor don ' t get
* cleared when they ' re allocated , so we need to do it here .
*/
if ( s - > ctor )
kmemcheck_mark_uninitialized_pages ( page , pages ) ;
else
kmemcheck_mark_unallocated_pages ( page , pages ) ;
2008-04-04 02:54:48 +04:00
}
2008-04-14 20:11:31 +04:00
page - > objects = oo_objects ( oo ) ;
2007-05-07 01:49:36 +04:00
2012-12-19 02:22:50 +04:00
order = compound_order ( page ) ;
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
page - > slab_cache = s ;
2012-05-17 19:47:47 +04:00
__SetPageSlab ( page ) ;
2015-08-22 00:11:51 +03:00
if ( page_is_pfmemalloc ( page ) )
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
SetPageSlabPfmemalloc ( page ) ;
2007-05-07 01:49:36 +04:00
start = page_address ( page ) ;
if ( unlikely ( s - > flags & SLAB_POISON ) )
2012-12-19 02:22:50 +04:00
memset ( start , POISON_INUSE , PAGE_SIZE < < order ) ;
2007-05-07 01:49:36 +04:00
2015-02-14 01:39:42 +03:00
kasan_poison_slab ( page ) ;
2014-08-07 03:04:42 +04:00
for_each_object_idx ( p , idx , s , start , page - > objects ) {
setup_object ( s , page , p ) ;
if ( likely ( idx < page - > objects ) )
set_freepointer ( s , p , p + s - > size ) ;
else
set_freepointer ( s , p , NULL ) ;
2007-05-07 01:49:36 +04:00
}
page - > freelist = start ;
2011-08-10 01:12:24 +04:00
page - > inuse = page - > objects ;
2011-06-01 21:25:46 +04:00
page - > frozen = 1 ;
2015-09-05 01:45:48 +03:00
2007-05-07 01:49:36 +04:00
out :
2015-11-07 03:28:21 +03:00
if ( gfpflags_allow_blocking ( flags ) )
2015-09-05 01:45:48 +03:00
local_irq_disable ( ) ;
if ( ! page )
return NULL ;
mod_zone_page_state ( page_zone ( page ) ,
( s - > flags & SLAB_RECLAIM_ACCOUNT ) ?
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE ,
1 < < oo_order ( oo ) ) ;
inc_slabs_node ( s , page_to_nid ( page ) , page - > objects ) ;
2007-05-07 01:49:36 +04:00
return page ;
}
2015-09-05 01:45:48 +03:00
static struct page * new_slab ( struct kmem_cache * s , gfp_t flags , int node )
{
if ( unlikely ( flags & GFP_SLAB_BUG_MASK ) ) {
pr_emerg ( " gfp: %u \n " , flags & GFP_SLAB_BUG_MASK ) ;
BUG ( ) ;
}
return allocate_slab ( s ,
flags & ( GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK ) , node ) ;
}
2007-05-07 01:49:36 +04:00
static void __free_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 20:11:31 +04:00
int order = compound_order ( page ) ;
int pages = 1 < < order ;
2007-05-07 01:49:36 +04:00
2010-07-09 23:07:14 +04:00
if ( kmem_cache_debug ( s ) ) {
2007-05-07 01:49:36 +04:00
void * p ;
slab_pad_check ( s , page ) ;
2008-04-14 20:11:31 +04:00
for_each_object ( p , s , page_address ( page ) ,
page - > objects )
2010-09-29 16:15:01 +04:00
check_object ( s , page , p , SLUB_RED_INACTIVE ) ;
2007-05-07 01:49:36 +04:00
}
2008-11-25 18:55:53 +03:00
kmemcheck_free_shadow ( page , compound_order ( page ) ) ;
2008-04-04 02:54:48 +04:00
2007-05-07 01:49:36 +04:00
mod_zone_page_state ( page_zone ( page ) ,
( s - > flags & SLAB_RECLAIM_ACCOUNT ) ?
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE ,
2008-01-08 10:20:27 +03:00
- pages ) ;
2007-05-07 01:49:36 +04:00
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
__ClearPageSlabPfmemalloc ( page ) ;
2008-04-14 19:52:18 +04:00
__ClearPageSlab ( page ) ;
2012-12-19 02:22:50 +04:00
2013-02-23 04:34:59 +04:00
page_mapcount_reset ( page ) ;
2009-05-05 13:13:44 +04:00
if ( current - > reclaim_state )
current - > reclaim_state - > reclaimed_slab + = pages ;
memcg: unify slab and other kmem pages charging
We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
uncharging kmem pages to memcg, but currently they are not used for
charging slab pages (i.e. they are only used for charging pages allocated
with alloc_kmem_pages). The only reason why the slab subsystem uses
special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
to the memcg that the current task belongs to.
To remove this diversity, this patch adds an extra argument to
__memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
not NULL, the function tries to charge to the memcg it points to,
otherwise it charge to the current context. Next, it makes the slab
subsystem use this function to charge slab pages.
Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
__memcg_kmem_charge stores a pointer to the memcg in the page struct, we
don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
Besides, one can now detect which memcg a slab page belongs to by reading
/proc/kpagecgroup.
Note, this patch switches slab to charge-after-alloc design. Since this
design is already used for all other memcg charges, it should not make any
difference.
[hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 05:49:01 +03:00
__free_kmem_pages ( page , order ) ;
2007-05-07 01:49:36 +04:00
}
2011-03-10 10:22:00 +03:00
# define need_reserve_slab_rcu \
( sizeof ( ( ( struct page * ) NULL ) - > lru ) < sizeof ( struct rcu_head ) )
2007-05-07 01:49:36 +04:00
static void rcu_free_slab ( struct rcu_head * h )
{
struct page * page ;
2011-03-10 10:22:00 +03:00
if ( need_reserve_slab_rcu )
page = virt_to_head_page ( h ) ;
else
page = container_of ( ( struct list_head * ) h , struct page , lru ) ;
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
__free_slab ( page - > slab_cache , page ) ;
2007-05-07 01:49:36 +04:00
}
static void free_slab ( struct kmem_cache * s , struct page * page )
{
if ( unlikely ( s - > flags & SLAB_DESTROY_BY_RCU ) ) {
2011-03-10 10:22:00 +03:00
struct rcu_head * head ;
if ( need_reserve_slab_rcu ) {
int order = compound_order ( page ) ;
int offset = ( PAGE_SIZE < < order ) - s - > reserved ;
VM_BUG_ON ( s - > reserved ! = sizeof ( * head ) ) ;
head = page_address ( page ) + offset ;
} else {
2015-11-07 03:29:44 +03:00
head = & page - > rcu_head ;
2011-03-10 10:22:00 +03:00
}
2007-05-07 01:49:36 +04:00
call_rcu ( head , rcu_free_slab ) ;
} else
__free_slab ( s , page ) ;
}
static void discard_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 20:11:40 +04:00
dec_slabs_node ( s , page_to_nid ( page ) , page - > objects ) ;
2007-05-07 01:49:36 +04:00
free_slab ( s , page ) ;
}
/*
2011-06-01 21:25:50 +04:00
* Management of partially allocated slabs .
2007-05-07 01:49:36 +04:00
*/
2014-02-11 02:25:46 +04:00
static inline void
__add_partial ( struct kmem_cache_node * n , struct page * page , int tail )
2007-05-07 01:49:36 +04:00
{
2007-05-07 01:49:44 +04:00
n - > nr_partial + + ;
2011-08-24 04:57:52 +04:00
if ( tail = = DEACTIVATE_TO_TAIL )
2008-01-08 10:20:27 +03:00
list_add_tail ( & page - > lru , & n - > partial ) ;
else
list_add ( & page - > lru , & n - > partial ) ;
2007-05-07 01:49:36 +04:00
}
2014-02-11 02:25:46 +04:00
static inline void add_partial ( struct kmem_cache_node * n ,
struct page * page , int tail )
2010-09-28 17:10:28 +04:00
{
2014-01-10 16:23:49 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2014-02-11 02:25:46 +04:00
__add_partial ( n , page , tail ) ;
}
2014-01-10 16:23:49 +04:00
2014-02-11 02:25:46 +04:00
static inline void
__remove_partial ( struct kmem_cache_node * n , struct page * page )
{
2010-09-28 17:10:28 +04:00
list_del ( & page - > lru ) ;
n - > nr_partial - - ;
}
2014-02-11 02:25:46 +04:00
static inline void remove_partial ( struct kmem_cache_node * n ,
struct page * page )
{
lockdep_assert_held ( & n - > list_lock ) ;
__remove_partial ( n , page ) ;
}
2007-05-07 01:49:36 +04:00
/*
2012-05-09 19:09:53 +04:00
* Remove slab from the partial list , freeze it and
* return the pointer to the freelist .
2007-05-07 01:49:36 +04:00
*
2011-08-10 01:12:26 +04:00
* Returns a list of objects or NULL if it fails .
2007-05-07 01:49:36 +04:00
*/
2011-08-10 01:12:26 +04:00
static inline void * acquire_slab ( struct kmem_cache * s ,
2011-08-10 01:12:25 +04:00
struct kmem_cache_node * n , struct page * page ,
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
int mode , int * objects )
2007-05-07 01:49:36 +04:00
{
2011-06-01 21:25:52 +04:00
void * freelist ;
unsigned long counters ;
struct page new ;
2014-01-10 16:23:49 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2011-06-01 21:25:52 +04:00
/*
* Zap the freelist and set the frozen bit .
* The old freelist is the list of objects for the
* per cpu allocation list .
*/
2012-05-09 19:09:53 +04:00
freelist = page - > freelist ;
counters = page - > counters ;
new . counters = counters ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
* objects = new . objects - new . inuse ;
2012-06-04 11:14:58 +04:00
if ( mode ) {
2012-05-09 19:09:53 +04:00
new . inuse = page - > objects ;
2012-06-04 11:14:58 +04:00
new . freelist = NULL ;
} else {
new . freelist = freelist ;
}
2011-06-01 21:25:52 +04:00
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( new . frozen ) ;
2012-05-09 19:09:53 +04:00
new . frozen = 1 ;
2011-06-01 21:25:52 +04:00
2012-05-09 19:09:53 +04:00
if ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 21:25:52 +04:00
freelist , counters ,
2012-05-16 19:13:02 +04:00
new . freelist , new . counters ,
2012-05-09 19:09:53 +04:00
" acquire_slab " ) )
return NULL ;
2011-06-01 21:25:52 +04:00
remove_partial ( n , page ) ;
2012-05-09 19:09:53 +04:00
WARN_ON ( ! freelist ) ;
2011-08-10 01:12:27 +04:00
return freelist ;
2007-05-07 01:49:36 +04:00
}
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
static void put_cpu_partial ( struct kmem_cache * s , struct page * page , int drain ) ;
2012-09-18 01:09:09 +04:00
static inline bool pfmemalloc_match ( struct page * page , gfp_t gfpflags ) ;
2011-08-10 01:12:27 +04:00
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* Try to allocate a partial slab from a specific node .
2007-05-07 01:49:36 +04:00
*/
2012-09-18 01:09:09 +04:00
static void * get_partial_node ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct kmem_cache_cpu * c , gfp_t flags )
2007-05-07 01:49:36 +04:00
{
2011-08-10 01:12:27 +04:00
struct page * page , * page2 ;
void * object = NULL ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
int available = 0 ;
int objects ;
2007-05-07 01:49:36 +04:00
/*
* Racy check . If we mistakenly see no partial slabs then we
* just allocate an empty slab . If we mistakenly try to get a
2007-05-09 13:32:39 +04:00
* partial slab and there is none available then get_partials ( )
* will return NULL .
2007-05-07 01:49:36 +04:00
*/
if ( ! n | | ! n - > nr_partial )
return NULL ;
spin_lock ( & n - > list_lock ) ;
2011-08-10 01:12:27 +04:00
list_for_each_entry_safe ( page , page2 , & n - > partial , lru ) {
2012-09-18 01:09:09 +04:00
void * t ;
2011-08-10 01:12:27 +04:00
2012-09-18 01:09:09 +04:00
if ( ! pfmemalloc_match ( page , flags ) )
continue ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
t = acquire_slab ( s , n , page , object = = NULL , & objects ) ;
2011-08-10 01:12:27 +04:00
if ( ! t )
break ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
available + = objects ;
2011-09-07 06:26:36 +04:00
if ( ! object ) {
2011-08-10 01:12:27 +04:00
c - > page = page ;
stat ( s , ALLOC_FROM_PARTIAL ) ;
object = t ;
} else {
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
put_cpu_partial ( s , page , 0 ) ;
2012-02-03 19:34:56 +04:00
stat ( s , CPU_PARTIAL_NODE ) ;
2011-08-10 01:12:27 +04:00
}
2013-06-19 09:05:52 +04:00
if ( ! kmem_cache_has_cpu_partial ( s )
| | available > s - > cpu_partial / 2 )
2011-08-10 01:12:27 +04:00
break ;
2011-08-10 01:12:26 +04:00
}
2007-05-07 01:49:36 +04:00
spin_unlock ( & n - > list_lock ) ;
2011-08-10 01:12:26 +04:00
return object ;
2007-05-07 01:49:36 +04:00
}
/*
2007-05-09 13:32:39 +04:00
* Get a page from somewhere . Search in increasing NUMA distances .
2007-05-07 01:49:36 +04:00
*/
2012-01-27 12:12:23 +04:00
static void * get_any_partial ( struct kmem_cache * s , gfp_t flags ,
2011-08-10 01:12:25 +04:00
struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
# ifdef CONFIG_NUMA
struct zonelist * zonelist ;
2008-04-28 13:12:17 +04:00
struct zoneref * z ;
2008-04-28 13:12:16 +04:00
struct zone * zone ;
enum zone_type high_zoneidx = gfp_zone ( flags ) ;
2011-08-10 01:12:26 +04:00
void * object ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
unsigned int cpuset_mems_cookie ;
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* The defrag ratio allows a configuration of the tradeoffs between
* inter node defragmentation and node local allocations . A lower
* defrag_ratio increases the tendency to do local allocations
* instead of attempting to obtain partial slabs from other nodes .
2007-05-07 01:49:36 +04:00
*
2007-05-09 13:32:39 +04:00
* If the defrag_ratio is set to 0 then kmalloc ( ) always
* returns node local objects . If the ratio is higher then kmalloc ( )
* may return off node objects because partial slabs are obtained
* from other nodes and filled up .
2007-05-07 01:49:36 +04:00
*
2008-02-16 10:45:26 +03:00
* If / sys / kernel / slab / xx / defrag_ratio is set to 100 ( which makes
2007-05-09 13:32:39 +04:00
* defrag_ratio = 1000 ) then every ( well almost ) allocation will
* first attempt to defrag slab caches on other nodes . This means
* scanning over all nodes to look for partial slabs which may be
* expensive if we do it every time we are trying to find a slab
* with available objects .
2007-05-07 01:49:36 +04:00
*/
2008-01-08 10:20:26 +03:00
if ( ! s - > remote_node_defrag_ratio | |
get_cycles ( ) % 1024 > s - > remote_node_defrag_ratio )
2007-05-07 01:49:36 +04:00
return NULL ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
do {
2014-04-04 01:47:24 +04:00
cpuset_mems_cookie = read_mems_allowed_begin ( ) ;
2014-04-08 02:37:29 +04:00
zonelist = node_zonelist ( mempolicy_slab_node ( ) , flags ) ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
for_each_zone_zonelist ( zone , z , zonelist , high_zoneidx ) {
struct kmem_cache_node * n ;
n = get_node ( s , zone_to_nid ( zone ) ) ;
2014-12-13 03:58:28 +03:00
if ( n & & cpuset_zone_allowed ( zone , flags ) & &
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
n - > nr_partial > s - > min_partial ) {
2012-09-18 01:09:09 +04:00
object = get_partial_node ( s , n , c , flags ) ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
if ( object ) {
/*
2014-04-04 01:47:24 +04:00
* Don ' t check read_mems_allowed_retry ( )
* here - if mems_allowed was updated in
* parallel , that was a harmless race
* between allocation and the cpuset
* update
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
*/
return object ;
}
2010-05-25 01:32:08 +04:00
}
2007-05-07 01:49:36 +04:00
}
2014-04-04 01:47:24 +04:00
} while ( read_mems_allowed_retry ( cpuset_mems_cookie ) ) ;
2007-05-07 01:49:36 +04:00
# endif
return NULL ;
}
/*
* Get a partial page , lock it and return it .
*/
2011-08-10 01:12:26 +04:00
static void * get_partial ( struct kmem_cache * s , gfp_t flags , int node ,
2011-08-10 01:12:25 +04:00
struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2011-08-10 01:12:26 +04:00
void * object ;
2014-10-10 02:26:15 +04:00
int searchnode = node ;
if ( node = = NUMA_NO_NODE )
searchnode = numa_mem_id ( ) ;
else if ( ! node_present_pages ( node ) )
searchnode = node_to_mem_node ( node ) ;
2007-05-07 01:49:36 +04:00
2012-09-18 01:09:09 +04:00
object = get_partial_node ( s , get_node ( s , searchnode ) , c , flags ) ;
2011-08-10 01:12:26 +04:00
if ( object | | node ! = NUMA_NO_NODE )
return object ;
2007-05-07 01:49:36 +04:00
2011-08-10 01:12:25 +04:00
return get_any_partial ( s , flags , c ) ;
2007-05-07 01:49:36 +04:00
}
2011-02-25 20:38:54 +03:00
# ifdef CONFIG_PREEMPT
/*
* Calculate the next globally unique transaction for disambiguiation
* during cmpxchg . The transactions start with the cpu number and are then
* incremented by CONFIG_NR_CPUS .
*/
# define TID_STEP roundup_pow_of_two(CONFIG_NR_CPUS)
# else
/*
* No preemption supported therefore also no need to check for
* different cpus .
*/
# define TID_STEP 1
# endif
static inline unsigned long next_tid ( unsigned long tid )
{
return tid + TID_STEP ;
}
static inline unsigned int tid_to_cpu ( unsigned long tid )
{
return tid % TID_STEP ;
}
static inline unsigned long tid_to_event ( unsigned long tid )
{
return tid / TID_STEP ;
}
static inline unsigned int init_tid ( int cpu )
{
return cpu ;
}
static inline void note_cmpxchg_failure ( const char * n ,
const struct kmem_cache * s , unsigned long tid )
{
# ifdef SLUB_DEBUG_CMPXCHG
unsigned long actual_tid = __this_cpu_read ( s - > cpu_slab - > tid ) ;
2014-06-05 03:06:34 +04:00
pr_info ( " %s %s: cmpxchg redo " , n , s - > name ) ;
2011-02-25 20:38:54 +03:00
# ifdef CONFIG_PREEMPT
if ( tid_to_cpu ( tid ) ! = tid_to_cpu ( actual_tid ) )
2014-06-05 03:06:34 +04:00
pr_warn ( " due to cpu change %d -> %d \n " ,
2011-02-25 20:38:54 +03:00
tid_to_cpu ( tid ) , tid_to_cpu ( actual_tid ) ) ;
else
# endif
if ( tid_to_event ( tid ) ! = tid_to_event ( actual_tid ) )
2014-06-05 03:06:34 +04:00
pr_warn ( " due to cpu running other code. Event %ld->%ld \n " ,
2011-02-25 20:38:54 +03:00
tid_to_event ( tid ) , tid_to_event ( actual_tid ) ) ;
else
2014-06-05 03:06:34 +04:00
pr_warn ( " for unknown reason: actual=%lx was=%lx target=%lx \n " ,
2011-02-25 20:38:54 +03:00
actual_tid , tid , next_tid ( tid ) ) ;
# endif
2011-03-22 21:35:00 +03:00
stat ( s , CMPXCHG_DOUBLE_CPU_FAIL ) ;
2011-02-25 20:38:54 +03:00
}
2012-09-28 12:34:05 +04:00
static void init_kmem_cache_cpus ( struct kmem_cache * s )
2011-02-25 20:38:54 +03:00
{
int cpu ;
for_each_possible_cpu ( cpu )
per_cpu_ptr ( s - > cpu_slab , cpu ) - > tid = init_tid ( cpu ) ;
}
2011-06-01 21:25:52 +04:00
2007-05-07 01:49:36 +04:00
/*
* Remove the cpu slab
*/
2013-07-15 05:05:29 +04:00
static void deactivate_slab ( struct kmem_cache * s , struct page * page ,
void * freelist )
2007-05-07 01:49:36 +04:00
{
2011-06-01 21:25:52 +04:00
enum slab_modes { M_NONE , M_PARTIAL , M_FULL , M_FREE } ;
struct kmem_cache_node * n = get_node ( s , page_to_nid ( page ) ) ;
int lock = 0 ;
enum slab_modes l = M_NONE , m = M_NONE ;
void * nextfree ;
2011-08-24 04:57:52 +04:00
int tail = DEACTIVATE_TO_HEAD ;
2011-06-01 21:25:52 +04:00
struct page new ;
struct page old ;
if ( page - > freelist ) {
2009-12-19 01:26:23 +03:00
stat ( s , DEACTIVATE_REMOTE_FREES ) ;
2011-08-24 04:57:52 +04:00
tail = DEACTIVATE_TO_TAIL ;
2011-06-01 21:25:52 +04:00
}
2007-05-10 14:15:16 +04:00
/*
2011-06-01 21:25:52 +04:00
* Stage one : Free all available per cpu objects back
* to the page freelist while it is still frozen . Leave the
* last one .
*
* There is no need to take the list - > lock because the page
* is still frozen .
*/
while ( freelist & & ( nextfree = get_freepointer ( s , freelist ) ) ) {
void * prior ;
unsigned long counters ;
do {
prior = page - > freelist ;
counters = page - > counters ;
set_freepointer ( s , freelist , prior ) ;
new . counters = counters ;
new . inuse - - ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! new . frozen ) ;
2011-06-01 21:25:52 +04:00
2011-07-14 21:49:12 +04:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 21:25:52 +04:00
prior , counters ,
freelist , new . counters ,
" drain percpu freelist " ) ) ;
freelist = nextfree ;
}
2007-05-10 14:15:16 +04:00
/*
2011-06-01 21:25:52 +04:00
* Stage two : Ensure that the page is unfrozen while the
* list presence reflects the actual number of objects
* during unfreeze .
*
* We setup the list membership and then perform a cmpxchg
* with the count . If there is a mismatch then the page
* is not unfrozen but the page is on the wrong list .
*
* Then we restart the process which may have to remove
* the page from the list that we just put it on again
* because the number of objects in the slab may have
* changed .
2007-05-10 14:15:16 +04:00
*/
2011-06-01 21:25:52 +04:00
redo :
2007-05-10 14:15:16 +04:00
2011-06-01 21:25:52 +04:00
old . freelist = page - > freelist ;
old . counters = page - > counters ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! old . frozen ) ;
2008-01-08 10:20:27 +03:00
2011-06-01 21:25:52 +04:00
/* Determine target state of the slab */
new . counters = old . counters ;
if ( freelist ) {
new . inuse - - ;
set_freepointer ( s , freelist , old . freelist ) ;
new . freelist = freelist ;
} else
new . freelist = old . freelist ;
new . frozen = 0 ;
2014-07-03 02:22:35 +04:00
if ( ! new . inuse & & n - > nr_partial > = s - > min_partial )
2011-06-01 21:25:52 +04:00
m = M_FREE ;
else if ( new . freelist ) {
m = M_PARTIAL ;
if ( ! lock ) {
lock = 1 ;
/*
* Taking the spinlock removes the possiblity
* that acquire_slab ( ) will see a slab page that
* is frozen
*/
spin_lock ( & n - > list_lock ) ;
}
} else {
m = M_FULL ;
if ( kmem_cache_debug ( s ) & & ! lock ) {
lock = 1 ;
/*
* This also ensures that the scanning of full
* slabs from diagnostic functions will not see
* any frozen slabs .
*/
spin_lock ( & n - > list_lock ) ;
}
}
if ( l ! = m ) {
if ( l = = M_PARTIAL )
remove_partial ( n , page ) ;
else if ( l = = M_FULL )
2007-05-10 14:15:16 +04:00
2014-01-10 16:23:49 +04:00
remove_full ( s , n , page ) ;
2011-06-01 21:25:52 +04:00
if ( m = = M_PARTIAL ) {
add_partial ( n , page , tail ) ;
2011-08-24 04:57:52 +04:00
stat ( s , tail ) ;
2011-06-01 21:25:52 +04:00
} else if ( m = = M_FULL ) {
2007-05-10 14:15:16 +04:00
2011-06-01 21:25:52 +04:00
stat ( s , DEACTIVATE_FULL ) ;
add_full ( s , n , page ) ;
}
}
l = m ;
2011-07-14 21:49:12 +04:00
if ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 21:25:52 +04:00
old . freelist , old . counters ,
new . freelist , new . counters ,
" unfreezing slab " ) )
goto redo ;
if ( lock )
spin_unlock ( & n - > list_lock ) ;
if ( m = = M_FREE ) {
stat ( s , DEACTIVATE_EMPTY ) ;
discard_slab ( s , page ) ;
stat ( s , FREE_SLAB ) ;
2007-05-10 14:15:16 +04:00
}
2007-05-07 01:49:36 +04:00
}
2012-05-18 17:01:17 +04:00
/*
* Unfreeze all the cpu partial slabs .
*
2012-11-28 20:23:00 +04:00
* This function must be called with interrupts disabled
* for the cpu using c ( or some other guarantee must be there
* to guarantee no concurrent accesses ) .
2012-05-18 17:01:17 +04:00
*/
2012-11-28 20:23:00 +04:00
static void unfreeze_partials ( struct kmem_cache * s ,
struct kmem_cache_cpu * c )
2011-08-10 01:12:27 +04:00
{
2013-06-19 09:05:52 +04:00
# ifdef CONFIG_SLUB_CPU_PARTIAL
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
struct kmem_cache_node * n = NULL , * n2 = NULL ;
2011-11-14 09:34:13 +04:00
struct page * page , * discard_page = NULL ;
2011-08-10 01:12:27 +04:00
while ( ( page = c - > partial ) ) {
struct page new ;
struct page old ;
c - > partial = page - > next ;
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
n2 = get_node ( s , page_to_nid ( page ) ) ;
if ( n ! = n2 ) {
if ( n )
spin_unlock ( & n - > list_lock ) ;
n = n2 ;
spin_lock ( & n - > list_lock ) ;
}
2011-08-10 01:12:27 +04:00
do {
old . freelist = page - > freelist ;
old . counters = page - > counters ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! old . frozen ) ;
2011-08-10 01:12:27 +04:00
new . counters = old . counters ;
new . freelist = old . freelist ;
new . frozen = 0 ;
2012-05-18 17:01:17 +04:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-08-10 01:12:27 +04:00
old . freelist , old . counters ,
new . freelist , new . counters ,
" unfreezing slab " ) ) ;
2014-07-03 02:22:35 +04:00
if ( unlikely ( ! new . inuse & & n - > nr_partial > = s - > min_partial ) ) {
2011-11-14 09:34:13 +04:00
page - > next = discard_page ;
discard_page = page ;
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
} else {
add_partial ( n , page , DEACTIVATE_TO_TAIL ) ;
stat ( s , FREE_ADD_PARTIAL ) ;
2011-08-10 01:12:27 +04:00
}
}
if ( n )
spin_unlock ( & n - > list_lock ) ;
2011-11-14 09:34:13 +04:00
while ( discard_page ) {
page = discard_page ;
discard_page = discard_page - > next ;
stat ( s , DEACTIVATE_EMPTY ) ;
discard_slab ( s , page ) ;
stat ( s , FREE_SLAB ) ;
}
2013-06-19 09:05:52 +04:00
# endif
2011-08-10 01:12:27 +04:00
}
/*
* Put a page that was just frozen ( in __slab_free ) into a partial page
* slot if available . This is done without interrupts disabled and without
* preemption disabled . The cmpxchg is racy and may put the partial page
* onto a random cpus partial slot .
*
* If we did not find a slot then simply move all the partials to the
* per node partial list .
*/
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
static void put_cpu_partial ( struct kmem_cache * s , struct page * page , int drain )
2011-08-10 01:12:27 +04:00
{
2013-06-19 09:05:52 +04:00
# ifdef CONFIG_SLUB_CPU_PARTIAL
2011-08-10 01:12:27 +04:00
struct page * oldpage ;
int pages ;
int pobjects ;
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 01:59:47 +03:00
preempt_disable ( ) ;
2011-08-10 01:12:27 +04:00
do {
pages = 0 ;
pobjects = 0 ;
oldpage = this_cpu_read ( s - > cpu_slab - > partial ) ;
if ( oldpage ) {
pobjects = oldpage - > pobjects ;
pages = oldpage - > pages ;
if ( drain & & pobjects > s - > cpu_partial ) {
unsigned long flags ;
/*
* partial array is full . Move the existing
* set to the per node partial list .
*/
local_irq_save ( flags ) ;
2012-11-28 20:23:00 +04:00
unfreeze_partials ( s , this_cpu_ptr ( s - > cpu_slab ) ) ;
2011-08-10 01:12:27 +04:00
local_irq_restore ( flags ) ;
2012-06-22 22:22:38 +04:00
oldpage = NULL ;
2011-08-10 01:12:27 +04:00
pobjects = 0 ;
pages = 0 ;
2012-02-03 19:34:56 +04:00
stat ( s , CPU_PARTIAL_DRAIN ) ;
2011-08-10 01:12:27 +04:00
}
}
pages + + ;
pobjects + = page - > objects - page - > inuse ;
page - > pages = pages ;
page - > pobjects = pobjects ;
page - > next = oldpage ;
2013-07-15 05:05:29 +04:00
} while ( this_cpu_cmpxchg ( s - > cpu_slab - > partial , oldpage , page )
! = oldpage ) ;
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 01:59:47 +03:00
if ( unlikely ( ! s - > cpu_partial ) ) {
unsigned long flags ;
local_irq_save ( flags ) ;
unfreeze_partials ( s , this_cpu_ptr ( s - > cpu_slab ) ) ;
local_irq_restore ( flags ) ;
}
preempt_enable ( ) ;
2013-06-19 09:05:52 +04:00
# endif
2011-08-10 01:12:27 +04:00
}
2007-10-16 12:26:05 +04:00
static inline void flush_slab ( struct kmem_cache * s , struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2009-12-19 01:26:23 +03:00
stat ( s , CPUSLAB_FLUSH ) ;
2012-05-09 19:09:57 +04:00
deactivate_slab ( s , c - > page , c - > freelist ) ;
c - > tid = next_tid ( c - > tid ) ;
c - > page = NULL ;
c - > freelist = NULL ;
2007-05-07 01:49:36 +04:00
}
/*
* Flush cpu slab .
2008-02-16 10:45:26 +03:00
*
2007-05-07 01:49:36 +04:00
* Called from IPI handler with interrupts disabled .
*/
2007-07-17 15:03:24 +04:00
static inline void __flush_cpu_slab ( struct kmem_cache * s , int cpu )
2007-05-07 01:49:36 +04:00
{
2009-12-19 01:26:20 +03:00
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab , cpu ) ;
2007-05-07 01:49:36 +04:00
2011-08-10 01:12:27 +04:00
if ( likely ( c ) ) {
if ( c - > page )
flush_slab ( s , c ) ;
2012-11-28 20:23:00 +04:00
unfreeze_partials ( s , c ) ;
2011-08-10 01:12:27 +04:00
}
2007-05-07 01:49:36 +04:00
}
static void flush_cpu_slab ( void * d )
{
struct kmem_cache * s = d ;
2007-10-16 12:26:05 +04:00
__flush_cpu_slab ( s , smp_processor_id ( ) ) ;
2007-05-07 01:49:36 +04:00
}
2012-03-29 01:42:44 +04:00
static bool has_cpu_slab ( int cpu , void * info )
{
struct kmem_cache * s = info ;
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab , cpu ) ;
2012-05-18 04:03:26 +04:00
return c - > page | | c - > partial ;
2012-03-29 01:42:44 +04:00
}
2007-05-07 01:49:36 +04:00
static void flush_all ( struct kmem_cache * s )
{
2012-03-29 01:42:44 +04:00
on_each_cpu_cond ( has_cpu_slab , flush_cpu_slab , s , 1 , GFP_ATOMIC ) ;
2007-05-07 01:49:36 +04:00
}
2007-10-16 12:26:05 +04:00
/*
* Check if the objects in a per cpu structure fit numa
* locality expectations .
*/
2012-05-09 19:09:59 +04:00
static inline int node_match ( struct page * page , int node )
2007-10-16 12:26:05 +04:00
{
# ifdef CONFIG_NUMA
2013-01-24 01:45:47 +04:00
if ( ! page | | ( node ! = NUMA_NO_NODE & & page_to_nid ( page ) ! = node ) )
2007-10-16 12:26:05 +04:00
return 0 ;
# endif
return 1 ;
}
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
# ifdef CONFIG_SLUB_DEBUG
2009-06-10 19:50:32 +04:00
static int count_free ( struct page * page )
{
return page - > objects - page - > inuse ;
}
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
static inline unsigned long node_nr_objs ( struct kmem_cache_node * n )
{
return atomic_long_read ( & n - > total_objects ) ;
}
# endif /* CONFIG_SLUB_DEBUG */
# if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SYSFS)
2009-06-10 19:50:32 +04:00
static unsigned long count_partial ( struct kmem_cache_node * n ,
int ( * get_count ) ( struct page * ) )
{
unsigned long flags ;
unsigned long x = 0 ;
struct page * page ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
list_for_each_entry ( page , & n - > partial , lru )
x + = get_count ( page ) ;
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
return x ;
}
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
# endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
2009-06-11 14:08:48 +04:00
2009-06-10 19:50:32 +04:00
static noinline void
slab_out_of_memory ( struct kmem_cache * s , gfp_t gfpflags , int nid )
{
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
# ifdef CONFIG_SLUB_DEBUG
static DEFINE_RATELIMIT_STATE ( slub_oom_rs , DEFAULT_RATELIMIT_INTERVAL ,
DEFAULT_RATELIMIT_BURST ) ;
2009-06-10 19:50:32 +04:00
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2009-06-10 19:50:32 +04:00
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
if ( ( gfpflags & __GFP_NOWARN ) | | ! __ratelimit ( & slub_oom_rs ) )
return ;
2014-06-05 03:06:34 +04:00
pr_warn ( " SLUB: Unable to allocate memory on node %d (gfp=0x%x) \n " ,
2009-06-10 19:50:32 +04:00
nid , gfpflags ) ;
2014-06-05 03:06:34 +04:00
pr_warn ( " cache: %s, object size: %d, buffer size: %d, default order: %d, min order: %d \n " ,
s - > name , s - > object_size , s - > size , oo_order ( s - > oo ) ,
oo_order ( s - > min ) ) ;
2009-06-10 19:50:32 +04:00
2012-06-13 19:24:57 +04:00
if ( oo_order ( s - > min ) > get_order ( s - > object_size ) )
2014-06-05 03:06:34 +04:00
pr_warn ( " %s debugging increased min order, use slub_debug=O to disable. \n " ,
s - > name ) ;
2009-07-07 11:14:14 +04:00
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2009-06-10 19:50:32 +04:00
unsigned long nr_slabs ;
unsigned long nr_objs ;
unsigned long nr_free ;
2009-06-11 14:08:48 +04:00
nr_free = count_partial ( n , count_free ) ;
nr_slabs = node_nr_slabs ( n ) ;
nr_objs = node_nr_objs ( n ) ;
2009-06-10 19:50:32 +04:00
2014-06-05 03:06:34 +04:00
pr_warn ( " node %d: slabs: %ld, objs: %ld, free: %ld \n " ,
2009-06-10 19:50:32 +04:00
node , nr_slabs , nr_objs , nr_free ) ;
}
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
# endif
2009-06-10 19:50:32 +04:00
}
2011-08-10 01:12:26 +04:00
static inline void * new_slab_objects ( struct kmem_cache * s , gfp_t flags ,
int node , struct kmem_cache_cpu * * pc )
{
2012-05-09 19:09:51 +04:00
void * freelist ;
2012-05-09 19:09:55 +04:00
struct kmem_cache_cpu * c = * pc ;
struct page * page ;
2011-08-10 01:12:26 +04:00
2012-05-09 19:09:55 +04:00
freelist = get_partial ( s , flags , node , c ) ;
2011-08-10 01:12:26 +04:00
2012-05-09 19:09:55 +04:00
if ( freelist )
return freelist ;
page = new_slab ( s , flags , node ) ;
2011-08-10 01:12:26 +04:00
if ( page ) {
2014-06-05 03:07:56 +04:00
c = raw_cpu_ptr ( s - > cpu_slab ) ;
2011-08-10 01:12:26 +04:00
if ( c - > page )
flush_slab ( s , c ) ;
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
*/
2012-05-09 19:09:51 +04:00
freelist = page - > freelist ;
2011-08-10 01:12:26 +04:00
page - > freelist = NULL ;
stat ( s , ALLOC_SLAB ) ;
c - > page = page ;
* pc = c ;
} else
2012-05-09 19:09:51 +04:00
freelist = NULL ;
2011-08-10 01:12:26 +04:00
2012-05-09 19:09:51 +04:00
return freelist ;
2011-08-10 01:12:26 +04:00
}
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
static inline bool pfmemalloc_match ( struct page * page , gfp_t gfpflags )
{
if ( unlikely ( PageSlabPfmemalloc ( page ) ) )
return gfp_pfmemalloc_allowed ( gfpflags ) ;
return true ;
}
2011-11-12 00:07:14 +04:00
/*
2013-07-15 05:05:29 +04:00
* Check the page - > freelist of a page and either transfer the freelist to the
* per cpu freelist or deactivate the page .
2011-11-12 00:07:14 +04:00
*
* The page is still frozen if the return value is not NULL .
*
* If this function returns NULL then the page has been unfrozen .
2012-05-18 17:01:17 +04:00
*
* This function must be called with interrupt disabled .
2011-11-12 00:07:14 +04:00
*/
static inline void * get_freelist ( struct kmem_cache * s , struct page * page )
{
struct page new ;
unsigned long counters ;
void * freelist ;
do {
freelist = page - > freelist ;
counters = page - > counters ;
2012-05-09 19:09:51 +04:00
2011-11-12 00:07:14 +04:00
new . counters = counters ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! new . frozen ) ;
2011-11-12 00:07:14 +04:00
new . inuse = page - > objects ;
new . frozen = freelist ! = NULL ;
2012-05-18 17:01:17 +04:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-11-12 00:07:14 +04:00
freelist , counters ,
NULL , new . counters ,
" get_freelist " ) ) ;
return freelist ;
}
2007-05-07 01:49:36 +04:00
/*
2007-05-10 14:15:16 +04:00
* Slow path . The lockless freelist is empty or we need to perform
* debugging duties .
*
* Processing is still very fast if new objects have been freed to the
* regular freelist . In that case we simply take over the regular freelist
* as the lockless freelist and zap the regular freelist .
2007-05-07 01:49:36 +04:00
*
2007-05-10 14:15:16 +04:00
* If that is not working then we fall back to the partial lists . We take the
* first element of the freelist as the object to allocate now and move the
* rest of the freelist to the lockless freelist .
2007-05-07 01:49:36 +04:00
*
2007-05-10 14:15:16 +04:00
* And if we were unable to get a new slab from the partial slab lists then
2008-02-16 10:45:26 +03:00
* we need to allocate a new slab . This is the slowest path since it involves
* a call to the page allocator and the setup of a new slab .
2015-11-21 02:57:35 +03:00
*
* Version of __slab_alloc to use when we know that interrupts are
* already disabled ( which is the case for bulk allocation ) .
2007-05-07 01:49:36 +04:00
*/
2015-11-21 02:57:35 +03:00
static void * ___slab_alloc ( struct kmem_cache * s , gfp_t gfpflags , int node ,
2008-08-19 21:43:25 +04:00
unsigned long addr , struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2012-05-09 19:09:51 +04:00
void * freelist ;
2012-05-09 19:09:58 +04:00
struct page * page ;
2007-05-07 01:49:36 +04:00
2012-05-09 19:09:58 +04:00
page = c - > page ;
if ( ! page )
2007-05-07 01:49:36 +04:00
goto new_slab ;
2011-08-10 01:12:27 +04:00
redo :
2012-05-09 19:09:51 +04:00
2012-05-09 19:09:59 +04:00
if ( unlikely ( ! node_match ( page , node ) ) ) {
2014-10-10 02:26:15 +04:00
int searchnode = node ;
if ( node ! = NUMA_NO_NODE & & ! node_present_pages ( node ) )
searchnode = node_to_mem_node ( node ) ;
if ( unlikely ( ! node_match ( page , searchnode ) ) ) {
stat ( s , ALLOC_NODE_MISMATCH ) ;
deactivate_slab ( s , page , c - > freelist ) ;
c - > page = NULL ;
c - > freelist = NULL ;
goto new_slab ;
}
2011-06-01 21:25:56 +04:00
}
2008-02-16 10:45:26 +03:00
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
/*
* By rights , we should be searching for a slab page that was
* PFMEMALLOC but right now , we are losing the pfmemalloc
* information when the page leaves the per - cpu allocator
*/
if ( unlikely ( ! pfmemalloc_match ( page , gfpflags ) ) ) {
deactivate_slab ( s , page , c - > freelist ) ;
c - > page = NULL ;
c - > freelist = NULL ;
goto new_slab ;
}
2011-12-13 07:57:06 +04:00
/* must check again c->freelist in case of cpu migration or IRQ */
2012-05-09 19:09:51 +04:00
freelist = c - > freelist ;
if ( freelist )
2011-12-13 07:57:06 +04:00
goto load_freelist ;
2011-06-01 21:25:58 +04:00
2012-05-09 19:09:58 +04:00
freelist = get_freelist ( s , page ) ;
2008-02-16 10:45:26 +03:00
2012-05-09 19:09:51 +04:00
if ( ! freelist ) {
2011-06-01 21:25:58 +04:00
c - > page = NULL ;
stat ( s , DEACTIVATE_BYPASS ) ;
2011-06-01 21:25:56 +04:00
goto new_slab ;
2011-06-01 21:25:58 +04:00
}
2008-02-16 10:45:26 +03:00
2009-12-19 01:26:23 +03:00
stat ( s , ALLOC_REFILL ) ;
2008-02-16 10:45:26 +03:00
2007-05-10 14:15:16 +04:00
load_freelist :
2012-05-09 19:09:52 +04:00
/*
* freelist is pointing to the list of objects to be used .
* page is pointing to the page from which the objects are obtained .
* That page must be frozen for per cpu allocations to work .
*/
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! c - > page - > frozen ) ;
2012-05-09 19:09:51 +04:00
c - > freelist = get_freepointer ( s , freelist ) ;
2011-02-25 20:38:54 +03:00
c - > tid = next_tid ( c - > tid ) ;
2012-05-09 19:09:51 +04:00
return freelist ;
2007-05-07 01:49:36 +04:00
new_slab :
2011-06-01 21:25:52 +04:00
2011-08-10 01:12:27 +04:00
if ( c - > partial ) {
2012-05-09 19:09:58 +04:00
page = c - > page = c - > partial ;
c - > partial = page - > next ;
2011-08-10 01:12:27 +04:00
stat ( s , CPU_PARTIAL_ALLOC ) ;
c - > freelist = NULL ;
goto redo ;
2007-05-07 01:49:36 +04:00
}
2012-05-09 19:09:55 +04:00
freelist = new_slab_objects ( s , gfpflags , node , & c ) ;
2011-04-15 23:48:14 +04:00
2012-05-09 19:09:54 +04:00
if ( unlikely ( ! freelist ) ) {
mm, slab: suppress out of memory warning unless debug is enabled
When the slab or slub allocators cannot allocate additional slab pages,
they emit diagnostic information to the kernel log such as current
number of slabs, number of objects, active objects, etc. This is always
coupled with a page allocation failure warning since it is controlled by
!__GFP_NOWARN.
Suppress this out of memory warning if the allocator is configured
without debug supported. The page allocation failure warning will
indicate it is a failed slab allocation, the order, and the gfp mask, so
this is only useful to diagnose allocator issues.
Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
allocator, there is no functional change with this patch. If debug is
disabled, however, the warnings are now suppressed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:36 +04:00
slab_out_of_memory ( s , gfpflags , node ) ;
2012-05-09 19:09:54 +04:00
return NULL ;
2007-05-07 01:49:36 +04:00
}
2011-06-01 21:25:52 +04:00
2012-05-09 19:09:58 +04:00
page = c - > page ;
2012-08-01 03:44:00 +04:00
if ( likely ( ! kmem_cache_debug ( s ) & & pfmemalloc_match ( page , gfpflags ) ) )
2007-05-17 09:10:53 +04:00
goto load_freelist ;
2011-06-01 21:25:52 +04:00
2011-08-10 01:12:26 +04:00
/* Only entered in the debug case */
2013-07-15 05:05:29 +04:00
if ( kmem_cache_debug ( s ) & &
! alloc_debug_processing ( s , page , freelist , addr ) )
2011-08-10 01:12:26 +04:00
goto new_slab ; /* Slab failed checks. Next slab needed */
2007-05-10 14:15:16 +04:00
2012-05-09 19:09:58 +04:00
deactivate_slab ( s , page , get_freepointer ( s , freelist ) ) ;
2012-05-09 19:09:57 +04:00
c - > page = NULL ;
c - > freelist = NULL ;
2012-05-09 19:09:51 +04:00
return freelist ;
2007-05-10 14:15:16 +04:00
}
2015-11-21 02:57:35 +03:00
/*
* Another one that disabled interrupt and compensates for possible
* cpu changes by refetching the per cpu area pointer .
*/
static void * __slab_alloc ( struct kmem_cache * s , gfp_t gfpflags , int node ,
unsigned long addr , struct kmem_cache_cpu * c )
{
void * p ;
unsigned long flags ;
local_irq_save ( flags ) ;
# ifdef CONFIG_PREEMPT
/*
* We may have been preempted and rescheduled on a different
* cpu before disabling interrupts . Need to reload cpu area
* pointer .
*/
c = this_cpu_ptr ( s - > cpu_slab ) ;
# endif
p = ___slab_alloc ( s , gfpflags , node , addr , c ) ;
local_irq_restore ( flags ) ;
return p ;
}
2007-05-10 14:15:16 +04:00
/*
* Inlined fastpath so that allocation functions ( kmalloc , kmem_cache_alloc )
* have the fastpath folded into their functions . So no function call
* overhead for requests that can be satisfied on the fastpath .
*
* The fastpath works by first checking if the lockless freelist can be used .
* If not then __slab_alloc is called for slow processing .
*
* Otherwise we can simply pick the next object from the lockless free list .
*/
2012-09-09 00:47:58 +04:00
static __always_inline void * slab_alloc_node ( struct kmem_cache * s ,
2008-08-19 21:43:25 +04:00
gfp_t gfpflags , int node , unsigned long addr )
2007-05-10 14:15:16 +04:00
{
2015-11-21 02:57:52 +03:00
void * object ;
2007-10-16 12:26:05 +04:00
struct kmem_cache_cpu * c ;
2012-05-09 19:09:59 +04:00
struct page * page ;
2011-02-25 20:38:54 +03:00
unsigned long tid ;
2008-01-08 10:20:30 +03:00
memcg: fix possible use-after-free in memcg_kmem_get_cache()
Suppose task @t that belongs to a memory cgroup @memcg is going to
allocate an object from a kmem cache @c. The copy of @c corresponding to
@memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
cgroup destruction we can access the memory cgroup's copy of the cache
after it was destroyed:
CPU0 CPU1
---- ----
[ current=@t
@mc->memcg_params->nr_pages=0 ]
kmem_cache_alloc(@c):
call memcg_kmem_get_cache(@c);
proceed to allocation from @mc:
alloc a page for @mc:
...
move @t from @memcg
destroy @memcg:
mem_cgroup_css_offline(@memcg):
memcg_unregister_all_caches(@memcg):
kmem_cache_destroy(@mc)
add page to @mc
We could fix this issue by taking a reference to a per-memcg cache, but
that would require adding a per-cpu reference counter to per-memcg caches,
which would look cumbersome.
Instead, let's take a reference to a memory cgroup, which already has a
per-cpu reference counter, in the beginning of kmem_cache_alloc to be
dropped in the end, and move per memcg caches destruction from css offline
to css free. As a side effect, per-memcg caches will be destroyed not one
by one, but all at once when the last page accounted to the memory cgroup
is freed. This doesn't sound as a high price for code readability though.
Note, this patch does add some overhead to the kmem_cache_alloc hot path,
but it is pretty negligible - it's just a function call plus a per cpu
counter decrement, which is comparable to what we already have in
memcg_kmem_get_cache. Besides, it's only relevant if there are memory
cgroups with kmem accounting enabled. I don't think we can find a way to
handle this race w/o it, because alloc_page called from kmem_cache_alloc
may sleep so we can't flush all pending kmallocs w/o reference counting.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 03:56:38 +03:00
s = slab_pre_alloc_hook ( s , gfpflags ) ;
if ( ! s )
2008-12-23 13:37:01 +03:00
return NULL ;
2011-02-25 20:38:54 +03:00
redo :
/*
* Must read kmem_cache cpu data via this cpu ptr . Preemption is
* enabled . We may switch back and forth between cpus while
* reading from one cpu area . That does not matter as long
* as we end up on the original cpu again when doing the cmpxchg .
2013-01-24 01:45:48 +04:00
*
2015-02-11 01:09:32 +03:00
* We should guarantee that tid and kmem_cache are retrieved on
* the same cpu . It could be different if CONFIG_PREEMPT so we need
* to check if it is matched or not .
2011-02-25 20:38:54 +03:00
*/
2015-02-11 01:09:32 +03:00
do {
tid = this_cpu_read ( s - > cpu_slab - > tid ) ;
c = raw_cpu_ptr ( s - > cpu_slab ) ;
2015-03-26 01:55:23 +03:00
} while ( IS_ENABLED ( CONFIG_PREEMPT ) & &
unlikely ( tid ! = READ_ONCE ( c - > tid ) ) ) ;
2015-02-11 01:09:32 +03:00
/*
* Irqless object alloc / free algorithm used here depends on sequence
* of fetching cpu_slab ' s data . tid should be fetched before anything
* on c to guarantee that object and page associated with previous tid
* won ' t be used with current tid . If we fetch tid first , object and
* page could be one associated with next tid and our alloc / free
* request will be failed . In this case , we will retry . So , no problem .
*/
barrier ( ) ;
2011-02-25 20:38:54 +03:00
/*
* The transaction ids are globally unique per cpu and per operation on
* a per cpu queue . Thus they can be guarantee that the cmpxchg_double
* occurs on the right processor and that there was no operation on the
* linked list in between .
*/
2009-12-19 01:26:20 +03:00
object = c - > freelist ;
2012-05-09 19:09:59 +04:00
page = c - > page ;
mm: slub: fix ALLOC_SLOWPATH stat
There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
got bumped in that exit path. Now there are two, and a bunch of gotos.
ALLOC_SLOWPATH can now get set more than once during a single call to
__slab_alloc() which is pretty bogus. Here's the sequence:
1. Enter __slab_alloc(), fall through all the way to the
stat(s, ALLOC_SLOWPATH);
2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
new_slab (goto #1)
3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
(goto #2)
4. Fall through in the same path we did before all the way to
stat(s, ALLOC_SLOWPATH)
5. bump ALLOC_REFILL stat, then return
Doing this is obviously bogus. It keeps us from being able to
accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
that the total number of allocs always exceeds the total number of
frees.
This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
place that __slab_alloc() is. This makes it much less likely that
ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
__slab_alloc().
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:37 +04:00
if ( unlikely ( ! object | | ! node_match ( page , node ) ) ) {
2007-10-16 12:26:05 +04:00
object = __slab_alloc ( s , gfpflags , node , addr , c ) ;
mm: slub: fix ALLOC_SLOWPATH stat
There used to be only one path out of __slab_alloc(), and ALLOC_SLOWPATH
got bumped in that exit path. Now there are two, and a bunch of gotos.
ALLOC_SLOWPATH can now get set more than once during a single call to
__slab_alloc() which is pretty bogus. Here's the sequence:
1. Enter __slab_alloc(), fall through all the way to the
stat(s, ALLOC_SLOWPATH);
2. hit 'if (!freelist)', and bump DEACTIVATE_BYPASS, jump to
new_slab (goto #1)
3. Hit 'if (c->partial)', bump CPU_PARTIAL_ALLOC, goto redo
(goto #2)
4. Fall through in the same path we did before all the way to
stat(s, ALLOC_SLOWPATH)
5. bump ALLOC_REFILL stat, then return
Doing this is obviously bogus. It keeps us from being able to
accurately compare ALLOC_SLOWPATH vs. ALLOC_FASTPATH. It also means
that the total number of allocs always exceeds the total number of
frees.
This patch moves stat(s, ALLOC_SLOWPATH) to be called from the same
place that __slab_alloc() is. This makes it much less likely that
ALLOC_SLOWPATH will get botched again in the spaghetti-code inside
__slab_alloc().
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:06:37 +04:00
stat ( s , ALLOC_SLOWPATH ) ;
} else {
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
void * next_object = get_freepointer_safe ( s , object ) ;
2011-02-25 20:38:54 +03:00
/*
2011-03-31 05:57:33 +04:00
* The cmpxchg will only match if there was no additional
2011-02-25 20:38:54 +03:00
* operation and if we are on the right processor .
*
2013-07-15 05:05:29 +04:00
* The cmpxchg does the following atomically ( without lock
* semantics ! )
2011-02-25 20:38:54 +03:00
* 1. Relocate first pointer to the current per cpu area .
* 2. Verify that tid and freelist have not been changed
* 3. If they were not changed replace tid and freelist
*
2013-07-15 05:05:29 +04:00
* Since this is without lock semantics the protection is only
* against code executing on this cpu * not * from access by
* other cpus .
2011-02-25 20:38:54 +03:00
*/
2011-12-22 21:58:51 +04:00
if ( unlikely ( ! this_cpu_cmpxchg_double (
2011-02-25 20:38:54 +03:00
s - > cpu_slab - > freelist , s - > cpu_slab - > tid ,
object , tid ,
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
next_object , next_tid ( tid ) ) ) ) {
2011-02-25 20:38:54 +03:00
note_cmpxchg_failure ( " slab_alloc " , s , tid ) ;
goto redo ;
}
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
prefetch_freepointer ( s , next_object ) ;
2009-12-19 01:26:23 +03:00
stat ( s , ALLOC_FASTPATH ) ;
2007-05-10 14:15:16 +04:00
}
2011-02-25 20:38:54 +03:00
2009-11-25 21:14:48 +03:00
if ( unlikely ( gfpflags & __GFP_ZERO ) & & object )
2012-06-13 19:24:57 +04:00
memset ( object , 0 , s - > object_size ) ;
2007-07-17 15:03:23 +04:00
2015-11-21 02:57:52 +03:00
slab_post_alloc_hook ( s , gfpflags , 1 , & object ) ;
2008-04-04 02:54:48 +04:00
2007-05-10 14:15:16 +04:00
return object ;
2007-05-07 01:49:36 +04:00
}
2012-09-09 00:47:58 +04:00
static __always_inline void * slab_alloc ( struct kmem_cache * s ,
gfp_t gfpflags , unsigned long addr )
{
return slab_alloc_node ( s , gfpflags , NUMA_NO_NODE , addr ) ;
}
2007-05-07 01:49:36 +04:00
void * kmem_cache_alloc ( struct kmem_cache * s , gfp_t gfpflags )
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc ( s , gfpflags , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2013-07-15 05:05:29 +04:00
trace_kmem_cache_alloc ( _RET_IP_ , ret , s - > object_size ,
s - > size , gfpflags ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kmem_cache_alloc ) ;
2009-12-11 10:45:30 +03:00
# ifdef CONFIG_TRACING
2010-10-21 13:29:19 +04:00
void * kmem_cache_alloc_trace ( struct kmem_cache * s , gfp_t gfpflags , size_t size )
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc ( s , gfpflags , _RET_IP_ ) ;
2010-10-21 13:29:19 +04:00
trace_kmalloc ( _RET_IP_ , ret , size , s - > size , gfpflags ) ;
2015-02-14 01:39:42 +03:00
kasan_kmalloc ( s , ret , size ) ;
2010-10-21 13:29:19 +04:00
return ret ;
}
EXPORT_SYMBOL ( kmem_cache_alloc_trace ) ;
2008-08-19 21:43:26 +04:00
# endif
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_NUMA
void * kmem_cache_alloc_node ( struct kmem_cache * s , gfp_t gfpflags , int node )
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc_node ( s , gfpflags , node , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2009-03-23 16:12:24 +03:00
trace_kmem_cache_alloc_node ( _RET_IP_ , ret ,
2012-06-13 19:24:57 +04:00
s - > object_size , s - > size , gfpflags , node ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kmem_cache_alloc_node ) ;
2009-12-11 10:45:30 +03:00
# ifdef CONFIG_TRACING
2010-10-21 13:29:19 +04:00
void * kmem_cache_alloc_node_trace ( struct kmem_cache * s ,
2008-08-19 21:43:26 +04:00
gfp_t gfpflags ,
2010-10-21 13:29:19 +04:00
int node , size_t size )
2008-08-19 21:43:26 +04:00
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc_node ( s , gfpflags , node , _RET_IP_ ) ;
2010-10-21 13:29:19 +04:00
trace_kmalloc_node ( _RET_IP_ , ret ,
size , s - > size , gfpflags , node ) ;
2015-02-14 01:39:42 +03:00
kasan_kmalloc ( s , ret , size ) ;
2010-10-21 13:29:19 +04:00
return ret ;
2008-08-19 21:43:26 +04:00
}
2010-10-21 13:29:19 +04:00
EXPORT_SYMBOL ( kmem_cache_alloc_node_trace ) ;
2008-08-19 21:43:26 +04:00
# endif
2010-09-29 16:02:15 +04:00
# endif
2008-08-19 21:43:26 +04:00
2007-05-07 01:49:36 +04:00
/*
2015-02-11 01:09:37 +03:00
* Slow path handling . This may still be called frequently since objects
2007-05-10 14:15:16 +04:00
* have a longer lifetime than the cpu slabs in most processing loads .
2007-05-07 01:49:36 +04:00
*
2007-05-10 14:15:16 +04:00
* So we still attempt to reduce cache line usage . Just take the slab
* lock and free the item . If there is no additional partial page
* handling required then we can return immediately .
2007-05-07 01:49:36 +04:00
*/
2007-05-10 14:15:16 +04:00
static void __slab_free ( struct kmem_cache * s , struct page * page ,
2015-11-21 02:57:46 +03:00
void * head , void * tail , int cnt ,
unsigned long addr )
2007-05-07 01:49:36 +04:00
{
void * prior ;
2011-06-01 21:25:52 +04:00
int was_frozen ;
struct page new ;
unsigned long counters ;
struct kmem_cache_node * n = NULL ;
2011-06-01 21:25:51 +04:00
unsigned long uninitialized_var ( flags ) ;
2007-05-07 01:49:36 +04:00
2011-02-25 20:38:54 +03:00
stat ( s , FREE_SLOWPATH ) ;
2007-05-07 01:49:36 +04:00
2012-05-30 21:54:46 +04:00
if ( kmem_cache_debug ( s ) & &
2015-11-21 02:57:46 +03:00
! ( n = free_debug_processing ( s , page , head , tail , cnt ,
addr , & flags ) ) )
2011-06-01 21:25:55 +04:00
return ;
2008-02-16 10:45:26 +03:00
2011-06-01 21:25:52 +04:00
do {
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
if ( unlikely ( n ) ) {
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
n = NULL ;
}
2011-06-01 21:25:52 +04:00
prior = page - > freelist ;
counters = page - > counters ;
2015-11-21 02:57:46 +03:00
set_freepointer ( s , tail , prior ) ;
2011-06-01 21:25:52 +04:00
new . counters = counters ;
was_frozen = new . frozen ;
2015-11-21 02:57:46 +03:00
new . inuse - = cnt ;
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
if ( ( ! new . inuse | | ! prior ) & & ! was_frozen ) {
2011-08-10 01:12:27 +04:00
2014-01-10 16:23:49 +04:00
if ( kmem_cache_has_cpu_partial ( s ) & & ! prior ) {
2011-08-10 01:12:27 +04:00
/*
2013-07-15 05:05:29 +04:00
* Slab was on no list before and will be
* partially empty
* We can defer the list move and instead
* freeze it .
2011-08-10 01:12:27 +04:00
*/
new . frozen = 1 ;
2014-01-10 16:23:49 +04:00
} else { /* Needs to be taken off a list */
2011-08-10 01:12:27 +04:00
2014-12-11 02:42:13 +03:00
n = get_node ( s , page_to_nid ( page ) ) ;
2011-08-10 01:12:27 +04:00
/*
* Speculatively acquire the list_lock .
* If the cmpxchg does not succeed then we may
* drop the list_lock without any processing .
*
* Otherwise the list_lock will synchronize with
* other processors updating the list of slabs .
*/
spin_lock_irqsave ( & n - > list_lock , flags ) ;
}
2011-06-01 21:25:52 +04:00
}
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:52 +04:00
} while ( ! cmpxchg_double_slab ( s , page ,
prior , counters ,
2015-11-21 02:57:46 +03:00
head , new . counters ,
2011-06-01 21:25:52 +04:00
" __slab_free " ) ) ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:52 +04:00
if ( likely ( ! n ) ) {
2011-08-10 01:12:27 +04:00
/*
* If we just froze the page then put it onto the
* per cpu partial list .
*/
2012-02-03 19:34:56 +04:00
if ( new . frozen & & ! was_frozen ) {
2011-08-10 01:12:27 +04:00
put_cpu_partial ( s , page , 1 ) ;
2012-02-03 19:34:56 +04:00
stat ( s , CPU_PARTIAL_FREE ) ;
}
2011-08-10 01:12:27 +04:00
/*
2011-06-01 21:25:52 +04:00
* The list lock was not taken therefore no list
* activity can be necessary .
*/
2014-12-11 02:42:13 +03:00
if ( was_frozen )
stat ( s , FREE_FROZEN ) ;
return ;
}
2007-05-07 01:49:36 +04:00
2014-07-03 02:22:35 +04:00
if ( unlikely ( ! new . inuse & & n - > nr_partial > = s - > min_partial ) )
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
goto slab_empty ;
2007-05-07 01:49:36 +04:00
/*
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
* Objects left in the slab . If it was not on the partial list before
* then add it .
2007-05-07 01:49:36 +04:00
*/
2013-06-19 09:05:52 +04:00
if ( ! kmem_cache_has_cpu_partial ( s ) & & unlikely ( ! prior ) ) {
if ( kmem_cache_debug ( s ) )
2014-01-10 16:23:49 +04:00
remove_full ( s , n , page ) ;
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
add_partial ( n , page , DEACTIVATE_TO_TAIL ) ;
stat ( s , FREE_ADD_PARTIAL ) ;
2008-02-08 04:47:41 +03:00
}
2011-06-01 21:25:55 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2007-05-07 01:49:36 +04:00
return ;
slab_empty :
2008-03-02 00:40:44 +03:00
if ( prior ) {
2007-05-07 01:49:36 +04:00
/*
2011-08-08 20:16:56 +04:00
* Slab on the partial list .
2007-05-07 01:49:36 +04:00
*/
2011-06-01 21:25:50 +04:00
remove_partial ( n , page ) ;
2009-12-19 01:26:23 +03:00
stat ( s , FREE_REMOVE_PARTIAL ) ;
2014-01-10 16:23:49 +04:00
} else {
2011-08-08 20:16:56 +04:00
/* Slab must be on the full list */
2014-01-10 16:23:49 +04:00
remove_full ( s , n , page ) ;
}
2011-06-01 21:25:52 +04:00
2011-06-01 21:25:55 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2009-12-19 01:26:23 +03:00
stat ( s , FREE_SLAB ) ;
2007-05-07 01:49:36 +04:00
discard_slab ( s , page ) ;
}
2007-05-10 14:15:16 +04:00
/*
* Fastpath with forced inlining to produce a kfree and kmem_cache_free that
* can perform fastpath freeing without additional function calls .
*
* The fastpath is only possible if we are freeing to the current cpu slab
* of this processor . This typically the case if we have just allocated
* the item before .
*
* If fastpath is not possible then fall back to __slab_free where we deal
* with all sorts of special processing .
2015-11-21 02:57:46 +03:00
*
* Bulk free of a freelist with several objects ( all pointing to the
* same page ) possible by specifying head and tail ptr , plus objects
* count ( cnt ) . Bulk free indicated by tail pointer being set .
2007-05-10 14:15:16 +04:00
*/
2015-11-21 02:57:46 +03:00
static __always_inline void slab_free ( struct kmem_cache * s , struct page * page ,
void * head , void * tail , int cnt ,
unsigned long addr )
2007-05-10 14:15:16 +04:00
{
2015-11-21 02:57:46 +03:00
void * tail_obj = tail ? : head ;
2007-10-16 12:26:05 +04:00
struct kmem_cache_cpu * c ;
2011-02-25 20:38:54 +03:00
unsigned long tid ;
2008-01-08 10:20:30 +03:00
2015-11-21 02:57:46 +03:00
slab_free_freelist_hook ( s , head , tail ) ;
2010-08-20 21:37:16 +04:00
2011-02-25 20:38:54 +03:00
redo :
/*
* Determine the currently cpus per cpu slab .
* The cpu may change afterward . However that does not matter since
* data is retrieved via this pointer . If we are on the same cpu
2015-09-05 01:45:31 +03:00
* during the cmpxchg then the free will succeed .
2011-02-25 20:38:54 +03:00
*/
2015-02-11 01:09:32 +03:00
do {
tid = this_cpu_read ( s - > cpu_slab - > tid ) ;
c = raw_cpu_ptr ( s - > cpu_slab ) ;
2015-03-26 01:55:23 +03:00
} while ( IS_ENABLED ( CONFIG_PREEMPT ) & &
unlikely ( tid ! = READ_ONCE ( c - > tid ) ) ) ;
2010-08-20 21:37:16 +04:00
2015-02-11 01:09:32 +03:00
/* Same with comment on barrier() in slab_alloc_node() */
barrier ( ) ;
2010-08-20 21:37:16 +04:00
2011-05-18 01:29:31 +04:00
if ( likely ( page = = c - > page ) ) {
2015-11-21 02:57:46 +03:00
set_freepointer ( s , tail_obj , c - > freelist ) ;
2011-02-25 20:38:54 +03:00
2011-12-22 21:58:51 +04:00
if ( unlikely ( ! this_cpu_cmpxchg_double (
2011-02-25 20:38:54 +03:00
s - > cpu_slab - > freelist , s - > cpu_slab - > tid ,
c - > freelist , tid ,
2015-11-21 02:57:46 +03:00
head , next_tid ( tid ) ) ) ) {
2011-02-25 20:38:54 +03:00
note_cmpxchg_failure ( " slab_free " , s , tid ) ;
goto redo ;
}
2009-12-19 01:26:23 +03:00
stat ( s , FREE_FASTPATH ) ;
2007-05-10 14:15:16 +04:00
} else
2015-11-21 02:57:46 +03:00
__slab_free ( s , page , head , tail_obj , cnt , addr ) ;
2007-05-10 14:15:16 +04:00
}
2007-05-07 01:49:36 +04:00
void kmem_cache_free ( struct kmem_cache * s , void * x )
{
2012-12-19 02:22:46 +04:00
s = cache_from_obj ( s , x ) ;
if ( ! s )
2012-09-05 03:06:14 +04:00
return ;
2015-11-21 02:57:46 +03:00
slab_free ( s , virt_to_head_page ( x ) , x , NULL , 1 , _RET_IP_ ) ;
2009-03-23 16:12:24 +03:00
trace_kmem_cache_free ( _RET_IP_ , x ) ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kmem_cache_free ) ;
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
struct detached_freelist {
2015-09-05 01:45:43 +03:00
struct page * page ;
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
void * tail ;
void * freelist ;
int cnt ;
} ;
2015-09-05 01:45:43 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
/*
* This function progressively scans the array with free objects ( with
* a limited look ahead ) and extract objects belonging to the same
* page . It builds a detached freelist directly within the given
* page / objects . This can happen without any need for
* synchronization , because the objects are owned by running process .
* The freelist is build up as a single linked list in the objects .
* The idea is , that this detached freelist can then be bulk
* transferred to the real freelist ( s ) , but only requiring a single
* synchronization primitive . Look ahead in the array is limited due
* to performance reasons .
*/
static int build_detached_freelist ( struct kmem_cache * s , size_t size ,
void * * p , struct detached_freelist * df )
{
size_t first_skipped_index = 0 ;
int lookahead = 3 ;
void * object ;
2015-09-05 01:45:43 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
/* Always re-init detached_freelist */
df - > page = NULL ;
2015-09-05 01:45:43 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
do {
object = p [ - - size ] ;
} while ( ! object & & size ) ;
2015-09-05 01:45:45 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
if ( ! object )
return 0 ;
2015-09-05 01:45:43 +03:00
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
/* Start new detached freelist */
set_freepointer ( s , object , NULL ) ;
df - > page = virt_to_head_page ( object ) ;
df - > tail = object ;
df - > freelist = object ;
p [ size ] = NULL ; /* mark object processed */
df - > cnt = 1 ;
while ( size ) {
object = p [ - - size ] ;
if ( ! object )
continue ; /* Skip processed objects */
/* df->page is always set at this point */
if ( df - > page = = virt_to_head_page ( object ) ) {
/* Opportunity build freelist */
set_freepointer ( s , object , df - > freelist ) ;
df - > freelist = object ;
df - > cnt + + ;
p [ size ] = NULL ; /* mark object processed */
continue ;
2015-09-05 01:45:43 +03:00
}
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
/* Limit look ahead search */
if ( ! - - lookahead )
break ;
if ( ! first_skipped_index )
first_skipped_index = size + 1 ;
2015-09-05 01:45:43 +03:00
}
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
return first_skipped_index ;
}
/* Note that interrupts must be enabled when calling this function. */
2015-11-21 02:57:55 +03:00
void kmem_cache_free_bulk ( struct kmem_cache * orig_s , size_t size , void * * p )
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
{
if ( WARN_ON ( ! size ) )
return ;
do {
struct detached_freelist df ;
2015-11-21 02:57:55 +03:00
struct kmem_cache * s ;
/* Support for memcg */
s = cache_from_obj ( orig_s , p [ size - 1 ] ) ;
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
Performance data, compared current in-kernel bulking:
bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
Joint work with Alexander Duyck.
[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-21 02:57:49 +03:00
size = build_detached_freelist ( s , size , p , & df ) ;
if ( unlikely ( ! df . page ) )
continue ;
slab_free ( s , df . page , df . freelist , df . tail , df . cnt , _RET_IP_ ) ;
} while ( likely ( size ) ) ;
2015-09-05 01:45:34 +03:00
}
EXPORT_SYMBOL ( kmem_cache_free_bulk ) ;
2015-09-05 01:45:37 +03:00
/* Note that interrupts must be enabled when calling this function. */
2015-11-21 02:57:58 +03:00
int kmem_cache_alloc_bulk ( struct kmem_cache * s , gfp_t flags , size_t size ,
void * * p )
2015-09-05 01:45:34 +03:00
{
2015-09-05 01:45:37 +03:00
struct kmem_cache_cpu * c ;
int i ;
2015-11-21 02:57:52 +03:00
/* memcg and kmem_cache debug support */
s = slab_pre_alloc_hook ( s , flags ) ;
if ( unlikely ( ! s ) )
return false ;
2015-09-05 01:45:37 +03:00
/*
* Drain objects in the per cpu slab , while disabling local
* IRQs , which protects against PREEMPT and interrupts
* handlers invoking normal fastpath .
*/
local_irq_disable ( ) ;
c = this_cpu_ptr ( s - > cpu_slab ) ;
for ( i = 0 ; i < size ; i + + ) {
void * object = c - > freelist ;
2015-09-05 01:45:40 +03:00
if ( unlikely ( ! object ) ) {
/*
* Invoking slow path likely have side - effect
* of re - populating per CPU c - > freelist
*/
2015-11-21 02:57:38 +03:00
p [ i ] = ___slab_alloc ( s , flags , NUMA_NO_NODE ,
2015-09-05 01:45:40 +03:00
_RET_IP_ , c ) ;
2015-11-21 02:57:38 +03:00
if ( unlikely ( ! p [ i ] ) )
goto error ;
2015-09-05 01:45:40 +03:00
c = this_cpu_ptr ( s - > cpu_slab ) ;
continue ; /* goto for-loop */
}
2015-09-05 01:45:37 +03:00
c - > freelist = get_freepointer ( s , object ) ;
p [ i ] = object ;
}
c - > tid = next_tid ( c - > tid ) ;
local_irq_enable ( ) ;
/* Clear memory outside IRQ disabled fastpath loop */
if ( unlikely ( flags & __GFP_ZERO ) ) {
int j ;
for ( j = 0 ; j < i ; j + + )
memset ( p [ j ] , 0 , s - > object_size ) ;
}
2015-11-21 02:57:52 +03:00
/* memcg and kmem_cache debug support */
slab_post_alloc_hook ( s , flags , size , p ) ;
2015-11-21 02:57:58 +03:00
return i ;
2015-11-21 02:57:38 +03:00
error :
local_irq_enable ( ) ;
2015-11-21 02:57:52 +03:00
slab_post_alloc_hook ( s , flags , i , p ) ;
__kmem_cache_free_bulk ( s , i , p ) ;
2015-11-21 02:57:58 +03:00
return 0 ;
2015-09-05 01:45:34 +03:00
}
EXPORT_SYMBOL ( kmem_cache_alloc_bulk ) ;
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* Object placement in a slab is made very easy because we always start at
* offset 0. If we tune the size of the object to the alignment then we can
* get the required alignment by putting one properly sized object after
* another .
2007-05-07 01:49:36 +04:00
*
* Notice that the allocation order determines the sizes of the per cpu
* caches . Each processor has always one slab available for allocations .
* Increasing the allocation order reduces the number of times that slabs
2007-05-09 13:32:39 +04:00
* must be moved on and off the partial lists and is therefore a factor in
2007-05-07 01:49:36 +04:00
* locking overhead .
*/
/*
* Mininum / Maximum order of slab pages . This influences locking overhead
* and slab fragmentation . A higher order reduces the number of partial slabs
* and increases the number of allocations possible without having to
* take the list_lock .
*/
static int slub_min_order ;
2008-04-14 20:11:41 +04:00
static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER ;
2008-04-14 20:11:41 +04:00
static int slub_min_objects ;
2007-05-07 01:49:36 +04:00
/*
* Calculate the order of allocation given an slab object size .
*
2007-05-09 13:32:39 +04:00
* The order of allocation has significant impact on performance and other
* system components . Generally order 0 allocations should be preferred since
* order 0 does not cause fragmentation in the page allocator . Larger objects
* be problematic to put into order 0 slabs because there may be too much
2008-04-14 20:13:29 +04:00
* unused space left . We go to a higher order if more than 1 / 16 th of the slab
2007-05-09 13:32:39 +04:00
* would be wasted .
*
* In order to reach satisfactory performance we must ensure that a minimum
* number of objects is in one slab . Otherwise we may generate too much
* activity on the partial lists which requires taking the list_lock . This is
* less a concern for large slabs though which are rarely used .
2007-05-07 01:49:36 +04:00
*
2007-05-09 13:32:39 +04:00
* slub_max_order specifies the order where we begin to stop considering the
* number of objects in a slab as critical . If we reach slub_max_order then
* we try to keep the page order as low as possible . So we accept more waste
* of space in favor of a small page order .
2007-05-07 01:49:36 +04:00
*
2007-05-09 13:32:39 +04:00
* Higher order allocations also allow the placement of more objects in a
* slab and thereby reduce object handling overhead . If the user has
* requested a higher mininum order then we start with that one instead of
* the smallest order which will fit the object .
2007-05-07 01:49:36 +04:00
*/
2007-05-09 13:32:46 +04:00
static inline int slab_order ( int size , int min_objects ,
2011-03-10 10:21:48 +03:00
int max_order , int fract_leftover , int reserved )
2007-05-07 01:49:36 +04:00
{
int order ;
int rem ;
2007-07-17 15:03:20 +04:00
int min_order = slub_min_order ;
2007-05-07 01:49:36 +04:00
2011-03-10 10:21:48 +03:00
if ( order_objects ( min_order , size , reserved ) > MAX_OBJS_PER_PAGE )
2008-10-22 23:00:38 +04:00
return get_order ( size * MAX_OBJS_PER_PAGE ) - 1 ;
2008-04-14 20:11:30 +04:00
2015-11-06 05:45:51 +03:00
for ( order = max ( min_order , get_order ( min_objects * size + reserved ) ) ;
2007-05-09 13:32:46 +04:00
order < = max_order ; order + + ) {
2007-05-07 01:49:36 +04:00
2007-05-09 13:32:46 +04:00
unsigned long slab_size = PAGE_SIZE < < order ;
2007-05-07 01:49:36 +04:00
2011-03-10 10:21:48 +03:00
rem = ( slab_size - reserved ) % size ;
2007-05-07 01:49:36 +04:00
2007-05-09 13:32:46 +04:00
if ( rem < = slab_size / fract_leftover )
2007-05-07 01:49:36 +04:00
break ;
}
2007-05-09 13:32:39 +04:00
2007-05-07 01:49:36 +04:00
return order ;
}
2011-03-10 10:21:48 +03:00
static inline int calculate_order ( int size , int reserved )
2007-05-09 13:32:46 +04:00
{
int order ;
int min_objects ;
int fraction ;
2009-02-12 19:00:17 +03:00
int max_objects ;
2007-05-09 13:32:46 +04:00
/*
* Attempt to find best configuration for a slab . This
* works by first attempting to generate a layout with
* the best configuration and backing off gradually .
*
2015-11-06 05:45:46 +03:00
* First we increase the acceptable waste in a slab . Then
2007-05-09 13:32:46 +04:00
* we reduce the minimum objects required in a slab .
*/
min_objects = slub_min_objects ;
2008-04-14 20:11:41 +04:00
if ( ! min_objects )
min_objects = 4 * ( fls ( nr_cpu_ids ) + 1 ) ;
2011-03-10 10:21:48 +03:00
max_objects = order_objects ( slub_max_order , size , reserved ) ;
2009-02-12 19:00:17 +03:00
min_objects = min ( min_objects , max_objects ) ;
2007-05-09 13:32:46 +04:00
while ( min_objects > 1 ) {
2008-04-14 20:13:29 +04:00
fraction = 16 ;
2007-05-09 13:32:46 +04:00
while ( fraction > = 4 ) {
order = slab_order ( size , min_objects ,
2011-03-10 10:21:48 +03:00
slub_max_order , fraction , reserved ) ;
2007-05-09 13:32:46 +04:00
if ( order < = slub_max_order )
return order ;
fraction / = 2 ;
}
2009-08-19 22:44:13 +04:00
min_objects - - ;
2007-05-09 13:32:46 +04:00
}
/*
* We were unable to place multiple objects in a slab . Now
* lets see if we can place a single object there .
*/
2011-03-10 10:21:48 +03:00
order = slab_order ( size , 1 , slub_max_order , 1 , reserved ) ;
2007-05-09 13:32:46 +04:00
if ( order < = slub_max_order )
return order ;
/*
* Doh this slab cannot be placed using slub_max_order .
*/
2011-03-10 10:21:48 +03:00
order = slab_order ( size , 1 , MAX_ORDER , 1 , reserved ) ;
2009-04-23 10:58:22 +04:00
if ( order < MAX_ORDER )
2007-05-09 13:32:46 +04:00
return order ;
return - ENOSYS ;
}
2008-08-05 10:28:47 +04:00
static void
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( struct kmem_cache_node * n )
2007-05-07 01:49:36 +04:00
{
n - > nr_partial = 0 ;
spin_lock_init ( & n - > list_lock ) ;
INIT_LIST_HEAD ( & n - > partial ) ;
2007-07-17 15:03:32 +04:00
# ifdef CONFIG_SLUB_DEBUG
2008-04-14 19:53:02 +04:00
atomic_long_set ( & n - > nr_slabs , 0 ) ;
2008-09-11 23:25:41 +04:00
atomic_long_set ( & n - > total_objects , 0 ) ;
2007-05-07 01:49:42 +04:00
INIT_LIST_HEAD ( & n - > full ) ;
2007-07-17 15:03:32 +04:00
# endif
2007-05-07 01:49:36 +04:00
}
2010-08-20 21:37:13 +04:00
static inline int alloc_kmem_cache_cpus ( struct kmem_cache * s )
2007-10-16 12:26:08 +04:00
{
2010-08-20 21:37:14 +04:00
BUILD_BUG_ON ( PERCPU_DYNAMIC_EARLY_SIZE <
2013-01-10 23:14:19 +04:00
KMALLOC_SHIFT_HIGH * sizeof ( struct kmem_cache_cpu ) ) ;
2007-10-16 12:26:08 +04:00
2011-02-25 20:38:54 +03:00
/*
2011-06-02 18:19:41 +04:00
* Must align to double word boundary for the double cmpxchg
* instructions to work ; see __pcpu_double_call_return_bool ( ) .
2011-02-25 20:38:54 +03:00
*/
2011-06-02 18:19:41 +04:00
s - > cpu_slab = __alloc_percpu ( sizeof ( struct kmem_cache_cpu ) ,
2 * sizeof ( void * ) ) ;
2011-02-25 20:38:54 +03:00
if ( ! s - > cpu_slab )
return 0 ;
init_kmem_cache_cpus ( s ) ;
2007-10-16 12:26:08 +04:00
2011-02-25 20:38:54 +03:00
return 1 ;
2007-10-16 12:26:08 +04:00
}
2010-08-20 21:37:15 +04:00
static struct kmem_cache * kmem_cache_node ;
2007-05-07 01:49:36 +04:00
/*
* No kmalloc_node yet so do it by hand . We know that this is the first
* slab on the node for this slabcache . There are no concurrent accesses
* possible .
*
2013-11-08 16:47:37 +04:00
* Note that this function only works on the kmem_cache_node
* when allocating for the kmem_cache_node . This is used for bootstrapping
2007-10-16 12:26:08 +04:00
* memory on a fresh node that has no slab structures yet .
2007-05-07 01:49:36 +04:00
*/
2010-08-20 21:37:13 +04:00
static void early_kmem_cache_node_alloc ( int node )
2007-05-07 01:49:36 +04:00
{
struct page * page ;
struct kmem_cache_node * n ;
2010-08-20 21:37:15 +04:00
BUG_ON ( kmem_cache_node - > size < sizeof ( struct kmem_cache_node ) ) ;
2007-05-07 01:49:36 +04:00
2010-08-20 21:37:15 +04:00
page = new_slab ( kmem_cache_node , GFP_NOWAIT , node ) ;
2007-05-07 01:49:36 +04:00
BUG_ON ( ! page ) ;
2007-08-23 01:01:57 +04:00
if ( page_to_nid ( page ) ! = node ) {
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB: Unable to allocate memory from node %d \n " , node ) ;
pr_err ( " SLUB: Allocating a useless per node structure in order to be able to continue \n " ) ;
2007-08-23 01:01:57 +04:00
}
2007-05-07 01:49:36 +04:00
n = page - > freelist ;
BUG_ON ( ! n ) ;
2010-08-20 21:37:15 +04:00
page - > freelist = get_freepointer ( kmem_cache_node , n ) ;
2011-08-10 01:12:24 +04:00
page - > inuse = 1 ;
2011-06-01 21:25:46 +04:00
page - > frozen = 0 ;
2010-08-20 21:37:15 +04:00
kmem_cache_node - > node [ node ] = n ;
2007-07-17 15:03:32 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-09-29 16:15:01 +04:00
init_object ( kmem_cache_node , n , SLUB_RED_ACTIVE ) ;
2010-08-20 21:37:15 +04:00
init_tracking ( kmem_cache_node , n ) ;
2007-07-17 15:03:32 +04:00
# endif
2015-02-14 01:39:42 +03:00
kasan_kmalloc ( kmem_cache_node , n , sizeof ( struct kmem_cache_node ) ) ;
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( n ) ;
2010-08-20 21:37:15 +04:00
inc_slabs_node ( kmem_cache_node , node , page - > objects ) ;
2008-02-16 10:45:26 +03:00
2014-01-24 19:20:23 +04:00
/*
2014-02-11 02:25:46 +04:00
* No locks need to be taken here as it has just been
* initialized and there is no concurrent access .
2014-01-24 19:20:23 +04:00
*/
2014-02-11 02:25:46 +04:00
__add_partial ( n , page , DEACTIVATE_TO_HEAD ) ;
2007-05-07 01:49:36 +04:00
}
static void free_kmem_cache_nodes ( struct kmem_cache * s )
{
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:36 +04:00
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
kmem_cache_free ( kmem_cache_node , n ) ;
2007-05-07 01:49:36 +04:00
s - > node [ node ] = NULL ;
}
}
2010-08-20 21:37:13 +04:00
static int init_kmem_cache_nodes ( struct kmem_cache * s )
2007-05-07 01:49:36 +04:00
{
int node ;
2007-10-16 12:25:33 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-07 01:49:36 +04:00
struct kmem_cache_node * n ;
2010-05-22 01:41:35 +04:00
if ( slab_state = = DOWN ) {
2010-08-20 21:37:13 +04:00
early_kmem_cache_node_alloc ( node ) ;
2010-05-22 01:41:35 +04:00
continue ;
}
2010-08-20 21:37:15 +04:00
n = kmem_cache_alloc_node ( kmem_cache_node ,
2010-08-20 21:37:13 +04:00
GFP_KERNEL , node ) ;
2007-05-07 01:49:36 +04:00
2010-05-22 01:41:35 +04:00
if ( ! n ) {
free_kmem_cache_nodes ( s ) ;
return 0 ;
2007-05-07 01:49:36 +04:00
}
2010-05-22 01:41:35 +04:00
2007-05-07 01:49:36 +04:00
s - > node [ node ] = n ;
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( n ) ;
2007-05-07 01:49:36 +04:00
}
return 1 ;
}
2009-02-25 10:16:35 +03:00
static void set_min_partial ( struct kmem_cache * s , unsigned long min )
2009-02-23 04:40:07 +03:00
{
if ( min < MIN_PARTIAL )
min = MIN_PARTIAL ;
else if ( min > MAX_PARTIAL )
min = MAX_PARTIAL ;
s - > min_partial = min ;
}
2007-05-07 01:49:36 +04:00
/*
* calculate_sizes ( ) determines the order and the distribution of data within
* a slab object .
*/
2008-04-14 20:11:41 +04:00
static int calculate_sizes ( struct kmem_cache * s , int forced_order )
2007-05-07 01:49:36 +04:00
{
unsigned long flags = s - > flags ;
2012-06-13 19:24:57 +04:00
unsigned long size = s - > object_size ;
2008-04-14 20:11:31 +04:00
int order ;
2007-05-07 01:49:36 +04:00
2008-02-16 10:45:25 +03:00
/*
* Round up object size to the next word boundary . We can only
* place the free pointer at word boundaries and this determines
* the possible location of the free pointer .
*/
size = ALIGN ( size , sizeof ( void * ) ) ;
# ifdef CONFIG_SLUB_DEBUG
2007-05-07 01:49:36 +04:00
/*
* Determine if we can poison the object itself . If the user of
* the slab may touch the object after free or before allocation
* then we should never poison the object itself .
*/
if ( ( flags & SLAB_POISON ) & & ! ( flags & SLAB_DESTROY_BY_RCU ) & &
2007-05-17 09:10:50 +04:00
! s - > ctor )
2007-05-07 01:49:36 +04:00
s - > flags | = __OBJECT_POISON ;
else
s - > flags & = ~ __OBJECT_POISON ;
/*
2007-05-09 13:32:39 +04:00
* If we are Redzoning then check if there is some space between the
2007-05-07 01:49:36 +04:00
* end of the object and the free pointer . If not then add an
2007-05-09 13:32:39 +04:00
* additional word to have some bytes to store Redzone information .
2007-05-07 01:49:36 +04:00
*/
2012-06-13 19:24:57 +04:00
if ( ( flags & SLAB_RED_ZONE ) & & size = = s - > object_size )
2007-05-07 01:49:36 +04:00
size + = sizeof ( void * ) ;
2007-05-09 13:32:44 +04:00
# endif
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* With that we have determined the number of bytes in actual use
* by the object . This is the potential offset to the free pointer .
2007-05-07 01:49:36 +04:00
*/
s - > inuse = size ;
if ( ( ( flags & ( SLAB_DESTROY_BY_RCU | SLAB_POISON ) ) | |
2007-05-17 09:10:50 +04:00
s - > ctor ) ) {
2007-05-07 01:49:36 +04:00
/*
* Relocate free pointer after the object if it is not
* permitted to overwrite the first word of the object on
* kmem_cache_free .
*
* This is the case if we do RCU , have a constructor or
* destructor or are poisoning the objects .
*/
s - > offset = size ;
size + = sizeof ( void * ) ;
}
2007-05-24 00:57:31 +04:00
# ifdef CONFIG_SLUB_DEBUG
2007-05-07 01:49:36 +04:00
if ( flags & SLAB_STORE_USER )
/*
* Need to store information about allocs and frees after
* the object .
*/
size + = 2 * sizeof ( struct track ) ;
2007-05-09 13:32:36 +04:00
if ( flags & SLAB_RED_ZONE )
2007-05-07 01:49:36 +04:00
/*
* Add some empty padding so that we can catch
* overwrites from earlier objects rather than let
* tracking information or the free pointer be
2008-12-30 00:14:56 +03:00
* corrupted if a user writes before the start
2007-05-07 01:49:36 +04:00
* of the object .
*/
size + = sizeof ( void * ) ;
2007-05-09 13:32:44 +04:00
# endif
2007-05-09 13:32:39 +04:00
2007-05-07 01:49:36 +04:00
/*
* SLUB stores one object immediately after another beginning from
* offset 0. In order to align the objects we have to simply size
* each object to conform to the alignment .
*/
2012-11-28 20:23:16 +04:00
size = ALIGN ( size , s - > align ) ;
2007-05-07 01:49:36 +04:00
s - > size = size ;
2008-04-14 20:11:41 +04:00
if ( forced_order > = 0 )
order = forced_order ;
else
2011-03-10 10:21:48 +03:00
order = calculate_order ( size , s - > reserved ) ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:31 +04:00
if ( order < 0 )
2007-05-07 01:49:36 +04:00
return 0 ;
2008-02-15 01:21:32 +03:00
s - > allocflags = 0 ;
2008-04-14 20:11:31 +04:00
if ( order )
2008-02-15 01:21:32 +03:00
s - > allocflags | = __GFP_COMP ;
if ( s - > flags & SLAB_CACHE_DMA )
2013-01-10 23:14:19 +04:00
s - > allocflags | = GFP_DMA ;
2008-02-15 01:21:32 +03:00
if ( s - > flags & SLAB_RECLAIM_ACCOUNT )
s - > allocflags | = __GFP_RECLAIMABLE ;
2007-05-07 01:49:36 +04:00
/*
* Determine the number of objects per slab
*/
2011-03-10 10:21:48 +03:00
s - > oo = oo_make ( order , size , s - > reserved ) ;
s - > min = oo_make ( get_order ( size ) , size , s - > reserved ) ;
2008-04-14 20:11:40 +04:00
if ( oo_objects ( s - > oo ) > oo_objects ( s - > max ) )
s - > max = s - > oo ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:31 +04:00
return ! ! oo_objects ( s - > oo ) ;
2007-05-07 01:49:36 +04:00
}
2012-09-05 03:18:33 +04:00
static int kmem_cache_open ( struct kmem_cache * s , unsigned long flags )
2007-05-07 01:49:36 +04:00
{
2012-09-05 03:18:33 +04:00
s - > flags = kmem_cache_flags ( s - > size , flags , s - > name , s - > ctor ) ;
2011-03-10 10:21:48 +03:00
s - > reserved = 0 ;
2007-05-07 01:49:36 +04:00
2011-03-10 10:22:00 +03:00
if ( need_reserve_slab_rcu & & ( s - > flags & SLAB_DESTROY_BY_RCU ) )
s - > reserved = sizeof ( struct rcu_head ) ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:41 +04:00
if ( ! calculate_sizes ( s , - 1 ) )
2007-05-07 01:49:36 +04:00
goto error ;
2009-07-28 05:30:35 +04:00
if ( disable_higher_order_debug ) {
/*
* Disable debugging flags that store metadata if the min slab
* order increased .
*/
2012-06-13 19:24:57 +04:00
if ( get_order ( s - > size ) > get_order ( s - > object_size ) ) {
2009-07-28 05:30:35 +04:00
s - > flags & = ~ DEBUG_METADATA_FLAGS ;
s - > offset = 0 ;
if ( ! calculate_sizes ( s , - 1 ) )
goto error ;
}
}
2007-05-07 01:49:36 +04:00
2012-01-13 05:17:33 +04:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-06-01 21:25:49 +04:00
if ( system_has_cmpxchg_double ( ) & & ( s - > flags & SLAB_DEBUG_FLAGS ) = = 0 )
/* Enable fast mode */
s - > flags | = __CMPXCHG_DOUBLE ;
# endif
2009-02-23 04:40:07 +03:00
/*
* The larger the object size is , the more pages we want on the partial
* list to avoid pounding the page allocator excessively .
*/
2011-08-10 01:12:27 +04:00
set_min_partial ( s , ilog2 ( s - > size ) / 2 ) ;
/*
* cpu_partial determined the maximum number of objects kept in the
* per cpu partial lists of a processor .
*
* Per cpu partial lists mainly contain slabs that just have one
* object freed . If they are used for allocation then they can be
* filled up again with minimal effort . The slab will never hit the
* per node partial lists and therefore no locking will be required .
*
* This setting also determines
*
* A ) The number of objects from per cpu partial slabs dumped to the
* per node list when we reach the limit .
2011-09-01 07:32:18 +04:00
* B ) The number of objects in cpu partial slabs to extract from the
2013-07-15 05:05:29 +04:00
* per node list when we run out of per cpu objects . We only fetch
* 50 % to keep some capacity around for frees .
2011-08-10 01:12:27 +04:00
*/
2013-06-19 09:05:52 +04:00
if ( ! kmem_cache_has_cpu_partial ( s ) )
2011-11-23 19:24:27 +04:00
s - > cpu_partial = 0 ;
else if ( s - > size > = PAGE_SIZE )
2011-08-10 01:12:27 +04:00
s - > cpu_partial = 2 ;
else if ( s - > size > = 1024 )
s - > cpu_partial = 6 ;
else if ( s - > size > = 256 )
s - > cpu_partial = 13 ;
else
s - > cpu_partial = 30 ;
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_NUMA
2008-08-19 17:51:22 +04:00
s - > remote_node_defrag_ratio = 1000 ;
2007-05-07 01:49:36 +04:00
# endif
2010-08-20 21:37:13 +04:00
if ( ! init_kmem_cache_nodes ( s ) )
2007-10-16 12:26:05 +04:00
goto error ;
2007-05-07 01:49:36 +04:00
2010-08-20 21:37:13 +04:00
if ( alloc_kmem_cache_cpus ( s ) )
2012-09-05 04:20:34 +04:00
return 0 ;
2009-12-19 01:26:22 +03:00
2007-10-16 12:26:08 +04:00
free_kmem_cache_nodes ( s ) ;
2007-05-07 01:49:36 +04:00
error :
if ( flags & SLAB_PANIC )
panic ( " Cannot create slab %s size=%lu realsize=%u "
" order=%u offset=%u flags=%lx \n " ,
2013-07-15 05:05:29 +04:00
s - > name , ( unsigned long ) s - > size , s - > size ,
oo_order ( s - > oo ) , s - > offset , flags ) ;
2012-09-05 04:20:34 +04:00
return - EINVAL ;
2007-05-07 01:49:36 +04:00
}
2008-04-25 23:22:43 +04:00
static void list_slab_objects ( struct kmem_cache * s , struct page * page ,
const char * text )
{
# ifdef CONFIG_SLUB_DEBUG
void * addr = page_address ( page ) ;
void * p ;
2010-09-29 16:02:13 +04:00
unsigned long * map = kzalloc ( BITS_TO_LONGS ( page - > objects ) *
sizeof ( long ) , GFP_ATOMIC ) ;
2010-03-25 00:25:47 +03:00
if ( ! map )
return ;
2012-09-05 03:18:33 +04:00
slab_err ( s , page , text , s - > name ) ;
2008-04-25 23:22:43 +04:00
slab_lock ( page ) ;
2011-04-15 23:48:13 +04:00
get_map ( s , page , map ) ;
2008-04-25 23:22:43 +04:00
for_each_object ( p , s , addr , page - > objects ) {
if ( ! test_bit ( slab_index ( p , s , addr ) , map ) ) {
2014-06-05 03:06:34 +04:00
pr_err ( " INFO: Object 0x%p @offset=%tu \n " , p , p - addr ) ;
2008-04-25 23:22:43 +04:00
print_tracking ( s , p ) ;
}
}
slab_unlock ( page ) ;
2010-03-25 00:25:47 +03:00
kfree ( map ) ;
2008-04-25 23:22:43 +04:00
# endif
}
2007-05-07 01:49:36 +04:00
/*
2008-04-23 23:36:52 +04:00
* Attempt to free all partial slabs on a node .
2011-08-10 01:12:22 +04:00
* This is called from kmem_cache_close ( ) . We must be the last thread
* using the cache and therefore we do not need to lock anymore .
2007-05-07 01:49:36 +04:00
*/
2008-04-23 23:36:52 +04:00
static void free_partial ( struct kmem_cache * s , struct kmem_cache_node * n )
2007-05-07 01:49:36 +04:00
{
struct page * page , * h ;
2008-04-25 23:22:43 +04:00
list_for_each_entry_safe ( page , h , & n - > partial , lru ) {
2007-05-07 01:49:36 +04:00
if ( ! page - > inuse ) {
2014-02-11 02:25:46 +04:00
__remove_partial ( n , page ) ;
2007-05-07 01:49:36 +04:00
discard_slab ( s , page ) ;
2008-04-25 23:22:43 +04:00
} else {
list_slab_objects ( s , page ,
2012-09-05 03:18:33 +04:00
" Objects remaining in %s on kmem_cache_close() " ) ;
2008-04-23 23:36:52 +04:00
}
2008-04-25 23:22:43 +04:00
}
2007-05-07 01:49:36 +04:00
}
/*
2007-05-09 13:32:39 +04:00
* Release all resources used by a slab cache .
2007-05-07 01:49:36 +04:00
*/
2007-07-17 15:03:24 +04:00
static inline int kmem_cache_close ( struct kmem_cache * s )
2007-05-07 01:49:36 +04:00
{
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:36 +04:00
flush_all ( s ) ;
/* Attempt to free all objects */
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2008-04-23 23:36:52 +04:00
free_partial ( s , n ) ;
if ( n - > nr_partial | | slabs_node ( s , node ) )
2007-05-07 01:49:36 +04:00
return 1 ;
}
2012-09-05 03:18:33 +04:00
free_percpu ( s - > cpu_slab ) ;
2007-05-07 01:49:36 +04:00
free_kmem_cache_nodes ( s ) ;
return 0 ;
}
2012-09-05 03:18:33 +04:00
int __kmem_cache_shutdown ( struct kmem_cache * s )
2007-05-07 01:49:36 +04:00
{
2014-05-06 23:50:08 +04:00
return kmem_cache_close ( s ) ;
2007-05-07 01:49:36 +04:00
}
/********************************************************************
* Kmalloc subsystem
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
static int __init setup_slub_min_order ( char * str )
{
2008-01-08 10:20:27 +03:00
get_option ( & str , & slub_min_order ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
}
__setup ( " slub_min_order= " , setup_slub_min_order ) ;
static int __init setup_slub_max_order ( char * str )
{
2008-01-08 10:20:27 +03:00
get_option ( & str , & slub_max_order ) ;
2009-04-23 10:58:22 +04:00
slub_max_order = min ( slub_max_order , MAX_ORDER - 1 ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
}
__setup ( " slub_max_order= " , setup_slub_max_order ) ;
static int __init setup_slub_min_objects ( char * str )
{
2008-01-08 10:20:27 +03:00
get_option ( & str , & slub_min_objects ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
}
__setup ( " slub_min_objects= " , setup_slub_min_objects ) ;
void * __kmalloc ( size_t size , gfp_t flags )
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-19 21:43:26 +04:00
void * ret ;
2007-05-07 01:49:36 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) )
2008-02-11 23:47:46 +03:00
return kmalloc_large ( size , flags ) ;
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , flags ) ;
2007-10-16 12:24:38 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2012-09-09 00:47:58 +04:00
ret = slab_alloc ( s , flags , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2009-03-23 16:12:24 +03:00
trace_kmalloc ( _RET_IP_ , ret , size , s - > size , flags ) ;
2008-08-19 21:43:26 +04:00
2015-02-14 01:39:42 +03:00
kasan_kmalloc ( s , ret , size ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( __kmalloc ) ;
2010-09-29 16:02:15 +04:00
# ifdef CONFIG_NUMA
2008-03-02 00:56:40 +03:00
static void * kmalloc_large_node ( size_t size , gfp_t flags , int node )
{
2008-11-25 18:55:53 +03:00
struct page * page ;
2009-07-07 13:32:59 +04:00
void * ptr = NULL ;
2008-03-02 00:56:40 +03:00
2014-06-05 03:06:39 +04:00
flags | = __GFP_COMP | __GFP_NOTRACK ;
page = alloc_kmem_pages_node ( node , flags , get_order ( size ) ) ;
2008-03-02 00:56:40 +03:00
if ( page )
2009-07-07 13:32:59 +04:00
ptr = page_address ( page ) ;
2013-10-09 02:58:57 +04:00
kmalloc_large_node_hook ( ptr , size , flags ) ;
2009-07-07 13:32:59 +04:00
return ptr ;
2008-03-02 00:56:40 +03:00
}
2007-05-07 01:49:36 +04:00
void * __kmalloc_node ( size_t size , gfp_t flags , int node )
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-19 21:43:26 +04:00
void * ret ;
2007-05-07 01:49:36 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) ) {
2008-08-19 21:43:26 +04:00
ret = kmalloc_large_node ( size , flags , node ) ;
2009-03-23 16:12:24 +03:00
trace_kmalloc_node ( _RET_IP_ , ret ,
size , PAGE_SIZE < < get_order ( size ) ,
flags , node ) ;
2008-08-19 21:43:26 +04:00
return ret ;
}
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , flags ) ;
2007-10-16 12:24:38 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2012-09-09 00:47:58 +04:00
ret = slab_alloc_node ( s , flags , node , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2009-03-23 16:12:24 +03:00
trace_kmalloc_node ( _RET_IP_ , ret , size , s - > size , flags , node ) ;
2008-08-19 21:43:26 +04:00
2015-02-14 01:39:42 +03:00
kasan_kmalloc ( s , ret , size ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( __kmalloc_node ) ;
# endif
2015-02-14 01:39:42 +03:00
static size_t __ksize ( const void * object )
2007-05-07 01:49:36 +04:00
{
2007-06-09 00:46:49 +04:00
struct page * page ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:24:46 +04:00
if ( unlikely ( object = = ZERO_SIZE_PTR ) )
2007-06-09 00:46:49 +04:00
return 0 ;
2007-12-05 10:45:30 +03:00
page = virt_to_head_page ( object ) ;
2008-05-22 20:22:25 +04:00
if ( unlikely ( ! PageSlab ( page ) ) ) {
WARN_ON ( ! PageCompound ( page ) ) ;
2007-12-05 10:45:30 +03:00
return PAGE_SIZE < < compound_order ( page ) ;
2008-05-22 20:22:25 +04:00
}
2007-05-07 01:49:36 +04:00
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
return slab_ksize ( page - > slab_cache ) ;
2007-05-07 01:49:36 +04:00
}
2015-02-14 01:39:42 +03:00
size_t ksize ( const void * object )
{
size_t size = __ksize ( object ) ;
/* We assume that ksize callers could use whole allocated area,
so we need unpoison this area . */
kasan_krealloc ( object , size ) ;
return size ;
}
2009-02-10 16:21:44 +03:00
EXPORT_SYMBOL ( ksize ) ;
2007-05-07 01:49:36 +04:00
void kfree ( const void * x )
{
struct page * page ;
2008-02-08 04:47:41 +03:00
void * object = ( void * ) x ;
2007-05-07 01:49:36 +04:00
2009-03-25 12:05:57 +03:00
trace_kfree ( _RET_IP_ , x ) ;
2007-10-16 12:24:44 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( x ) ) )
2007-05-07 01:49:36 +04:00
return ;
2007-05-07 01:49:41 +04:00
page = virt_to_head_page ( x ) ;
2007-10-16 12:24:38 +04:00
if ( unlikely ( ! PageSlab ( page ) ) ) {
2008-05-28 21:32:22 +04:00
BUG_ON ( ! PageCompound ( page ) ) ;
2013-10-09 02:58:57 +04:00
kfree_hook ( x ) ;
2014-06-05 03:06:39 +04:00
__free_kmem_pages ( page , compound_order ( page ) ) ;
2007-10-16 12:24:38 +04:00
return ;
}
2015-11-21 02:57:46 +03:00
slab_free ( page - > slab_cache , page , object , NULL , 1 , _RET_IP_ ) ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kfree ) ;
2015-02-13 01:59:41 +03:00
# define SHRINK_PROMOTE_MAX 32
2007-05-07 01:49:46 +04:00
/*
2015-02-13 01:59:41 +03:00
* kmem_cache_shrink discards empty slabs and promotes the slabs filled
* up most to the head of the partial lists . New allocations will then
* fill those up and thus they can be removed from the partial lists .
2007-05-09 13:32:39 +04:00
*
* The slabs with the least items are placed last . This results in them
* being allocated from last increasing the chance that the last objects
* are freed in them .
2007-05-07 01:49:46 +04:00
*/
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 01:59:47 +03:00
int __kmem_cache_shrink ( struct kmem_cache * s , bool deactivate )
2007-05-07 01:49:46 +04:00
{
int node ;
int i ;
struct kmem_cache_node * n ;
struct page * page ;
struct page * t ;
2015-02-13 01:59:41 +03:00
struct list_head discard ;
struct list_head promote [ SHRINK_PROMOTE_MAX ] ;
2007-05-07 01:49:46 +04:00
unsigned long flags ;
2015-02-13 01:59:44 +03:00
int ret = 0 ;
2007-05-07 01:49:46 +04:00
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 01:59:47 +03:00
if ( deactivate ) {
/*
* Disable empty slabs caching . Used to avoid pinning offline
* memory cgroups by kmem pages that can be freed .
*/
s - > cpu_partial = 0 ;
s - > min_partial = 0 ;
/*
* s - > cpu_partial is checked locklessly ( see put_cpu_partial ) ,
* so we have to make sure the change is visible .
*/
kick_all_cpus_sync ( ) ;
}
2007-05-07 01:49:46 +04:00
flush_all ( s ) ;
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2015-02-13 01:59:41 +03:00
INIT_LIST_HEAD ( & discard ) ;
for ( i = 0 ; i < SHRINK_PROMOTE_MAX ; i + + )
INIT_LIST_HEAD ( promote + i ) ;
2007-05-07 01:49:46 +04:00
spin_lock_irqsave ( & n - > list_lock , flags ) ;
/*
2015-02-13 01:59:41 +03:00
* Build lists of slabs to discard or promote .
2007-05-07 01:49:46 +04:00
*
2007-05-09 13:32:39 +04:00
* Note that concurrent frees may occur while we hold the
* list_lock . page - > inuse here is the upper limit .
2007-05-07 01:49:46 +04:00
*/
list_for_each_entry_safe ( page , t , & n - > partial , lru ) {
2015-02-13 01:59:41 +03:00
int free = page - > objects - page - > inuse ;
/* Do not reread page->inuse */
barrier ( ) ;
/* We do not keep full slabs on the list */
BUG_ON ( free < = 0 ) ;
if ( free = = page - > objects ) {
list_move ( & page - > lru , & discard ) ;
2011-08-10 01:12:22 +04:00
n - > nr_partial - - ;
2015-02-13 01:59:41 +03:00
} else if ( free < = SHRINK_PROMOTE_MAX )
list_move ( & page - > lru , promote + free - 1 ) ;
2007-05-07 01:49:46 +04:00
}
/*
2015-02-13 01:59:41 +03:00
* Promote the slabs filled up most to the head of the
* partial list .
2007-05-07 01:49:46 +04:00
*/
2015-02-13 01:59:41 +03:00
for ( i = SHRINK_PROMOTE_MAX - 1 ; i > = 0 ; i - - )
list_splice ( promote + i , & n - > partial ) ;
2007-05-07 01:49:46 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2011-08-10 01:12:22 +04:00
/* Release empty slabs */
2015-02-13 01:59:41 +03:00
list_for_each_entry_safe ( page , t , & discard , lru )
2011-08-10 01:12:22 +04:00
discard_slab ( s , page ) ;
2015-02-13 01:59:44 +03:00
if ( slabs_node ( s , node ) )
ret = 1 ;
2007-05-07 01:49:46 +04:00
}
2015-02-13 01:59:44 +03:00
return ret ;
2007-05-07 01:49:46 +04:00
}
2007-10-22 03:41:37 +04:00
static int slab_mem_going_offline_callback ( void * arg )
{
struct kmem_cache * s ;
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
list_for_each_entry ( s , & slab_caches , list )
slub: make dead caches discard free slabs immediately
To speed up further allocations SLUB may store empty slabs in per cpu/node
partial lists instead of freeing them immediately. This prevents per
memcg caches destruction, because kmem caches created for a memory cgroup
are only destroyed after the last page charged to the cgroup is freed.
To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
that it would drop frozen empty slabs immediately if cpu_partial = 0.
The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to a
partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable preemption
for put_cpu_partial() to achieve that though.
The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since SLUB
does not keep free slabs on a separate list. Besides, we currently do not
even call per-memcg shrinkers for offline memcgs. Overall, it would
introduce much more complexity to both SLUB and memcg than this small
patch.
Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.
[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 01:59:47 +03:00
__kmem_cache_shrink ( s , false ) ;
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
return 0 ;
}
static void slab_mem_offline_callback ( void * arg )
{
struct kmem_cache_node * n ;
struct kmem_cache * s ;
struct memory_notify * marg = arg ;
int offline_node ;
2012-12-12 04:01:05 +04:00
offline_node = marg - > status_change_nid_normal ;
2007-10-22 03:41:37 +04:00
/*
* If the node still has available memory . we need kmem_cache_node
* for it yet .
*/
if ( offline_node < 0 )
return ;
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
n = get_node ( s , offline_node ) ;
if ( n ) {
/*
* if n - > nr_slabs > 0 , slabs still exist on the node
* that is going down . We were unable to free them ,
2009-12-18 23:40:42 +03:00
* and offline_pages ( ) function shouldn ' t call this
2007-10-22 03:41:37 +04:00
* callback . So , we must fail .
*/
2008-04-14 19:53:02 +04:00
BUG_ON ( slabs_node ( s , offline_node ) ) ;
2007-10-22 03:41:37 +04:00
s - > node [ offline_node ] = NULL ;
2010-08-25 23:51:14 +04:00
kmem_cache_free ( kmem_cache_node , n ) ;
2007-10-22 03:41:37 +04:00
}
}
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
}
static int slab_mem_going_online_callback ( void * arg )
{
struct kmem_cache_node * n ;
struct kmem_cache * s ;
struct memory_notify * marg = arg ;
2012-12-12 04:01:05 +04:00
int nid = marg - > status_change_nid_normal ;
2007-10-22 03:41:37 +04:00
int ret = 0 ;
/*
* If the node ' s memory is already available , then kmem_cache_node is
* already created . Nothing to do .
*/
if ( nid < 0 )
return 0 ;
/*
2008-04-30 03:11:12 +04:00
* We are bringing a node online . No memory is available yet . We must
2007-10-22 03:41:37 +04:00
* allocate a kmem_cache_node structure in order to bring the node
* online .
*/
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
/*
* XXX : kmem_cache_alloc_node will fallback to other nodes
* since memory is not yet available from the node that
* is brought up .
*/
2010-08-25 23:51:14 +04:00
n = kmem_cache_alloc ( kmem_cache_node , GFP_KERNEL ) ;
2007-10-22 03:41:37 +04:00
if ( ! n ) {
ret = - ENOMEM ;
goto out ;
}
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( n ) ;
2007-10-22 03:41:37 +04:00
s - > node [ nid ] = n ;
}
out :
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
return ret ;
}
static int slab_memory_callback ( struct notifier_block * self ,
unsigned long action , void * arg )
{
int ret = 0 ;
switch ( action ) {
case MEM_GOING_ONLINE :
ret = slab_mem_going_online_callback ( arg ) ;
break ;
case MEM_GOING_OFFLINE :
ret = slab_mem_going_offline_callback ( arg ) ;
break ;
case MEM_OFFLINE :
case MEM_CANCEL_ONLINE :
slab_mem_offline_callback ( arg ) ;
break ;
case MEM_ONLINE :
case MEM_CANCEL_OFFLINE :
break ;
}
2008-12-02 00:13:48 +03:00
if ( ret )
ret = notifier_from_errno ( ret ) ;
else
ret = NOTIFY_OK ;
2007-10-22 03:41:37 +04:00
return ret ;
}
2013-04-30 02:08:06 +04:00
static struct notifier_block slab_memory_callback_nb = {
. notifier_call = slab_memory_callback ,
. priority = SLAB_CALLBACK_PRI ,
} ;
2007-10-22 03:41:37 +04:00
2007-05-07 01:49:36 +04:00
/********************************************************************
* Basic setup of slabs
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
2010-08-20 21:37:15 +04:00
/*
* Used for early kmem_cache structures that were allocated using
2012-11-28 20:23:07 +04:00
* the page allocator . Allocate them properly then fix up the pointers
* that may be pointing to the wrong kmem_cache structure .
2010-08-20 21:37:15 +04:00
*/
2012-11-28 20:23:07 +04:00
static struct kmem_cache * __init bootstrap ( struct kmem_cache * static_cache )
2010-08-20 21:37:15 +04:00
{
int node ;
2012-11-28 20:23:07 +04:00
struct kmem_cache * s = kmem_cache_zalloc ( kmem_cache , GFP_NOWAIT ) ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2010-08-20 21:37:15 +04:00
2012-11-28 20:23:07 +04:00
memcpy ( s , static_cache , kmem_cache - > object_size ) ;
2010-08-20 21:37:15 +04:00
2013-02-22 20:20:00 +04:00
/*
* This runs very early , and only the boot processor is supposed to be
* up . Even if it weren ' t true , IRQs are not up so we couldn ' t fire
* IPIs around .
*/
__flush_cpu_slab ( s , smp_processor_id ( ) ) ;
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2010-08-20 21:37:15 +04:00
struct page * p ;
2014-08-07 03:04:09 +04:00
list_for_each_entry ( p , & n - > partial , lru )
p - > slab_cache = s ;
2010-08-20 21:37:15 +04:00
2011-04-12 11:22:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2014-08-07 03:04:09 +04:00
list_for_each_entry ( p , & n - > full , lru )
p - > slab_cache = s ;
2010-08-20 21:37:15 +04:00
# endif
}
2015-02-13 01:59:20 +03:00
slab_init_memcg_params ( s ) ;
2012-11-28 20:23:07 +04:00
list_add ( & s - > list , & slab_caches ) ;
return s ;
2010-08-20 21:37:15 +04:00
}
2007-05-07 01:49:36 +04:00
void __init kmem_cache_init ( void )
{
2012-11-28 20:23:07 +04:00
static __initdata struct kmem_cache boot_kmem_cache ,
boot_kmem_cache_node ;
2010-08-20 21:37:15 +04:00
2012-01-11 03:07:32 +04:00
if ( debug_guardpage_minorder ( ) )
slub_max_order = 0 ;
2012-11-28 20:23:07 +04:00
kmem_cache_node = & boot_kmem_cache_node ;
kmem_cache = & boot_kmem_cache ;
2010-08-20 21:37:15 +04:00
2012-11-28 20:23:07 +04:00
create_boot_cache ( kmem_cache_node , " kmem_cache_node " ,
sizeof ( struct kmem_cache_node ) , SLAB_HWCACHE_ALIGN ) ;
2007-10-22 03:41:37 +04:00
2013-04-30 02:08:06 +04:00
register_hotmemory_notifier ( & slab_memory_callback_nb ) ;
2007-05-07 01:49:36 +04:00
/* Able to allocate the per node structures */
slab_state = PARTIAL ;
2012-11-28 20:23:07 +04:00
create_boot_cache ( kmem_cache , " kmem_cache " ,
offsetof ( struct kmem_cache , node ) +
nr_node_ids * sizeof ( struct kmem_cache_node * ) ,
SLAB_HWCACHE_ALIGN ) ;
2012-09-05 03:18:33 +04:00
2012-11-28 20:23:07 +04:00
kmem_cache = bootstrap ( & boot_kmem_cache ) ;
2007-05-07 01:49:36 +04:00
2010-08-20 21:37:15 +04:00
/*
* Allocate kmem_cache_node properly from the kmem_cache slab .
* kmem_cache_node is separately allocated so no need to
* update any list pointers .
*/
2012-11-28 20:23:07 +04:00
kmem_cache_node = bootstrap ( & boot_kmem_cache_node ) ;
2010-08-20 21:37:15 +04:00
/* Now we can use the kmem_cache to allocate kmalloc slabs */
2015-06-25 02:55:57 +03:00
setup_kmalloc_cache_index_table ( ) ;
2013-01-10 23:12:17 +04:00
create_kmalloc_caches ( 0 ) ;
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_SMP
register_cpu_notifier ( & slab_notifier ) ;
2009-12-19 01:26:20 +03:00
# endif
2007-05-07 01:49:36 +04:00
2014-06-05 03:06:34 +04:00
pr_info ( " SLUB: HWalign=%d, Order=%d-%d, MinObjects=%d, CPUs=%d, Nodes=%d \n " ,
2013-01-10 23:12:17 +04:00
cache_line_size ( ) ,
2007-05-07 01:49:36 +04:00
slub_min_order , slub_max_order , slub_min_objects ,
nr_cpu_ids , nr_node_ids ) ;
}
2009-06-12 15:03:06 +04:00
void __init kmem_cache_init_late ( void )
{
}
2012-12-19 02:22:34 +04:00
struct kmem_cache *
memcg, slab: never try to merge memcg caches
When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.
Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.
The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.
So let's remove the useless code responsible for merging memcg caches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:39:23 +04:00
__kmem_cache_alias ( const char * name , size_t size , size_t align ,
unsigned long flags , void ( * ctor ) ( void * ) )
2007-05-07 01:49:36 +04:00
{
2015-02-13 01:59:23 +03:00
struct kmem_cache * s , * c ;
2007-05-07 01:49:36 +04:00
memcg, slab: never try to merge memcg caches
When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.
Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.
The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.
So let's remove the useless code responsible for merging memcg caches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:39:23 +04:00
s = find_mergeable ( size , align , flags , name , ctor ) ;
2007-05-07 01:49:36 +04:00
if ( s ) {
s - > refcount + + ;
2014-04-08 02:39:29 +04:00
2007-05-07 01:49:36 +04:00
/*
* Adjust the object sizes so that we clear
* the complete object on kzalloc .
*/
2012-06-13 19:24:57 +04:00
s - > object_size = max ( s - > object_size , ( int ) size ) ;
2007-05-07 01:49:36 +04:00
s - > inuse = max_t ( int , s - > inuse , ALIGN ( size , sizeof ( void * ) ) ) ;
2008-02-16 10:45:26 +03:00
2015-02-13 01:59:23 +03:00
for_each_memcg_cache ( c , s ) {
2014-04-08 02:39:29 +04:00
c - > object_size = s - > object_size ;
c - > inuse = max_t ( int , c - > inuse ,
ALIGN ( size , sizeof ( void * ) ) ) ;
}
2008-12-18 09:09:46 +03:00
if ( sysfs_slab_alias ( s , name ) ) {
s - > refcount - - ;
2012-09-05 04:18:32 +04:00
s = NULL ;
2008-12-18 09:09:46 +03:00
}
2007-07-17 15:03:31 +04:00
}
2008-02-16 10:45:26 +03:00
2012-09-05 04:18:32 +04:00
return s ;
}
2010-09-15 00:21:12 +04:00
2012-09-05 03:18:33 +04:00
int __kmem_cache_create ( struct kmem_cache * s , unsigned long flags )
2012-09-05 04:18:32 +04:00
{
2012-09-05 13:07:44 +04:00
int err ;
err = kmem_cache_open ( s , flags ) ;
if ( err )
return err ;
2012-07-07 00:25:13 +04:00
2012-11-28 20:23:07 +04:00
/* Mutex is not taken during early boot */
if ( slab_state < = UP )
return 0 ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
memcg_propagate_slab_attrs ( s ) ;
2012-09-05 13:07:44 +04:00
err = sysfs_slab_add ( s ) ;
if ( err )
kmem_cache_close ( s ) ;
2012-07-07 00:25:13 +04:00
2012-09-05 13:07:44 +04:00
return err ;
2007-05-07 01:49:36 +04:00
}
# ifdef CONFIG_SMP
/*
2007-05-09 13:32:39 +04:00
* Use the cpu notifier to insure that the cpu slabs are flushed when
* necessary .
2007-05-07 01:49:36 +04:00
*/
2013-06-19 22:53:51 +04:00
static int slab_cpuup_callback ( struct notifier_block * nfb ,
2007-05-07 01:49:36 +04:00
unsigned long action , void * hcpu )
{
long cpu = ( long ) hcpu ;
2007-07-17 15:03:19 +04:00
struct kmem_cache * s ;
unsigned long flags ;
2007-05-07 01:49:36 +04:00
switch ( action ) {
case CPU_UP_CANCELED :
2007-05-09 13:35:10 +04:00
case CPU_UP_CANCELED_FROZEN :
2007-05-07 01:49:36 +04:00
case CPU_DEAD :
2007-05-09 13:35:10 +04:00
case CPU_DEAD_FROZEN :
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-07-17 15:03:19 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
local_irq_save ( flags ) ;
__flush_cpu_slab ( s , cpu ) ;
local_irq_restore ( flags ) ;
}
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-05-07 01:49:36 +04:00
break ;
default :
break ;
}
return NOTIFY_OK ;
}
2013-06-19 22:53:51 +04:00
static struct notifier_block slab_notifier = {
2008-02-06 04:57:39 +03:00
. notifier_call = slab_cpuup_callback
2008-01-08 10:20:27 +03:00
} ;
2007-05-07 01:49:36 +04:00
# endif
2008-08-19 21:43:25 +04:00
void * __kmalloc_track_caller ( size_t size , gfp_t gfpflags , unsigned long caller )
2007-05-07 01:49:36 +04:00
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-24 21:49:35 +04:00
void * ret ;
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) )
2008-02-11 23:47:46 +03:00
return kmalloc_large ( size , gfpflags ) ;
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , gfpflags ) ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:24:44 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2007-05-07 01:49:36 +04:00
2012-09-09 00:47:58 +04:00
ret = slab_alloc ( s , gfpflags , caller ) ;
2008-08-24 21:49:35 +04:00
2011-03-31 05:57:33 +04:00
/* Honor the call site pointer we received. */
2009-03-23 16:12:24 +03:00
trace_kmalloc ( caller , ret , size , s - > size , gfpflags ) ;
2008-08-24 21:49:35 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
2010-09-29 16:02:15 +04:00
# ifdef CONFIG_NUMA
2007-05-07 01:49:36 +04:00
void * __kmalloc_node_track_caller ( size_t size , gfp_t gfpflags ,
2008-08-19 21:43:25 +04:00
int node , unsigned long caller )
2007-05-07 01:49:36 +04:00
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-24 21:49:35 +04:00
void * ret ;
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) ) {
2010-04-08 13:26:44 +04:00
ret = kmalloc_large_node ( size , gfpflags , node ) ;
trace_kmalloc_node ( caller , ret ,
size , PAGE_SIZE < < get_order ( size ) ,
gfpflags , node ) ;
return ret ;
}
2008-02-11 23:47:46 +03:00
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , gfpflags ) ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:24:44 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2007-05-07 01:49:36 +04:00
2012-09-09 00:47:58 +04:00
ret = slab_alloc_node ( s , gfpflags , node , caller ) ;
2008-08-24 21:49:35 +04:00
2011-03-31 05:57:33 +04:00
/* Honor the call site pointer we received. */
2009-03-23 16:12:24 +03:00
trace_kmalloc_node ( caller , ret , size , s - > size , gfpflags , node ) ;
2008-08-24 21:49:35 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
2010-09-29 16:02:15 +04:00
# endif
2007-05-07 01:49:36 +04:00
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SYSFS
2008-04-14 20:11:40 +04:00
static int count_inuse ( struct page * page )
{
return page - > inuse ;
}
static int count_total ( struct page * page )
{
return page - > objects ;
}
2010-10-05 22:57:26 +04:00
# endif
2008-04-14 20:11:40 +04:00
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2007-07-17 15:03:30 +04:00
static int validate_slab ( struct kmem_cache * s , struct page * page ,
unsigned long * map )
2007-05-07 01:49:43 +04:00
{
void * p ;
2008-03-02 00:40:44 +03:00
void * addr = page_address ( page ) ;
2007-05-07 01:49:43 +04:00
if ( ! check_slab ( s , page ) | |
! on_freelist ( s , page , NULL ) )
return 0 ;
/* Now we know that a valid freelist exists */
2008-04-14 20:11:30 +04:00
bitmap_zero ( map , page - > objects ) ;
2007-05-07 01:49:43 +04:00
2011-04-15 23:48:13 +04:00
get_map ( s , page , map ) ;
for_each_object ( p , s , addr , page - > objects ) {
if ( test_bit ( slab_index ( p , s , addr ) , map ) )
if ( ! check_object ( s , page , p , SLUB_RED_INACTIVE ) )
return 0 ;
2007-05-07 01:49:43 +04:00
}
2008-04-14 20:11:31 +04:00
for_each_object ( p , s , addr , page - > objects )
2007-05-09 13:32:40 +04:00
if ( ! test_bit ( slab_index ( p , s , addr ) , map ) )
2010-12-01 21:04:20 +03:00
if ( ! check_object ( s , page , p , SLUB_RED_ACTIVE ) )
2007-05-07 01:49:43 +04:00
return 0 ;
return 1 ;
}
2007-07-17 15:03:30 +04:00
static void validate_slab_slab ( struct kmem_cache * s , struct page * page ,
unsigned long * map )
2007-05-07 01:49:43 +04:00
{
2011-06-01 21:25:53 +04:00
slab_lock ( page ) ;
validate_slab ( s , page , map ) ;
slab_unlock ( page ) ;
2007-05-07 01:49:43 +04:00
}
2007-07-17 15:03:30 +04:00
static int validate_slab_node ( struct kmem_cache * s ,
struct kmem_cache_node * n , unsigned long * map )
2007-05-07 01:49:43 +04:00
{
unsigned long count = 0 ;
struct page * page ;
unsigned long flags ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
list_for_each_entry ( page , & n - > partial , lru ) {
2007-07-17 15:03:30 +04:00
validate_slab_slab ( s , page , map ) ;
2007-05-07 01:49:43 +04:00
count + + ;
}
if ( count ! = n - > nr_partial )
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB %s: %ld partial slabs counted but counter=%ld \n " ,
s - > name , count , n - > nr_partial ) ;
2007-05-07 01:49:43 +04:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
goto out ;
list_for_each_entry ( page , & n - > full , lru ) {
2007-07-17 15:03:30 +04:00
validate_slab_slab ( s , page , map ) ;
2007-05-07 01:49:43 +04:00
count + + ;
}
if ( count ! = atomic_long_read ( & n - > nr_slabs ) )
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB: %s %ld slabs counted but counter=%ld \n " ,
s - > name , count , atomic_long_read ( & n - > nr_slabs ) ) ;
2007-05-07 01:49:43 +04:00
out :
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
return count ;
}
2007-07-17 15:03:30 +04:00
static long validate_slab_cache ( struct kmem_cache * s )
2007-05-07 01:49:43 +04:00
{
int node ;
unsigned long count = 0 ;
2008-04-14 20:11:40 +04:00
unsigned long * map = kmalloc ( BITS_TO_LONGS ( oo_objects ( s - > max ) ) *
2007-07-17 15:03:30 +04:00
sizeof ( unsigned long ) , GFP_KERNEL ) ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-07-17 15:03:30 +04:00
if ( ! map )
return - ENOMEM ;
2007-05-07 01:49:43 +04:00
flush_all ( s ) ;
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n )
2007-07-17 15:03:30 +04:00
count + = validate_slab_node ( s , n , map ) ;
kfree ( map ) ;
2007-05-07 01:49:43 +04:00
return count ;
}
2007-05-07 01:49:45 +04:00
/*
2007-05-09 13:32:39 +04:00
* Generate lists of code addresses where slabcache objects are allocated
2007-05-07 01:49:45 +04:00
* and freed .
*/
struct location {
unsigned long count ;
2008-08-19 21:43:25 +04:00
unsigned long addr ;
2007-05-09 13:32:45 +04:00
long long sum_time ;
long min_time ;
long max_time ;
long min_pid ;
long max_pid ;
2009-01-01 02:42:29 +03:00
DECLARE_BITMAP ( cpus , NR_CPUS ) ;
2007-05-09 13:32:45 +04:00
nodemask_t nodes ;
2007-05-07 01:49:45 +04:00
} ;
struct loc_track {
unsigned long max ;
unsigned long count ;
struct location * loc ;
} ;
static void free_loc_track ( struct loc_track * t )
{
if ( t - > max )
free_pages ( ( unsigned long ) t - > loc ,
get_order ( sizeof ( struct location ) * t - > max ) ) ;
}
2007-07-17 15:03:20 +04:00
static int alloc_loc_track ( struct loc_track * t , unsigned long max , gfp_t flags )
2007-05-07 01:49:45 +04:00
{
struct location * l ;
int order ;
order = get_order ( sizeof ( struct location ) * max ) ;
2007-07-17 15:03:20 +04:00
l = ( void * ) __get_free_pages ( flags , order ) ;
2007-05-07 01:49:45 +04:00
if ( ! l )
return 0 ;
if ( t - > count ) {
memcpy ( l , t - > loc , sizeof ( struct location ) * t - > count ) ;
free_loc_track ( t ) ;
}
t - > max = max ;
t - > loc = l ;
return 1 ;
}
static int add_location ( struct loc_track * t , struct kmem_cache * s ,
2007-05-09 13:32:45 +04:00
const struct track * track )
2007-05-07 01:49:45 +04:00
{
long start , end , pos ;
struct location * l ;
2008-08-19 21:43:25 +04:00
unsigned long caddr ;
2007-05-09 13:32:45 +04:00
unsigned long age = jiffies - track - > when ;
2007-05-07 01:49:45 +04:00
start = - 1 ;
end = t - > count ;
for ( ; ; ) {
pos = start + ( end - start + 1 ) / 2 ;
/*
* There is nothing at " end " . If we end up there
* we need to add something to before end .
*/
if ( pos = = end )
break ;
caddr = t - > loc [ pos ] . addr ;
2007-05-09 13:32:45 +04:00
if ( track - > addr = = caddr ) {
l = & t - > loc [ pos ] ;
l - > count + + ;
if ( track - > when ) {
l - > sum_time + = age ;
if ( age < l - > min_time )
l - > min_time = age ;
if ( age > l - > max_time )
l - > max_time = age ;
if ( track - > pid < l - > min_pid )
l - > min_pid = track - > pid ;
if ( track - > pid > l - > max_pid )
l - > max_pid = track - > pid ;
2009-01-01 02:42:29 +03:00
cpumask_set_cpu ( track - > cpu ,
to_cpumask ( l - > cpus ) ) ;
2007-05-09 13:32:45 +04:00
}
node_set ( page_to_nid ( virt_to_page ( track ) ) , l - > nodes ) ;
2007-05-07 01:49:45 +04:00
return 1 ;
}
2007-05-09 13:32:45 +04:00
if ( track - > addr < caddr )
2007-05-07 01:49:45 +04:00
end = pos ;
else
start = pos ;
}
/*
2007-05-09 13:32:39 +04:00
* Not found . Insert new tracking element .
2007-05-07 01:49:45 +04:00
*/
2007-07-17 15:03:20 +04:00
if ( t - > count > = t - > max & & ! alloc_loc_track ( t , 2 * t - > max , GFP_ATOMIC ) )
2007-05-07 01:49:45 +04:00
return 0 ;
l = t - > loc + pos ;
if ( pos < t - > count )
memmove ( l + 1 , l ,
( t - > count - pos ) * sizeof ( struct location ) ) ;
t - > count + + ;
l - > count = 1 ;
2007-05-09 13:32:45 +04:00
l - > addr = track - > addr ;
l - > sum_time = age ;
l - > min_time = age ;
l - > max_time = age ;
l - > min_pid = track - > pid ;
l - > max_pid = track - > pid ;
2009-01-01 02:42:29 +03:00
cpumask_clear ( to_cpumask ( l - > cpus ) ) ;
cpumask_set_cpu ( track - > cpu , to_cpumask ( l - > cpus ) ) ;
2007-05-09 13:32:45 +04:00
nodes_clear ( l - > nodes ) ;
node_set ( page_to_nid ( virt_to_page ( track ) ) , l - > nodes ) ;
2007-05-07 01:49:45 +04:00
return 1 ;
}
static void process_slab ( struct loc_track * t , struct kmem_cache * s ,
2010-03-25 00:25:47 +03:00
struct page * page , enum track_item alloc ,
2010-09-29 16:02:13 +04:00
unsigned long * map )
2007-05-07 01:49:45 +04:00
{
2008-03-02 00:40:44 +03:00
void * addr = page_address ( page ) ;
2007-05-07 01:49:45 +04:00
void * p ;
2008-04-14 20:11:30 +04:00
bitmap_zero ( map , page - > objects ) ;
2011-04-15 23:48:13 +04:00
get_map ( s , page , map ) ;
2007-05-07 01:49:45 +04:00
2008-04-14 20:11:31 +04:00
for_each_object ( p , s , addr , page - > objects )
2007-05-09 13:32:45 +04:00
if ( ! test_bit ( slab_index ( p , s , addr ) , map ) )
add_location ( t , s , get_track ( s , p , alloc ) ) ;
2007-05-07 01:49:45 +04:00
}
static int list_locations ( struct kmem_cache * s , char * buf ,
enum track_item alloc )
{
2008-02-01 02:20:50 +03:00
int len = 0 ;
2007-05-07 01:49:45 +04:00
unsigned long i ;
2007-07-17 15:03:20 +04:00
struct loc_track t = { 0 , 0 , NULL } ;
2007-05-07 01:49:45 +04:00
int node ;
2010-03-25 00:25:47 +03:00
unsigned long * map = kmalloc ( BITS_TO_LONGS ( oo_objects ( s - > max ) ) *
sizeof ( unsigned long ) , GFP_KERNEL ) ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:45 +04:00
2010-03-25 00:25:47 +03:00
if ( ! map | | ! alloc_loc_track ( & t , PAGE_SIZE / sizeof ( struct location ) ,
GFP_TEMPORARY ) ) {
kfree ( map ) ;
2007-07-17 15:03:20 +04:00
return sprintf ( buf , " Out of memory \n " ) ;
2010-03-25 00:25:47 +03:00
}
2007-05-07 01:49:45 +04:00
/* Push back cpu slabs */
flush_all ( s ) ;
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2007-05-07 01:49:45 +04:00
unsigned long flags ;
struct page * page ;
2007-08-23 01:01:56 +04:00
if ( ! atomic_long_read ( & n - > nr_slabs ) )
2007-05-07 01:49:45 +04:00
continue ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
list_for_each_entry ( page , & n - > partial , lru )
2010-03-25 00:25:47 +03:00
process_slab ( & t , s , page , alloc , map ) ;
2007-05-07 01:49:45 +04:00
list_for_each_entry ( page , & n - > full , lru )
2010-03-25 00:25:47 +03:00
process_slab ( & t , s , page , alloc , map ) ;
2007-05-07 01:49:45 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
}
for ( i = 0 ; i < t . count ; i + + ) {
2007-05-09 13:32:45 +04:00
struct location * l = & t . loc [ i ] ;
2007-05-07 01:49:45 +04:00
2008-12-10 00:14:27 +03:00
if ( len > PAGE_SIZE - KSYM_SYMBOL_LEN - 100 )
2007-05-07 01:49:45 +04:00
break ;
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " %7ld " , l - > count ) ;
2007-05-09 13:32:45 +04:00
if ( l - > addr )
2011-01-14 02:45:52 +03:00
len + = sprintf ( buf + len , " %pS " , ( void * ) l - > addr ) ;
2007-05-07 01:49:45 +04:00
else
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " <not-available> " ) ;
2007-05-09 13:32:45 +04:00
if ( l - > sum_time ! = l - > min_time ) {
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " age=%ld/%ld/%ld " ,
2008-05-01 15:34:31 +04:00
l - > min_time ,
( long ) div_u64 ( l - > sum_time , l - > count ) ,
l - > max_time ) ;
2007-05-09 13:32:45 +04:00
} else
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " age=%ld " ,
2007-05-09 13:32:45 +04:00
l - > min_time ) ;
if ( l - > min_pid ! = l - > max_pid )
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " pid=%ld-%ld " ,
2007-05-09 13:32:45 +04:00
l - > min_pid , l - > max_pid ) ;
else
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " pid=%ld " ,
2007-05-09 13:32:45 +04:00
l - > min_pid ) ;
2009-01-01 02:42:29 +03:00
if ( num_online_cpus ( ) > 1 & &
! cpumask_empty ( to_cpumask ( l - > cpus ) ) & &
2015-02-14 01:37:59 +03:00
len < PAGE_SIZE - 60 )
len + = scnprintf ( buf + len , PAGE_SIZE - len - 50 ,
" cpus=%*pbl " ,
cpumask_pr_args ( to_cpumask ( l - > cpus ) ) ) ;
2007-05-09 13:32:45 +04:00
2009-06-17 02:32:15 +04:00
if ( nr_online_nodes > 1 & & ! nodes_empty ( l - > nodes ) & &
2015-02-14 01:37:59 +03:00
len < PAGE_SIZE - 60 )
len + = scnprintf ( buf + len , PAGE_SIZE - len - 50 ,
" nodes=%*pbl " ,
nodemask_pr_args ( & l - > nodes ) ) ;
2007-05-09 13:32:45 +04:00
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " \n " ) ;
2007-05-07 01:49:45 +04:00
}
free_loc_track ( & t ) ;
2010-03-25 00:25:47 +03:00
kfree ( map ) ;
2007-05-07 01:49:45 +04:00
if ( ! t . count )
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf , " No data \n " ) ;
return len ;
2007-05-07 01:49:45 +04:00
}
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:45 +04:00
2010-10-05 22:57:27 +04:00
# ifdef SLUB_RESILIENCY_TEST
2014-08-07 03:04:16 +04:00
static void __init resiliency_test ( void )
2010-10-05 22:57:27 +04:00
{
u8 * p ;
2013-01-10 23:14:19 +04:00
BUILD_BUG_ON ( KMALLOC_MIN_SIZE > 16 | | KMALLOC_SHIFT_HIGH < 10 ) ;
2010-10-05 22:57:27 +04:00
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB resiliency testing \n " ) ;
pr_err ( " ----------------------- \n " ) ;
pr_err ( " A. Corruption after allocation \n " ) ;
2010-10-05 22:57:27 +04:00
p = kzalloc ( 16 , GFP_KERNEL ) ;
p [ 16 ] = 0x12 ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 1. kmalloc-16: Clobber Redzone/next pointer 0x12->0x%p \n \n " ,
p + 16 ) ;
2010-10-05 22:57:27 +04:00
validate_slab_cache ( kmalloc_caches [ 4 ] ) ;
/* Hmmm... The next two are dangerous */
p = kzalloc ( 32 , GFP_KERNEL ) ;
p [ 32 + sizeof ( void * ) ] = 0x34 ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 2. kmalloc-32: Clobber next pointer/next slab 0x34 -> -0x%p \n " ,
p ) ;
pr_err ( " If allocated object is overwritten then not detectable \n \n " ) ;
2010-10-05 22:57:27 +04:00
validate_slab_cache ( kmalloc_caches [ 5 ] ) ;
p = kzalloc ( 64 , GFP_KERNEL ) ;
p + = 64 + ( get_cycles ( ) & 0xff ) * sizeof ( void * ) ;
* p = 0x56 ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 3. kmalloc-64: corrupting random byte 0x56->0x%p \n " ,
p ) ;
pr_err ( " If allocated object is overwritten then not detectable \n \n " ) ;
2010-10-05 22:57:27 +04:00
validate_slab_cache ( kmalloc_caches [ 6 ] ) ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n B. Corruption after free \n " ) ;
2010-10-05 22:57:27 +04:00
p = kzalloc ( 128 , GFP_KERNEL ) ;
kfree ( p ) ;
* p = 0x78 ;
2014-06-05 03:06:34 +04:00
pr_err ( " 1. kmalloc-128: Clobber first word 0x78->0x%p \n \n " , p ) ;
2010-10-05 22:57:27 +04:00
validate_slab_cache ( kmalloc_caches [ 7 ] ) ;
p = kzalloc ( 256 , GFP_KERNEL ) ;
kfree ( p ) ;
p [ 50 ] = 0x9a ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 2. kmalloc-256: Clobber 50th byte 0x9a->0x%p \n \n " , p ) ;
2010-10-05 22:57:27 +04:00
validate_slab_cache ( kmalloc_caches [ 8 ] ) ;
p = kzalloc ( 512 , GFP_KERNEL ) ;
kfree ( p ) ;
p [ 512 ] = 0xab ;
2014-06-05 03:06:34 +04:00
pr_err ( " \n 3. kmalloc-512: Clobber redzone 0xab->0x%p \n \n " , p ) ;
2010-10-05 22:57:27 +04:00
validate_slab_cache ( kmalloc_caches [ 9 ] ) ;
}
# else
# ifdef CONFIG_SYSFS
static void resiliency_test ( void ) { } ;
# endif
# endif
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SYSFS
2007-05-07 01:49:36 +04:00
enum slab_stat_type {
2008-04-14 20:11:40 +04:00
SL_ALL , /* All slabs */
SL_PARTIAL , /* Only partially allocated slabs */
SL_CPU , /* Only slabs used for cpu caches */
SL_OBJECTS , /* Determine allocated objects not slabs */
SL_TOTAL /* Determine object capacity not slabs */
2007-05-07 01:49:36 +04:00
} ;
2008-04-14 20:11:40 +04:00
# define SO_ALL (1 << SL_ALL)
2007-05-07 01:49:36 +04:00
# define SO_PARTIAL (1 << SL_PARTIAL)
# define SO_CPU (1 << SL_CPU)
# define SO_OBJECTS (1 << SL_OBJECTS)
2008-04-14 20:11:40 +04:00
# define SO_TOTAL (1 << SL_TOTAL)
2007-05-07 01:49:36 +04:00
2008-03-02 23:28:24 +03:00
static ssize_t show_slab_objects ( struct kmem_cache * s ,
char * buf , unsigned long flags )
2007-05-07 01:49:36 +04:00
{
unsigned long total = 0 ;
int node ;
int x ;
unsigned long * nodes ;
2013-07-12 04:23:48 +04:00
nodes = kzalloc ( sizeof ( unsigned long ) * nr_node_ids , GFP_KERNEL ) ;
2008-03-02 23:28:24 +03:00
if ( ! nodes )
return - ENOMEM ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:40 +04:00
if ( flags & SO_CPU ) {
int cpu ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:40 +04:00
for_each_possible_cpu ( cpu ) {
2013-07-15 05:05:29 +04:00
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab ,
cpu ) ;
2012-05-09 19:09:56 +04:00
int node ;
2011-08-10 01:12:27 +04:00
struct page * page ;
2007-10-16 12:26:05 +04:00
2015-04-16 02:14:08 +03:00
page = READ_ONCE ( c - > page ) ;
2012-05-09 19:09:56 +04:00
if ( ! page )
continue ;
2008-04-14 20:11:40 +04:00
2012-05-09 19:09:56 +04:00
node = page_to_nid ( page ) ;
if ( flags & SO_TOTAL )
x = page - > objects ;
else if ( flags & SO_OBJECTS )
x = page - > inuse ;
else
x = 1 ;
2011-08-10 01:12:27 +04:00
2012-05-09 19:09:56 +04:00
total + = x ;
nodes [ node ] + = x ;
2015-04-16 02:14:08 +03:00
page = READ_ONCE ( c - > partial ) ;
2011-08-10 01:12:27 +04:00
if ( page ) {
2013-09-10 07:43:37 +04:00
node = page_to_nid ( page ) ;
if ( flags & SO_TOTAL )
WARN_ON_ONCE ( 1 ) ;
else if ( flags & SO_OBJECTS )
WARN_ON_ONCE ( 1 ) ;
else
x = page - > pages ;
2011-11-22 19:02:02 +04:00
total + = x ;
nodes [ node ] + = x ;
2011-08-10 01:12:27 +04:00
}
2007-05-07 01:49:36 +04:00
}
}
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:07:18 +04:00
get_online_mems ( ) ;
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2008-04-14 20:11:40 +04:00
if ( flags & SO_ALL ) {
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
for_each_kmem_cache_node ( s , node , n ) {
2008-04-14 20:11:40 +04:00
2013-07-15 05:05:29 +04:00
if ( flags & SO_TOTAL )
x = atomic_long_read ( & n - > total_objects ) ;
else if ( flags & SO_OBJECTS )
x = atomic_long_read ( & n - > total_objects ) -
count_partial ( n , count_free ) ;
2007-05-07 01:49:36 +04:00
else
2008-04-14 20:11:40 +04:00
x = atomic_long_read ( & n - > nr_slabs ) ;
2007-05-07 01:49:36 +04:00
total + = x ;
nodes [ node ] + = x ;
}
2010-10-05 22:57:26 +04:00
} else
# endif
if ( flags & SO_PARTIAL ) {
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:36 +04:00
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2008-04-14 20:11:40 +04:00
if ( flags & SO_TOTAL )
x = count_partial ( n , count_total ) ;
else if ( flags & SO_OBJECTS )
x = count_partial ( n , count_inuse ) ;
2007-05-07 01:49:36 +04:00
else
2008-04-14 20:11:40 +04:00
x = n - > nr_partial ;
2007-05-07 01:49:36 +04:00
total + = x ;
nodes [ node ] + = x ;
}
}
x = sprintf ( buf , " %lu " , total ) ;
# ifdef CONFIG_NUMA
2014-08-07 03:04:09 +04:00
for ( node = 0 ; node < nr_node_ids ; node + + )
2007-05-07 01:49:36 +04:00
if ( nodes [ node ] )
x + = sprintf ( buf + x , " N%d=%lu " ,
node , nodes [ node ] ) ;
# endif
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 03:07:18 +04:00
put_online_mems ( ) ;
2007-05-07 01:49:36 +04:00
kfree ( nodes ) ;
return x + sprintf ( buf + x , " \n " ) ;
}
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2007-05-07 01:49:36 +04:00
static int any_slab_objects ( struct kmem_cache * s )
{
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2007-05-07 01:49:36 +04:00
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n )
2008-05-07 07:42:39 +04:00
if ( atomic_long_read ( & n - > total_objects ) )
2007-05-07 01:49:36 +04:00
return 1 ;
2014-08-07 03:04:09 +04:00
2007-05-07 01:49:36 +04:00
return 0 ;
}
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:36 +04:00
# define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
2011-07-14 16:07:13 +04:00
# define to_slab(n) container_of(n, struct kmem_cache, kobj)
2007-05-07 01:49:36 +04:00
struct slab_attribute {
struct attribute attr ;
ssize_t ( * show ) ( struct kmem_cache * s , char * buf ) ;
ssize_t ( * store ) ( struct kmem_cache * s , const char * x , size_t count ) ;
} ;
# define SLAB_ATTR_RO(_name) \
mm: restrict access to slab files under procfs and sysfs
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:
1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.
2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.
3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).
4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.
5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.
6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.
Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.
Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:
http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:
groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER
And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):
chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <kees@ubuntu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27 21:54:53 +04:00
static struct slab_attribute _name # # _attr = \
__ATTR ( _name , 0400 , _name # # _show , NULL )
2007-05-07 01:49:36 +04:00
# define SLAB_ATTR(_name) \
static struct slab_attribute _name # # _attr = \
mm: restrict access to slab files under procfs and sysfs
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:
1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.
2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.
3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).
4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.
5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.
6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.
Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.
Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:
http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:
groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER
And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):
chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <kees@ubuntu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27 21:54:53 +04:00
__ATTR ( _name , 0600 , _name # # _show , _name # # _store )
2007-05-07 01:49:36 +04:00
static ssize_t slab_size_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > size ) ;
}
SLAB_ATTR_RO ( slab_size ) ;
static ssize_t align_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > align ) ;
}
SLAB_ATTR_RO ( align ) ;
static ssize_t object_size_show ( struct kmem_cache * s , char * buf )
{
2012-06-13 19:24:57 +04:00
return sprintf ( buf , " %d \n " , s - > object_size ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( object_size ) ;
static ssize_t objs_per_slab_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 20:11:31 +04:00
return sprintf ( buf , " %d \n " , oo_objects ( s - > oo ) ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( objs_per_slab ) ;
2008-04-14 20:11:41 +04:00
static ssize_t order_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
2008-04-30 03:11:12 +04:00
unsigned long order ;
int err ;
2013-09-12 01:20:25 +04:00
err = kstrtoul ( buf , 10 , & order ) ;
2008-04-30 03:11:12 +04:00
if ( err )
return err ;
2008-04-14 20:11:41 +04:00
if ( order > slub_max_order | | order < slub_min_order )
return - EINVAL ;
calculate_sizes ( s , order ) ;
return length ;
}
2007-05-07 01:49:36 +04:00
static ssize_t order_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 20:11:31 +04:00
return sprintf ( buf , " %d \n " , oo_order ( s - > oo ) ) ;
2007-05-07 01:49:36 +04:00
}
2008-04-14 20:11:41 +04:00
SLAB_ATTR ( order ) ;
2007-05-07 01:49:36 +04:00
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
static ssize_t min_partial_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %lu \n " , s - > min_partial ) ;
}
static ssize_t min_partial_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
unsigned long min ;
int err ;
2013-09-12 01:20:25 +04:00
err = kstrtoul ( buf , 10 , & min ) ;
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
if ( err )
return err ;
2009-02-25 10:16:35 +03:00
set_min_partial ( s , min ) ;
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
return length ;
}
SLAB_ATTR ( min_partial ) ;
2011-08-10 01:12:27 +04:00
static ssize_t cpu_partial_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %u \n " , s - > cpu_partial ) ;
}
static ssize_t cpu_partial_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
unsigned long objects ;
int err ;
2013-09-12 01:20:25 +04:00
err = kstrtoul ( buf , 10 , & objects ) ;
2011-08-10 01:12:27 +04:00
if ( err )
return err ;
2013-06-19 09:05:52 +04:00
if ( objects & & ! kmem_cache_has_cpu_partial ( s ) )
2012-01-10 01:19:45 +04:00
return - EINVAL ;
2011-08-10 01:12:27 +04:00
s - > cpu_partial = objects ;
flush_all ( s ) ;
return length ;
}
SLAB_ATTR ( cpu_partial ) ;
2007-05-07 01:49:36 +04:00
static ssize_t ctor_show ( struct kmem_cache * s , char * buf )
{
2011-01-14 02:45:52 +03:00
if ( ! s - > ctor )
return 0 ;
return sprintf ( buf , " %pS \n " , s - > ctor ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( ctor ) ;
static ssize_t aliases_show ( struct kmem_cache * s , char * buf )
{
2014-08-07 03:04:51 +04:00
return sprintf ( buf , " %d \n " , s - > refcount < 0 ? 0 : s - > refcount - 1 ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( aliases ) ;
static ssize_t partial_show ( struct kmem_cache * s , char * buf )
{
2008-02-16 02:22:21 +03:00
return show_slab_objects ( s , buf , SO_PARTIAL ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( partial ) ;
static ssize_t cpu_slabs_show ( struct kmem_cache * s , char * buf )
{
2008-02-16 02:22:21 +03:00
return show_slab_objects ( s , buf , SO_CPU ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( cpu_slabs ) ;
static ssize_t objects_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 20:11:40 +04:00
return show_slab_objects ( s , buf , SO_ALL | SO_OBJECTS ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( objects ) ;
2008-04-14 20:11:40 +04:00
static ssize_t objects_partial_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_PARTIAL | SO_OBJECTS ) ;
}
SLAB_ATTR_RO ( objects_partial ) ;
2011-08-10 01:12:27 +04:00
static ssize_t slabs_cpu_partial_show ( struct kmem_cache * s , char * buf )
{
int objects = 0 ;
int pages = 0 ;
int cpu ;
int len ;
for_each_online_cpu ( cpu ) {
struct page * page = per_cpu_ptr ( s - > cpu_slab , cpu ) - > partial ;
if ( page ) {
pages + = page - > pages ;
objects + = page - > pobjects ;
}
}
len = sprintf ( buf , " %d(%d) " , objects , pages ) ;
# ifdef CONFIG_SMP
for_each_online_cpu ( cpu ) {
struct page * page = per_cpu_ptr ( s - > cpu_slab , cpu ) - > partial ;
if ( page & & len < PAGE_SIZE - 20 )
len + = sprintf ( buf + len , " C%d=%d(%d) " , cpu ,
page - > pobjects , page - > pages ) ;
}
# endif
return len + sprintf ( buf + len , " \n " ) ;
}
SLAB_ATTR_RO ( slabs_cpu_partial ) ;
2010-10-05 22:57:27 +04:00
static ssize_t reclaim_account_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_RECLAIM_ACCOUNT ) ) ;
}
static ssize_t reclaim_account_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
s - > flags & = ~ SLAB_RECLAIM_ACCOUNT ;
if ( buf [ 0 ] = = ' 1 ' )
s - > flags | = SLAB_RECLAIM_ACCOUNT ;
return length ;
}
SLAB_ATTR ( reclaim_account ) ;
static ssize_t hwcache_align_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_HWCACHE_ALIGN ) ) ;
}
SLAB_ATTR_RO ( hwcache_align ) ;
# ifdef CONFIG_ZONE_DMA
static ssize_t cache_dma_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_CACHE_DMA ) ) ;
}
SLAB_ATTR_RO ( cache_dma ) ;
# endif
static ssize_t destroy_by_rcu_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_DESTROY_BY_RCU ) ) ;
}
SLAB_ATTR_RO ( destroy_by_rcu ) ;
2011-03-10 10:21:48 +03:00
static ssize_t reserved_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > reserved ) ;
}
SLAB_ATTR_RO ( reserved ) ;
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-10-05 22:57:27 +04:00
static ssize_t slabs_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_ALL ) ;
}
SLAB_ATTR_RO ( slabs ) ;
2008-04-14 20:11:40 +04:00
static ssize_t total_objects_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_ALL | SO_TOTAL ) ;
}
SLAB_ATTR_RO ( total_objects ) ;
2007-05-07 01:49:36 +04:00
static ssize_t sanity_checks_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_DEBUG_FREE ) ) ;
}
static ssize_t sanity_checks_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
s - > flags & = ~ SLAB_DEBUG_FREE ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_DEBUG_FREE ;
2011-06-01 21:25:49 +04:00
}
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( sanity_checks ) ;
static ssize_t trace_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_TRACE ) ) ;
}
static ssize_t trace_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
2014-10-10 02:26:11 +04:00
/*
* Tracing a merged cache is going to give confusing results
* as well as cause other issues like converting a mergeable
* cache into an umergeable one .
*/
if ( s - > refcount > 1 )
return - EINVAL ;
2007-05-07 01:49:36 +04:00
s - > flags & = ~ SLAB_TRACE ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_TRACE ;
2011-06-01 21:25:49 +04:00
}
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( trace ) ;
static ssize_t red_zone_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_RED_ZONE ) ) ;
}
static ssize_t red_zone_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( any_slab_objects ( s ) )
return - EBUSY ;
s - > flags & = ~ SLAB_RED_ZONE ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_RED_ZONE ;
2011-06-01 21:25:49 +04:00
}
2008-04-14 20:11:41 +04:00
calculate_sizes ( s , - 1 ) ;
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( red_zone ) ;
static ssize_t poison_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_POISON ) ) ;
}
static ssize_t poison_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( any_slab_objects ( s ) )
return - EBUSY ;
s - > flags & = ~ SLAB_POISON ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_POISON ;
2011-06-01 21:25:49 +04:00
}
2008-04-14 20:11:41 +04:00
calculate_sizes ( s , - 1 ) ;
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( poison ) ;
static ssize_t store_user_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_STORE_USER ) ) ;
}
static ssize_t store_user_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( any_slab_objects ( s ) )
return - EBUSY ;
s - > flags & = ~ SLAB_STORE_USER ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_STORE_USER ;
2011-06-01 21:25:49 +04:00
}
2008-04-14 20:11:41 +04:00
calculate_sizes ( s , - 1 ) ;
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( store_user ) ;
2007-05-07 01:49:43 +04:00
static ssize_t validate_show ( struct kmem_cache * s , char * buf )
{
return 0 ;
}
static ssize_t validate_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
2007-07-17 15:03:30 +04:00
int ret = - EINVAL ;
if ( buf [ 0 ] = = ' 1 ' ) {
ret = validate_slab_cache ( s ) ;
if ( ret > = 0 )
ret = length ;
}
return ret ;
2007-05-07 01:49:43 +04:00
}
SLAB_ATTR ( validate ) ;
2010-10-05 22:57:27 +04:00
static ssize_t alloc_calls_show ( struct kmem_cache * s , char * buf )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return - ENOSYS ;
return list_locations ( s , buf , TRACK_ALLOC ) ;
}
SLAB_ATTR_RO ( alloc_calls ) ;
static ssize_t free_calls_show ( struct kmem_cache * s , char * buf )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return - ENOSYS ;
return list_locations ( s , buf , TRACK_FREE ) ;
}
SLAB_ATTR_RO ( free_calls ) ;
# endif /* CONFIG_SLUB_DEBUG */
# ifdef CONFIG_FAILSLAB
static ssize_t failslab_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_FAILSLAB ) ) ;
}
static ssize_t failslab_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
2014-10-10 02:26:11 +04:00
if ( s - > refcount > 1 )
return - EINVAL ;
2010-10-05 22:57:27 +04:00
s - > flags & = ~ SLAB_FAILSLAB ;
if ( buf [ 0 ] = = ' 1 ' )
s - > flags | = SLAB_FAILSLAB ;
return length ;
}
SLAB_ATTR ( failslab ) ;
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:43 +04:00
2007-05-07 01:49:46 +04:00
static ssize_t shrink_show ( struct kmem_cache * s , char * buf )
{
return 0 ;
}
static ssize_t shrink_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
2015-02-13 01:59:41 +03:00
if ( buf [ 0 ] = = ' 1 ' )
kmem_cache_shrink ( s ) ;
else
2007-05-07 01:49:46 +04:00
return - EINVAL ;
return length ;
}
SLAB_ATTR ( shrink ) ;
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_NUMA
2008-01-08 10:20:26 +03:00
static ssize_t remote_node_defrag_ratio_show ( struct kmem_cache * s , char * buf )
2007-05-07 01:49:36 +04:00
{
2008-01-08 10:20:26 +03:00
return sprintf ( buf , " %d \n " , s - > remote_node_defrag_ratio / 10 ) ;
2007-05-07 01:49:36 +04:00
}
2008-01-08 10:20:26 +03:00
static ssize_t remote_node_defrag_ratio_store ( struct kmem_cache * s ,
2007-05-07 01:49:36 +04:00
const char * buf , size_t length )
{
2008-04-30 03:11:12 +04:00
unsigned long ratio ;
int err ;
2013-09-12 01:20:25 +04:00
err = kstrtoul ( buf , 10 , & ratio ) ;
2008-04-30 03:11:12 +04:00
if ( err )
return err ;
2008-08-19 17:51:22 +04:00
if ( ratio < = 100 )
2008-04-30 03:11:12 +04:00
s - > remote_node_defrag_ratio = ratio * 10 ;
2007-05-07 01:49:36 +04:00
return length ;
}
2008-01-08 10:20:26 +03:00
SLAB_ATTR ( remote_node_defrag_ratio ) ;
2007-05-07 01:49:36 +04:00
# endif
2008-02-08 04:47:41 +03:00
# ifdef CONFIG_SLUB_STATS
static int show_stat ( struct kmem_cache * s , char * buf , enum stat_item si )
{
unsigned long sum = 0 ;
int cpu ;
int len ;
int * data = kmalloc ( nr_cpu_ids * sizeof ( int ) , GFP_KERNEL ) ;
if ( ! data )
return - ENOMEM ;
for_each_online_cpu ( cpu ) {
2009-12-19 01:26:20 +03:00
unsigned x = per_cpu_ptr ( s - > cpu_slab , cpu ) - > stat [ si ] ;
2008-02-08 04:47:41 +03:00
data [ cpu ] = x ;
sum + = x ;
}
len = sprintf ( buf , " %lu " , sum ) ;
2008-04-14 19:52:05 +04:00
# ifdef CONFIG_SMP
2008-02-08 04:47:41 +03:00
for_each_online_cpu ( cpu ) {
if ( data [ cpu ] & & len < PAGE_SIZE - 20 )
2008-04-14 19:52:05 +04:00
len + = sprintf ( buf + len , " C%d=%u " , cpu , data [ cpu ] ) ;
2008-02-08 04:47:41 +03:00
}
2008-04-14 19:52:05 +04:00
# endif
2008-02-08 04:47:41 +03:00
kfree ( data ) ;
return len + sprintf ( buf + len , " \n " ) ;
}
2009-10-15 13:20:22 +04:00
static void clear_stat ( struct kmem_cache * s , enum stat_item si )
{
int cpu ;
for_each_online_cpu ( cpu )
2009-12-19 01:26:20 +03:00
per_cpu_ptr ( s - > cpu_slab , cpu ) - > stat [ si ] = 0 ;
2009-10-15 13:20:22 +04:00
}
2008-02-08 04:47:41 +03:00
# define STAT_ATTR(si, text) \
static ssize_t text # # _show ( struct kmem_cache * s , char * buf ) \
{ \
return show_stat ( s , buf , si ) ; \
} \
2009-10-15 13:20:22 +04:00
static ssize_t text # # _store ( struct kmem_cache * s , \
const char * buf , size_t length ) \
{ \
if ( buf [ 0 ] ! = ' 0 ' ) \
return - EINVAL ; \
clear_stat ( s , si ) ; \
return length ; \
} \
SLAB_ATTR ( text ) ; \
2008-02-08 04:47:41 +03:00
STAT_ATTR ( ALLOC_FASTPATH , alloc_fastpath ) ;
STAT_ATTR ( ALLOC_SLOWPATH , alloc_slowpath ) ;
STAT_ATTR ( FREE_FASTPATH , free_fastpath ) ;
STAT_ATTR ( FREE_SLOWPATH , free_slowpath ) ;
STAT_ATTR ( FREE_FROZEN , free_frozen ) ;
STAT_ATTR ( FREE_ADD_PARTIAL , free_add_partial ) ;
STAT_ATTR ( FREE_REMOVE_PARTIAL , free_remove_partial ) ;
STAT_ATTR ( ALLOC_FROM_PARTIAL , alloc_from_partial ) ;
STAT_ATTR ( ALLOC_SLAB , alloc_slab ) ;
STAT_ATTR ( ALLOC_REFILL , alloc_refill ) ;
2011-06-01 21:25:57 +04:00
STAT_ATTR ( ALLOC_NODE_MISMATCH , alloc_node_mismatch ) ;
2008-02-08 04:47:41 +03:00
STAT_ATTR ( FREE_SLAB , free_slab ) ;
STAT_ATTR ( CPUSLAB_FLUSH , cpuslab_flush ) ;
STAT_ATTR ( DEACTIVATE_FULL , deactivate_full ) ;
STAT_ATTR ( DEACTIVATE_EMPTY , deactivate_empty ) ;
STAT_ATTR ( DEACTIVATE_TO_HEAD , deactivate_to_head ) ;
STAT_ATTR ( DEACTIVATE_TO_TAIL , deactivate_to_tail ) ;
STAT_ATTR ( DEACTIVATE_REMOTE_FREES , deactivate_remote_frees ) ;
2011-06-01 21:25:58 +04:00
STAT_ATTR ( DEACTIVATE_BYPASS , deactivate_bypass ) ;
2008-04-14 20:11:40 +04:00
STAT_ATTR ( ORDER_FALLBACK , order_fallback ) ;
2011-06-01 21:25:49 +04:00
STAT_ATTR ( CMPXCHG_DOUBLE_CPU_FAIL , cmpxchg_double_cpu_fail ) ;
STAT_ATTR ( CMPXCHG_DOUBLE_FAIL , cmpxchg_double_fail ) ;
2011-08-10 01:12:27 +04:00
STAT_ATTR ( CPU_PARTIAL_ALLOC , cpu_partial_alloc ) ;
STAT_ATTR ( CPU_PARTIAL_FREE , cpu_partial_free ) ;
2012-02-03 19:34:56 +04:00
STAT_ATTR ( CPU_PARTIAL_NODE , cpu_partial_node ) ;
STAT_ATTR ( CPU_PARTIAL_DRAIN , cpu_partial_drain ) ;
2008-02-08 04:47:41 +03:00
# endif
2008-01-08 10:20:27 +03:00
static struct attribute * slab_attrs [ ] = {
2007-05-07 01:49:36 +04:00
& slab_size_attr . attr ,
& object_size_attr . attr ,
& objs_per_slab_attr . attr ,
& order_attr . attr ,
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
& min_partial_attr . attr ,
2011-08-10 01:12:27 +04:00
& cpu_partial_attr . attr ,
2007-05-07 01:49:36 +04:00
& objects_attr . attr ,
2008-04-14 20:11:40 +04:00
& objects_partial_attr . attr ,
2007-05-07 01:49:36 +04:00
& partial_attr . attr ,
& cpu_slabs_attr . attr ,
& ctor_attr . attr ,
& aliases_attr . attr ,
& align_attr . attr ,
& hwcache_align_attr . attr ,
& reclaim_account_attr . attr ,
& destroy_by_rcu_attr . attr ,
2010-10-05 22:57:27 +04:00
& shrink_attr . attr ,
2011-03-10 10:21:48 +03:00
& reserved_attr . attr ,
2011-08-10 01:12:27 +04:00
& slabs_cpu_partial_attr . attr ,
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-10-05 22:57:27 +04:00
& total_objects_attr . attr ,
& slabs_attr . attr ,
& sanity_checks_attr . attr ,
& trace_attr . attr ,
2007-05-07 01:49:36 +04:00
& red_zone_attr . attr ,
& poison_attr . attr ,
& store_user_attr . attr ,
2007-05-07 01:49:43 +04:00
& validate_attr . attr ,
2007-05-07 01:49:45 +04:00
& alloc_calls_attr . attr ,
& free_calls_attr . attr ,
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_ZONE_DMA
& cache_dma_attr . attr ,
# endif
# ifdef CONFIG_NUMA
2008-01-08 10:20:26 +03:00
& remote_node_defrag_ratio_attr . attr ,
2008-02-08 04:47:41 +03:00
# endif
# ifdef CONFIG_SLUB_STATS
& alloc_fastpath_attr . attr ,
& alloc_slowpath_attr . attr ,
& free_fastpath_attr . attr ,
& free_slowpath_attr . attr ,
& free_frozen_attr . attr ,
& free_add_partial_attr . attr ,
& free_remove_partial_attr . attr ,
& alloc_from_partial_attr . attr ,
& alloc_slab_attr . attr ,
& alloc_refill_attr . attr ,
2011-06-01 21:25:57 +04:00
& alloc_node_mismatch_attr . attr ,
2008-02-08 04:47:41 +03:00
& free_slab_attr . attr ,
& cpuslab_flush_attr . attr ,
& deactivate_full_attr . attr ,
& deactivate_empty_attr . attr ,
& deactivate_to_head_attr . attr ,
& deactivate_to_tail_attr . attr ,
& deactivate_remote_frees_attr . attr ,
2011-06-01 21:25:58 +04:00
& deactivate_bypass_attr . attr ,
2008-04-14 20:11:40 +04:00
& order_fallback_attr . attr ,
2011-06-01 21:25:49 +04:00
& cmpxchg_double_fail_attr . attr ,
& cmpxchg_double_cpu_fail_attr . attr ,
2011-08-10 01:12:27 +04:00
& cpu_partial_alloc_attr . attr ,
& cpu_partial_free_attr . attr ,
2012-02-03 19:34:56 +04:00
& cpu_partial_node_attr . attr ,
& cpu_partial_drain_attr . attr ,
2007-05-07 01:49:36 +04:00
# endif
2010-02-26 09:36:12 +03:00
# ifdef CONFIG_FAILSLAB
& failslab_attr . attr ,
# endif
2007-05-07 01:49:36 +04:00
NULL
} ;
static struct attribute_group slab_attr_group = {
. attrs = slab_attrs ,
} ;
static ssize_t slab_attr_show ( struct kobject * kobj ,
struct attribute * attr ,
char * buf )
{
struct slab_attribute * attribute ;
struct kmem_cache * s ;
int err ;
attribute = to_slab_attr ( attr ) ;
s = to_slab ( kobj ) ;
if ( ! attribute - > show )
return - EIO ;
err = attribute - > show ( s , buf ) ;
return err ;
}
static ssize_t slab_attr_store ( struct kobject * kobj ,
struct attribute * attr ,
const char * buf , size_t len )
{
struct slab_attribute * attribute ;
struct kmem_cache * s ;
int err ;
attribute = to_slab_attr ( attr ) ;
s = to_slab ( kobj ) ;
if ( ! attribute - > store )
return - EIO ;
err = attribute - > store ( s , buf , len ) ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
# ifdef CONFIG_MEMCG_KMEM
if ( slab_state > = FULL & & err > = 0 & & is_root_cache ( s ) ) {
2015-02-13 01:59:23 +03:00
struct kmem_cache * c ;
2007-05-07 01:49:36 +04:00
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
mutex_lock ( & slab_mutex ) ;
if ( s - > max_attr_size < len )
s - > max_attr_size = len ;
2012-12-19 02:23:10 +04:00
/*
* This is a best effort propagation , so this function ' s return
* value will be determined by the parent cache only . This is
* basically because not all attributes will have a well
* defined semantics for rollbacks - most of the actions will
* have permanent effects .
*
* Returning the error value of any of the children that fail
* is not 100 % defined , in the sense that users seeing the
* error code won ' t be able to know anything about the state of
* the cache .
*
* Only returning the error code for the parent cache at least
* has well defined semantics . The cache being written to
* directly either failed or succeeded , in which case we loop
* through the descendants with best - effort propagation .
*/
2015-02-13 01:59:23 +03:00
for_each_memcg_cache ( c , s )
attribute - > store ( c , buf , len ) ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
mutex_unlock ( & slab_mutex ) ;
}
# endif
2007-05-07 01:49:36 +04:00
return err ;
}
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
static void memcg_propagate_slab_attrs ( struct kmem_cache * s )
{
# ifdef CONFIG_MEMCG_KMEM
int i ;
char * buffer = NULL ;
2014-05-06 23:49:59 +04:00
struct kmem_cache * root_cache ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
2014-05-06 23:49:59 +04:00
if ( is_root_cache ( s ) )
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
return ;
2015-02-13 01:59:20 +03:00
root_cache = s - > memcg_params . root_cache ;
2014-05-06 23:49:59 +04:00
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
/*
* This mean this cache had no attribute written . Therefore , no point
* in copying default values around
*/
2014-05-06 23:49:59 +04:00
if ( ! root_cache - > max_attr_size )
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
return ;
for ( i = 0 ; i < ARRAY_SIZE ( slab_attrs ) ; i + + ) {
char mbuf [ 64 ] ;
char * buf ;
struct slab_attribute * attr = to_slab_attr ( slab_attrs [ i ] ) ;
if ( ! attr | | ! attr - > store | | ! attr - > show )
continue ;
/*
* It is really bad that we have to allocate here , so we will
* do it only as a fallback . If we actually allocate , though ,
* we can just use the allocated buffer until the end .
*
* Most of the slub attributes will tend to be very small in
* size , but sysfs allows buffers up to a page , so they can
* theoretically happen .
*/
if ( buffer )
buf = buffer ;
2014-05-06 23:49:59 +04:00
else if ( root_cache - > max_attr_size < ARRAY_SIZE ( mbuf ) )
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
buf = mbuf ;
else {
buffer = ( char * ) get_zeroed_page ( GFP_KERNEL ) ;
if ( WARN_ON ( ! buffer ) )
continue ;
buf = buffer ;
}
2014-05-06 23:49:59 +04:00
attr - > show ( root_cache , buf ) ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
attr - > store ( s , buf , strlen ( buf ) ) ;
}
if ( buffer )
free_page ( ( unsigned long ) buffer ) ;
# endif
}
2014-05-06 23:50:08 +04:00
static void kmem_cache_release ( struct kobject * k )
{
slab_kmem_cache_release ( to_slab ( k ) ) ;
}
2010-01-19 04:58:23 +03:00
static const struct sysfs_ops slab_sysfs_ops = {
2007-05-07 01:49:36 +04:00
. show = slab_attr_show ,
. store = slab_attr_store ,
} ;
static struct kobj_type slab_ktype = {
. sysfs_ops = & slab_sysfs_ops ,
2014-05-06 23:50:08 +04:00
. release = kmem_cache_release ,
2007-05-07 01:49:36 +04:00
} ;
static int uevent_filter ( struct kset * kset , struct kobject * kobj )
{
struct kobj_type * ktype = get_ktype ( kobj ) ;
if ( ktype = = & slab_ktype )
return 1 ;
return 0 ;
}
2009-12-31 16:52:51 +03:00
static const struct kset_uevent_ops slab_uevent_ops = {
2007-05-07 01:49:36 +04:00
. filter = uevent_filter ,
} ;
2007-11-01 18:29:06 +03:00
static struct kset * slab_kset ;
2007-05-07 01:49:36 +04:00
2014-04-08 02:39:31 +04:00
static inline struct kset * cache_kset ( struct kmem_cache * s )
{
# ifdef CONFIG_MEMCG_KMEM
if ( ! is_root_cache ( s ) )
2015-02-13 01:59:20 +03:00
return s - > memcg_params . root_cache - > memcg_kset ;
2014-04-08 02:39:31 +04:00
# endif
return slab_kset ;
}
2007-05-07 01:49:36 +04:00
# define ID_STR_LENGTH 64
/* Create a unique string id for a slab cache:
2008-02-16 10:45:26 +03:00
*
* Format : [ flags - ] size
2007-05-07 01:49:36 +04:00
*/
static char * create_unique_id ( struct kmem_cache * s )
{
char * name = kmalloc ( ID_STR_LENGTH , GFP_KERNEL ) ;
char * p = name ;
BUG_ON ( ! name ) ;
* p + + = ' : ' ;
/*
* First flags affecting slabcache operations . We will only
* get here for aliasable slabs so we do not need to support
* too many flags . The flags here must cover all flags that
* are matched during merging to guarantee that the id is
* unique .
*/
if ( s - > flags & SLAB_CACHE_DMA )
* p + + = ' d ' ;
if ( s - > flags & SLAB_RECLAIM_ACCOUNT )
* p + + = ' a ' ;
if ( s - > flags & SLAB_DEBUG_FREE )
* p + + = ' F ' ;
2008-04-04 02:54:48 +04:00
if ( ! ( s - > flags & SLAB_NOTRACK ) )
* p + + = ' t ' ;
2016-01-15 02:18:15 +03:00
if ( s - > flags & SLAB_ACCOUNT )
* p + + = ' A ' ;
2007-05-07 01:49:36 +04:00
if ( p ! = name + 1 )
* p + + = ' - ' ;
p + = sprintf ( p , " %07d " , s - > size ) ;
2012-12-19 02:22:34 +04:00
2007-05-07 01:49:36 +04:00
BUG_ON ( p > name + ID_STR_LENGTH - 1 ) ;
return name ;
}
static int sysfs_slab_add ( struct kmem_cache * s )
{
int err ;
const char * name ;
2012-11-28 20:23:07 +04:00
int unmergeable = slab_unmergeable ( s ) ;
2007-05-07 01:49:36 +04:00
if ( unmergeable ) {
/*
* Slabcache can never be merged so we can use the name proper .
* This is typically the case for debug situations . In that
* case we can catch duplicate names easily .
*/
2007-11-01 18:29:06 +03:00
sysfs_remove_link ( & slab_kset - > kobj , s - > name ) ;
2007-05-07 01:49:36 +04:00
name = s - > name ;
} else {
/*
* Create a unique name for the slab as a target
* for the symlinks .
*/
name = create_unique_id ( s ) ;
}
2014-04-08 02:39:31 +04:00
s - > kobj . kset = cache_kset ( s ) ;
2014-01-04 11:32:31 +04:00
err = kobject_init_and_add ( & s - > kobj , & slab_ktype , NULL , " %s " , name ) ;
2014-04-08 02:39:32 +04:00
if ( err )
2015-09-05 01:45:51 +03:00
goto out ;
2007-05-07 01:49:36 +04:00
err = sysfs_create_group ( & s - > kobj , & slab_attr_group ) ;
2014-04-08 02:39:32 +04:00
if ( err )
goto out_del_kobj ;
2014-04-08 02:39:31 +04:00
# ifdef CONFIG_MEMCG_KMEM
if ( is_root_cache ( s ) ) {
s - > memcg_kset = kset_create_and_add ( " cgroup " , NULL , & s - > kobj ) ;
if ( ! s - > memcg_kset ) {
2014-04-08 02:39:32 +04:00
err = - ENOMEM ;
goto out_del_kobj ;
2014-04-08 02:39:31 +04:00
}
}
# endif
2007-05-07 01:49:36 +04:00
kobject_uevent ( & s - > kobj , KOBJ_ADD ) ;
if ( ! unmergeable ) {
/* Setup first alias */
sysfs_slab_alias ( s , s - > name ) ;
}
2014-04-08 02:39:32 +04:00
out :
if ( ! unmergeable )
kfree ( name ) ;
return err ;
out_del_kobj :
kobject_del ( & s - > kobj ) ;
goto out ;
2007-05-07 01:49:36 +04:00
}
2014-05-06 23:50:08 +04:00
void sysfs_slab_remove ( struct kmem_cache * s )
2007-05-07 01:49:36 +04:00
{
2012-07-07 00:25:11 +04:00
if ( slab_state < FULL )
2010-07-19 20:39:11 +04:00
/*
* Sysfs has not been setup yet so no need to remove the
* cache from sysfs .
*/
return ;
2014-04-08 02:39:31 +04:00
# ifdef CONFIG_MEMCG_KMEM
kset_unregister ( s - > memcg_kset ) ;
# endif
2007-05-07 01:49:36 +04:00
kobject_uevent ( & s - > kobj , KOBJ_REMOVE ) ;
kobject_del ( & s - > kobj ) ;
2008-01-08 09:29:05 +03:00
kobject_put ( & s - > kobj ) ;
2007-05-07 01:49:36 +04:00
}
/*
* Need to buffer aliases during bootup until sysfs becomes
2008-12-05 06:08:08 +03:00
* available lest we lose that information .
2007-05-07 01:49:36 +04:00
*/
struct saved_alias {
struct kmem_cache * s ;
const char * name ;
struct saved_alias * next ;
} ;
2007-07-17 15:03:27 +04:00
static struct saved_alias * alias_list ;
2007-05-07 01:49:36 +04:00
static int sysfs_slab_alias ( struct kmem_cache * s , const char * name )
{
struct saved_alias * al ;
2012-07-07 00:25:11 +04:00
if ( slab_state = = FULL ) {
2007-05-07 01:49:36 +04:00
/*
* If we have a leftover link then remove it .
*/
2007-11-01 18:29:06 +03:00
sysfs_remove_link ( & slab_kset - > kobj , name ) ;
return sysfs_create_link ( & slab_kset - > kobj , & s - > kobj , name ) ;
2007-05-07 01:49:36 +04:00
}
al = kmalloc ( sizeof ( struct saved_alias ) , GFP_KERNEL ) ;
if ( ! al )
return - ENOMEM ;
al - > s = s ;
al - > name = name ;
al - > next = alias_list ;
alias_list = al ;
return 0 ;
}
static int __init slab_sysfs_init ( void )
{
2007-07-17 15:03:19 +04:00
struct kmem_cache * s ;
2007-05-07 01:49:36 +04:00
int err ;
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2010-07-19 20:39:11 +04:00
2007-11-06 21:36:58 +03:00
slab_kset = kset_create_and_add ( " slab " , & slab_uevent_ops , kernel_kobj ) ;
2007-11-01 18:29:06 +03:00
if ( ! slab_kset ) {
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2014-06-05 03:06:34 +04:00
pr_err ( " Cannot register slab subsystem. \n " ) ;
2007-05-07 01:49:36 +04:00
return - ENOSYS ;
}
2012-07-07 00:25:11 +04:00
slab_state = FULL ;
2007-05-09 13:32:39 +04:00
2007-07-17 15:03:19 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
2007-05-09 13:32:39 +04:00
err = sysfs_slab_add ( s ) ;
2007-08-31 10:56:26 +04:00
if ( err )
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB: Unable to add boot slab %s to sysfs \n " ,
s - > name ) ;
2007-05-09 13:32:39 +04:00
}
2007-05-07 01:49:36 +04:00
while ( alias_list ) {
struct saved_alias * al = alias_list ;
alias_list = alias_list - > next ;
err = sysfs_slab_alias ( al - > s , al - > name ) ;
2007-08-31 10:56:26 +04:00
if ( err )
2014-06-05 03:06:34 +04:00
pr_err ( " SLUB: Unable to add boot slab alias %s to sysfs \n " ,
al - > name ) ;
2007-05-07 01:49:36 +04:00
kfree ( al ) ;
}
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-05-07 01:49:36 +04:00
resiliency_test ( ) ;
return 0 ;
}
__initcall ( slab_sysfs_init ) ;
2010-10-05 22:57:26 +04:00
# endif /* CONFIG_SYSFS */
2008-01-01 19:23:28 +03:00
/*
* The / proc / slabinfo ABI
*/
2008-01-03 00:04:48 +03:00
# ifdef CONFIG_SLABINFO
2012-10-19 18:20:27 +04:00
void get_slabinfo ( struct kmem_cache * s , struct slabinfo * sinfo )
2008-01-01 19:23:28 +03:00
{
unsigned long nr_slabs = 0 ;
2008-04-14 20:11:40 +04:00
unsigned long nr_objs = 0 ;
unsigned long nr_free = 0 ;
2008-01-01 19:23:28 +03:00
int node ;
2014-08-07 03:04:09 +04:00
struct kmem_cache_node * n ;
2008-01-01 19:23:28 +03:00
2014-08-07 03:04:09 +04:00
for_each_kmem_cache_node ( s , node , n ) {
2013-07-04 04:33:26 +04:00
nr_slabs + = node_nr_slabs ( n ) ;
nr_objs + = node_nr_objs ( n ) ;
2008-04-14 20:11:40 +04:00
nr_free + = count_partial ( n , count_free ) ;
2008-01-01 19:23:28 +03:00
}
2012-10-19 18:20:27 +04:00
sinfo - > active_objs = nr_objs - nr_free ;
sinfo - > num_objs = nr_objs ;
sinfo - > active_slabs = nr_slabs ;
sinfo - > num_slabs = nr_slabs ;
sinfo - > objects_per_slab = oo_objects ( s - > oo ) ;
sinfo - > cache_order = oo_order ( s - > oo ) ;
2008-01-01 19:23:28 +03:00
}
2012-10-19 18:20:27 +04:00
void slabinfo_show_stats ( struct seq_file * m , struct kmem_cache * s )
2008-10-06 02:42:17 +04:00
{
}
2012-10-19 18:20:25 +04:00
ssize_t slabinfo_write ( struct file * file , const char __user * buffer ,
size_t count , loff_t * ppos )
2008-10-06 02:42:17 +04:00
{
2012-10-19 18:20:25 +04:00
return - EIO ;
2008-10-06 02:42:17 +04:00
}
2008-01-03 00:04:48 +03:00
# endif /* CONFIG_SLABINFO */