2007-05-07 01:49:36 +04:00
/*
* SLUB : A slab allocator that limits cache line use instead of queuing
* objects in per cpu and per node lists .
*
2011-06-01 21:25:53 +04:00
* The allocator synchronizes using per slab locks or atomic operatios
* and only uses a centralized lock to manage a pool of partial slabs .
2007-05-07 01:49:36 +04:00
*
2008-07-04 20:59:22 +04:00
* ( C ) 2007 SGI , Christoph Lameter
2011-06-01 21:25:53 +04:00
* ( C ) 2011 Linux Foundation , Christoph Lameter
2007-05-07 01:49:36 +04:00
*/
# include <linux/mm.h>
2009-05-05 13:13:44 +04:00
# include <linux/swap.h> /* struct reclaim_state */
2007-05-07 01:49:36 +04:00
# include <linux/module.h>
# include <linux/bit_spinlock.h>
# include <linux/interrupt.h>
# include <linux/bitops.h>
# include <linux/slab.h>
2012-07-07 00:25:11 +04:00
# include "slab.h"
2008-10-06 02:42:17 +04:00
# include <linux/proc_fs.h>
2013-04-30 02:08:06 +04:00
# include <linux/notifier.h>
2007-05-07 01:49:36 +04:00
# include <linux/seq_file.h>
2008-04-04 02:54:48 +04:00
# include <linux/kmemcheck.h>
2007-05-07 01:49:36 +04:00
# include <linux/cpu.h>
# include <linux/cpuset.h>
# include <linux/mempolicy.h>
# include <linux/ctype.h>
2008-04-30 11:55:01 +04:00
# include <linux/debugobjects.h>
2007-05-07 01:49:36 +04:00
# include <linux/kallsyms.h>
2007-10-22 03:41:37 +04:00
# include <linux/memory.h>
2008-05-01 15:34:31 +04:00
# include <linux/math64.h>
2008-12-23 13:37:01 +03:00
# include <linux/fault-inject.h>
2011-07-07 23:47:01 +04:00
# include <linux/stacktrace.h>
2012-01-31 01:53:51 +04:00
# include <linux/prefetch.h>
2012-12-19 02:22:34 +04:00
# include <linux/memcontrol.h>
2007-05-07 01:49:36 +04:00
2010-10-21 13:29:19 +04:00
# include <trace/events/kmem.h>
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
# include "internal.h"
2007-05-07 01:49:36 +04:00
/*
* Lock order :
2012-07-07 00:25:12 +04:00
* 1. slab_mutex ( Global Mutex )
2011-06-01 21:25:53 +04:00
* 2. node - > list_lock
* 3. slab_lock ( page ) ( Only on some arches and for debugging )
2007-05-07 01:49:36 +04:00
*
2012-07-07 00:25:12 +04:00
* slab_mutex
2011-06-01 21:25:53 +04:00
*
2012-07-07 00:25:12 +04:00
* The role of the slab_mutex is to protect the list of all the slabs
2011-06-01 21:25:53 +04:00
* and to synchronize major metadata changes to slab cache structures .
*
* The slab_lock is only used for debugging and on arches that do not
* have the ability to do a cmpxchg_double . It only protects the second
* double word in the page struct . Meaning
* A . page - > freelist - > List of object free in a page
* B . page - > counters - > Counters of objects
* C . page - > frozen - > frozen state
*
* If a slab is frozen then it is exempt from list management . It is not
* on any list . The processor that froze the slab is the one who can
* perform list operations on the page . Other processors may put objects
* onto the freelist but the processor that froze the slab is the only
* one that can retrieve the objects from the page ' s freelist .
2007-05-07 01:49:36 +04:00
*
* The list_lock protects the partial and full list on each node and
* the partial slab counter . If taken then no new slabs may be added or
* removed from the lists nor make the number of partial slabs be modified .
* ( Note that the total number of slabs is an atomic value that may be
* modified without taking the list lock ) .
*
* The list_lock is a centralized lock and thus we avoid taking it as
* much as possible . As long as SLUB does not have to handle partial
* slabs , operations can continue without any centralized lock . F . e .
* allocating a long series of objects that fill up slabs does not require
* the list lock .
* Interrupts are disabled during allocation and deallocation in order to
* make the slab allocator safe to use in the context of an irq . In addition
* interrupts are disabled to ensure that the processor does not change
* while handling per_cpu slabs , due to kernel preemption .
*
* SLUB assigns one slab for allocation to each processor .
* Allocations only occur from these slabs called cpu slabs .
*
2007-05-09 13:32:39 +04:00
* Slabs with free elements are kept on a partial list and during regular
* operations no list for full slabs is used . If an object in a full slab is
2007-05-07 01:49:36 +04:00
* freed then the slab will show up again on the partial lists .
2007-05-09 13:32:39 +04:00
* We track full slabs for debugging purposes though because otherwise we
* cannot scan all objects .
2007-05-07 01:49:36 +04:00
*
* Slabs are freed when they become empty . Teardown and setup is
* minimal so we rely on the page allocators per cpu caches for
* fast frees and allocs .
*
* Overloading of page flags that are otherwise used for LRU management .
*
2007-05-17 09:10:53 +04:00
* PageActive The slab is frozen and exempt from list processing .
* This means that the slab is dedicated to a purpose
* such as satisfying allocations for a specific
* processor . Objects may be freed in the slab while
* it is frozen but slab_free will then skip the usual
* list operations . It is up to the processor holding
* the slab to integrate the slab into the slab lists
* when the slab is no longer needed .
*
* One use of this flag is to mark slabs that are
* used for allocations . Then such a slab becomes a cpu
* slab . The cpu slab may be equipped with an additional
2007-10-16 12:26:05 +04:00
* freelist that allows lockless access to
2007-05-10 14:15:16 +04:00
* free objects in addition to the regular freelist
* that requires the slab lock .
2007-05-07 01:49:36 +04:00
*
* PageError Slab requires special handling due to debug
* options set . This moves slab handling out of
2007-05-10 14:15:16 +04:00
* the fast path and disables lockless freelists .
2007-05-07 01:49:36 +04:00
*/
2010-07-09 23:07:14 +04:00
static inline int kmem_cache_debug ( struct kmem_cache * s )
{
2007-05-17 09:10:56 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-07-09 23:07:14 +04:00
return unlikely ( s - > flags & SLAB_DEBUG_FLAGS ) ;
2007-05-17 09:10:56 +04:00
# else
2010-07-09 23:07:14 +04:00
return 0 ;
2007-05-17 09:10:56 +04:00
# endif
2010-07-09 23:07:14 +04:00
}
2007-05-17 09:10:56 +04:00
2013-06-19 09:05:52 +04:00
static inline bool kmem_cache_has_cpu_partial ( struct kmem_cache * s )
{
# ifdef CONFIG_SLUB_CPU_PARTIAL
return ! kmem_cache_debug ( s ) ;
# else
return false ;
# endif
}
2007-05-07 01:49:36 +04:00
/*
* Issues still to be resolved :
*
* - Support PAGE_ALLOC_DEBUG . Should be easy to do .
*
* - Variable sizing of the per node arrays
*/
/* Enable to test recovery from slab corruption on boot */
# undef SLUB_RESILIENCY_TEST
2011-06-01 21:25:49 +04:00
/* Enable to log cmpxchg failures */
# undef SLUB_DEBUG_CMPXCHG
2007-05-07 01:49:46 +04:00
/*
* Mininum number of partial slabs . These will be left on the partial
* lists even if they are empty . kmem_cache_shrink may reclaim them .
*/
2007-12-22 01:37:37 +03:00
# define MIN_PARTIAL 5
2007-05-07 01:49:44 +04:00
2007-05-07 01:49:46 +04:00
/*
* Maximum number of desirable partial slabs .
* The existence of more partial slabs makes kmem_cache_shrink
2013-11-08 16:47:37 +04:00
* sort the partial list by the number of objects in use .
2007-05-07 01:49:46 +04:00
*/
# define MAX_PARTIAL 10
2007-05-07 01:49:36 +04:00
# define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER )
2007-05-09 13:32:39 +04:00
2009-07-07 11:14:14 +04:00
/*
2009-07-28 05:30:35 +04:00
* Debugging flags that require metadata to be stored in the slab . These get
* disabled when slub_debug = O is used and a cache ' s min order increases with
* metadata .
2009-07-07 11:14:14 +04:00
*/
2009-07-28 05:30:35 +04:00
# define DEBUG_METADATA_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
2009-07-07 11:14:14 +04:00
2007-05-07 01:49:36 +04:00
/*
* Set of flags that will prevent slab merging
*/
# define SLUB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
2010-02-26 09:36:12 +03:00
SLAB_TRACE | SLAB_DESTROY_BY_RCU | SLAB_NOLEAKTRACE | \
SLAB_FAILSLAB )
2007-05-07 01:49:36 +04:00
# define SLUB_MERGE_SAME (SLAB_DEBUG_FREE | SLAB_RECLAIM_ACCOUNT | \
2008-04-04 02:54:48 +04:00
SLAB_CACHE_DMA | SLAB_NOTRACK )
2007-05-07 01:49:36 +04:00
2008-10-22 23:00:38 +04:00
# define OO_SHIFT 16
# define OO_MASK ((1 << OO_SHIFT) - 1)
2011-06-01 21:25:45 +04:00
# define MAX_OBJS_PER_PAGE 32767 /* since page.objects is u15 */
2008-10-22 23:00:38 +04:00
2007-05-07 01:49:36 +04:00
/* Internal SLUB flags */
2010-07-09 23:07:11 +04:00
# define __OBJECT_POISON 0x80000000UL /* Poison object */
2011-06-01 21:25:49 +04:00
# define __CMPXCHG_DOUBLE 0x40000000UL /* Use cmpxchg_double */
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_SMP
static struct notifier_block slab_notifier ;
# endif
2007-05-09 13:32:43 +04:00
/*
* Tracking user of a slab .
*/
2011-07-07 22:36:36 +04:00
# define TRACK_ADDRS_COUNT 16
2007-05-09 13:32:43 +04:00
struct track {
2008-08-19 21:43:25 +04:00
unsigned long addr ; /* Called from address */
2011-07-07 22:36:36 +04:00
# ifdef CONFIG_STACKTRACE
unsigned long addrs [ TRACK_ADDRS_COUNT ] ; /* Called from address */
# endif
2007-05-09 13:32:43 +04:00
int cpu ; /* Was running on cpu */
int pid ; /* Pid context */
unsigned long when ; /* When did the operation occur */
} ;
enum track_item { TRACK_ALLOC , TRACK_FREE } ;
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SYSFS
2007-05-07 01:49:36 +04:00
static int sysfs_slab_add ( struct kmem_cache * ) ;
static int sysfs_slab_alias ( struct kmem_cache * , const char * ) ;
static void sysfs_slab_remove ( struct kmem_cache * ) ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
static void memcg_propagate_slab_attrs ( struct kmem_cache * s ) ;
2007-05-07 01:49:36 +04:00
# else
2007-07-17 15:03:24 +04:00
static inline int sysfs_slab_add ( struct kmem_cache * s ) { return 0 ; }
static inline int sysfs_slab_alias ( struct kmem_cache * s , const char * p )
{ return 0 ; }
2012-09-05 03:18:33 +04:00
static inline void sysfs_slab_remove ( struct kmem_cache * s ) { }
2008-02-08 04:47:41 +03:00
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
static inline void memcg_propagate_slab_attrs ( struct kmem_cache * s ) { }
2007-05-07 01:49:36 +04:00
# endif
2011-03-22 21:35:00 +03:00
static inline void stat ( const struct kmem_cache * s , enum stat_item si )
2008-02-08 04:47:41 +03:00
{
# ifdef CONFIG_SLUB_STATS
2014-04-08 02:39:42 +04:00
/*
* The rmw is racy on a preemptible kernel but this is acceptable , so
* avoid this_cpu_add ( ) ' s irq - disable overhead .
*/
raw_cpu_inc ( s - > cpu_slab - > stat [ si ] ) ;
2008-02-08 04:47:41 +03:00
# endif
}
2007-05-07 01:49:36 +04:00
/********************************************************************
* Core slab cache functions
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
static inline struct kmem_cache_node * get_node ( struct kmem_cache * s , int node )
{
return s - > node [ node ] ;
}
2008-02-16 10:45:26 +03:00
/* Verify that a pointer has an address that is valid within a slab page */
2007-05-09 13:32:43 +04:00
static inline int check_valid_pointer ( struct kmem_cache * s ,
struct page * page , const void * object )
{
void * base ;
2008-03-02 00:40:44 +03:00
if ( ! object )
2007-05-09 13:32:43 +04:00
return 1 ;
2008-03-02 00:40:44 +03:00
base = page_address ( page ) ;
2008-04-14 20:11:30 +04:00
if ( object < base | | object > = base + page - > objects * s - > size | |
2007-05-09 13:32:43 +04:00
( object - base ) % s - > size ) {
return 0 ;
}
return 1 ;
}
2007-05-09 13:32:40 +04:00
static inline void * get_freepointer ( struct kmem_cache * s , void * object )
{
return * ( void * * ) ( object + s - > offset ) ;
}
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
static void prefetch_freepointer ( const struct kmem_cache * s , void * object )
{
prefetch ( object + s - > offset ) ;
}
2011-05-17 00:26:08 +04:00
static inline void * get_freepointer_safe ( struct kmem_cache * s , void * object )
{
void * p ;
# ifdef CONFIG_DEBUG_PAGEALLOC
probe_kernel_read ( & p , ( void * * ) ( object + s - > offset ) , sizeof ( p ) ) ;
# else
p = get_freepointer ( s , object ) ;
# endif
return p ;
}
2007-05-09 13:32:40 +04:00
static inline void set_freepointer ( struct kmem_cache * s , void * object , void * fp )
{
* ( void * * ) ( object + s - > offset ) = fp ;
}
/* Loop over all objects in a slab */
2008-04-14 20:11:31 +04:00
# define for_each_object(__p, __s, __addr, __objects) \
for ( __p = ( __addr ) ; __p < ( __addr ) + ( __objects ) * ( __s ) - > size ; \
2007-05-09 13:32:40 +04:00
__p + = ( __s ) - > size )
/* Determine object index from a given position */
static inline int slab_index ( void * p , struct kmem_cache * s , void * addr )
{
return ( p - addr ) / s - > size ;
}
2011-02-26 22:10:26 +03:00
static inline size_t slab_ksize ( const struct kmem_cache * s )
{
# ifdef CONFIG_SLUB_DEBUG
/*
* Debugging requires use of the padding between object
* and whatever may come after it .
*/
if ( s - > flags & ( SLAB_RED_ZONE | SLAB_POISON ) )
2012-06-13 19:24:57 +04:00
return s - > object_size ;
2011-02-26 22:10:26 +03:00
# endif
/*
* If we have the need to store the freelist pointer
* back there or track user information then we can
* only use the space before that information .
*/
if ( s - > flags & ( SLAB_DESTROY_BY_RCU | SLAB_STORE_USER ) )
return s - > inuse ;
/*
* Else we can use all the padding etc for the allocation
*/
return s - > size ;
}
2011-03-10 10:21:48 +03:00
static inline int order_objects ( int order , unsigned long size , int reserved )
{
return ( ( PAGE_SIZE < < order ) - reserved ) / size ;
}
2008-04-14 20:11:31 +04:00
static inline struct kmem_cache_order_objects oo_make ( int order ,
2011-03-10 10:21:48 +03:00
unsigned long size , int reserved )
2008-04-14 20:11:31 +04:00
{
struct kmem_cache_order_objects x = {
2011-03-10 10:21:48 +03:00
( order < < OO_SHIFT ) + order_objects ( order , size , reserved )
2008-04-14 20:11:31 +04:00
} ;
return x ;
}
static inline int oo_order ( struct kmem_cache_order_objects x )
{
2008-10-22 23:00:38 +04:00
return x . x > > OO_SHIFT ;
2008-04-14 20:11:31 +04:00
}
static inline int oo_objects ( struct kmem_cache_order_objects x )
{
2008-10-22 23:00:38 +04:00
return x . x & OO_MASK ;
2008-04-14 20:11:31 +04:00
}
2011-06-01 21:25:53 +04:00
/*
* Per slab locking using the pagelock
*/
static __always_inline void slab_lock ( struct page * page )
{
bit_spin_lock ( PG_locked , & page - > flags ) ;
}
static __always_inline void slab_unlock ( struct page * page )
{
__bit_spin_unlock ( PG_locked , & page - > flags ) ;
}
mm/slub.c: fix page->_count corruption (again)
Commit abca7c496584 ("mm: fix slab->page _count corruption when using
slub") notes that we can not _set_ a page->counters directly, except
when using a real double-cmpxchg. Doing so can lose updates to
->_count.
That is an absolute rule:
You may not *set* page->counters except via a cmpxchg.
Commit abca7c496584 fixed this for the folks who have the slub
cmpxchg_double code turned off at compile time, but it left the bad case
alone. It can still be reached, and the same bug triggered in two
cases:
1. Turning on slub debugging at runtime, which is available on
the distro kernels that I looked at.
2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
cpus, evidently)
There are at least 3 ways we could fix this:
1. Take all of the exising calls to cmpxchg_double_slab() and
__cmpxchg_double_slab() and convert them to take an old, new
and target 'struct page'.
2. Do (1), but with the newly-introduced 'slub_data'.
3. Do some magic inside the two cmpxchg...slab() functions to
pull the counters out of new_counters and only set those
fields in page->{inuse,frozen,objects}.
I've done (2) as well, but it's a bunch more code. This patch is an
attempt at (3). This was the most straightforward and foolproof way
that I could think to do this.
This would also technically allow us to get rid of the ugly
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
in 'struct page', but leaving it alone has the added benefit that
'counters' stays 'unsigned' instead of 'unsigned long', so all the
copies that the slub code does stay a bit smaller.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-31 03:46:09 +04:00
static inline void set_page_slub_counters ( struct page * page , unsigned long counters_new )
{
struct page tmp ;
tmp . counters = counters_new ;
/*
* page - > counters can cover frozen / inuse / objects as well
* as page - > _count . If we assign to - > counters directly
* we run the risk of losing updates to page - > _count , so
* be careful and only assign to the fields we need .
*/
page - > frozen = tmp . frozen ;
page - > inuse = tmp . inuse ;
page - > objects = tmp . objects ;
}
2011-07-14 21:49:12 +04:00
/* Interrupts must be disabled (for the fallback code to work right) */
static inline bool __cmpxchg_double_slab ( struct kmem_cache * s , struct page * page ,
void * freelist_old , unsigned long counters_old ,
void * freelist_new , unsigned long counters_new ,
const char * n )
{
VM_BUG_ON ( ! irqs_disabled ( ) ) ;
2012-01-13 05:17:33 +04:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-07-14 21:49:12 +04:00
if ( s - > flags & __CMPXCHG_DOUBLE ) {
2012-01-02 21:02:18 +04:00
if ( cmpxchg_double ( & page - > freelist , & page - > counters ,
2011-07-14 21:49:12 +04:00
freelist_old , counters_old ,
freelist_new , counters_new ) )
return 1 ;
} else
# endif
{
slab_lock ( page ) ;
2013-07-15 05:05:29 +04:00
if ( page - > freelist = = freelist_old & &
page - > counters = = counters_old ) {
2011-07-14 21:49:12 +04:00
page - > freelist = freelist_new ;
mm/slub.c: fix page->_count corruption (again)
Commit abca7c496584 ("mm: fix slab->page _count corruption when using
slub") notes that we can not _set_ a page->counters directly, except
when using a real double-cmpxchg. Doing so can lose updates to
->_count.
That is an absolute rule:
You may not *set* page->counters except via a cmpxchg.
Commit abca7c496584 fixed this for the folks who have the slub
cmpxchg_double code turned off at compile time, but it left the bad case
alone. It can still be reached, and the same bug triggered in two
cases:
1. Turning on slub debugging at runtime, which is available on
the distro kernels that I looked at.
2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
cpus, evidently)
There are at least 3 ways we could fix this:
1. Take all of the exising calls to cmpxchg_double_slab() and
__cmpxchg_double_slab() and convert them to take an old, new
and target 'struct page'.
2. Do (1), but with the newly-introduced 'slub_data'.
3. Do some magic inside the two cmpxchg...slab() functions to
pull the counters out of new_counters and only set those
fields in page->{inuse,frozen,objects}.
I've done (2) as well, but it's a bunch more code. This patch is an
attempt at (3). This was the most straightforward and foolproof way
that I could think to do this.
This would also technically allow us to get rid of the ugly
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
in 'struct page', but leaving it alone has the added benefit that
'counters' stays 'unsigned' instead of 'unsigned long', so all the
copies that the slub code does stay a bit smaller.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-31 03:46:09 +04:00
set_page_slub_counters ( page , counters_new ) ;
2011-07-14 21:49:12 +04:00
slab_unlock ( page ) ;
return 1 ;
}
slab_unlock ( page ) ;
}
cpu_relax ( ) ;
stat ( s , CMPXCHG_DOUBLE_FAIL ) ;
# ifdef SLUB_DEBUG_CMPXCHG
printk ( KERN_INFO " %s %s: cmpxchg double redo " , n , s - > name ) ;
# endif
return 0 ;
}
2011-06-01 21:25:49 +04:00
static inline bool cmpxchg_double_slab ( struct kmem_cache * s , struct page * page ,
void * freelist_old , unsigned long counters_old ,
void * freelist_new , unsigned long counters_new ,
const char * n )
{
2012-01-13 05:17:33 +04:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-06-01 21:25:49 +04:00
if ( s - > flags & __CMPXCHG_DOUBLE ) {
2012-01-02 21:02:18 +04:00
if ( cmpxchg_double ( & page - > freelist , & page - > counters ,
2011-06-01 21:25:49 +04:00
freelist_old , counters_old ,
freelist_new , counters_new ) )
return 1 ;
} else
# endif
{
2011-07-14 21:49:12 +04:00
unsigned long flags ;
local_irq_save ( flags ) ;
2011-06-01 21:25:53 +04:00
slab_lock ( page ) ;
2013-07-15 05:05:29 +04:00
if ( page - > freelist = = freelist_old & &
page - > counters = = counters_old ) {
2011-06-01 21:25:49 +04:00
page - > freelist = freelist_new ;
mm/slub.c: fix page->_count corruption (again)
Commit abca7c496584 ("mm: fix slab->page _count corruption when using
slub") notes that we can not _set_ a page->counters directly, except
when using a real double-cmpxchg. Doing so can lose updates to
->_count.
That is an absolute rule:
You may not *set* page->counters except via a cmpxchg.
Commit abca7c496584 fixed this for the folks who have the slub
cmpxchg_double code turned off at compile time, but it left the bad case
alone. It can still be reached, and the same bug triggered in two
cases:
1. Turning on slub debugging at runtime, which is available on
the distro kernels that I looked at.
2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
cpus, evidently)
There are at least 3 ways we could fix this:
1. Take all of the exising calls to cmpxchg_double_slab() and
__cmpxchg_double_slab() and convert them to take an old, new
and target 'struct page'.
2. Do (1), but with the newly-introduced 'slub_data'.
3. Do some magic inside the two cmpxchg...slab() functions to
pull the counters out of new_counters and only set those
fields in page->{inuse,frozen,objects}.
I've done (2) as well, but it's a bunch more code. This patch is an
attempt at (3). This was the most straightforward and foolproof way
that I could think to do this.
This would also technically allow us to get rid of the ugly
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
in 'struct page', but leaving it alone has the added benefit that
'counters' stays 'unsigned' instead of 'unsigned long', so all the
copies that the slub code does stay a bit smaller.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-31 03:46:09 +04:00
set_page_slub_counters ( page , counters_new ) ;
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2011-07-14 21:49:12 +04:00
local_irq_restore ( flags ) ;
2011-06-01 21:25:49 +04:00
return 1 ;
}
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2011-07-14 21:49:12 +04:00
local_irq_restore ( flags ) ;
2011-06-01 21:25:49 +04:00
}
cpu_relax ( ) ;
stat ( s , CMPXCHG_DOUBLE_FAIL ) ;
# ifdef SLUB_DEBUG_CMPXCHG
printk ( KERN_INFO " %s %s: cmpxchg double redo " , n , s - > name ) ;
# endif
return 0 ;
}
2007-05-09 13:32:44 +04:00
# ifdef CONFIG_SLUB_DEBUG
2011-04-15 23:48:13 +04:00
/*
* Determine a map of object in use on a page .
*
2011-06-01 21:25:53 +04:00
* Node listlock must be held to guarantee that the page does
2011-04-15 23:48:13 +04:00
* not vanish from under us .
*/
static void get_map ( struct kmem_cache * s , struct page * page , unsigned long * map )
{
void * p ;
void * addr = page_address ( page ) ;
for ( p = page - > freelist ; p ; p = get_freepointer ( s , p ) )
set_bit ( slab_index ( p , s , addr ) , map ) ;
}
2007-05-09 13:32:44 +04:00
/*
* Debug settings :
*/
2007-07-16 10:38:14 +04:00
# ifdef CONFIG_SLUB_DEBUG_ON
static int slub_debug = DEBUG_DEFAULT_FLAGS ;
# else
2007-05-09 13:32:44 +04:00
static int slub_debug ;
2007-07-16 10:38:14 +04:00
# endif
2007-05-09 13:32:44 +04:00
static char * slub_debug_slabs ;
2009-07-07 11:14:14 +04:00
static int disable_higher_order_debug ;
2007-05-09 13:32:44 +04:00
2007-05-07 01:49:36 +04:00
/*
* Object debugging
*/
static void print_section ( char * text , u8 * addr , unsigned int length )
{
2011-07-29 16:10:20 +04:00
print_hex_dump ( KERN_ERR , text , DUMP_PREFIX_ADDRESS , 16 , 1 , addr ,
length , 1 ) ;
2007-05-07 01:49:36 +04:00
}
static struct track * get_track ( struct kmem_cache * s , void * object ,
enum track_item alloc )
{
struct track * p ;
if ( s - > offset )
p = object + s - > offset + sizeof ( void * ) ;
else
p = object + s - > inuse ;
return p + alloc ;
}
static void set_track ( struct kmem_cache * s , void * object ,
2008-08-19 21:43:25 +04:00
enum track_item alloc , unsigned long addr )
2007-05-07 01:49:36 +04:00
{
2009-03-06 18:36:21 +03:00
struct track * p = get_track ( s , object , alloc ) ;
2007-05-07 01:49:36 +04:00
if ( addr ) {
2011-07-07 22:36:36 +04:00
# ifdef CONFIG_STACKTRACE
struct stack_trace trace ;
int i ;
trace . nr_entries = 0 ;
trace . max_entries = TRACK_ADDRS_COUNT ;
trace . entries = p - > addrs ;
trace . skip = 3 ;
save_stack_trace ( & trace ) ;
/* See rant in lockdep.c */
if ( trace . nr_entries ! = 0 & &
trace . entries [ trace . nr_entries - 1 ] = = ULONG_MAX )
trace . nr_entries - - ;
for ( i = trace . nr_entries ; i < TRACK_ADDRS_COUNT ; i + + )
p - > addrs [ i ] = 0 ;
# endif
2007-05-07 01:49:36 +04:00
p - > addr = addr ;
p - > cpu = smp_processor_id ( ) ;
2008-06-23 02:58:37 +04:00
p - > pid = current - > pid ;
2007-05-07 01:49:36 +04:00
p - > when = jiffies ;
} else
memset ( p , 0 , sizeof ( struct track ) ) ;
}
static void init_tracking ( struct kmem_cache * s , void * object )
{
2007-07-17 15:03:18 +04:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2008-08-19 21:43:25 +04:00
set_track ( s , object , TRACK_FREE , 0UL ) ;
set_track ( s , object , TRACK_ALLOC , 0UL ) ;
2007-05-07 01:49:36 +04:00
}
static void print_track ( const char * s , struct track * t )
{
if ( ! t - > addr )
return ;
2008-07-14 23:12:53 +04:00
printk ( KERN_ERR " INFO: %s in %pS age=%lu cpu=%u pid=%d \n " ,
2008-08-19 21:43:25 +04:00
s , ( void * ) t - > addr , jiffies - t - > when , t - > cpu , t - > pid ) ;
2011-07-07 22:36:36 +04:00
# ifdef CONFIG_STACKTRACE
{
int i ;
for ( i = 0 ; i < TRACK_ADDRS_COUNT ; i + + )
if ( t - > addrs [ i ] )
printk ( KERN_ERR " \t %pS \n " , ( void * ) t - > addrs [ i ] ) ;
else
break ;
}
# endif
2007-07-17 15:03:18 +04:00
}
static void print_tracking ( struct kmem_cache * s , void * object )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
print_track ( " Allocated " , get_track ( s , object , TRACK_ALLOC ) ) ;
print_track ( " Freed " , get_track ( s , object , TRACK_FREE ) ) ;
}
static void print_page_info ( struct page * page )
{
2013-07-15 05:05:29 +04:00
printk ( KERN_ERR
" INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx \n " ,
page , page - > objects , page - > inuse , page - > freelist , page - > flags ) ;
2007-07-17 15:03:18 +04:00
}
static void slab_bug ( struct kmem_cache * s , char * fmt , . . . )
{
va_list args ;
char buf [ 100 ] ;
va_start ( args , fmt ) ;
vsnprintf ( buf , sizeof ( buf ) , fmt , args ) ;
va_end ( args ) ;
printk ( KERN_ERR " ======================================== "
" ===================================== \n " ) ;
2011-11-16 03:04:00 +04:00
printk ( KERN_ERR " BUG %s (%s): %s \n " , s - > name , print_tainted ( ) , buf ) ;
2007-07-17 15:03:18 +04:00
printk ( KERN_ERR " ---------------------------------------- "
" ------------------------------------- \n \n " ) ;
2012-09-18 23:54:12 +04:00
2013-01-21 10:47:39 +04:00
add_taint ( TAINT_BAD_PAGE , LOCKDEP_NOW_UNRELIABLE ) ;
2007-05-07 01:49:36 +04:00
}
2007-07-17 15:03:18 +04:00
static void slab_fix ( struct kmem_cache * s , char * fmt , . . . )
{
va_list args ;
char buf [ 100 ] ;
va_start ( args , fmt ) ;
vsnprintf ( buf , sizeof ( buf ) , fmt , args ) ;
va_end ( args ) ;
printk ( KERN_ERR " FIX %s: %s \n " , s - > name , buf ) ;
}
static void print_trailer ( struct kmem_cache * s , struct page * page , u8 * p )
2007-05-07 01:49:36 +04:00
{
unsigned int off ; /* Offset of last byte */
2008-03-02 00:40:44 +03:00
u8 * addr = page_address ( page ) ;
2007-07-17 15:03:18 +04:00
print_tracking ( s , p ) ;
print_page_info ( page ) ;
printk ( KERN_ERR " INFO: Object 0x%p @offset=%tu fp=0x%p \n \n " ,
p , p - addr , get_freepointer ( s , p ) ) ;
if ( p > addr + 16 )
2011-07-29 16:10:20 +04:00
print_section ( " Bytes b4 " , p - 16 , 16 ) ;
2007-05-07 01:49:36 +04:00
2012-06-13 19:24:57 +04:00
print_section ( " Object " , p , min_t ( unsigned long , s - > object_size ,
2011-07-29 16:10:20 +04:00
PAGE_SIZE ) ) ;
2007-05-07 01:49:36 +04:00
if ( s - > flags & SLAB_RED_ZONE )
2012-06-13 19:24:57 +04:00
print_section ( " Redzone " , p + s - > object_size ,
s - > inuse - s - > object_size ) ;
2007-05-07 01:49:36 +04:00
if ( s - > offset )
off = s - > offset + sizeof ( void * ) ;
else
off = s - > inuse ;
2007-07-17 15:03:18 +04:00
if ( s - > flags & SLAB_STORE_USER )
2007-05-07 01:49:36 +04:00
off + = 2 * sizeof ( struct track ) ;
if ( off ! = s - > size )
/* Beginning of the filler is the free pointer */
2011-07-29 16:10:20 +04:00
print_section ( " Padding " , p + off , s - > size - off ) ;
2007-07-17 15:03:18 +04:00
dump_stack ( ) ;
2007-05-07 01:49:36 +04:00
}
static void object_err ( struct kmem_cache * s , struct page * page ,
u8 * object , char * reason )
{
2008-04-23 23:28:01 +04:00
slab_bug ( s , " %s " , reason ) ;
2007-07-17 15:03:18 +04:00
print_trailer ( s , page , object ) ;
2007-05-07 01:49:36 +04:00
}
2013-07-15 05:05:29 +04:00
static void slab_err ( struct kmem_cache * s , struct page * page ,
const char * fmt , . . . )
2007-05-07 01:49:36 +04:00
{
va_list args ;
char buf [ 100 ] ;
2007-07-17 15:03:18 +04:00
va_start ( args , fmt ) ;
vsnprintf ( buf , sizeof ( buf ) , fmt , args ) ;
2007-05-07 01:49:36 +04:00
va_end ( args ) ;
2008-04-23 23:28:01 +04:00
slab_bug ( s , " %s " , buf ) ;
2007-07-17 15:03:18 +04:00
print_page_info ( page ) ;
2007-05-07 01:49:36 +04:00
dump_stack ( ) ;
}
2010-09-29 16:15:01 +04:00
static void init_object ( struct kmem_cache * s , void * object , u8 val )
2007-05-07 01:49:36 +04:00
{
u8 * p = object ;
if ( s - > flags & __OBJECT_POISON ) {
2012-06-13 19:24:57 +04:00
memset ( p , POISON_FREE , s - > object_size - 1 ) ;
p [ s - > object_size - 1 ] = POISON_END ;
2007-05-07 01:49:36 +04:00
}
if ( s - > flags & SLAB_RED_ZONE )
2012-06-13 19:24:57 +04:00
memset ( p + s - > object_size , val , s - > inuse - s - > object_size ) ;
2007-05-07 01:49:36 +04:00
}
2007-07-17 15:03:18 +04:00
static void restore_bytes ( struct kmem_cache * s , char * message , u8 data ,
void * from , void * to )
{
slab_fix ( s , " Restoring 0x%p-0x%p=0x%x \n " , from , to - 1 , data ) ;
memset ( from , data , to - from ) ;
}
static int check_bytes_and_report ( struct kmem_cache * s , struct page * page ,
u8 * object , char * what ,
2008-01-08 10:20:27 +03:00
u8 * start , unsigned int value , unsigned int bytes )
2007-07-17 15:03:18 +04:00
{
u8 * fault ;
u8 * end ;
2011-11-01 04:08:07 +04:00
fault = memchr_inv ( start , value , bytes ) ;
2007-07-17 15:03:18 +04:00
if ( ! fault )
return 1 ;
end = start + bytes ;
while ( end > fault & & end [ - 1 ] = = value )
end - - ;
slab_bug ( s , " %s overwritten " , what ) ;
printk ( KERN_ERR " INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x \n " ,
fault , end - 1 , fault [ 0 ] , value ) ;
print_trailer ( s , page , object ) ;
restore_bytes ( s , what , value , fault , end ) ;
return 0 ;
2007-05-07 01:49:36 +04:00
}
/*
* Object layout :
*
* object address
* Bytes of the object to be managed .
* If the freepointer may overlay the object then the free
* pointer is the first word of the object .
2007-05-09 13:32:39 +04:00
*
2007-05-07 01:49:36 +04:00
* Poisoning uses 0x6b ( POISON_FREE ) and the last byte is
* 0xa5 ( POISON_END )
*
2012-06-13 19:24:57 +04:00
* object + s - > object_size
2007-05-07 01:49:36 +04:00
* Padding to reach word boundary . This is also used for Redzoning .
2007-05-09 13:32:39 +04:00
* Padding is extended by another word if Redzoning is enabled and
2012-06-13 19:24:57 +04:00
* object_size = = inuse .
2007-05-09 13:32:39 +04:00
*
2007-05-07 01:49:36 +04:00
* We fill with 0xbb ( RED_INACTIVE ) for inactive objects and with
* 0xcc ( RED_ACTIVE ) for objects in use .
*
* object + s - > inuse
2007-05-09 13:32:39 +04:00
* Meta data starts here .
*
2007-05-07 01:49:36 +04:00
* A . Free pointer ( if we cannot overwrite object on free )
* B . Tracking data for SLAB_STORE_USER
2007-05-09 13:32:39 +04:00
* C . Padding to reach required alignment boundary or at mininum
2008-02-16 10:45:26 +03:00
* one word if debugging is on to be able to detect writes
2007-05-09 13:32:39 +04:00
* before the word boundary .
*
* Padding is done using 0x5a ( POISON_INUSE )
2007-05-07 01:49:36 +04:00
*
* object + s - > size
2007-05-09 13:32:39 +04:00
* Nothing is used beyond s - > size .
2007-05-07 01:49:36 +04:00
*
2012-06-13 19:24:57 +04:00
* If slabcaches are merged then the object_size and inuse boundaries are mostly
2007-05-09 13:32:39 +04:00
* ignored . And therefore no slab options that rely on these boundaries
2007-05-07 01:49:36 +04:00
* may be used with merged slabcaches .
*/
static int check_pad_bytes ( struct kmem_cache * s , struct page * page , u8 * p )
{
unsigned long off = s - > inuse ; /* The end of info */
if ( s - > offset )
/* Freepointer is placed after the object. */
off + = sizeof ( void * ) ;
if ( s - > flags & SLAB_STORE_USER )
/* We also have user information there */
off + = 2 * sizeof ( struct track ) ;
if ( s - > size = = off )
return 1 ;
2007-07-17 15:03:18 +04:00
return check_bytes_and_report ( s , page , p , " Object padding " ,
p + off , POISON_INUSE , s - > size - off ) ;
2007-05-07 01:49:36 +04:00
}
2008-04-14 20:11:30 +04:00
/* Check the pad bytes at the end of a slab page */
2007-05-07 01:49:36 +04:00
static int slab_pad_check ( struct kmem_cache * s , struct page * page )
{
2007-07-17 15:03:18 +04:00
u8 * start ;
u8 * fault ;
u8 * end ;
int length ;
int remainder ;
2007-05-07 01:49:36 +04:00
if ( ! ( s - > flags & SLAB_POISON ) )
return 1 ;
2008-03-02 00:40:44 +03:00
start = page_address ( page ) ;
2011-03-10 10:21:48 +03:00
length = ( PAGE_SIZE < < compound_order ( page ) ) - s - > reserved ;
2008-04-14 20:11:30 +04:00
end = start + length ;
remainder = length % s - > size ;
2007-05-07 01:49:36 +04:00
if ( ! remainder )
return 1 ;
2011-11-01 04:08:07 +04:00
fault = memchr_inv ( end - remainder , POISON_INUSE , remainder ) ;
2007-07-17 15:03:18 +04:00
if ( ! fault )
return 1 ;
while ( end > fault & & end [ - 1 ] = = POISON_INUSE )
end - - ;
slab_err ( s , page , " Padding overwritten. 0x%p-0x%p " , fault , end - 1 ) ;
2011-07-29 16:10:20 +04:00
print_section ( " Padding " , end - remainder , remainder ) ;
2007-07-17 15:03:18 +04:00
2009-09-03 18:08:06 +04:00
restore_bytes ( s , " slab padding " , POISON_INUSE , end - remainder , end ) ;
2007-07-17 15:03:18 +04:00
return 0 ;
2007-05-07 01:49:36 +04:00
}
static int check_object ( struct kmem_cache * s , struct page * page ,
2010-09-29 16:15:01 +04:00
void * object , u8 val )
2007-05-07 01:49:36 +04:00
{
u8 * p = object ;
2012-06-13 19:24:57 +04:00
u8 * endobject = object + s - > object_size ;
2007-05-07 01:49:36 +04:00
if ( s - > flags & SLAB_RED_ZONE ) {
2007-07-17 15:03:18 +04:00
if ( ! check_bytes_and_report ( s , page , object , " Redzone " ,
2012-06-13 19:24:57 +04:00
endobject , val , s - > inuse - s - > object_size ) )
2007-05-07 01:49:36 +04:00
return 0 ;
} else {
2012-06-13 19:24:57 +04:00
if ( ( s - > flags & SLAB_POISON ) & & s - > object_size < s - > inuse ) {
2008-02-06 04:57:39 +03:00
check_bytes_and_report ( s , page , p , " Alignment padding " ,
2013-07-15 05:05:29 +04:00
endobject , POISON_INUSE ,
s - > inuse - s - > object_size ) ;
2008-02-06 04:57:39 +03:00
}
2007-05-07 01:49:36 +04:00
}
if ( s - > flags & SLAB_POISON ) {
2010-09-29 16:15:01 +04:00
if ( val ! = SLUB_RED_ACTIVE & & ( s - > flags & __OBJECT_POISON ) & &
2007-07-17 15:03:18 +04:00
( ! check_bytes_and_report ( s , page , p , " Poison " , p ,
2012-06-13 19:24:57 +04:00
POISON_FREE , s - > object_size - 1 ) | |
2007-07-17 15:03:18 +04:00
! check_bytes_and_report ( s , page , p , " Poison " ,
2012-06-13 19:24:57 +04:00
p + s - > object_size - 1 , POISON_END , 1 ) ) )
2007-05-07 01:49:36 +04:00
return 0 ;
/*
* check_pad_bytes cleans up on its own .
*/
check_pad_bytes ( s , page , p ) ;
}
2010-09-29 16:15:01 +04:00
if ( ! s - > offset & & val = = SLUB_RED_ACTIVE )
2007-05-07 01:49:36 +04:00
/*
* Object and freepointer overlap . Cannot check
* freepointer while object is allocated .
*/
return 1 ;
/* Check free pointer validity */
if ( ! check_valid_pointer ( s , page , get_freepointer ( s , p ) ) ) {
object_err ( s , page , p , " Freepointer corrupt " ) ;
/*
2008-12-05 06:08:08 +03:00
* No choice but to zap it and thus lose the remainder
2007-05-07 01:49:36 +04:00
* of the free objects in this slab . May cause
2007-05-09 13:32:39 +04:00
* another error because the object count is now wrong .
2007-05-07 01:49:36 +04:00
*/
2008-03-02 00:40:44 +03:00
set_freepointer ( s , p , NULL ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
return 1 ;
}
static int check_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 20:11:30 +04:00
int maxobj ;
2007-05-07 01:49:36 +04:00
VM_BUG_ON ( ! irqs_disabled ( ) ) ;
if ( ! PageSlab ( page ) ) {
2007-07-17 15:03:18 +04:00
slab_err ( s , page , " Not a valid slab page " ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
2008-04-14 20:11:30 +04:00
2011-03-10 10:21:48 +03:00
maxobj = order_objects ( compound_order ( page ) , s - > size , s - > reserved ) ;
2008-04-14 20:11:30 +04:00
if ( page - > objects > maxobj ) {
slab_err ( s , page , " objects %u > max %u " ,
s - > name , page - > objects , maxobj ) ;
return 0 ;
}
if ( page - > inuse > page - > objects ) {
2007-07-17 15:03:18 +04:00
slab_err ( s , page , " inuse %u > max %u " ,
2008-04-14 20:11:30 +04:00
s - > name , page - > inuse , page - > objects ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
/* Slab_pad_check fixes things up after itself */
slab_pad_check ( s , page ) ;
return 1 ;
}
/*
2007-05-09 13:32:39 +04:00
* Determine if a certain object on a page is on the freelist . Must hold the
* slab lock to guarantee that the chains are in a consistent state .
2007-05-07 01:49:36 +04:00
*/
static int on_freelist ( struct kmem_cache * s , struct page * page , void * search )
{
int nr = 0 ;
2011-06-01 21:25:53 +04:00
void * fp ;
2007-05-07 01:49:36 +04:00
void * object = NULL ;
2008-04-14 20:11:31 +04:00
unsigned long max_objects ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:53 +04:00
fp = page - > freelist ;
2008-04-14 20:11:30 +04:00
while ( fp & & nr < = page - > objects ) {
2007-05-07 01:49:36 +04:00
if ( fp = = search )
return 1 ;
if ( ! check_valid_pointer ( s , page , fp ) ) {
if ( object ) {
object_err ( s , page , object ,
" Freechain corrupt " ) ;
2008-03-02 00:40:44 +03:00
set_freepointer ( s , object , NULL ) ;
2007-05-07 01:49:36 +04:00
} else {
2007-07-17 15:03:18 +04:00
slab_err ( s , page , " Freepointer corrupt " ) ;
2008-03-02 00:40:44 +03:00
page - > freelist = NULL ;
2008-04-14 20:11:30 +04:00
page - > inuse = page - > objects ;
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Freelist cleared " ) ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
break ;
}
object = fp ;
fp = get_freepointer ( s , object ) ;
nr + + ;
}
2011-03-10 10:21:48 +03:00
max_objects = order_objects ( compound_order ( page ) , s - > size , s - > reserved ) ;
2008-10-22 23:00:38 +04:00
if ( max_objects > MAX_OBJS_PER_PAGE )
max_objects = MAX_OBJS_PER_PAGE ;
2008-04-14 20:11:31 +04:00
if ( page - > objects ! = max_objects ) {
slab_err ( s , page , " Wrong number of objects. Found %d but "
" should be %d " , page - > objects , max_objects ) ;
page - > objects = max_objects ;
slab_fix ( s , " Number of objects adjusted. " ) ;
}
2008-04-14 20:11:30 +04:00
if ( page - > inuse ! = page - > objects - nr ) {
2007-05-07 01:49:47 +04:00
slab_err ( s , page , " Wrong object count. Counter is %d but "
2008-04-14 20:11:30 +04:00
" counted were %d " , page - > inuse , page - > objects - nr ) ;
page - > inuse = page - > objects - nr ;
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Object count adjusted. " ) ;
2007-05-07 01:49:36 +04:00
}
return search = = NULL ;
}
2008-04-30 03:11:12 +04:00
static void trace ( struct kmem_cache * s , struct page * page , void * object ,
int alloc )
2007-05-17 09:11:00 +04:00
{
if ( s - > flags & SLAB_TRACE ) {
printk ( KERN_INFO " TRACE %s %s 0x%p inuse=%d fp=0x%p \n " ,
s - > name ,
alloc ? " alloc " : " free " ,
object , page - > inuse ,
page - > freelist ) ;
if ( ! alloc )
2013-07-15 05:05:29 +04:00
print_section ( " Object " , ( void * ) object ,
s - > object_size ) ;
2007-05-17 09:11:00 +04:00
dump_stack ( ) ;
}
}
2010-08-20 21:37:16 +04:00
/*
* Hooks for other subsystems that check memory allocations . In a typical
* production configuration these hooks all should produce no code at all .
*/
2013-10-09 02:58:57 +04:00
static inline void kmalloc_large_node_hook ( void * ptr , size_t size , gfp_t flags )
{
kmemleak_alloc ( ptr , size , 1 , flags ) ;
}
static inline void kfree_hook ( const void * x )
{
kmemleak_free ( x ) ;
}
2010-08-20 21:37:16 +04:00
static inline int slab_pre_alloc_hook ( struct kmem_cache * s , gfp_t flags )
{
2010-08-20 21:37:17 +04:00
flags & = gfp_allowed_mask ;
2010-08-20 21:37:16 +04:00
lockdep_trace_alloc ( flags ) ;
might_sleep_if ( flags & __GFP_WAIT ) ;
2012-06-13 19:24:57 +04:00
return should_failslab ( s - > object_size , flags , s - > flags ) ;
2010-08-20 21:37:16 +04:00
}
2013-07-15 05:05:29 +04:00
static inline void slab_post_alloc_hook ( struct kmem_cache * s ,
gfp_t flags , void * object )
2010-08-20 21:37:16 +04:00
{
2010-08-20 21:37:17 +04:00
flags & = gfp_allowed_mask ;
2011-02-14 20:35:22 +03:00
kmemcheck_slab_alloc ( s , flags , object , slab_ksize ( s ) ) ;
2012-06-13 19:24:57 +04:00
kmemleak_alloc_recursive ( object , s - > object_size , 1 , s - > flags , flags ) ;
2010-08-20 21:37:16 +04:00
}
static inline void slab_free_hook ( struct kmem_cache * s , void * x )
{
kmemleak_free_recursive ( x , s - > flags ) ;
2011-02-25 20:38:52 +03:00
/*
2013-10-18 05:12:43 +04:00
* Trouble is that we may no longer disable interrupts in the fast path
2011-02-25 20:38:52 +03:00
* So in order to make the debug calls that expect irqs to be
* disabled we need to disable interrupts temporarily .
*/
# if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP)
{
unsigned long flags ;
local_irq_save ( flags ) ;
2012-06-13 19:24:57 +04:00
kmemcheck_slab_free ( s , x , s - > object_size ) ;
debug_check_no_locks_freed ( x , s - > object_size ) ;
2011-02-25 20:38:52 +03:00
local_irq_restore ( flags ) ;
}
# endif
2011-03-24 22:26:46 +03:00
if ( ! ( s - > flags & SLAB_DEBUG_OBJECTS ) )
2012-06-13 19:24:57 +04:00
debug_check_no_obj_freed ( x , s - > object_size ) ;
2010-08-20 21:37:16 +04:00
}
2007-05-07 01:49:42 +04:00
/*
2007-05-09 13:32:39 +04:00
* Tracking of fully allocated slabs for debugging purposes .
2007-05-07 01:49:42 +04:00
*/
2011-06-01 21:25:50 +04:00
static void add_full ( struct kmem_cache * s ,
struct kmem_cache_node * n , struct page * page )
2007-05-07 01:49:42 +04:00
{
2011-06-01 21:25:50 +04:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2014-02-11 02:25:39 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2007-05-07 01:49:42 +04:00
list_add ( & page - > lru , & n - > full ) ;
}
2014-01-10 16:23:49 +04:00
static void remove_full ( struct kmem_cache * s , struct kmem_cache_node * n , struct page * page )
2007-05-07 01:49:42 +04:00
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2014-02-11 02:25:39 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2007-05-07 01:49:42 +04:00
list_del ( & page - > lru ) ;
}
2008-04-14 19:53:02 +04:00
/* Tracking of the number of slabs for debugging purposes */
static inline unsigned long slabs_node ( struct kmem_cache * s , int node )
{
struct kmem_cache_node * n = get_node ( s , node ) ;
return atomic_long_read ( & n - > nr_slabs ) ;
}
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_slabs ( struct kmem_cache_node * n )
{
return atomic_long_read ( & n - > nr_slabs ) ;
}
2008-04-14 20:11:40 +04:00
static inline void inc_slabs_node ( struct kmem_cache * s , int node , int objects )
2008-04-14 19:53:02 +04:00
{
struct kmem_cache_node * n = get_node ( s , node ) ;
/*
* May be called early in order to allocate a slab for the
* kmem_cache_node structure . Solve the chicken - egg
* dilemma by deferring the increment of the count during
* bootstrap ( see early_kmem_cache_node_alloc ) .
*/
2013-01-21 12:01:27 +04:00
if ( likely ( n ) ) {
2008-04-14 19:53:02 +04:00
atomic_long_inc ( & n - > nr_slabs ) ;
2008-04-14 20:11:40 +04:00
atomic_long_add ( objects , & n - > total_objects ) ;
}
2008-04-14 19:53:02 +04:00
}
2008-04-14 20:11:40 +04:00
static inline void dec_slabs_node ( struct kmem_cache * s , int node , int objects )
2008-04-14 19:53:02 +04:00
{
struct kmem_cache_node * n = get_node ( s , node ) ;
atomic_long_dec ( & n - > nr_slabs ) ;
2008-04-14 20:11:40 +04:00
atomic_long_sub ( objects , & n - > total_objects ) ;
2008-04-14 19:53:02 +04:00
}
/* Object debug checks for alloc/free paths */
2007-05-17 09:11:00 +04:00
static void setup_object_debug ( struct kmem_cache * s , struct page * page ,
void * object )
{
if ( ! ( s - > flags & ( SLAB_STORE_USER | SLAB_RED_ZONE | __OBJECT_POISON ) ) )
return ;
2010-09-29 16:15:01 +04:00
init_object ( s , object , SLUB_RED_INACTIVE ) ;
2007-05-17 09:11:00 +04:00
init_tracking ( s , object ) ;
}
2013-07-15 05:05:29 +04:00
static noinline int alloc_debug_processing ( struct kmem_cache * s ,
struct page * page ,
2008-08-19 21:43:25 +04:00
void * object , unsigned long addr )
2007-05-07 01:49:36 +04:00
{
if ( ! check_slab ( s , page ) )
goto bad ;
if ( ! check_valid_pointer ( s , page , object ) ) {
object_err ( s , page , object , " Freelist Pointer check fails " ) ;
2007-05-07 01:49:47 +04:00
goto bad ;
2007-05-07 01:49:36 +04:00
}
2010-09-29 16:15:01 +04:00
if ( ! check_object ( s , page , object , SLUB_RED_INACTIVE ) )
2007-05-07 01:49:36 +04:00
goto bad ;
2007-05-17 09:11:00 +04:00
/* Success perform special debug activities for allocs */
if ( s - > flags & SLAB_STORE_USER )
set_track ( s , object , TRACK_ALLOC , addr ) ;
trace ( s , page , object , 1 ) ;
2010-09-29 16:15:01 +04:00
init_object ( s , object , SLUB_RED_ACTIVE ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
2007-05-17 09:11:00 +04:00
2007-05-07 01:49:36 +04:00
bad :
if ( PageSlab ( page ) ) {
/*
* If this is a slab page then lets do the best we can
* to avoid issues in the future . Marking all objects
2007-05-09 13:32:39 +04:00
* as used avoids touching the remaining objects .
2007-05-07 01:49:36 +04:00
*/
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Marking all objects used " ) ;
2008-04-14 20:11:30 +04:00
page - > inuse = page - > objects ;
2008-03-02 00:40:44 +03:00
page - > freelist = NULL ;
2007-05-07 01:49:36 +04:00
}
return 0 ;
}
2012-05-30 21:54:46 +04:00
static noinline struct kmem_cache_node * free_debug_processing (
struct kmem_cache * s , struct page * page , void * object ,
unsigned long addr , unsigned long * flags )
2007-05-07 01:49:36 +04:00
{
2012-05-30 21:54:46 +04:00
struct kmem_cache_node * n = get_node ( s , page_to_nid ( page ) ) ;
2011-06-01 21:25:54 +04:00
2012-05-30 21:54:46 +04:00
spin_lock_irqsave ( & n - > list_lock , * flags ) ;
2011-06-01 21:25:53 +04:00
slab_lock ( page ) ;
2007-05-07 01:49:36 +04:00
if ( ! check_slab ( s , page ) )
goto fail ;
if ( ! check_valid_pointer ( s , page , object ) ) {
2007-05-07 01:49:47 +04:00
slab_err ( s , page , " Invalid object pointer 0x%p " , object ) ;
2007-05-07 01:49:36 +04:00
goto fail ;
}
if ( on_freelist ( s , page , object ) ) {
2007-07-17 15:03:18 +04:00
object_err ( s , page , object , " Object already free " ) ;
2007-05-07 01:49:36 +04:00
goto fail ;
}
2010-09-29 16:15:01 +04:00
if ( ! check_object ( s , page , object , SLUB_RED_ACTIVE ) )
2011-06-01 21:25:54 +04:00
goto out ;
2007-05-07 01:49:36 +04:00
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
if ( unlikely ( s ! = page - > slab_cache ) ) {
2008-02-06 04:57:39 +03:00
if ( ! PageSlab ( page ) ) {
2007-05-07 01:49:47 +04:00
slab_err ( s , page , " Attempt to free object(0x%p) "
" outside of slab " , object ) ;
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
} else if ( ! page - > slab_cache ) {
2007-05-07 01:49:36 +04:00
printk ( KERN_ERR
2007-05-07 01:49:47 +04:00
" SLUB <none>: no slab for object 0x%p. \n " ,
2007-05-07 01:49:36 +04:00
object ) ;
2007-05-07 01:49:47 +04:00
dump_stack ( ) ;
2008-01-08 10:20:27 +03:00
} else
2007-07-17 15:03:18 +04:00
object_err ( s , page , object ,
" page slab pointer corrupt. " ) ;
2007-05-07 01:49:36 +04:00
goto fail ;
}
2007-05-17 09:11:00 +04:00
if ( s - > flags & SLAB_STORE_USER )
set_track ( s , object , TRACK_FREE , addr ) ;
trace ( s , page , object , 0 ) ;
2010-09-29 16:15:01 +04:00
init_object ( s , object , SLUB_RED_INACTIVE ) ;
2011-06-01 21:25:54 +04:00
out :
2011-06-01 21:25:53 +04:00
slab_unlock ( page ) ;
2012-05-30 21:54:46 +04:00
/*
* Keep node_lock to preserve integrity
* until the object is actually freed
*/
return n ;
2007-05-17 09:11:00 +04:00
2007-05-07 01:49:36 +04:00
fail :
2012-05-30 21:54:46 +04:00
slab_unlock ( page ) ;
spin_unlock_irqrestore ( & n - > list_lock , * flags ) ;
2007-07-17 15:03:18 +04:00
slab_fix ( s , " Object at 0x%p not freed " , object ) ;
2012-05-30 21:54:46 +04:00
return NULL ;
2007-05-07 01:49:36 +04:00
}
2007-05-09 13:32:44 +04:00
static int __init setup_slub_debug ( char * str )
{
2007-07-16 10:38:14 +04:00
slub_debug = DEBUG_DEFAULT_FLAGS ;
if ( * str + + ! = ' = ' | | ! * str )
/*
* No options specified . Switch on full debugging .
*/
goto out ;
if ( * str = = ' , ' )
/*
* No options but restriction on slabs . This means full
* debugging for slabs matching a pattern .
*/
goto check_slabs ;
2009-07-07 11:14:14 +04:00
if ( tolower ( * str ) = = ' o ' ) {
/*
* Avoid enabling debugging on caches if its minimum order
* would increase as a result .
*/
disable_higher_order_debug = 1 ;
goto out ;
}
2007-07-16 10:38:14 +04:00
slub_debug = 0 ;
if ( * str = = ' - ' )
/*
* Switch off all debugging measures .
*/
goto out ;
/*
* Determine which debug features should be switched on
*/
2008-01-08 10:20:27 +03:00
for ( ; * str & & * str ! = ' , ' ; str + + ) {
2007-07-16 10:38:14 +04:00
switch ( tolower ( * str ) ) {
case ' f ' :
slub_debug | = SLAB_DEBUG_FREE ;
break ;
case ' z ' :
slub_debug | = SLAB_RED_ZONE ;
break ;
case ' p ' :
slub_debug | = SLAB_POISON ;
break ;
case ' u ' :
slub_debug | = SLAB_STORE_USER ;
break ;
case ' t ' :
slub_debug | = SLAB_TRACE ;
break ;
2010-02-26 09:36:12 +03:00
case ' a ' :
slub_debug | = SLAB_FAILSLAB ;
break ;
2007-07-16 10:38:14 +04:00
default :
printk ( KERN_ERR " slub_debug option '%c' "
2008-01-08 10:20:27 +03:00
" unknown. skipped \n " , * str ) ;
2007-07-16 10:38:14 +04:00
}
2007-05-09 13:32:44 +04:00
}
2007-07-16 10:38:14 +04:00
check_slabs :
2007-05-09 13:32:44 +04:00
if ( * str = = ' , ' )
slub_debug_slabs = str + 1 ;
2007-07-16 10:38:14 +04:00
out :
2007-05-09 13:32:44 +04:00
return 1 ;
}
__setup ( " slub_debug " , setup_slub_debug ) ;
2012-06-13 19:24:57 +04:00
static unsigned long kmem_cache_flags ( unsigned long object_size ,
2007-09-12 02:24:11 +04:00
unsigned long flags , const char * name ,
2008-07-26 06:45:34 +04:00
void ( * ctor ) ( void * ) )
2007-05-09 13:32:44 +04:00
{
/*
2008-02-16 10:45:24 +03:00
* Enable debugging if selected on the kernel commandline .
2007-05-09 13:32:44 +04:00
*/
2013-11-07 20:29:15 +04:00
if ( slub_debug & & ( ! slub_debug_slabs | | ( name & &
! strncmp ( slub_debug_slabs , name , strlen ( slub_debug_slabs ) ) ) ) )
2009-07-28 05:30:35 +04:00
flags | = slub_debug ;
2007-09-12 02:24:11 +04:00
return flags ;
2007-05-09 13:32:44 +04:00
}
# else
2007-05-17 09:11:00 +04:00
static inline void setup_object_debug ( struct kmem_cache * s ,
struct page * page , void * object ) { }
2007-05-09 13:32:44 +04:00
2007-05-17 09:11:00 +04:00
static inline int alloc_debug_processing ( struct kmem_cache * s ,
2008-08-19 21:43:25 +04:00
struct page * page , void * object , unsigned long addr ) { return 0 ; }
2007-05-09 13:32:44 +04:00
2012-05-30 21:54:46 +04:00
static inline struct kmem_cache_node * free_debug_processing (
struct kmem_cache * s , struct page * page , void * object ,
unsigned long addr , unsigned long * flags ) { return NULL ; }
2007-05-09 13:32:44 +04:00
static inline int slab_pad_check ( struct kmem_cache * s , struct page * page )
{ return 1 ; }
static inline int check_object ( struct kmem_cache * s , struct page * page ,
2010-09-29 16:15:01 +04:00
void * object , u8 val ) { return 1 ; }
2011-06-01 21:25:50 +04:00
static inline void add_full ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct page * page ) { }
2014-01-10 16:23:49 +04:00
static inline void remove_full ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct page * page ) { }
2012-06-13 19:24:57 +04:00
static inline unsigned long kmem_cache_flags ( unsigned long object_size ,
2007-09-12 02:24:11 +04:00
unsigned long flags , const char * name ,
2008-07-26 06:45:34 +04:00
void ( * ctor ) ( void * ) )
2007-09-12 02:24:11 +04:00
{
return flags ;
}
2007-05-09 13:32:44 +04:00
# define slub_debug 0
2008-04-14 19:53:02 +04:00
2009-09-15 13:00:26 +04:00
# define disable_higher_order_debug 0
2008-04-14 19:53:02 +04:00
static inline unsigned long slabs_node ( struct kmem_cache * s , int node )
{ return 0 ; }
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_slabs ( struct kmem_cache_node * n )
{ return 0 ; }
2008-04-14 20:11:40 +04:00
static inline void inc_slabs_node ( struct kmem_cache * s , int node ,
int objects ) { }
static inline void dec_slabs_node ( struct kmem_cache * s , int node ,
int objects ) { }
2010-08-25 23:07:16 +04:00
2013-10-09 02:58:57 +04:00
static inline void kmalloc_large_node_hook ( void * ptr , size_t size , gfp_t flags )
{
kmemleak_alloc ( ptr , size , 1 , flags ) ;
}
static inline void kfree_hook ( const void * x )
{
kmemleak_free ( x ) ;
}
2010-08-25 23:07:16 +04:00
static inline int slab_pre_alloc_hook ( struct kmem_cache * s , gfp_t flags )
{ return 0 ; }
static inline void slab_post_alloc_hook ( struct kmem_cache * s , gfp_t flags ,
2013-10-09 02:58:57 +04:00
void * object )
{
kmemleak_alloc_recursive ( object , s - > object_size , 1 , s - > flags ,
flags & gfp_allowed_mask ) ;
}
2010-08-25 23:07:16 +04:00
2013-10-09 02:58:57 +04:00
static inline void slab_free_hook ( struct kmem_cache * s , void * x )
{
kmemleak_free_recursive ( x , s - > flags ) ;
}
2010-08-25 23:07:16 +04:00
2010-10-05 22:57:26 +04:00
# endif /* CONFIG_SLUB_DEBUG */
2008-04-14 20:11:40 +04:00
2007-05-07 01:49:36 +04:00
/*
* Slab allocation and freeing
*/
2008-04-14 20:11:40 +04:00
static inline struct page * alloc_slab_page ( gfp_t flags , int node ,
struct kmem_cache_order_objects oo )
{
int order = oo_order ( oo ) ;
2008-11-25 18:55:53 +03:00
flags | = __GFP_NOTRACK ;
2010-07-09 23:07:10 +04:00
if ( node = = NUMA_NO_NODE )
2008-04-14 20:11:40 +04:00
return alloc_pages ( flags , order ) ;
else
2010-04-14 18:58:36 +04:00
return alloc_pages_exact_node ( node , flags , order ) ;
2008-04-14 20:11:40 +04:00
}
2007-05-07 01:49:36 +04:00
static struct page * allocate_slab ( struct kmem_cache * s , gfp_t flags , int node )
{
2008-01-08 10:20:27 +03:00
struct page * page ;
2008-04-14 20:11:31 +04:00
struct kmem_cache_order_objects oo = s - > oo ;
2009-06-24 22:59:51 +04:00
gfp_t alloc_gfp ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:44 +04:00
flags & = gfp_allowed_mask ;
if ( flags & __GFP_WAIT )
local_irq_enable ( ) ;
2008-02-15 01:21:32 +03:00
flags | = s - > allocflags ;
2007-10-16 12:25:52 +04:00
2009-06-24 22:59:51 +04:00
/*
* Let the initial higher - order allocation fail under memory pressure
* so we fall - back to the minimum order allocation .
*/
alloc_gfp = ( flags | __GFP_NOWARN | __GFP_NORETRY ) & ~ __GFP_NOFAIL ;
page = alloc_slab_page ( alloc_gfp , node , oo ) ;
2008-04-14 20:11:40 +04:00
if ( unlikely ( ! page ) ) {
oo = s - > min ;
2014-03-12 12:26:20 +04:00
alloc_gfp = flags ;
2008-04-14 20:11:40 +04:00
/*
* Allocation may have failed due to fragmentation .
* Try a lower order alloc if possible
*/
2014-03-12 12:26:20 +04:00
page = alloc_slab_page ( alloc_gfp , node , oo ) ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:44 +04:00
if ( page )
stat ( s , ORDER_FALLBACK ) ;
2008-04-14 20:11:40 +04:00
}
2008-04-04 02:54:48 +04:00
2012-07-10 01:00:38 +04:00
if ( kmemcheck_enabled & & page
2009-08-19 22:44:13 +04:00
& & ! ( s - > flags & ( SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS ) ) ) {
2008-11-25 18:55:53 +03:00
int pages = 1 < < oo_order ( oo ) ;
2014-03-12 12:26:20 +04:00
kmemcheck_alloc_shadow ( page , oo_order ( oo ) , alloc_gfp , node ) ;
2008-11-25 18:55:53 +03:00
/*
* Objects from caches that have a constructor don ' t get
* cleared when they ' re allocated , so we need to do it here .
*/
if ( s - > ctor )
kmemcheck_mark_uninitialized_pages ( page , pages ) ;
else
kmemcheck_mark_unallocated_pages ( page , pages ) ;
2008-04-04 02:54:48 +04:00
}
2012-07-10 01:00:38 +04:00
if ( flags & __GFP_WAIT )
local_irq_disable ( ) ;
if ( ! page )
return NULL ;
2008-04-14 20:11:31 +04:00
page - > objects = oo_objects ( oo ) ;
2007-05-07 01:49:36 +04:00
mod_zone_page_state ( page_zone ( page ) ,
( s - > flags & SLAB_RECLAIM_ACCOUNT ) ?
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE ,
2008-04-14 20:11:40 +04:00
1 < < oo_order ( oo ) ) ;
2007-05-07 01:49:36 +04:00
return page ;
}
static void setup_object ( struct kmem_cache * s , struct page * page ,
void * object )
{
2007-05-17 09:11:00 +04:00
setup_object_debug ( s , page , object ) ;
2007-05-07 01:50:17 +04:00
if ( unlikely ( s - > ctor ) )
2008-07-26 06:45:34 +04:00
s - > ctor ( object ) ;
2007-05-07 01:49:36 +04:00
}
static struct page * new_slab ( struct kmem_cache * s , gfp_t flags , int node )
{
struct page * page ;
void * start ;
void * last ;
void * p ;
2012-12-19 02:22:50 +04:00
int order ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:25:41 +04:00
BUG_ON ( flags & GFP_SLAB_BUG_MASK ) ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:25:41 +04:00
page = allocate_slab ( s ,
flags & ( GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK ) , node ) ;
2007-05-07 01:49:36 +04:00
if ( ! page )
goto out ;
2012-12-19 02:22:50 +04:00
order = compound_order ( page ) ;
2008-04-14 20:11:40 +04:00
inc_slabs_node ( s , page_to_nid ( page ) , page - > objects ) ;
2012-12-19 02:22:50 +04:00
memcg_bind_pages ( s , order ) ;
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
page - > slab_cache = s ;
2012-05-17 19:47:47 +04:00
__SetPageSlab ( page ) ;
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
if ( page - > pfmemalloc )
SetPageSlabPfmemalloc ( page ) ;
2007-05-07 01:49:36 +04:00
start = page_address ( page ) ;
if ( unlikely ( s - > flags & SLAB_POISON ) )
2012-12-19 02:22:50 +04:00
memset ( start , POISON_INUSE , PAGE_SIZE < < order ) ;
2007-05-07 01:49:36 +04:00
last = start ;
2008-04-14 20:11:31 +04:00
for_each_object ( p , s , start , page - > objects ) {
2007-05-07 01:49:36 +04:00
setup_object ( s , page , last ) ;
set_freepointer ( s , last , p ) ;
last = p ;
}
setup_object ( s , page , last ) ;
2008-03-02 00:40:44 +03:00
set_freepointer ( s , last , NULL ) ;
2007-05-07 01:49:36 +04:00
page - > freelist = start ;
2011-08-10 01:12:24 +04:00
page - > inuse = page - > objects ;
2011-06-01 21:25:46 +04:00
page - > frozen = 1 ;
2007-05-07 01:49:36 +04:00
out :
return page ;
}
static void __free_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 20:11:31 +04:00
int order = compound_order ( page ) ;
int pages = 1 < < order ;
2007-05-07 01:49:36 +04:00
2010-07-09 23:07:14 +04:00
if ( kmem_cache_debug ( s ) ) {
2007-05-07 01:49:36 +04:00
void * p ;
slab_pad_check ( s , page ) ;
2008-04-14 20:11:31 +04:00
for_each_object ( p , s , page_address ( page ) ,
page - > objects )
2010-09-29 16:15:01 +04:00
check_object ( s , page , p , SLUB_RED_INACTIVE ) ;
2007-05-07 01:49:36 +04:00
}
2008-11-25 18:55:53 +03:00
kmemcheck_free_shadow ( page , compound_order ( page ) ) ;
2008-04-04 02:54:48 +04:00
2007-05-07 01:49:36 +04:00
mod_zone_page_state ( page_zone ( page ) ,
( s - > flags & SLAB_RECLAIM_ACCOUNT ) ?
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE ,
2008-01-08 10:20:27 +03:00
- pages ) ;
2007-05-07 01:49:36 +04:00
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
__ClearPageSlabPfmemalloc ( page ) ;
2008-04-14 19:52:18 +04:00
__ClearPageSlab ( page ) ;
2012-12-19 02:22:50 +04:00
memcg_release_pages ( s , order ) ;
2013-02-23 04:34:59 +04:00
page_mapcount_reset ( page ) ;
2009-05-05 13:13:44 +04:00
if ( current - > reclaim_state )
current - > reclaim_state - > reclaimed_slab + = pages ;
2012-12-19 02:22:48 +04:00
__free_memcg_kmem_pages ( page , order ) ;
2007-05-07 01:49:36 +04:00
}
2011-03-10 10:22:00 +03:00
# define need_reserve_slab_rcu \
( sizeof ( ( ( struct page * ) NULL ) - > lru ) < sizeof ( struct rcu_head ) )
2007-05-07 01:49:36 +04:00
static void rcu_free_slab ( struct rcu_head * h )
{
struct page * page ;
2011-03-10 10:22:00 +03:00
if ( need_reserve_slab_rcu )
page = virt_to_head_page ( h ) ;
else
page = container_of ( ( struct list_head * ) h , struct page , lru ) ;
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
__free_slab ( page - > slab_cache , page ) ;
2007-05-07 01:49:36 +04:00
}
static void free_slab ( struct kmem_cache * s , struct page * page )
{
if ( unlikely ( s - > flags & SLAB_DESTROY_BY_RCU ) ) {
2011-03-10 10:22:00 +03:00
struct rcu_head * head ;
if ( need_reserve_slab_rcu ) {
int order = compound_order ( page ) ;
int offset = ( PAGE_SIZE < < order ) - s - > reserved ;
VM_BUG_ON ( s - > reserved ! = sizeof ( * head ) ) ;
head = page_address ( page ) + offset ;
} else {
/*
* RCU free overloads the RCU head over the LRU
*/
head = ( void * ) & page - > lru ;
}
2007-05-07 01:49:36 +04:00
call_rcu ( head , rcu_free_slab ) ;
} else
__free_slab ( s , page ) ;
}
static void discard_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 20:11:40 +04:00
dec_slabs_node ( s , page_to_nid ( page ) , page - > objects ) ;
2007-05-07 01:49:36 +04:00
free_slab ( s , page ) ;
}
/*
2011-06-01 21:25:50 +04:00
* Management of partially allocated slabs .
2007-05-07 01:49:36 +04:00
*/
2014-02-11 02:25:46 +04:00
static inline void
__add_partial ( struct kmem_cache_node * n , struct page * page , int tail )
2007-05-07 01:49:36 +04:00
{
2007-05-07 01:49:44 +04:00
n - > nr_partial + + ;
2011-08-24 04:57:52 +04:00
if ( tail = = DEACTIVATE_TO_TAIL )
2008-01-08 10:20:27 +03:00
list_add_tail ( & page - > lru , & n - > partial ) ;
else
list_add ( & page - > lru , & n - > partial ) ;
2007-05-07 01:49:36 +04:00
}
2014-02-11 02:25:46 +04:00
static inline void add_partial ( struct kmem_cache_node * n ,
struct page * page , int tail )
2010-09-28 17:10:28 +04:00
{
2014-01-10 16:23:49 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2014-02-11 02:25:46 +04:00
__add_partial ( n , page , tail ) ;
}
2014-01-10 16:23:49 +04:00
2014-02-11 02:25:46 +04:00
static inline void
__remove_partial ( struct kmem_cache_node * n , struct page * page )
{
2010-09-28 17:10:28 +04:00
list_del ( & page - > lru ) ;
n - > nr_partial - - ;
}
2014-02-11 02:25:46 +04:00
static inline void remove_partial ( struct kmem_cache_node * n ,
struct page * page )
{
lockdep_assert_held ( & n - > list_lock ) ;
__remove_partial ( n , page ) ;
}
2007-05-07 01:49:36 +04:00
/*
2012-05-09 19:09:53 +04:00
* Remove slab from the partial list , freeze it and
* return the pointer to the freelist .
2007-05-07 01:49:36 +04:00
*
2011-08-10 01:12:26 +04:00
* Returns a list of objects or NULL if it fails .
2007-05-07 01:49:36 +04:00
*/
2011-08-10 01:12:26 +04:00
static inline void * acquire_slab ( struct kmem_cache * s ,
2011-08-10 01:12:25 +04:00
struct kmem_cache_node * n , struct page * page ,
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
int mode , int * objects )
2007-05-07 01:49:36 +04:00
{
2011-06-01 21:25:52 +04:00
void * freelist ;
unsigned long counters ;
struct page new ;
2014-01-10 16:23:49 +04:00
lockdep_assert_held ( & n - > list_lock ) ;
2011-06-01 21:25:52 +04:00
/*
* Zap the freelist and set the frozen bit .
* The old freelist is the list of objects for the
* per cpu allocation list .
*/
2012-05-09 19:09:53 +04:00
freelist = page - > freelist ;
counters = page - > counters ;
new . counters = counters ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
* objects = new . objects - new . inuse ;
2012-06-04 11:14:58 +04:00
if ( mode ) {
2012-05-09 19:09:53 +04:00
new . inuse = page - > objects ;
2012-06-04 11:14:58 +04:00
new . freelist = NULL ;
} else {
new . freelist = freelist ;
}
2011-06-01 21:25:52 +04:00
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( new . frozen ) ;
2012-05-09 19:09:53 +04:00
new . frozen = 1 ;
2011-06-01 21:25:52 +04:00
2012-05-09 19:09:53 +04:00
if ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 21:25:52 +04:00
freelist , counters ,
2012-05-16 19:13:02 +04:00
new . freelist , new . counters ,
2012-05-09 19:09:53 +04:00
" acquire_slab " ) )
return NULL ;
2011-06-01 21:25:52 +04:00
remove_partial ( n , page ) ;
2012-05-09 19:09:53 +04:00
WARN_ON ( ! freelist ) ;
2011-08-10 01:12:27 +04:00
return freelist ;
2007-05-07 01:49:36 +04:00
}
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
static void put_cpu_partial ( struct kmem_cache * s , struct page * page , int drain ) ;
2012-09-18 01:09:09 +04:00
static inline bool pfmemalloc_match ( struct page * page , gfp_t gfpflags ) ;
2011-08-10 01:12:27 +04:00
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* Try to allocate a partial slab from a specific node .
2007-05-07 01:49:36 +04:00
*/
2012-09-18 01:09:09 +04:00
static void * get_partial_node ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct kmem_cache_cpu * c , gfp_t flags )
2007-05-07 01:49:36 +04:00
{
2011-08-10 01:12:27 +04:00
struct page * page , * page2 ;
void * object = NULL ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
int available = 0 ;
int objects ;
2007-05-07 01:49:36 +04:00
/*
* Racy check . If we mistakenly see no partial slabs then we
* just allocate an empty slab . If we mistakenly try to get a
2007-05-09 13:32:39 +04:00
* partial slab and there is none available then get_partials ( )
* will return NULL .
2007-05-07 01:49:36 +04:00
*/
if ( ! n | | ! n - > nr_partial )
return NULL ;
spin_lock ( & n - > list_lock ) ;
2011-08-10 01:12:27 +04:00
list_for_each_entry_safe ( page , page2 , & n - > partial , lru ) {
2012-09-18 01:09:09 +04:00
void * t ;
2011-08-10 01:12:27 +04:00
2012-09-18 01:09:09 +04:00
if ( ! pfmemalloc_match ( page , flags ) )
continue ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
t = acquire_slab ( s , n , page , object = = NULL , & objects ) ;
2011-08-10 01:12:27 +04:00
if ( ! t )
break ;
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
available + = objects ;
2011-09-07 06:26:36 +04:00
if ( ! object ) {
2011-08-10 01:12:27 +04:00
c - > page = page ;
stat ( s , ALLOC_FROM_PARTIAL ) ;
object = t ;
} else {
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
put_cpu_partial ( s , page , 0 ) ;
2012-02-03 19:34:56 +04:00
stat ( s , CPU_PARTIAL_NODE ) ;
2011-08-10 01:12:27 +04:00
}
2013-06-19 09:05:52 +04:00
if ( ! kmem_cache_has_cpu_partial ( s )
| | available > s - > cpu_partial / 2 )
2011-08-10 01:12:27 +04:00
break ;
2011-08-10 01:12:26 +04:00
}
2007-05-07 01:49:36 +04:00
spin_unlock ( & n - > list_lock ) ;
2011-08-10 01:12:26 +04:00
return object ;
2007-05-07 01:49:36 +04:00
}
/*
2007-05-09 13:32:39 +04:00
* Get a page from somewhere . Search in increasing NUMA distances .
2007-05-07 01:49:36 +04:00
*/
2012-01-27 12:12:23 +04:00
static void * get_any_partial ( struct kmem_cache * s , gfp_t flags ,
2011-08-10 01:12:25 +04:00
struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
# ifdef CONFIG_NUMA
struct zonelist * zonelist ;
2008-04-28 13:12:17 +04:00
struct zoneref * z ;
2008-04-28 13:12:16 +04:00
struct zone * zone ;
enum zone_type high_zoneidx = gfp_zone ( flags ) ;
2011-08-10 01:12:26 +04:00
void * object ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
unsigned int cpuset_mems_cookie ;
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* The defrag ratio allows a configuration of the tradeoffs between
* inter node defragmentation and node local allocations . A lower
* defrag_ratio increases the tendency to do local allocations
* instead of attempting to obtain partial slabs from other nodes .
2007-05-07 01:49:36 +04:00
*
2007-05-09 13:32:39 +04:00
* If the defrag_ratio is set to 0 then kmalloc ( ) always
* returns node local objects . If the ratio is higher then kmalloc ( )
* may return off node objects because partial slabs are obtained
* from other nodes and filled up .
2007-05-07 01:49:36 +04:00
*
2008-02-16 10:45:26 +03:00
* If / sys / kernel / slab / xx / defrag_ratio is set to 100 ( which makes
2007-05-09 13:32:39 +04:00
* defrag_ratio = 1000 ) then every ( well almost ) allocation will
* first attempt to defrag slab caches on other nodes . This means
* scanning over all nodes to look for partial slabs which may be
* expensive if we do it every time we are trying to find a slab
* with available objects .
2007-05-07 01:49:36 +04:00
*/
2008-01-08 10:20:26 +03:00
if ( ! s - > remote_node_defrag_ratio | |
get_cycles ( ) % 1024 > s - > remote_node_defrag_ratio )
2007-05-07 01:49:36 +04:00
return NULL ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
do {
2014-04-04 01:47:24 +04:00
cpuset_mems_cookie = read_mems_allowed_begin ( ) ;
2014-04-08 02:37:29 +04:00
zonelist = node_zonelist ( mempolicy_slab_node ( ) , flags ) ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
for_each_zone_zonelist ( zone , z , zonelist , high_zoneidx ) {
struct kmem_cache_node * n ;
n = get_node ( s , zone_to_nid ( zone ) ) ;
if ( n & & cpuset_zone_allowed_hardwall ( zone , flags ) & &
n - > nr_partial > s - > min_partial ) {
2012-09-18 01:09:09 +04:00
object = get_partial_node ( s , n , c , flags ) ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
if ( object ) {
/*
2014-04-04 01:47:24 +04:00
* Don ' t check read_mems_allowed_retry ( )
* here - if mems_allowed was updated in
* parallel , that was a harmless race
* between allocation and the cpuset
* update
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22 03:34:11 +04:00
*/
return object ;
}
2010-05-25 01:32:08 +04:00
}
2007-05-07 01:49:36 +04:00
}
2014-04-04 01:47:24 +04:00
} while ( read_mems_allowed_retry ( cpuset_mems_cookie ) ) ;
2007-05-07 01:49:36 +04:00
# endif
return NULL ;
}
/*
* Get a partial page , lock it and return it .
*/
2011-08-10 01:12:26 +04:00
static void * get_partial ( struct kmem_cache * s , gfp_t flags , int node ,
2011-08-10 01:12:25 +04:00
struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2011-08-10 01:12:26 +04:00
void * object ;
2010-07-09 23:07:10 +04:00
int searchnode = ( node = = NUMA_NO_NODE ) ? numa_node_id ( ) : node ;
2007-05-07 01:49:36 +04:00
2012-09-18 01:09:09 +04:00
object = get_partial_node ( s , get_node ( s , searchnode ) , c , flags ) ;
2011-08-10 01:12:26 +04:00
if ( object | | node ! = NUMA_NO_NODE )
return object ;
2007-05-07 01:49:36 +04:00
2011-08-10 01:12:25 +04:00
return get_any_partial ( s , flags , c ) ;
2007-05-07 01:49:36 +04:00
}
2011-02-25 20:38:54 +03:00
# ifdef CONFIG_PREEMPT
/*
* Calculate the next globally unique transaction for disambiguiation
* during cmpxchg . The transactions start with the cpu number and are then
* incremented by CONFIG_NR_CPUS .
*/
# define TID_STEP roundup_pow_of_two(CONFIG_NR_CPUS)
# else
/*
* No preemption supported therefore also no need to check for
* different cpus .
*/
# define TID_STEP 1
# endif
static inline unsigned long next_tid ( unsigned long tid )
{
return tid + TID_STEP ;
}
static inline unsigned int tid_to_cpu ( unsigned long tid )
{
return tid % TID_STEP ;
}
static inline unsigned long tid_to_event ( unsigned long tid )
{
return tid / TID_STEP ;
}
static inline unsigned int init_tid ( int cpu )
{
return cpu ;
}
static inline void note_cmpxchg_failure ( const char * n ,
const struct kmem_cache * s , unsigned long tid )
{
# ifdef SLUB_DEBUG_CMPXCHG
unsigned long actual_tid = __this_cpu_read ( s - > cpu_slab - > tid ) ;
printk ( KERN_INFO " %s %s: cmpxchg redo " , n , s - > name ) ;
# ifdef CONFIG_PREEMPT
if ( tid_to_cpu ( tid ) ! = tid_to_cpu ( actual_tid ) )
printk ( " due to cpu change %d -> %d \n " ,
tid_to_cpu ( tid ) , tid_to_cpu ( actual_tid ) ) ;
else
# endif
if ( tid_to_event ( tid ) ! = tid_to_event ( actual_tid ) )
printk ( " due to cpu running other code. Event %ld->%ld \n " ,
tid_to_event ( tid ) , tid_to_event ( actual_tid ) ) ;
else
printk ( " for unknown reason: actual=%lx was=%lx target=%lx \n " ,
actual_tid , tid , next_tid ( tid ) ) ;
# endif
2011-03-22 21:35:00 +03:00
stat ( s , CMPXCHG_DOUBLE_CPU_FAIL ) ;
2011-02-25 20:38:54 +03:00
}
2012-09-28 12:34:05 +04:00
static void init_kmem_cache_cpus ( struct kmem_cache * s )
2011-02-25 20:38:54 +03:00
{
int cpu ;
for_each_possible_cpu ( cpu )
per_cpu_ptr ( s - > cpu_slab , cpu ) - > tid = init_tid ( cpu ) ;
}
2011-06-01 21:25:52 +04:00
2007-05-07 01:49:36 +04:00
/*
* Remove the cpu slab
*/
2013-07-15 05:05:29 +04:00
static void deactivate_slab ( struct kmem_cache * s , struct page * page ,
void * freelist )
2007-05-07 01:49:36 +04:00
{
2011-06-01 21:25:52 +04:00
enum slab_modes { M_NONE , M_PARTIAL , M_FULL , M_FREE } ;
struct kmem_cache_node * n = get_node ( s , page_to_nid ( page ) ) ;
int lock = 0 ;
enum slab_modes l = M_NONE , m = M_NONE ;
void * nextfree ;
2011-08-24 04:57:52 +04:00
int tail = DEACTIVATE_TO_HEAD ;
2011-06-01 21:25:52 +04:00
struct page new ;
struct page old ;
if ( page - > freelist ) {
2009-12-19 01:26:23 +03:00
stat ( s , DEACTIVATE_REMOTE_FREES ) ;
2011-08-24 04:57:52 +04:00
tail = DEACTIVATE_TO_TAIL ;
2011-06-01 21:25:52 +04:00
}
2007-05-10 14:15:16 +04:00
/*
2011-06-01 21:25:52 +04:00
* Stage one : Free all available per cpu objects back
* to the page freelist while it is still frozen . Leave the
* last one .
*
* There is no need to take the list - > lock because the page
* is still frozen .
*/
while ( freelist & & ( nextfree = get_freepointer ( s , freelist ) ) ) {
void * prior ;
unsigned long counters ;
do {
prior = page - > freelist ;
counters = page - > counters ;
set_freepointer ( s , freelist , prior ) ;
new . counters = counters ;
new . inuse - - ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! new . frozen ) ;
2011-06-01 21:25:52 +04:00
2011-07-14 21:49:12 +04:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 21:25:52 +04:00
prior , counters ,
freelist , new . counters ,
" drain percpu freelist " ) ) ;
freelist = nextfree ;
}
2007-05-10 14:15:16 +04:00
/*
2011-06-01 21:25:52 +04:00
* Stage two : Ensure that the page is unfrozen while the
* list presence reflects the actual number of objects
* during unfreeze .
*
* We setup the list membership and then perform a cmpxchg
* with the count . If there is a mismatch then the page
* is not unfrozen but the page is on the wrong list .
*
* Then we restart the process which may have to remove
* the page from the list that we just put it on again
* because the number of objects in the slab may have
* changed .
2007-05-10 14:15:16 +04:00
*/
2011-06-01 21:25:52 +04:00
redo :
2007-05-10 14:15:16 +04:00
2011-06-01 21:25:52 +04:00
old . freelist = page - > freelist ;
old . counters = page - > counters ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! old . frozen ) ;
2008-01-08 10:20:27 +03:00
2011-06-01 21:25:52 +04:00
/* Determine target state of the slab */
new . counters = old . counters ;
if ( freelist ) {
new . inuse - - ;
set_freepointer ( s , freelist , old . freelist ) ;
new . freelist = freelist ;
} else
new . freelist = old . freelist ;
new . frozen = 0 ;
2011-08-09 22:01:32 +04:00
if ( ! new . inuse & & n - > nr_partial > s - > min_partial )
2011-06-01 21:25:52 +04:00
m = M_FREE ;
else if ( new . freelist ) {
m = M_PARTIAL ;
if ( ! lock ) {
lock = 1 ;
/*
* Taking the spinlock removes the possiblity
* that acquire_slab ( ) will see a slab page that
* is frozen
*/
spin_lock ( & n - > list_lock ) ;
}
} else {
m = M_FULL ;
if ( kmem_cache_debug ( s ) & & ! lock ) {
lock = 1 ;
/*
* This also ensures that the scanning of full
* slabs from diagnostic functions will not see
* any frozen slabs .
*/
spin_lock ( & n - > list_lock ) ;
}
}
if ( l ! = m ) {
if ( l = = M_PARTIAL )
remove_partial ( n , page ) ;
else if ( l = = M_FULL )
2007-05-10 14:15:16 +04:00
2014-01-10 16:23:49 +04:00
remove_full ( s , n , page ) ;
2011-06-01 21:25:52 +04:00
if ( m = = M_PARTIAL ) {
add_partial ( n , page , tail ) ;
2011-08-24 04:57:52 +04:00
stat ( s , tail ) ;
2011-06-01 21:25:52 +04:00
} else if ( m = = M_FULL ) {
2007-05-10 14:15:16 +04:00
2011-06-01 21:25:52 +04:00
stat ( s , DEACTIVATE_FULL ) ;
add_full ( s , n , page ) ;
}
}
l = m ;
2011-07-14 21:49:12 +04:00
if ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 21:25:52 +04:00
old . freelist , old . counters ,
new . freelist , new . counters ,
" unfreezing slab " ) )
goto redo ;
if ( lock )
spin_unlock ( & n - > list_lock ) ;
if ( m = = M_FREE ) {
stat ( s , DEACTIVATE_EMPTY ) ;
discard_slab ( s , page ) ;
stat ( s , FREE_SLAB ) ;
2007-05-10 14:15:16 +04:00
}
2007-05-07 01:49:36 +04:00
}
2012-05-18 17:01:17 +04:00
/*
* Unfreeze all the cpu partial slabs .
*
2012-11-28 20:23:00 +04:00
* This function must be called with interrupts disabled
* for the cpu using c ( or some other guarantee must be there
* to guarantee no concurrent accesses ) .
2012-05-18 17:01:17 +04:00
*/
2012-11-28 20:23:00 +04:00
static void unfreeze_partials ( struct kmem_cache * s ,
struct kmem_cache_cpu * c )
2011-08-10 01:12:27 +04:00
{
2013-06-19 09:05:52 +04:00
# ifdef CONFIG_SLUB_CPU_PARTIAL
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
struct kmem_cache_node * n = NULL , * n2 = NULL ;
2011-11-14 09:34:13 +04:00
struct page * page , * discard_page = NULL ;
2011-08-10 01:12:27 +04:00
while ( ( page = c - > partial ) ) {
struct page new ;
struct page old ;
c - > partial = page - > next ;
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
n2 = get_node ( s , page_to_nid ( page ) ) ;
if ( n ! = n2 ) {
if ( n )
spin_unlock ( & n - > list_lock ) ;
n = n2 ;
spin_lock ( & n - > list_lock ) ;
}
2011-08-10 01:12:27 +04:00
do {
old . freelist = page - > freelist ;
old . counters = page - > counters ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! old . frozen ) ;
2011-08-10 01:12:27 +04:00
new . counters = old . counters ;
new . freelist = old . freelist ;
new . frozen = 0 ;
2012-05-18 17:01:17 +04:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-08-10 01:12:27 +04:00
old . freelist , old . counters ,
new . freelist , new . counters ,
" unfreezing slab " ) ) ;
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
if ( unlikely ( ! new . inuse & & n - > nr_partial > s - > min_partial ) ) {
2011-11-14 09:34:13 +04:00
page - > next = discard_page ;
discard_page = page ;
slub: refactoring unfreeze_partials()
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.
I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().
Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.
** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0
** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1
We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )
13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):
107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )
13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-08 21:23:16 +04:00
} else {
add_partial ( n , page , DEACTIVATE_TO_TAIL ) ;
stat ( s , FREE_ADD_PARTIAL ) ;
2011-08-10 01:12:27 +04:00
}
}
if ( n )
spin_unlock ( & n - > list_lock ) ;
2011-11-14 09:34:13 +04:00
while ( discard_page ) {
page = discard_page ;
discard_page = discard_page - > next ;
stat ( s , DEACTIVATE_EMPTY ) ;
discard_slab ( s , page ) ;
stat ( s , FREE_SLAB ) ;
}
2013-06-19 09:05:52 +04:00
# endif
2011-08-10 01:12:27 +04:00
}
/*
* Put a page that was just frozen ( in __slab_free ) into a partial page
* slot if available . This is done without interrupts disabled and without
* preemption disabled . The cmpxchg is racy and may put the partial page
* onto a random cpus partial slot .
*
* If we did not find a slot then simply move all the partials to the
* per node partial list .
*/
slub: correct to calculate num of acquired objects in get_partial_node()
There is a subtle bug when calculating a number of acquired objects.
Currently, we calculate "available = page->objects - page->inuse",
after acquire_slab() is called in get_partial_node().
In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
So,
acquire_slab(s, n, page, object == NULL);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
object = t;
available = page->objects - page->inuse;
!!! availabe is always 0 !!!
...
Therfore, "available > s->cpu_partial / 2" is always false and
we always go to second iteration.
This patch correct this problem.
After that, we don't need return value of put_cpu_partial().
So remove it.
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-21 12:01:25 +04:00
static void put_cpu_partial ( struct kmem_cache * s , struct page * page , int drain )
2011-08-10 01:12:27 +04:00
{
2013-06-19 09:05:52 +04:00
# ifdef CONFIG_SLUB_CPU_PARTIAL
2011-08-10 01:12:27 +04:00
struct page * oldpage ;
int pages ;
int pobjects ;
do {
pages = 0 ;
pobjects = 0 ;
oldpage = this_cpu_read ( s - > cpu_slab - > partial ) ;
if ( oldpage ) {
pobjects = oldpage - > pobjects ;
pages = oldpage - > pages ;
if ( drain & & pobjects > s - > cpu_partial ) {
unsigned long flags ;
/*
* partial array is full . Move the existing
* set to the per node partial list .
*/
local_irq_save ( flags ) ;
2012-11-28 20:23:00 +04:00
unfreeze_partials ( s , this_cpu_ptr ( s - > cpu_slab ) ) ;
2011-08-10 01:12:27 +04:00
local_irq_restore ( flags ) ;
2012-06-22 22:22:38 +04:00
oldpage = NULL ;
2011-08-10 01:12:27 +04:00
pobjects = 0 ;
pages = 0 ;
2012-02-03 19:34:56 +04:00
stat ( s , CPU_PARTIAL_DRAIN ) ;
2011-08-10 01:12:27 +04:00
}
}
pages + + ;
pobjects + = page - > objects - page - > inuse ;
page - > pages = pages ;
page - > pobjects = pobjects ;
page - > next = oldpage ;
2013-07-15 05:05:29 +04:00
} while ( this_cpu_cmpxchg ( s - > cpu_slab - > partial , oldpage , page )
! = oldpage ) ;
2013-06-19 09:05:52 +04:00
# endif
2011-08-10 01:12:27 +04:00
}
2007-10-16 12:26:05 +04:00
static inline void flush_slab ( struct kmem_cache * s , struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2009-12-19 01:26:23 +03:00
stat ( s , CPUSLAB_FLUSH ) ;
2012-05-09 19:09:57 +04:00
deactivate_slab ( s , c - > page , c - > freelist ) ;
c - > tid = next_tid ( c - > tid ) ;
c - > page = NULL ;
c - > freelist = NULL ;
2007-05-07 01:49:36 +04:00
}
/*
* Flush cpu slab .
2008-02-16 10:45:26 +03:00
*
2007-05-07 01:49:36 +04:00
* Called from IPI handler with interrupts disabled .
*/
2007-07-17 15:03:24 +04:00
static inline void __flush_cpu_slab ( struct kmem_cache * s , int cpu )
2007-05-07 01:49:36 +04:00
{
2009-12-19 01:26:20 +03:00
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab , cpu ) ;
2007-05-07 01:49:36 +04:00
2011-08-10 01:12:27 +04:00
if ( likely ( c ) ) {
if ( c - > page )
flush_slab ( s , c ) ;
2012-11-28 20:23:00 +04:00
unfreeze_partials ( s , c ) ;
2011-08-10 01:12:27 +04:00
}
2007-05-07 01:49:36 +04:00
}
static void flush_cpu_slab ( void * d )
{
struct kmem_cache * s = d ;
2007-10-16 12:26:05 +04:00
__flush_cpu_slab ( s , smp_processor_id ( ) ) ;
2007-05-07 01:49:36 +04:00
}
2012-03-29 01:42:44 +04:00
static bool has_cpu_slab ( int cpu , void * info )
{
struct kmem_cache * s = info ;
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab , cpu ) ;
2012-05-18 04:03:26 +04:00
return c - > page | | c - > partial ;
2012-03-29 01:42:44 +04:00
}
2007-05-07 01:49:36 +04:00
static void flush_all ( struct kmem_cache * s )
{
2012-03-29 01:42:44 +04:00
on_each_cpu_cond ( has_cpu_slab , flush_cpu_slab , s , 1 , GFP_ATOMIC ) ;
2007-05-07 01:49:36 +04:00
}
2007-10-16 12:26:05 +04:00
/*
* Check if the objects in a per cpu structure fit numa
* locality expectations .
*/
2012-05-09 19:09:59 +04:00
static inline int node_match ( struct page * page , int node )
2007-10-16 12:26:05 +04:00
{
# ifdef CONFIG_NUMA
2013-01-24 01:45:47 +04:00
if ( ! page | | ( node ! = NUMA_NO_NODE & & page_to_nid ( page ) ! = node ) )
2007-10-16 12:26:05 +04:00
return 0 ;
# endif
return 1 ;
}
2009-06-10 19:50:32 +04:00
static int count_free ( struct page * page )
{
return page - > objects - page - > inuse ;
}
static unsigned long count_partial ( struct kmem_cache_node * n ,
int ( * get_count ) ( struct page * ) )
{
unsigned long flags ;
unsigned long x = 0 ;
struct page * page ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
list_for_each_entry ( page , & n - > partial , lru )
x + = get_count ( page ) ;
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
return x ;
}
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_objs ( struct kmem_cache_node * n )
{
# ifdef CONFIG_SLUB_DEBUG
return atomic_long_read ( & n - > total_objects ) ;
# else
return 0 ;
# endif
}
2009-06-10 19:50:32 +04:00
static noinline void
slab_out_of_memory ( struct kmem_cache * s , gfp_t gfpflags , int nid )
{
int node ;
printk ( KERN_WARNING
" SLUB: Unable to allocate memory on node %d (gfp=0x%x) \n " ,
nid , gfpflags ) ;
printk ( KERN_WARNING " cache: %s, object size: %d, buffer size: %d, "
2012-06-13 19:24:57 +04:00
" default order: %d, min order: %d \n " , s - > name , s - > object_size ,
2009-06-10 19:50:32 +04:00
s - > size , oo_order ( s - > oo ) , oo_order ( s - > min ) ) ;
2012-06-13 19:24:57 +04:00
if ( oo_order ( s - > min ) > get_order ( s - > object_size ) )
2009-07-07 11:14:14 +04:00
printk ( KERN_WARNING " %s debugging increased min order, use "
" slub_debug=O to disable. \n " , s - > name ) ;
2009-06-10 19:50:32 +04:00
for_each_online_node ( node ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
unsigned long nr_slabs ;
unsigned long nr_objs ;
unsigned long nr_free ;
if ( ! n )
continue ;
2009-06-11 14:08:48 +04:00
nr_free = count_partial ( n , count_free ) ;
nr_slabs = node_nr_slabs ( n ) ;
nr_objs = node_nr_objs ( n ) ;
2009-06-10 19:50:32 +04:00
printk ( KERN_WARNING
" node %d: slabs: %ld, objs: %ld, free: %ld \n " ,
node , nr_slabs , nr_objs , nr_free ) ;
}
}
2011-08-10 01:12:26 +04:00
static inline void * new_slab_objects ( struct kmem_cache * s , gfp_t flags ,
int node , struct kmem_cache_cpu * * pc )
{
2012-05-09 19:09:51 +04:00
void * freelist ;
2012-05-09 19:09:55 +04:00
struct kmem_cache_cpu * c = * pc ;
struct page * page ;
2011-08-10 01:12:26 +04:00
2012-05-09 19:09:55 +04:00
freelist = get_partial ( s , flags , node , c ) ;
2011-08-10 01:12:26 +04:00
2012-05-09 19:09:55 +04:00
if ( freelist )
return freelist ;
page = new_slab ( s , flags , node ) ;
2011-08-10 01:12:26 +04:00
if ( page ) {
c = __this_cpu_ptr ( s - > cpu_slab ) ;
if ( c - > page )
flush_slab ( s , c ) ;
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
*/
2012-05-09 19:09:51 +04:00
freelist = page - > freelist ;
2011-08-10 01:12:26 +04:00
page - > freelist = NULL ;
stat ( s , ALLOC_SLAB ) ;
c - > page = page ;
* pc = c ;
} else
2012-05-09 19:09:51 +04:00
freelist = NULL ;
2011-08-10 01:12:26 +04:00
2012-05-09 19:09:51 +04:00
return freelist ;
2011-08-10 01:12:26 +04:00
}
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
static inline bool pfmemalloc_match ( struct page * page , gfp_t gfpflags )
{
if ( unlikely ( PageSlabPfmemalloc ( page ) ) )
return gfp_pfmemalloc_allowed ( gfpflags ) ;
return true ;
}
2011-11-12 00:07:14 +04:00
/*
2013-07-15 05:05:29 +04:00
* Check the page - > freelist of a page and either transfer the freelist to the
* per cpu freelist or deactivate the page .
2011-11-12 00:07:14 +04:00
*
* The page is still frozen if the return value is not NULL .
*
* If this function returns NULL then the page has been unfrozen .
2012-05-18 17:01:17 +04:00
*
* This function must be called with interrupt disabled .
2011-11-12 00:07:14 +04:00
*/
static inline void * get_freelist ( struct kmem_cache * s , struct page * page )
{
struct page new ;
unsigned long counters ;
void * freelist ;
do {
freelist = page - > freelist ;
counters = page - > counters ;
2012-05-09 19:09:51 +04:00
2011-11-12 00:07:14 +04:00
new . counters = counters ;
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! new . frozen ) ;
2011-11-12 00:07:14 +04:00
new . inuse = page - > objects ;
new . frozen = freelist ! = NULL ;
2012-05-18 17:01:17 +04:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-11-12 00:07:14 +04:00
freelist , counters ,
NULL , new . counters ,
" get_freelist " ) ) ;
return freelist ;
}
2007-05-07 01:49:36 +04:00
/*
2007-05-10 14:15:16 +04:00
* Slow path . The lockless freelist is empty or we need to perform
* debugging duties .
*
* Processing is still very fast if new objects have been freed to the
* regular freelist . In that case we simply take over the regular freelist
* as the lockless freelist and zap the regular freelist .
2007-05-07 01:49:36 +04:00
*
2007-05-10 14:15:16 +04:00
* If that is not working then we fall back to the partial lists . We take the
* first element of the freelist as the object to allocate now and move the
* rest of the freelist to the lockless freelist .
2007-05-07 01:49:36 +04:00
*
2007-05-10 14:15:16 +04:00
* And if we were unable to get a new slab from the partial slab lists then
2008-02-16 10:45:26 +03:00
* we need to allocate a new slab . This is the slowest path since it involves
* a call to the page allocator and the setup of a new slab .
2007-05-07 01:49:36 +04:00
*/
2008-08-19 21:43:25 +04:00
static void * __slab_alloc ( struct kmem_cache * s , gfp_t gfpflags , int node ,
unsigned long addr , struct kmem_cache_cpu * c )
2007-05-07 01:49:36 +04:00
{
2012-05-09 19:09:51 +04:00
void * freelist ;
2012-05-09 19:09:58 +04:00
struct page * page ;
2011-02-25 20:38:54 +03:00
unsigned long flags ;
local_irq_save ( flags ) ;
# ifdef CONFIG_PREEMPT
/*
* We may have been preempted and rescheduled on a different
* cpu before disabling interrupts . Need to reload cpu area
* pointer .
*/
c = this_cpu_ptr ( s - > cpu_slab ) ;
# endif
2007-05-07 01:49:36 +04:00
2012-05-09 19:09:58 +04:00
page = c - > page ;
if ( ! page )
2007-05-07 01:49:36 +04:00
goto new_slab ;
2011-08-10 01:12:27 +04:00
redo :
2012-05-09 19:09:51 +04:00
2012-05-09 19:09:59 +04:00
if ( unlikely ( ! node_match ( page , node ) ) ) {
2011-06-01 21:25:57 +04:00
stat ( s , ALLOC_NODE_MISMATCH ) ;
2012-05-09 19:09:58 +04:00
deactivate_slab ( s , page , c - > freelist ) ;
2012-05-09 19:09:57 +04:00
c - > page = NULL ;
c - > freelist = NULL ;
2011-06-01 21:25:56 +04:00
goto new_slab ;
}
2008-02-16 10:45:26 +03:00
mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. Swap over the network is considered as an option in diskless
systems. The two likely scenarios are when blade servers are used as part
of a cluster where the form factor or maintenance costs do not allow the
use of disks and thin clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap according to the manual at
https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
There is also documentation and tutorials on how to setup swap over NBD at
places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
nbd-client also documents the use of NBD as swap. Despite this, the fact
is that a machine using NBD for swap can deadlock within minutes if swap
is used intensively. This patch series addresses the problem.
The core issue is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need. Some years ago, Peter Zijlstra developed a series of
patches that supported swap over an NFS that at least one distribution is
carrying within their kernels. This patch series borrows very heavily
from Peter's work to support swapping over NBD as a pre-requisite to
supporting swap-over-NFS. The bulk of the complexity is concerned with
preserving memory that is allocated from the PFMEMALLOC reserves for use
by the network layer which is needed for both NBD and NFS.
Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
preserve access to pages allocated under low memory situations
to callers that are freeing memory.
Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks
Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
reserves without setting PFMEMALLOC.
Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
for later use by network packet processing.
Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required
Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.
Patches 7-12 allows network processing to use PFMEMALLOC reserves when
the socket has been marked as being used by the VM to clean pages. If
packets are received and stored in pages that were allocated under
low-memory situations and are unrelated to the VM, the packets
are dropped.
Patch 11 reintroduces __skb_alloc_page which the networking
folk may object to but is needed in some cases to propogate
pfmemalloc from a newly allocated page to an skb. If there is a
strong objection, this patch can be dropped with the impact being
that swap-over-network will be slower in some cases but it should
not fail.
Patch 13 is a micro-optimisation to avoid a function call in the
common case.
Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
PFMEMALLOC if necessary.
Patch 15 notes that it is still possible for the PFMEMALLOC reserve
to be depleted. To prevent this, direct reclaimers get throttled on
a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
expected that kswapd and the direct reclaimers already running
will clean enough pages for the low watermark to be reached and
the throttled processes are woken up.
Patch 16 adds a statistic to track how often processes get throttled
Some basic performance testing was run using kernel builds, netperf on
loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant performance
variances.
For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 8*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure.
Without the patches and using SLUB, the machine locks up within minutes
and runs to completion with them applied. With SLAB, the story is
different as an unpatched kernel run to completion. However, the patched
kernel completed the test 45% faster.
MICRO
3.5.0-rc2 3.5.0-rc2
vanilla swapnbd
Unrecognised test vmscan-anon-mmap-write
MMTests Statistics: duration
Sys Time Running Test (seconds) 197.80 173.07
User+Sys Time Running Test (seconds) 206.96 182.03
Total Elapsed Time (seconds) 3240.70 1762.09
This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages
Allocations of pages below the min watermark run a risk of the machine
hanging due to a lack of memory. To prevent this, only callers who have
PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
a slab though, nothing prevents other callers consuming free objects
within those slabs. This patch limits access to slab pages that were
alloced from the PFMEMALLOC reserves.
When this patch is applied, pages allocated from below the low watermark
are returned with page->pfmemalloc set and it is up to the caller to
determine how the page should be protected. SLAB restricts access to any
page with page->pfmemalloc set to callers which are known to able to
access the PFMEMALLOC reserve. If one is not available, an attempt is
made to allocate a new page rather than use a reserve. SLUB is a bit more
relaxed in that it only records if the current per-CPU page was allocated
from PFMEMALLOC reserve and uses another partial slab if the caller does
not have the necessary GFP or process flags. This was found to be
sufficient in tests to avoid hangs due to SLUB generally maintaining
smaller lists than SLAB.
In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
a slab allocation even though free objects are available because they are
being preserved for callers that are freeing pages.
[a.p.zijlstra@chello.nl: Original implementation]
[sebastian@breakpoint.cc: Correct order of page flag clearing]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:43:58 +04:00
/*
* By rights , we should be searching for a slab page that was
* PFMEMALLOC but right now , we are losing the pfmemalloc
* information when the page leaves the per - cpu allocator
*/
if ( unlikely ( ! pfmemalloc_match ( page , gfpflags ) ) ) {
deactivate_slab ( s , page , c - > freelist ) ;
c - > page = NULL ;
c - > freelist = NULL ;
goto new_slab ;
}
2011-12-13 07:57:06 +04:00
/* must check again c->freelist in case of cpu migration or IRQ */
2012-05-09 19:09:51 +04:00
freelist = c - > freelist ;
if ( freelist )
2011-12-13 07:57:06 +04:00
goto load_freelist ;
2011-06-01 21:25:58 +04:00
2011-06-01 21:25:52 +04:00
stat ( s , ALLOC_SLOWPATH ) ;
2011-06-01 21:25:58 +04:00
2012-05-09 19:09:58 +04:00
freelist = get_freelist ( s , page ) ;
2008-02-16 10:45:26 +03:00
2012-05-09 19:09:51 +04:00
if ( ! freelist ) {
2011-06-01 21:25:58 +04:00
c - > page = NULL ;
stat ( s , DEACTIVATE_BYPASS ) ;
2011-06-01 21:25:56 +04:00
goto new_slab ;
2011-06-01 21:25:58 +04:00
}
2008-02-16 10:45:26 +03:00
2009-12-19 01:26:23 +03:00
stat ( s , ALLOC_REFILL ) ;
2008-02-16 10:45:26 +03:00
2007-05-10 14:15:16 +04:00
load_freelist :
2012-05-09 19:09:52 +04:00
/*
* freelist is pointing to the list of objects to be used .
* page is pointing to the page from which the objects are obtained .
* That page must be frozen for per cpu allocations to work .
*/
2014-01-30 02:05:50 +04:00
VM_BUG_ON ( ! c - > page - > frozen ) ;
2012-05-09 19:09:51 +04:00
c - > freelist = get_freepointer ( s , freelist ) ;
2011-02-25 20:38:54 +03:00
c - > tid = next_tid ( c - > tid ) ;
local_irq_restore ( flags ) ;
2012-05-09 19:09:51 +04:00
return freelist ;
2007-05-07 01:49:36 +04:00
new_slab :
2011-06-01 21:25:52 +04:00
2011-08-10 01:12:27 +04:00
if ( c - > partial ) {
2012-05-09 19:09:58 +04:00
page = c - > page = c - > partial ;
c - > partial = page - > next ;
2011-08-10 01:12:27 +04:00
stat ( s , CPU_PARTIAL_ALLOC ) ;
c - > freelist = NULL ;
goto redo ;
2007-05-07 01:49:36 +04:00
}
2012-05-09 19:09:55 +04:00
freelist = new_slab_objects ( s , gfpflags , node , & c ) ;
2011-04-15 23:48:14 +04:00
2012-05-09 19:09:54 +04:00
if ( unlikely ( ! freelist ) ) {
if ( ! ( gfpflags & __GFP_NOWARN ) & & printk_ratelimit ( ) )
slab_out_of_memory ( s , gfpflags , node ) ;
2011-06-01 21:25:52 +04:00
2012-05-09 19:09:54 +04:00
local_irq_restore ( flags ) ;
return NULL ;
2007-05-07 01:49:36 +04:00
}
2011-06-01 21:25:52 +04:00
2012-05-09 19:09:58 +04:00
page = c - > page ;
2012-08-01 03:44:00 +04:00
if ( likely ( ! kmem_cache_debug ( s ) & & pfmemalloc_match ( page , gfpflags ) ) )
2007-05-17 09:10:53 +04:00
goto load_freelist ;
2011-06-01 21:25:52 +04:00
2011-08-10 01:12:26 +04:00
/* Only entered in the debug case */
2013-07-15 05:05:29 +04:00
if ( kmem_cache_debug ( s ) & &
! alloc_debug_processing ( s , page , freelist , addr ) )
2011-08-10 01:12:26 +04:00
goto new_slab ; /* Slab failed checks. Next slab needed */
2007-05-10 14:15:16 +04:00
2012-05-09 19:09:58 +04:00
deactivate_slab ( s , page , get_freepointer ( s , freelist ) ) ;
2012-05-09 19:09:57 +04:00
c - > page = NULL ;
c - > freelist = NULL ;
2011-05-25 18:47:43 +04:00
local_irq_restore ( flags ) ;
2012-05-09 19:09:51 +04:00
return freelist ;
2007-05-10 14:15:16 +04:00
}
/*
* Inlined fastpath so that allocation functions ( kmalloc , kmem_cache_alloc )
* have the fastpath folded into their functions . So no function call
* overhead for requests that can be satisfied on the fastpath .
*
* The fastpath works by first checking if the lockless freelist can be used .
* If not then __slab_alloc is called for slow processing .
*
* Otherwise we can simply pick the next object from the lockless free list .
*/
2012-09-09 00:47:58 +04:00
static __always_inline void * slab_alloc_node ( struct kmem_cache * s ,
2008-08-19 21:43:25 +04:00
gfp_t gfpflags , int node , unsigned long addr )
2007-05-10 14:15:16 +04:00
{
void * * object ;
2007-10-16 12:26:05 +04:00
struct kmem_cache_cpu * c ;
2012-05-09 19:09:59 +04:00
struct page * page ;
2011-02-25 20:38:54 +03:00
unsigned long tid ;
2008-01-08 10:20:30 +03:00
2010-08-20 21:37:16 +04:00
if ( slab_pre_alloc_hook ( s , gfpflags ) )
2008-12-23 13:37:01 +03:00
return NULL ;
2008-01-08 10:20:30 +03:00
2012-12-19 02:22:48 +04:00
s = memcg_kmem_get_cache ( s , gfpflags ) ;
2011-02-25 20:38:54 +03:00
redo :
/*
* Must read kmem_cache cpu data via this cpu ptr . Preemption is
* enabled . We may switch back and forth between cpus while
* reading from one cpu area . That does not matter as long
* as we end up on the original cpu again when doing the cmpxchg .
2013-01-24 01:45:48 +04:00
*
* Preemption is disabled for the retrieval of the tid because that
* must occur from the current processor . We cannot allow rescheduling
* on a different processor between the determination of the pointer
* and the retrieval of the tid .
2011-02-25 20:38:54 +03:00
*/
2013-01-24 01:45:48 +04:00
preempt_disable ( ) ;
2009-12-19 01:26:20 +03:00
c = __this_cpu_ptr ( s - > cpu_slab ) ;
2011-02-25 20:38:54 +03:00
/*
* The transaction ids are globally unique per cpu and per operation on
* a per cpu queue . Thus they can be guarantee that the cmpxchg_double
* occurs on the right processor and that there was no operation on the
* linked list in between .
*/
tid = c - > tid ;
2013-01-24 01:45:48 +04:00
preempt_enable ( ) ;
2011-02-25 20:38:54 +03:00
2009-12-19 01:26:20 +03:00
object = c - > freelist ;
2012-05-09 19:09:59 +04:00
page = c - > page ;
2013-07-18 11:39:51 +04:00
if ( unlikely ( ! object | | ! node_match ( page , node ) ) )
2007-10-16 12:26:05 +04:00
object = __slab_alloc ( s , gfpflags , node , addr , c ) ;
2007-05-10 14:15:16 +04:00
else {
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
void * next_object = get_freepointer_safe ( s , object ) ;
2011-02-25 20:38:54 +03:00
/*
2011-03-31 05:57:33 +04:00
* The cmpxchg will only match if there was no additional
2011-02-25 20:38:54 +03:00
* operation and if we are on the right processor .
*
2013-07-15 05:05:29 +04:00
* The cmpxchg does the following atomically ( without lock
* semantics ! )
2011-02-25 20:38:54 +03:00
* 1. Relocate first pointer to the current per cpu area .
* 2. Verify that tid and freelist have not been changed
* 3. If they were not changed replace tid and freelist
*
2013-07-15 05:05:29 +04:00
* Since this is without lock semantics the protection is only
* against code executing on this cpu * not * from access by
* other cpus .
2011-02-25 20:38:54 +03:00
*/
2011-12-22 21:58:51 +04:00
if ( unlikely ( ! this_cpu_cmpxchg_double (
2011-02-25 20:38:54 +03:00
s - > cpu_slab - > freelist , s - > cpu_slab - > tid ,
object , tid ,
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
next_object , next_tid ( tid ) ) ) ) {
2011-02-25 20:38:54 +03:00
note_cmpxchg_failure ( " slab_alloc " , s , tid ) ;
goto redo ;
}
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 19:25:34 +04:00
prefetch_freepointer ( s , next_object ) ;
2009-12-19 01:26:23 +03:00
stat ( s , ALLOC_FASTPATH ) ;
2007-05-10 14:15:16 +04:00
}
2011-02-25 20:38:54 +03:00
2009-11-25 21:14:48 +03:00
if ( unlikely ( gfpflags & __GFP_ZERO ) & & object )
2012-06-13 19:24:57 +04:00
memset ( object , 0 , s - > object_size ) ;
2007-07-17 15:03:23 +04:00
2010-08-20 21:37:16 +04:00
slab_post_alloc_hook ( s , gfpflags , object ) ;
2008-04-04 02:54:48 +04:00
2007-05-10 14:15:16 +04:00
return object ;
2007-05-07 01:49:36 +04:00
}
2012-09-09 00:47:58 +04:00
static __always_inline void * slab_alloc ( struct kmem_cache * s ,
gfp_t gfpflags , unsigned long addr )
{
return slab_alloc_node ( s , gfpflags , NUMA_NO_NODE , addr ) ;
}
2007-05-07 01:49:36 +04:00
void * kmem_cache_alloc ( struct kmem_cache * s , gfp_t gfpflags )
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc ( s , gfpflags , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2013-07-15 05:05:29 +04:00
trace_kmem_cache_alloc ( _RET_IP_ , ret , s - > object_size ,
s - > size , gfpflags ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kmem_cache_alloc ) ;
2009-12-11 10:45:30 +03:00
# ifdef CONFIG_TRACING
2010-10-21 13:29:19 +04:00
void * kmem_cache_alloc_trace ( struct kmem_cache * s , gfp_t gfpflags , size_t size )
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc ( s , gfpflags , _RET_IP_ ) ;
2010-10-21 13:29:19 +04:00
trace_kmalloc ( _RET_IP_ , ret , size , s - > size , gfpflags ) ;
return ret ;
}
EXPORT_SYMBOL ( kmem_cache_alloc_trace ) ;
2008-08-19 21:43:26 +04:00
# endif
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_NUMA
void * kmem_cache_alloc_node ( struct kmem_cache * s , gfp_t gfpflags , int node )
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc_node ( s , gfpflags , node , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2009-03-23 16:12:24 +03:00
trace_kmem_cache_alloc_node ( _RET_IP_ , ret ,
2012-06-13 19:24:57 +04:00
s - > object_size , s - > size , gfpflags , node ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kmem_cache_alloc_node ) ;
2009-12-11 10:45:30 +03:00
# ifdef CONFIG_TRACING
2010-10-21 13:29:19 +04:00
void * kmem_cache_alloc_node_trace ( struct kmem_cache * s ,
2008-08-19 21:43:26 +04:00
gfp_t gfpflags ,
2010-10-21 13:29:19 +04:00
int node , size_t size )
2008-08-19 21:43:26 +04:00
{
2012-09-09 00:47:58 +04:00
void * ret = slab_alloc_node ( s , gfpflags , node , _RET_IP_ ) ;
2010-10-21 13:29:19 +04:00
trace_kmalloc_node ( _RET_IP_ , ret ,
size , s - > size , gfpflags , node ) ;
return ret ;
2008-08-19 21:43:26 +04:00
}
2010-10-21 13:29:19 +04:00
EXPORT_SYMBOL ( kmem_cache_alloc_node_trace ) ;
2008-08-19 21:43:26 +04:00
# endif
2010-09-29 16:02:15 +04:00
# endif
2008-08-19 21:43:26 +04:00
2007-05-07 01:49:36 +04:00
/*
2007-05-10 14:15:16 +04:00
* Slow patch handling . This may still be called frequently since objects
* have a longer lifetime than the cpu slabs in most processing loads .
2007-05-07 01:49:36 +04:00
*
2007-05-10 14:15:16 +04:00
* So we still attempt to reduce cache line usage . Just take the slab
* lock and free the item . If there is no additional partial page
* handling required then we can return immediately .
2007-05-07 01:49:36 +04:00
*/
2007-05-10 14:15:16 +04:00
static void __slab_free ( struct kmem_cache * s , struct page * page ,
2009-12-19 01:26:22 +03:00
void * x , unsigned long addr )
2007-05-07 01:49:36 +04:00
{
void * prior ;
void * * object = ( void * ) x ;
2011-06-01 21:25:52 +04:00
int was_frozen ;
struct page new ;
unsigned long counters ;
struct kmem_cache_node * n = NULL ;
2011-06-01 21:25:51 +04:00
unsigned long uninitialized_var ( flags ) ;
2007-05-07 01:49:36 +04:00
2011-02-25 20:38:54 +03:00
stat ( s , FREE_SLOWPATH ) ;
2007-05-07 01:49:36 +04:00
2012-05-30 21:54:46 +04:00
if ( kmem_cache_debug ( s ) & &
! ( n = free_debug_processing ( s , page , x , addr , & flags ) ) )
2011-06-01 21:25:55 +04:00
return ;
2008-02-16 10:45:26 +03:00
2011-06-01 21:25:52 +04:00
do {
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
if ( unlikely ( n ) ) {
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
n = NULL ;
}
2011-06-01 21:25:52 +04:00
prior = page - > freelist ;
counters = page - > counters ;
set_freepointer ( s , object , prior ) ;
new . counters = counters ;
was_frozen = new . frozen ;
new . inuse - - ;
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
if ( ( ! new . inuse | | ! prior ) & & ! was_frozen ) {
2011-08-10 01:12:27 +04:00
2014-01-10 16:23:49 +04:00
if ( kmem_cache_has_cpu_partial ( s ) & & ! prior ) {
2011-08-10 01:12:27 +04:00
/*
2013-07-15 05:05:29 +04:00
* Slab was on no list before and will be
* partially empty
* We can defer the list move and instead
* freeze it .
2011-08-10 01:12:27 +04:00
*/
new . frozen = 1 ;
2014-01-10 16:23:49 +04:00
} else { /* Needs to be taken off a list */
2011-08-10 01:12:27 +04:00
n = get_node ( s , page_to_nid ( page ) ) ;
/*
* Speculatively acquire the list_lock .
* If the cmpxchg does not succeed then we may
* drop the list_lock without any processing .
*
* Otherwise the list_lock will synchronize with
* other processors updating the list of slabs .
*/
spin_lock_irqsave ( & n - > list_lock , flags ) ;
}
2011-06-01 21:25:52 +04:00
}
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:52 +04:00
} while ( ! cmpxchg_double_slab ( s , page ,
prior , counters ,
object , new . counters ,
" __slab_free " ) ) ;
2007-05-07 01:49:36 +04:00
2011-06-01 21:25:52 +04:00
if ( likely ( ! n ) ) {
2011-08-10 01:12:27 +04:00
/*
* If we just froze the page then put it onto the
* per cpu partial list .
*/
2012-02-03 19:34:56 +04:00
if ( new . frozen & & ! was_frozen ) {
2011-08-10 01:12:27 +04:00
put_cpu_partial ( s , page , 1 ) ;
2012-02-03 19:34:56 +04:00
stat ( s , CPU_PARTIAL_FREE ) ;
}
2011-08-10 01:12:27 +04:00
/*
2011-06-01 21:25:52 +04:00
* The list lock was not taken therefore no list
* activity can be necessary .
*/
if ( was_frozen )
stat ( s , FREE_FROZEN ) ;
2011-06-01 21:25:55 +04:00
return ;
2011-06-01 21:25:52 +04:00
}
2007-05-07 01:49:36 +04:00
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
if ( unlikely ( ! new . inuse & & n - > nr_partial > s - > min_partial ) )
goto slab_empty ;
2007-05-07 01:49:36 +04:00
/*
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
* Objects left in the slab . If it was not on the partial list before
* then add it .
2007-05-07 01:49:36 +04:00
*/
2013-06-19 09:05:52 +04:00
if ( ! kmem_cache_has_cpu_partial ( s ) & & unlikely ( ! prior ) ) {
if ( kmem_cache_debug ( s ) )
2014-01-10 16:23:49 +04:00
remove_full ( s , n , page ) ;
slub: remove one code path and reduce lock contention in __slab_free()
When we try to free object, there is some of case that we need
to take a node lock. This is the necessary step for preventing a race.
After taking a lock, then we try to cmpxchg_double_slab().
But, there is a possible scenario that cmpxchg_double_slab() is failed
with taking a lock. Following example explains it.
CPU A CPU B
need lock
... need lock
... lock!!
lock..but spin free success
spin... unlock
lock!!
free fail
In this case, retry with taking a lock is occured in CPU A.
I think that in this case for CPU A,
"release a lock first, and re-take a lock if necessary" is preferable way.
There are two reasons for this.
First, this makes __slab_free()'s logic somehow simple.
With this patch, 'was_frozen = 1' is "always" handled without taking a lock.
So we can remove one code path.
Second, it may reduce lock contention.
When we do retrying, status of slab is already changed,
so we don't need a lock anymore in almost every case.
"release a lock first, and re-take a lock if necessary" policy is
helpful to this.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-08-15 19:02:40 +04:00
add_partial ( n , page , DEACTIVATE_TO_TAIL ) ;
stat ( s , FREE_ADD_PARTIAL ) ;
2008-02-08 04:47:41 +03:00
}
2011-06-01 21:25:55 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2007-05-07 01:49:36 +04:00
return ;
slab_empty :
2008-03-02 00:40:44 +03:00
if ( prior ) {
2007-05-07 01:49:36 +04:00
/*
2011-08-08 20:16:56 +04:00
* Slab on the partial list .
2007-05-07 01:49:36 +04:00
*/
2011-06-01 21:25:50 +04:00
remove_partial ( n , page ) ;
2009-12-19 01:26:23 +03:00
stat ( s , FREE_REMOVE_PARTIAL ) ;
2014-01-10 16:23:49 +04:00
} else {
2011-08-08 20:16:56 +04:00
/* Slab must be on the full list */
2014-01-10 16:23:49 +04:00
remove_full ( s , n , page ) ;
}
2011-06-01 21:25:52 +04:00
2011-06-01 21:25:55 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2009-12-19 01:26:23 +03:00
stat ( s , FREE_SLAB ) ;
2007-05-07 01:49:36 +04:00
discard_slab ( s , page ) ;
}
2007-05-10 14:15:16 +04:00
/*
* Fastpath with forced inlining to produce a kfree and kmem_cache_free that
* can perform fastpath freeing without additional function calls .
*
* The fastpath is only possible if we are freeing to the current cpu slab
* of this processor . This typically the case if we have just allocated
* the item before .
*
* If fastpath is not possible then fall back to __slab_free where we deal
* with all sorts of special processing .
*/
2008-01-08 10:20:27 +03:00
static __always_inline void slab_free ( struct kmem_cache * s ,
2008-08-19 21:43:25 +04:00
struct page * page , void * x , unsigned long addr )
2007-05-10 14:15:16 +04:00
{
void * * object = ( void * ) x ;
2007-10-16 12:26:05 +04:00
struct kmem_cache_cpu * c ;
2011-02-25 20:38:54 +03:00
unsigned long tid ;
2008-01-08 10:20:30 +03:00
2010-08-20 21:37:16 +04:00
slab_free_hook ( s , x ) ;
2011-02-25 20:38:54 +03:00
redo :
/*
* Determine the currently cpus per cpu slab .
* The cpu may change afterward . However that does not matter since
* data is retrieved via this pointer . If we are on the same cpu
* during the cmpxchg then the free will succedd .
*/
2013-01-24 01:45:48 +04:00
preempt_disable ( ) ;
2009-12-19 01:26:20 +03:00
c = __this_cpu_ptr ( s - > cpu_slab ) ;
2010-08-20 21:37:16 +04:00
2011-02-25 20:38:54 +03:00
tid = c - > tid ;
2013-01-24 01:45:48 +04:00
preempt_enable ( ) ;
2010-08-20 21:37:16 +04:00
2011-05-18 01:29:31 +04:00
if ( likely ( page = = c - > page ) ) {
2009-12-19 01:26:22 +03:00
set_freepointer ( s , object , c - > freelist ) ;
2011-02-25 20:38:54 +03:00
2011-12-22 21:58:51 +04:00
if ( unlikely ( ! this_cpu_cmpxchg_double (
2011-02-25 20:38:54 +03:00
s - > cpu_slab - > freelist , s - > cpu_slab - > tid ,
c - > freelist , tid ,
object , next_tid ( tid ) ) ) ) {
note_cmpxchg_failure ( " slab_free " , s , tid ) ;
goto redo ;
}
2009-12-19 01:26:23 +03:00
stat ( s , FREE_FASTPATH ) ;
2007-05-10 14:15:16 +04:00
} else
2009-12-19 01:26:22 +03:00
__slab_free ( s , page , x , addr ) ;
2007-05-10 14:15:16 +04:00
}
2007-05-07 01:49:36 +04:00
void kmem_cache_free ( struct kmem_cache * s , void * x )
{
2012-12-19 02:22:46 +04:00
s = cache_from_obj ( s , x ) ;
if ( ! s )
2012-09-05 03:06:14 +04:00
return ;
2012-12-19 02:22:46 +04:00
slab_free ( s , virt_to_head_page ( x ) , x , _RET_IP_ ) ;
2009-03-23 16:12:24 +03:00
trace_kmem_cache_free ( _RET_IP_ , x ) ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kmem_cache_free ) ;
/*
2007-05-09 13:32:39 +04:00
* Object placement in a slab is made very easy because we always start at
* offset 0. If we tune the size of the object to the alignment then we can
* get the required alignment by putting one properly sized object after
* another .
2007-05-07 01:49:36 +04:00
*
* Notice that the allocation order determines the sizes of the per cpu
* caches . Each processor has always one slab available for allocations .
* Increasing the allocation order reduces the number of times that slabs
2007-05-09 13:32:39 +04:00
* must be moved on and off the partial lists and is therefore a factor in
2007-05-07 01:49:36 +04:00
* locking overhead .
*/
/*
* Mininum / Maximum order of slab pages . This influences locking overhead
* and slab fragmentation . A higher order reduces the number of partial slabs
* and increases the number of allocations possible without having to
* take the list_lock .
*/
static int slub_min_order ;
2008-04-14 20:11:41 +04:00
static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER ;
2008-04-14 20:11:41 +04:00
static int slub_min_objects ;
2007-05-07 01:49:36 +04:00
/*
* Merge control . If this is set then no merging of slab caches will occur .
2007-05-09 13:32:39 +04:00
* ( Could be removed . This was introduced to pacify the merge skeptics . )
2007-05-07 01:49:36 +04:00
*/
static int slub_nomerge ;
/*
* Calculate the order of allocation given an slab object size .
*
2007-05-09 13:32:39 +04:00
* The order of allocation has significant impact on performance and other
* system components . Generally order 0 allocations should be preferred since
* order 0 does not cause fragmentation in the page allocator . Larger objects
* be problematic to put into order 0 slabs because there may be too much
2008-04-14 20:13:29 +04:00
* unused space left . We go to a higher order if more than 1 / 16 th of the slab
2007-05-09 13:32:39 +04:00
* would be wasted .
*
* In order to reach satisfactory performance we must ensure that a minimum
* number of objects is in one slab . Otherwise we may generate too much
* activity on the partial lists which requires taking the list_lock . This is
* less a concern for large slabs though which are rarely used .
2007-05-07 01:49:36 +04:00
*
2007-05-09 13:32:39 +04:00
* slub_max_order specifies the order where we begin to stop considering the
* number of objects in a slab as critical . If we reach slub_max_order then
* we try to keep the page order as low as possible . So we accept more waste
* of space in favor of a small page order .
2007-05-07 01:49:36 +04:00
*
2007-05-09 13:32:39 +04:00
* Higher order allocations also allow the placement of more objects in a
* slab and thereby reduce object handling overhead . If the user has
* requested a higher mininum order then we start with that one instead of
* the smallest order which will fit the object .
2007-05-07 01:49:36 +04:00
*/
2007-05-09 13:32:46 +04:00
static inline int slab_order ( int size , int min_objects ,
2011-03-10 10:21:48 +03:00
int max_order , int fract_leftover , int reserved )
2007-05-07 01:49:36 +04:00
{
int order ;
int rem ;
2007-07-17 15:03:20 +04:00
int min_order = slub_min_order ;
2007-05-07 01:49:36 +04:00
2011-03-10 10:21:48 +03:00
if ( order_objects ( min_order , size , reserved ) > MAX_OBJS_PER_PAGE )
2008-10-22 23:00:38 +04:00
return get_order ( size * MAX_OBJS_PER_PAGE ) - 1 ;
2008-04-14 20:11:30 +04:00
2007-07-17 15:03:20 +04:00
for ( order = max ( min_order ,
2007-05-09 13:32:46 +04:00
fls ( min_objects * size - 1 ) - PAGE_SHIFT ) ;
order < = max_order ; order + + ) {
2007-05-07 01:49:36 +04:00
2007-05-09 13:32:46 +04:00
unsigned long slab_size = PAGE_SIZE < < order ;
2007-05-07 01:49:36 +04:00
2011-03-10 10:21:48 +03:00
if ( slab_size < min_objects * size + reserved )
2007-05-07 01:49:36 +04:00
continue ;
2011-03-10 10:21:48 +03:00
rem = ( slab_size - reserved ) % size ;
2007-05-07 01:49:36 +04:00
2007-05-09 13:32:46 +04:00
if ( rem < = slab_size / fract_leftover )
2007-05-07 01:49:36 +04:00
break ;
}
2007-05-09 13:32:39 +04:00
2007-05-07 01:49:36 +04:00
return order ;
}
2011-03-10 10:21:48 +03:00
static inline int calculate_order ( int size , int reserved )
2007-05-09 13:32:46 +04:00
{
int order ;
int min_objects ;
int fraction ;
2009-02-12 19:00:17 +03:00
int max_objects ;
2007-05-09 13:32:46 +04:00
/*
* Attempt to find best configuration for a slab . This
* works by first attempting to generate a layout with
* the best configuration and backing off gradually .
*
* First we reduce the acceptable waste in a slab . Then
* we reduce the minimum objects required in a slab .
*/
min_objects = slub_min_objects ;
2008-04-14 20:11:41 +04:00
if ( ! min_objects )
min_objects = 4 * ( fls ( nr_cpu_ids ) + 1 ) ;
2011-03-10 10:21:48 +03:00
max_objects = order_objects ( slub_max_order , size , reserved ) ;
2009-02-12 19:00:17 +03:00
min_objects = min ( min_objects , max_objects ) ;
2007-05-09 13:32:46 +04:00
while ( min_objects > 1 ) {
2008-04-14 20:13:29 +04:00
fraction = 16 ;
2007-05-09 13:32:46 +04:00
while ( fraction > = 4 ) {
order = slab_order ( size , min_objects ,
2011-03-10 10:21:48 +03:00
slub_max_order , fraction , reserved ) ;
2007-05-09 13:32:46 +04:00
if ( order < = slub_max_order )
return order ;
fraction / = 2 ;
}
2009-08-19 22:44:13 +04:00
min_objects - - ;
2007-05-09 13:32:46 +04:00
}
/*
* We were unable to place multiple objects in a slab . Now
* lets see if we can place a single object there .
*/
2011-03-10 10:21:48 +03:00
order = slab_order ( size , 1 , slub_max_order , 1 , reserved ) ;
2007-05-09 13:32:46 +04:00
if ( order < = slub_max_order )
return order ;
/*
* Doh this slab cannot be placed using slub_max_order .
*/
2011-03-10 10:21:48 +03:00
order = slab_order ( size , 1 , MAX_ORDER , 1 , reserved ) ;
2009-04-23 10:58:22 +04:00
if ( order < MAX_ORDER )
2007-05-09 13:32:46 +04:00
return order ;
return - ENOSYS ;
}
2008-08-05 10:28:47 +04:00
static void
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( struct kmem_cache_node * n )
2007-05-07 01:49:36 +04:00
{
n - > nr_partial = 0 ;
spin_lock_init ( & n - > list_lock ) ;
INIT_LIST_HEAD ( & n - > partial ) ;
2007-07-17 15:03:32 +04:00
# ifdef CONFIG_SLUB_DEBUG
2008-04-14 19:53:02 +04:00
atomic_long_set ( & n - > nr_slabs , 0 ) ;
2008-09-11 23:25:41 +04:00
atomic_long_set ( & n - > total_objects , 0 ) ;
2007-05-07 01:49:42 +04:00
INIT_LIST_HEAD ( & n - > full ) ;
2007-07-17 15:03:32 +04:00
# endif
2007-05-07 01:49:36 +04:00
}
2010-08-20 21:37:13 +04:00
static inline int alloc_kmem_cache_cpus ( struct kmem_cache * s )
2007-10-16 12:26:08 +04:00
{
2010-08-20 21:37:14 +04:00
BUILD_BUG_ON ( PERCPU_DYNAMIC_EARLY_SIZE <
2013-01-10 23:14:19 +04:00
KMALLOC_SHIFT_HIGH * sizeof ( struct kmem_cache_cpu ) ) ;
2007-10-16 12:26:08 +04:00
2011-02-25 20:38:54 +03:00
/*
2011-06-02 18:19:41 +04:00
* Must align to double word boundary for the double cmpxchg
* instructions to work ; see __pcpu_double_call_return_bool ( ) .
2011-02-25 20:38:54 +03:00
*/
2011-06-02 18:19:41 +04:00
s - > cpu_slab = __alloc_percpu ( sizeof ( struct kmem_cache_cpu ) ,
2 * sizeof ( void * ) ) ;
2011-02-25 20:38:54 +03:00
if ( ! s - > cpu_slab )
return 0 ;
init_kmem_cache_cpus ( s ) ;
2007-10-16 12:26:08 +04:00
2011-02-25 20:38:54 +03:00
return 1 ;
2007-10-16 12:26:08 +04:00
}
2010-08-20 21:37:15 +04:00
static struct kmem_cache * kmem_cache_node ;
2007-05-07 01:49:36 +04:00
/*
* No kmalloc_node yet so do it by hand . We know that this is the first
* slab on the node for this slabcache . There are no concurrent accesses
* possible .
*
2013-11-08 16:47:37 +04:00
* Note that this function only works on the kmem_cache_node
* when allocating for the kmem_cache_node . This is used for bootstrapping
2007-10-16 12:26:08 +04:00
* memory on a fresh node that has no slab structures yet .
2007-05-07 01:49:36 +04:00
*/
2010-08-20 21:37:13 +04:00
static void early_kmem_cache_node_alloc ( int node )
2007-05-07 01:49:36 +04:00
{
struct page * page ;
struct kmem_cache_node * n ;
2010-08-20 21:37:15 +04:00
BUG_ON ( kmem_cache_node - > size < sizeof ( struct kmem_cache_node ) ) ;
2007-05-07 01:49:36 +04:00
2010-08-20 21:37:15 +04:00
page = new_slab ( kmem_cache_node , GFP_NOWAIT , node ) ;
2007-05-07 01:49:36 +04:00
BUG_ON ( ! page ) ;
2007-08-23 01:01:57 +04:00
if ( page_to_nid ( page ) ! = node ) {
printk ( KERN_ERR " SLUB: Unable to allocate memory from "
" node %d \n " , node ) ;
printk ( KERN_ERR " SLUB: Allocating a useless per node structure "
" in order to be able to continue \n " ) ;
}
2007-05-07 01:49:36 +04:00
n = page - > freelist ;
BUG_ON ( ! n ) ;
2010-08-20 21:37:15 +04:00
page - > freelist = get_freepointer ( kmem_cache_node , n ) ;
2011-08-10 01:12:24 +04:00
page - > inuse = 1 ;
2011-06-01 21:25:46 +04:00
page - > frozen = 0 ;
2010-08-20 21:37:15 +04:00
kmem_cache_node - > node [ node ] = n ;
2007-07-17 15:03:32 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-09-29 16:15:01 +04:00
init_object ( kmem_cache_node , n , SLUB_RED_ACTIVE ) ;
2010-08-20 21:37:15 +04:00
init_tracking ( kmem_cache_node , n ) ;
2007-07-17 15:03:32 +04:00
# endif
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( n ) ;
2010-08-20 21:37:15 +04:00
inc_slabs_node ( kmem_cache_node , node , page - > objects ) ;
2008-02-16 10:45:26 +03:00
2014-01-24 19:20:23 +04:00
/*
2014-02-11 02:25:46 +04:00
* No locks need to be taken here as it has just been
* initialized and there is no concurrent access .
2014-01-24 19:20:23 +04:00
*/
2014-02-11 02:25:46 +04:00
__add_partial ( n , page , DEACTIVATE_TO_HEAD ) ;
2007-05-07 01:49:36 +04:00
}
static void free_kmem_cache_nodes ( struct kmem_cache * s )
{
int node ;
2007-10-16 12:25:33 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-07 01:49:36 +04:00
struct kmem_cache_node * n = s - > node [ node ] ;
2010-08-20 21:37:15 +04:00
2010-05-22 01:41:35 +04:00
if ( n )
2010-08-20 21:37:15 +04:00
kmem_cache_free ( kmem_cache_node , n ) ;
2007-05-07 01:49:36 +04:00
s - > node [ node ] = NULL ;
}
}
2010-08-20 21:37:13 +04:00
static int init_kmem_cache_nodes ( struct kmem_cache * s )
2007-05-07 01:49:36 +04:00
{
int node ;
2007-10-16 12:25:33 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-07 01:49:36 +04:00
struct kmem_cache_node * n ;
2010-05-22 01:41:35 +04:00
if ( slab_state = = DOWN ) {
2010-08-20 21:37:13 +04:00
early_kmem_cache_node_alloc ( node ) ;
2010-05-22 01:41:35 +04:00
continue ;
}
2010-08-20 21:37:15 +04:00
n = kmem_cache_alloc_node ( kmem_cache_node ,
2010-08-20 21:37:13 +04:00
GFP_KERNEL , node ) ;
2007-05-07 01:49:36 +04:00
2010-05-22 01:41:35 +04:00
if ( ! n ) {
free_kmem_cache_nodes ( s ) ;
return 0 ;
2007-05-07 01:49:36 +04:00
}
2010-05-22 01:41:35 +04:00
2007-05-07 01:49:36 +04:00
s - > node [ node ] = n ;
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( n ) ;
2007-05-07 01:49:36 +04:00
}
return 1 ;
}
2009-02-25 10:16:35 +03:00
static void set_min_partial ( struct kmem_cache * s , unsigned long min )
2009-02-23 04:40:07 +03:00
{
if ( min < MIN_PARTIAL )
min = MIN_PARTIAL ;
else if ( min > MAX_PARTIAL )
min = MAX_PARTIAL ;
s - > min_partial = min ;
}
2007-05-07 01:49:36 +04:00
/*
* calculate_sizes ( ) determines the order and the distribution of data within
* a slab object .
*/
2008-04-14 20:11:41 +04:00
static int calculate_sizes ( struct kmem_cache * s , int forced_order )
2007-05-07 01:49:36 +04:00
{
unsigned long flags = s - > flags ;
2012-06-13 19:24:57 +04:00
unsigned long size = s - > object_size ;
2008-04-14 20:11:31 +04:00
int order ;
2007-05-07 01:49:36 +04:00
2008-02-16 10:45:25 +03:00
/*
* Round up object size to the next word boundary . We can only
* place the free pointer at word boundaries and this determines
* the possible location of the free pointer .
*/
size = ALIGN ( size , sizeof ( void * ) ) ;
# ifdef CONFIG_SLUB_DEBUG
2007-05-07 01:49:36 +04:00
/*
* Determine if we can poison the object itself . If the user of
* the slab may touch the object after free or before allocation
* then we should never poison the object itself .
*/
if ( ( flags & SLAB_POISON ) & & ! ( flags & SLAB_DESTROY_BY_RCU ) & &
2007-05-17 09:10:50 +04:00
! s - > ctor )
2007-05-07 01:49:36 +04:00
s - > flags | = __OBJECT_POISON ;
else
s - > flags & = ~ __OBJECT_POISON ;
/*
2007-05-09 13:32:39 +04:00
* If we are Redzoning then check if there is some space between the
2007-05-07 01:49:36 +04:00
* end of the object and the free pointer . If not then add an
2007-05-09 13:32:39 +04:00
* additional word to have some bytes to store Redzone information .
2007-05-07 01:49:36 +04:00
*/
2012-06-13 19:24:57 +04:00
if ( ( flags & SLAB_RED_ZONE ) & & size = = s - > object_size )
2007-05-07 01:49:36 +04:00
size + = sizeof ( void * ) ;
2007-05-09 13:32:44 +04:00
# endif
2007-05-07 01:49:36 +04:00
/*
2007-05-09 13:32:39 +04:00
* With that we have determined the number of bytes in actual use
* by the object . This is the potential offset to the free pointer .
2007-05-07 01:49:36 +04:00
*/
s - > inuse = size ;
if ( ( ( flags & ( SLAB_DESTROY_BY_RCU | SLAB_POISON ) ) | |
2007-05-17 09:10:50 +04:00
s - > ctor ) ) {
2007-05-07 01:49:36 +04:00
/*
* Relocate free pointer after the object if it is not
* permitted to overwrite the first word of the object on
* kmem_cache_free .
*
* This is the case if we do RCU , have a constructor or
* destructor or are poisoning the objects .
*/
s - > offset = size ;
size + = sizeof ( void * ) ;
}
2007-05-24 00:57:31 +04:00
# ifdef CONFIG_SLUB_DEBUG
2007-05-07 01:49:36 +04:00
if ( flags & SLAB_STORE_USER )
/*
* Need to store information about allocs and frees after
* the object .
*/
size + = 2 * sizeof ( struct track ) ;
2007-05-09 13:32:36 +04:00
if ( flags & SLAB_RED_ZONE )
2007-05-07 01:49:36 +04:00
/*
* Add some empty padding so that we can catch
* overwrites from earlier objects rather than let
* tracking information or the free pointer be
2008-12-30 00:14:56 +03:00
* corrupted if a user writes before the start
2007-05-07 01:49:36 +04:00
* of the object .
*/
size + = sizeof ( void * ) ;
2007-05-09 13:32:44 +04:00
# endif
2007-05-09 13:32:39 +04:00
2007-05-07 01:49:36 +04:00
/*
* SLUB stores one object immediately after another beginning from
* offset 0. In order to align the objects we have to simply size
* each object to conform to the alignment .
*/
2012-11-28 20:23:16 +04:00
size = ALIGN ( size , s - > align ) ;
2007-05-07 01:49:36 +04:00
s - > size = size ;
2008-04-14 20:11:41 +04:00
if ( forced_order > = 0 )
order = forced_order ;
else
2011-03-10 10:21:48 +03:00
order = calculate_order ( size , s - > reserved ) ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:31 +04:00
if ( order < 0 )
2007-05-07 01:49:36 +04:00
return 0 ;
2008-02-15 01:21:32 +03:00
s - > allocflags = 0 ;
2008-04-14 20:11:31 +04:00
if ( order )
2008-02-15 01:21:32 +03:00
s - > allocflags | = __GFP_COMP ;
if ( s - > flags & SLAB_CACHE_DMA )
2013-01-10 23:14:19 +04:00
s - > allocflags | = GFP_DMA ;
2008-02-15 01:21:32 +03:00
if ( s - > flags & SLAB_RECLAIM_ACCOUNT )
s - > allocflags | = __GFP_RECLAIMABLE ;
2007-05-07 01:49:36 +04:00
/*
* Determine the number of objects per slab
*/
2011-03-10 10:21:48 +03:00
s - > oo = oo_make ( order , size , s - > reserved ) ;
s - > min = oo_make ( get_order ( size ) , size , s - > reserved ) ;
2008-04-14 20:11:40 +04:00
if ( oo_objects ( s - > oo ) > oo_objects ( s - > max ) )
s - > max = s - > oo ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:31 +04:00
return ! ! oo_objects ( s - > oo ) ;
2007-05-07 01:49:36 +04:00
}
2012-09-05 03:18:33 +04:00
static int kmem_cache_open ( struct kmem_cache * s , unsigned long flags )
2007-05-07 01:49:36 +04:00
{
2012-09-05 03:18:33 +04:00
s - > flags = kmem_cache_flags ( s - > size , flags , s - > name , s - > ctor ) ;
2011-03-10 10:21:48 +03:00
s - > reserved = 0 ;
2007-05-07 01:49:36 +04:00
2011-03-10 10:22:00 +03:00
if ( need_reserve_slab_rcu & & ( s - > flags & SLAB_DESTROY_BY_RCU ) )
s - > reserved = sizeof ( struct rcu_head ) ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:41 +04:00
if ( ! calculate_sizes ( s , - 1 ) )
2007-05-07 01:49:36 +04:00
goto error ;
2009-07-28 05:30:35 +04:00
if ( disable_higher_order_debug ) {
/*
* Disable debugging flags that store metadata if the min slab
* order increased .
*/
2012-06-13 19:24:57 +04:00
if ( get_order ( s - > size ) > get_order ( s - > object_size ) ) {
2009-07-28 05:30:35 +04:00
s - > flags & = ~ DEBUG_METADATA_FLAGS ;
s - > offset = 0 ;
if ( ! calculate_sizes ( s , - 1 ) )
goto error ;
}
}
2007-05-07 01:49:36 +04:00
2012-01-13 05:17:33 +04:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-06-01 21:25:49 +04:00
if ( system_has_cmpxchg_double ( ) & & ( s - > flags & SLAB_DEBUG_FLAGS ) = = 0 )
/* Enable fast mode */
s - > flags | = __CMPXCHG_DOUBLE ;
# endif
2009-02-23 04:40:07 +03:00
/*
* The larger the object size is , the more pages we want on the partial
* list to avoid pounding the page allocator excessively .
*/
2011-08-10 01:12:27 +04:00
set_min_partial ( s , ilog2 ( s - > size ) / 2 ) ;
/*
* cpu_partial determined the maximum number of objects kept in the
* per cpu partial lists of a processor .
*
* Per cpu partial lists mainly contain slabs that just have one
* object freed . If they are used for allocation then they can be
* filled up again with minimal effort . The slab will never hit the
* per node partial lists and therefore no locking will be required .
*
* This setting also determines
*
* A ) The number of objects from per cpu partial slabs dumped to the
* per node list when we reach the limit .
2011-09-01 07:32:18 +04:00
* B ) The number of objects in cpu partial slabs to extract from the
2013-07-15 05:05:29 +04:00
* per node list when we run out of per cpu objects . We only fetch
* 50 % to keep some capacity around for frees .
2011-08-10 01:12:27 +04:00
*/
2013-06-19 09:05:52 +04:00
if ( ! kmem_cache_has_cpu_partial ( s ) )
2011-11-23 19:24:27 +04:00
s - > cpu_partial = 0 ;
else if ( s - > size > = PAGE_SIZE )
2011-08-10 01:12:27 +04:00
s - > cpu_partial = 2 ;
else if ( s - > size > = 1024 )
s - > cpu_partial = 6 ;
else if ( s - > size > = 256 )
s - > cpu_partial = 13 ;
else
s - > cpu_partial = 30 ;
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_NUMA
2008-08-19 17:51:22 +04:00
s - > remote_node_defrag_ratio = 1000 ;
2007-05-07 01:49:36 +04:00
# endif
2010-08-20 21:37:13 +04:00
if ( ! init_kmem_cache_nodes ( s ) )
2007-10-16 12:26:05 +04:00
goto error ;
2007-05-07 01:49:36 +04:00
2010-08-20 21:37:13 +04:00
if ( alloc_kmem_cache_cpus ( s ) )
2012-09-05 04:20:34 +04:00
return 0 ;
2009-12-19 01:26:22 +03:00
2007-10-16 12:26:08 +04:00
free_kmem_cache_nodes ( s ) ;
2007-05-07 01:49:36 +04:00
error :
if ( flags & SLAB_PANIC )
panic ( " Cannot create slab %s size=%lu realsize=%u "
" order=%u offset=%u flags=%lx \n " ,
2013-07-15 05:05:29 +04:00
s - > name , ( unsigned long ) s - > size , s - > size ,
oo_order ( s - > oo ) , s - > offset , flags ) ;
2012-09-05 04:20:34 +04:00
return - EINVAL ;
2007-05-07 01:49:36 +04:00
}
2008-04-25 23:22:43 +04:00
static void list_slab_objects ( struct kmem_cache * s , struct page * page ,
const char * text )
{
# ifdef CONFIG_SLUB_DEBUG
void * addr = page_address ( page ) ;
void * p ;
2010-09-29 16:02:13 +04:00
unsigned long * map = kzalloc ( BITS_TO_LONGS ( page - > objects ) *
sizeof ( long ) , GFP_ATOMIC ) ;
2010-03-25 00:25:47 +03:00
if ( ! map )
return ;
2012-09-05 03:18:33 +04:00
slab_err ( s , page , text , s - > name ) ;
2008-04-25 23:22:43 +04:00
slab_lock ( page ) ;
2011-04-15 23:48:13 +04:00
get_map ( s , page , map ) ;
2008-04-25 23:22:43 +04:00
for_each_object ( p , s , addr , page - > objects ) {
if ( ! test_bit ( slab_index ( p , s , addr ) , map ) ) {
printk ( KERN_ERR " INFO: Object 0x%p @offset=%tu \n " ,
p , p - addr ) ;
print_tracking ( s , p ) ;
}
}
slab_unlock ( page ) ;
2010-03-25 00:25:47 +03:00
kfree ( map ) ;
2008-04-25 23:22:43 +04:00
# endif
}
2007-05-07 01:49:36 +04:00
/*
2008-04-23 23:36:52 +04:00
* Attempt to free all partial slabs on a node .
2011-08-10 01:12:22 +04:00
* This is called from kmem_cache_close ( ) . We must be the last thread
* using the cache and therefore we do not need to lock anymore .
2007-05-07 01:49:36 +04:00
*/
2008-04-23 23:36:52 +04:00
static void free_partial ( struct kmem_cache * s , struct kmem_cache_node * n )
2007-05-07 01:49:36 +04:00
{
struct page * page , * h ;
2008-04-25 23:22:43 +04:00
list_for_each_entry_safe ( page , h , & n - > partial , lru ) {
2007-05-07 01:49:36 +04:00
if ( ! page - > inuse ) {
2014-02-11 02:25:46 +04:00
__remove_partial ( n , page ) ;
2007-05-07 01:49:36 +04:00
discard_slab ( s , page ) ;
2008-04-25 23:22:43 +04:00
} else {
list_slab_objects ( s , page ,
2012-09-05 03:18:33 +04:00
" Objects remaining in %s on kmem_cache_close() " ) ;
2008-04-23 23:36:52 +04:00
}
2008-04-25 23:22:43 +04:00
}
2007-05-07 01:49:36 +04:00
}
/*
2007-05-09 13:32:39 +04:00
* Release all resources used by a slab cache .
2007-05-07 01:49:36 +04:00
*/
2007-07-17 15:03:24 +04:00
static inline int kmem_cache_close ( struct kmem_cache * s )
2007-05-07 01:49:36 +04:00
{
int node ;
flush_all ( s ) ;
/* Attempt to free all objects */
2007-10-16 12:25:33 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-07 01:49:36 +04:00
struct kmem_cache_node * n = get_node ( s , node ) ;
2008-04-23 23:36:52 +04:00
free_partial ( s , n ) ;
if ( n - > nr_partial | | slabs_node ( s , node ) )
2007-05-07 01:49:36 +04:00
return 1 ;
}
2012-09-05 03:18:33 +04:00
free_percpu ( s - > cpu_slab ) ;
2007-05-07 01:49:36 +04:00
free_kmem_cache_nodes ( s ) ;
return 0 ;
}
2012-09-05 03:18:33 +04:00
int __kmem_cache_shutdown ( struct kmem_cache * s )
2007-05-07 01:49:36 +04:00
{
2012-09-05 03:38:33 +04:00
int rc = kmem_cache_close ( s ) ;
2012-09-05 03:18:33 +04:00
slub: drop mutex before deleting sysfs entry
Sasha Levin recently reported a lockdep problem resulting from the new
attribute propagation introduced by kmemcg series. In short, slab_mutex
will be called from within the sysfs attribute store function. This will
create a dependency, that will later be held backwards when a cache is
destroyed - since destruction occurs with the slab_mutex held, and then
calls in to the sysfs directory removal function.
In this patch, I propose to adopt a strategy close to what
__kmem_cache_create does before calling sysfs_slab_add, and release the
lock before the call to sysfs_slab_remove. This is pretty much the last
operation in the kmem_cache_shutdown() path, so we could do better by
splitting this and moving this call alone to later on. This will fit
nicely when sysfs handling is consistent between all caches, but will look
weird now.
Lockdep info:
======================================================
[ INFO: possible circular locking dependency detected ]
3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
-------------------------------------------------------
trinity-child13/6961 is trying to acquire lock:
(s_active#43){++++.+}, at: sysfs_addrm_finish+0x31/0x60
but task is already holding lock:
(slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (slab_mutex){+.+.+.}:
lock_acquire+0x1aa/0x240
__mutex_lock_common+0x59/0x5a0
mutex_lock_nested+0x3f/0x50
slab_attr_store+0xde/0x110
sysfs_write_file+0xfa/0x150
vfs_write+0xb0/0x180
sys_pwrite64+0x60/0xb0
tracesys+0xe1/0xe6
-> #0 (s_active#43){++++.+}:
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(slab_mutex);
lock(s_active#43);
lock(slab_mutex);
lock(s_active#43);
*** DEADLOCK ***
2 locks held by trinity-child13/6961:
#0: (mon_lock){+.+.+.}, at: mon_text_release+0x25/0xe0
#1: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0
stack backtrace:
Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
Call Trace:
print_circular_bug+0x1fb/0x20c
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:13 +04:00
if ( ! rc ) {
/*
slub: do not drop slab_mutex for sysfs_slab_add
We release the slab_mutex while calling sysfs_slab_add from
__kmem_cache_create since commit 66c4c35c6bc5 ("slub: Do not hold
slub_lock when calling sysfs_slab_add()"), because kobject_uevent called
by sysfs_slab_add might block waiting for the usermode helper to exec,
which would result in a deadlock if we took the slab_mutex while
executing it.
However, apart from complicating synchronization rules, releasing the
slab_mutex on kmem cache creation can result in a kmemcg-related race.
The point is that we check if the memcg cache exists before going to
__kmem_cache_create, but register the new cache in memcg subsys after
it. Since we can drop the mutex there, several threads can see that the
memcg cache does not exist and proceed to creating it, which is wrong.
Fortunately, recently kobject_uevent was patched to call the usermode
helper with the UMH_NO_WAIT flag, making the deadlock impossible.
Therefore there is no point in releasing the slab_mutex while calling
sysfs_slab_add, so let's simplify kmem_cache_create synchronization and
fix the kmemcg-race mentioned above by holding the slab_mutex during the
whole cache creation path.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Greg KH <greg@kroah.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-04 01:48:22 +04:00
* Since slab_attr_store may take the slab_mutex , we should
* release the lock while removing the sysfs entry in order to
* avoid a deadlock . Because this is pretty much the last
slub: drop mutex before deleting sysfs entry
Sasha Levin recently reported a lockdep problem resulting from the new
attribute propagation introduced by kmemcg series. In short, slab_mutex
will be called from within the sysfs attribute store function. This will
create a dependency, that will later be held backwards when a cache is
destroyed - since destruction occurs with the slab_mutex held, and then
calls in to the sysfs directory removal function.
In this patch, I propose to adopt a strategy close to what
__kmem_cache_create does before calling sysfs_slab_add, and release the
lock before the call to sysfs_slab_remove. This is pretty much the last
operation in the kmem_cache_shutdown() path, so we could do better by
splitting this and moving this call alone to later on. This will fit
nicely when sysfs handling is consistent between all caches, but will look
weird now.
Lockdep info:
======================================================
[ INFO: possible circular locking dependency detected ]
3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
-------------------------------------------------------
trinity-child13/6961 is trying to acquire lock:
(s_active#43){++++.+}, at: sysfs_addrm_finish+0x31/0x60
but task is already holding lock:
(slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (slab_mutex){+.+.+.}:
lock_acquire+0x1aa/0x240
__mutex_lock_common+0x59/0x5a0
mutex_lock_nested+0x3f/0x50
slab_attr_store+0xde/0x110
sysfs_write_file+0xfa/0x150
vfs_write+0xb0/0x180
sys_pwrite64+0x60/0xb0
tracesys+0xe1/0xe6
-> #0 (s_active#43){++++.+}:
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(slab_mutex);
lock(s_active#43);
lock(slab_mutex);
lock(s_active#43);
*** DEADLOCK ***
2 locks held by trinity-child13/6961:
#0: (mon_lock){+.+.+.}, at: mon_text_release+0x25/0xe0
#1: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0
stack backtrace:
Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
Call Trace:
print_circular_bug+0x1fb/0x20c
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:13 +04:00
* operation we do and the lock will be released shortly after
* that in slab_common . c , we could just move sysfs_slab_remove
* to a later point in common code . We should do that when we
* have a common sysfs framework for all allocators .
*/
mutex_unlock ( & slab_mutex ) ;
2007-05-07 01:49:36 +04:00
sysfs_slab_remove ( s ) ;
slub: drop mutex before deleting sysfs entry
Sasha Levin recently reported a lockdep problem resulting from the new
attribute propagation introduced by kmemcg series. In short, slab_mutex
will be called from within the sysfs attribute store function. This will
create a dependency, that will later be held backwards when a cache is
destroyed - since destruction occurs with the slab_mutex held, and then
calls in to the sysfs directory removal function.
In this patch, I propose to adopt a strategy close to what
__kmem_cache_create does before calling sysfs_slab_add, and release the
lock before the call to sysfs_slab_remove. This is pretty much the last
operation in the kmem_cache_shutdown() path, so we could do better by
splitting this and moving this call alone to later on. This will fit
nicely when sysfs handling is consistent between all caches, but will look
weird now.
Lockdep info:
======================================================
[ INFO: possible circular locking dependency detected ]
3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W
-------------------------------------------------------
trinity-child13/6961 is trying to acquire lock:
(s_active#43){++++.+}, at: sysfs_addrm_finish+0x31/0x60
but task is already holding lock:
(slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (slab_mutex){+.+.+.}:
lock_acquire+0x1aa/0x240
__mutex_lock_common+0x59/0x5a0
mutex_lock_nested+0x3f/0x50
slab_attr_store+0xde/0x110
sysfs_write_file+0xfa/0x150
vfs_write+0xb0/0x180
sys_pwrite64+0x60/0xb0
tracesys+0xe1/0xe6
-> #0 (s_active#43){++++.+}:
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(slab_mutex);
lock(s_active#43);
lock(slab_mutex);
lock(s_active#43);
*** DEADLOCK ***
2 locks held by trinity-child13/6961:
#0: (mon_lock){+.+.+.}, at: mon_text_release+0x25/0xe0
#1: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0
stack backtrace:
Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117
Call Trace:
print_circular_bug+0x1fb/0x20c
__lock_acquire+0x14df/0x1ca0
lock_acquire+0x1aa/0x240
sysfs_deactivate+0x122/0x1a0
sysfs_addrm_finish+0x31/0x60
sysfs_remove_dir+0x89/0xd0
kobject_del+0x16/0x40
__kmem_cache_shutdown+0x40/0x60
kmem_cache_destroy+0x40/0xe0
mon_text_release+0x78/0xe0
__fput+0x122/0x2d0
____fput+0x9/0x10
task_work_run+0xbe/0x100
do_exit+0x432/0xbd0
do_group_exit+0x84/0xd0
get_signal_to_deliver+0x81d/0x930
do_signal+0x3a/0x950
do_notify_resume+0x3e/0x90
int_signal+0x12/0x17
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:13 +04:00
mutex_lock ( & slab_mutex ) ;
}
2012-09-05 03:38:33 +04:00
return rc ;
2007-05-07 01:49:36 +04:00
}
/********************************************************************
* Kmalloc subsystem
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
static int __init setup_slub_min_order ( char * str )
{
2008-01-08 10:20:27 +03:00
get_option ( & str , & slub_min_order ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
}
__setup ( " slub_min_order= " , setup_slub_min_order ) ;
static int __init setup_slub_max_order ( char * str )
{
2008-01-08 10:20:27 +03:00
get_option ( & str , & slub_max_order ) ;
2009-04-23 10:58:22 +04:00
slub_max_order = min ( slub_max_order , MAX_ORDER - 1 ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
}
__setup ( " slub_max_order= " , setup_slub_max_order ) ;
static int __init setup_slub_min_objects ( char * str )
{
2008-01-08 10:20:27 +03:00
get_option ( & str , & slub_min_objects ) ;
2007-05-07 01:49:36 +04:00
return 1 ;
}
__setup ( " slub_min_objects= " , setup_slub_min_objects ) ;
static int __init setup_slub_nomerge ( char * str )
{
slub_nomerge = 1 ;
return 1 ;
}
__setup ( " slub_nomerge " , setup_slub_nomerge ) ;
void * __kmalloc ( size_t size , gfp_t flags )
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-19 21:43:26 +04:00
void * ret ;
2007-05-07 01:49:36 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) )
2008-02-11 23:47:46 +03:00
return kmalloc_large ( size , flags ) ;
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , flags ) ;
2007-10-16 12:24:38 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2012-09-09 00:47:58 +04:00
ret = slab_alloc ( s , flags , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2009-03-23 16:12:24 +03:00
trace_kmalloc ( _RET_IP_ , ret , size , s - > size , flags ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( __kmalloc ) ;
2010-09-29 16:02:15 +04:00
# ifdef CONFIG_NUMA
2008-03-02 00:56:40 +03:00
static void * kmalloc_large_node ( size_t size , gfp_t flags , int node )
{
2008-11-25 18:55:53 +03:00
struct page * page ;
2009-07-07 13:32:59 +04:00
void * ptr = NULL ;
2008-03-02 00:56:40 +03:00
2012-12-19 02:22:48 +04:00
flags | = __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG ;
2008-11-25 18:55:53 +03:00
page = alloc_pages_node ( node , flags , get_order ( size ) ) ;
2008-03-02 00:56:40 +03:00
if ( page )
2009-07-07 13:32:59 +04:00
ptr = page_address ( page ) ;
2013-10-09 02:58:57 +04:00
kmalloc_large_node_hook ( ptr , size , flags ) ;
2009-07-07 13:32:59 +04:00
return ptr ;
2008-03-02 00:56:40 +03:00
}
2007-05-07 01:49:36 +04:00
void * __kmalloc_node ( size_t size , gfp_t flags , int node )
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-19 21:43:26 +04:00
void * ret ;
2007-05-07 01:49:36 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) ) {
2008-08-19 21:43:26 +04:00
ret = kmalloc_large_node ( size , flags , node ) ;
2009-03-23 16:12:24 +03:00
trace_kmalloc_node ( _RET_IP_ , ret ,
size , PAGE_SIZE < < get_order ( size ) ,
flags , node ) ;
2008-08-19 21:43:26 +04:00
return ret ;
}
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , flags ) ;
2007-10-16 12:24:38 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2012-09-09 00:47:58 +04:00
ret = slab_alloc_node ( s , flags , node , _RET_IP_ ) ;
2008-08-19 21:43:26 +04:00
2009-03-23 16:12:24 +03:00
trace_kmalloc_node ( _RET_IP_ , ret , size , s - > size , flags , node ) ;
2008-08-19 21:43:26 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( __kmalloc_node ) ;
# endif
size_t ksize ( const void * object )
{
2007-06-09 00:46:49 +04:00
struct page * page ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:24:46 +04:00
if ( unlikely ( object = = ZERO_SIZE_PTR ) )
2007-06-09 00:46:49 +04:00
return 0 ;
2007-12-05 10:45:30 +03:00
page = virt_to_head_page ( object ) ;
2008-05-22 20:22:25 +04:00
if ( unlikely ( ! PageSlab ( page ) ) ) {
WARN_ON ( ! PageCompound ( page ) ) ;
2007-12-05 10:45:30 +03:00
return PAGE_SIZE < < compound_order ( page ) ;
2008-05-22 20:22:25 +04:00
}
2007-05-07 01:49:36 +04:00
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
return slab_ksize ( page - > slab_cache ) ;
2007-05-07 01:49:36 +04:00
}
2009-02-10 16:21:44 +03:00
EXPORT_SYMBOL ( ksize ) ;
2007-05-07 01:49:36 +04:00
void kfree ( const void * x )
{
struct page * page ;
2008-02-08 04:47:41 +03:00
void * object = ( void * ) x ;
2007-05-07 01:49:36 +04:00
2009-03-25 12:05:57 +03:00
trace_kfree ( _RET_IP_ , x ) ;
2007-10-16 12:24:44 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( x ) ) )
2007-05-07 01:49:36 +04:00
return ;
2007-05-07 01:49:41 +04:00
page = virt_to_head_page ( x ) ;
2007-10-16 12:24:38 +04:00
if ( unlikely ( ! PageSlab ( page ) ) ) {
2008-05-28 21:32:22 +04:00
BUG_ON ( ! PageCompound ( page ) ) ;
2013-10-09 02:58:57 +04:00
kfree_hook ( x ) ;
2012-12-19 02:22:48 +04:00
__free_memcg_kmem_pages ( page , compound_order ( page ) ) ;
2007-10-16 12:24:38 +04:00
return ;
}
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
slab_free ( page - > slab_cache , page , object , _RET_IP_ ) ;
2007-05-07 01:49:36 +04:00
}
EXPORT_SYMBOL ( kfree ) ;
2007-05-07 01:49:46 +04:00
/*
2007-05-09 13:32:39 +04:00
* kmem_cache_shrink removes empty slabs from the partial lists and sorts
* the remaining slabs by the number of items in use . The slabs with the
* most items in use come first . New allocations will then fill those up
* and thus they can be removed from the partial lists .
*
* The slabs with the least items are placed last . This results in them
* being allocated from last increasing the chance that the last objects
* are freed in them .
2007-05-07 01:49:46 +04:00
*/
int kmem_cache_shrink ( struct kmem_cache * s )
{
int node ;
int i ;
struct kmem_cache_node * n ;
struct page * page ;
struct page * t ;
2008-04-14 20:11:40 +04:00
int objects = oo_objects ( s - > max ) ;
2007-05-07 01:49:46 +04:00
struct list_head * slabs_by_inuse =
2008-04-14 20:11:31 +04:00
kmalloc ( sizeof ( struct list_head ) * objects , GFP_KERNEL ) ;
2007-05-07 01:49:46 +04:00
unsigned long flags ;
if ( ! slabs_by_inuse )
return - ENOMEM ;
flush_all ( s ) ;
2007-10-16 12:25:33 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-07 01:49:46 +04:00
n = get_node ( s , node ) ;
if ( ! n - > nr_partial )
continue ;
2008-04-14 20:11:31 +04:00
for ( i = 0 ; i < objects ; i + + )
2007-05-07 01:49:46 +04:00
INIT_LIST_HEAD ( slabs_by_inuse + i ) ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
/*
2007-05-09 13:32:39 +04:00
* Build lists indexed by the items in use in each slab .
2007-05-07 01:49:46 +04:00
*
2007-05-09 13:32:39 +04:00
* Note that concurrent frees may occur while we hold the
* list_lock . page - > inuse here is the upper limit .
2007-05-07 01:49:46 +04:00
*/
list_for_each_entry_safe ( page , t , & n - > partial , lru ) {
2011-08-10 01:12:22 +04:00
list_move ( & page - > lru , slabs_by_inuse + page - > inuse ) ;
if ( ! page - > inuse )
n - > nr_partial - - ;
2007-05-07 01:49:46 +04:00
}
/*
2007-05-09 13:32:39 +04:00
* Rebuild the partial list with the slabs filled up most
* first and the least used slabs at the end .
2007-05-07 01:49:46 +04:00
*/
2011-08-10 01:12:22 +04:00
for ( i = objects - 1 ; i > 0 ; i - - )
2007-05-07 01:49:46 +04:00
list_splice ( slabs_by_inuse + i , n - > partial . prev ) ;
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2011-08-10 01:12:22 +04:00
/* Release empty slabs */
list_for_each_entry_safe ( page , t , slabs_by_inuse , lru )
discard_slab ( s , page ) ;
2007-05-07 01:49:46 +04:00
}
kfree ( slabs_by_inuse ) ;
return 0 ;
}
EXPORT_SYMBOL ( kmem_cache_shrink ) ;
2007-10-22 03:41:37 +04:00
static int slab_mem_going_offline_callback ( void * arg )
{
struct kmem_cache * s ;
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
list_for_each_entry ( s , & slab_caches , list )
kmem_cache_shrink ( s ) ;
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
return 0 ;
}
static void slab_mem_offline_callback ( void * arg )
{
struct kmem_cache_node * n ;
struct kmem_cache * s ;
struct memory_notify * marg = arg ;
int offline_node ;
2012-12-12 04:01:05 +04:00
offline_node = marg - > status_change_nid_normal ;
2007-10-22 03:41:37 +04:00
/*
* If the node still has available memory . we need kmem_cache_node
* for it yet .
*/
if ( offline_node < 0 )
return ;
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
n = get_node ( s , offline_node ) ;
if ( n ) {
/*
* if n - > nr_slabs > 0 , slabs still exist on the node
* that is going down . We were unable to free them ,
2009-12-18 23:40:42 +03:00
* and offline_pages ( ) function shouldn ' t call this
2007-10-22 03:41:37 +04:00
* callback . So , we must fail .
*/
2008-04-14 19:53:02 +04:00
BUG_ON ( slabs_node ( s , offline_node ) ) ;
2007-10-22 03:41:37 +04:00
s - > node [ offline_node ] = NULL ;
2010-08-25 23:51:14 +04:00
kmem_cache_free ( kmem_cache_node , n ) ;
2007-10-22 03:41:37 +04:00
}
}
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
}
static int slab_mem_going_online_callback ( void * arg )
{
struct kmem_cache_node * n ;
struct kmem_cache * s ;
struct memory_notify * marg = arg ;
2012-12-12 04:01:05 +04:00
int nid = marg - > status_change_nid_normal ;
2007-10-22 03:41:37 +04:00
int ret = 0 ;
/*
* If the node ' s memory is already available , then kmem_cache_node is
* already created . Nothing to do .
*/
if ( nid < 0 )
return 0 ;
/*
2008-04-30 03:11:12 +04:00
* We are bringing a node online . No memory is available yet . We must
2007-10-22 03:41:37 +04:00
* allocate a kmem_cache_node structure in order to bring the node
* online .
*/
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
/*
* XXX : kmem_cache_alloc_node will fallback to other nodes
* since memory is not yet available from the node that
* is brought up .
*/
2010-08-25 23:51:14 +04:00
n = kmem_cache_alloc ( kmem_cache_node , GFP_KERNEL ) ;
2007-10-22 03:41:37 +04:00
if ( ! n ) {
ret = - ENOMEM ;
goto out ;
}
2012-05-10 19:50:47 +04:00
init_kmem_cache_node ( n ) ;
2007-10-22 03:41:37 +04:00
s - > node [ nid ] = n ;
}
out :
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-10-22 03:41:37 +04:00
return ret ;
}
static int slab_memory_callback ( struct notifier_block * self ,
unsigned long action , void * arg )
{
int ret = 0 ;
switch ( action ) {
case MEM_GOING_ONLINE :
ret = slab_mem_going_online_callback ( arg ) ;
break ;
case MEM_GOING_OFFLINE :
ret = slab_mem_going_offline_callback ( arg ) ;
break ;
case MEM_OFFLINE :
case MEM_CANCEL_ONLINE :
slab_mem_offline_callback ( arg ) ;
break ;
case MEM_ONLINE :
case MEM_CANCEL_OFFLINE :
break ;
}
2008-12-02 00:13:48 +03:00
if ( ret )
ret = notifier_from_errno ( ret ) ;
else
ret = NOTIFY_OK ;
2007-10-22 03:41:37 +04:00
return ret ;
}
2013-04-30 02:08:06 +04:00
static struct notifier_block slab_memory_callback_nb = {
. notifier_call = slab_memory_callback ,
. priority = SLAB_CALLBACK_PRI ,
} ;
2007-10-22 03:41:37 +04:00
2007-05-07 01:49:36 +04:00
/********************************************************************
* Basic setup of slabs
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
2010-08-20 21:37:15 +04:00
/*
* Used for early kmem_cache structures that were allocated using
2012-11-28 20:23:07 +04:00
* the page allocator . Allocate them properly then fix up the pointers
* that may be pointing to the wrong kmem_cache structure .
2010-08-20 21:37:15 +04:00
*/
2012-11-28 20:23:07 +04:00
static struct kmem_cache * __init bootstrap ( struct kmem_cache * static_cache )
2010-08-20 21:37:15 +04:00
{
int node ;
2012-11-28 20:23:07 +04:00
struct kmem_cache * s = kmem_cache_zalloc ( kmem_cache , GFP_NOWAIT ) ;
2010-08-20 21:37:15 +04:00
2012-11-28 20:23:07 +04:00
memcpy ( s , static_cache , kmem_cache - > object_size ) ;
2010-08-20 21:37:15 +04:00
2013-02-22 20:20:00 +04:00
/*
* This runs very early , and only the boot processor is supposed to be
* up . Even if it weren ' t true , IRQs are not up so we couldn ' t fire
* IPIs around .
*/
__flush_cpu_slab ( s , smp_processor_id ( ) ) ;
2010-08-20 21:37:15 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
struct page * p ;
if ( n ) {
list_for_each_entry ( p , & n - > partial , lru )
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
p - > slab_cache = s ;
2010-08-20 21:37:15 +04:00
2011-04-12 11:22:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-08-20 21:37:15 +04:00
list_for_each_entry ( p , & n - > full , lru )
slub: Commonize slab_cache field in struct page
Right now, slab and slub have fields in struct page to derive which
cache a page belongs to, but they do it slightly differently.
slab uses a field called slab_cache, that lives in the third double
word. slub, uses a field called "slab", living outside of the
doublewords area.
Ideally, we could use the same field for this. Since slub heavily makes
use of the doubleword region, there isn't really much room to move
slub's slab_cache field around. Since slab does not have such strict
placement restrictions, we can move it outside the doubleword area.
The naming used by slab, "slab_cache", is less confusing, and it is
preferred over slub's generic "slab".
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-10-22 18:05:36 +04:00
p - > slab_cache = s ;
2010-08-20 21:37:15 +04:00
# endif
}
}
2012-11-28 20:23:07 +04:00
list_add ( & s - > list , & slab_caches ) ;
return s ;
2010-08-20 21:37:15 +04:00
}
2007-05-07 01:49:36 +04:00
void __init kmem_cache_init ( void )
{
2012-11-28 20:23:07 +04:00
static __initdata struct kmem_cache boot_kmem_cache ,
boot_kmem_cache_node ;
2010-08-20 21:37:15 +04:00
2012-01-11 03:07:32 +04:00
if ( debug_guardpage_minorder ( ) )
slub_max_order = 0 ;
2012-11-28 20:23:07 +04:00
kmem_cache_node = & boot_kmem_cache_node ;
kmem_cache = & boot_kmem_cache ;
2010-08-20 21:37:15 +04:00
2012-11-28 20:23:07 +04:00
create_boot_cache ( kmem_cache_node , " kmem_cache_node " ,
sizeof ( struct kmem_cache_node ) , SLAB_HWCACHE_ALIGN ) ;
2007-10-22 03:41:37 +04:00
2013-04-30 02:08:06 +04:00
register_hotmemory_notifier ( & slab_memory_callback_nb ) ;
2007-05-07 01:49:36 +04:00
/* Able to allocate the per node structures */
slab_state = PARTIAL ;
2012-11-28 20:23:07 +04:00
create_boot_cache ( kmem_cache , " kmem_cache " ,
offsetof ( struct kmem_cache , node ) +
nr_node_ids * sizeof ( struct kmem_cache_node * ) ,
SLAB_HWCACHE_ALIGN ) ;
2012-09-05 03:18:33 +04:00
2012-11-28 20:23:07 +04:00
kmem_cache = bootstrap ( & boot_kmem_cache ) ;
2007-05-07 01:49:36 +04:00
2010-08-20 21:37:15 +04:00
/*
* Allocate kmem_cache_node properly from the kmem_cache slab .
* kmem_cache_node is separately allocated so no need to
* update any list pointers .
*/
2012-11-28 20:23:07 +04:00
kmem_cache_node = bootstrap ( & boot_kmem_cache_node ) ;
2010-08-20 21:37:15 +04:00
/* Now we can use the kmem_cache to allocate kmalloc slabs */
2013-01-10 23:12:17 +04:00
create_kmalloc_caches ( 0 ) ;
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_SMP
register_cpu_notifier ( & slab_notifier ) ;
2009-12-19 01:26:20 +03:00
# endif
2007-05-07 01:49:36 +04:00
2008-02-06 04:57:39 +03:00
printk ( KERN_INFO
2013-01-10 23:12:17 +04:00
" SLUB: HWalign=%d, Order=%d-%d, MinObjects=%d, "
2007-06-16 21:16:13 +04:00
" CPUs=%d, Nodes=%d \n " ,
2013-01-10 23:12:17 +04:00
cache_line_size ( ) ,
2007-05-07 01:49:36 +04:00
slub_min_order , slub_max_order , slub_min_objects ,
nr_cpu_ids , nr_node_ids ) ;
}
2009-06-12 15:03:06 +04:00
void __init kmem_cache_init_late ( void )
{
}
2007-05-07 01:49:36 +04:00
/*
* Find a mergeable slab cache
*/
static int slab_unmergeable ( struct kmem_cache * s )
{
if ( slub_nomerge | | ( s - > flags & SLUB_NEVER_MERGE ) )
return 1 ;
memcg, slab: never try to merge memcg caches
When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.
Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.
The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.
So let's remove the useless code responsible for merging memcg caches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:39:23 +04:00
if ( ! is_root_cache ( s ) )
return 1 ;
2007-05-17 09:10:50 +04:00
if ( s - > ctor )
2007-05-07 01:49:36 +04:00
return 1 ;
2007-05-31 11:40:51 +04:00
/*
* We may have set a slab to be unmergeable during bootstrap .
*/
if ( s - > refcount < 0 )
return 1 ;
2007-05-07 01:49:36 +04:00
return 0 ;
}
memcg, slab: never try to merge memcg caches
When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.
Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.
The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.
So let's remove the useless code responsible for merging memcg caches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:39:23 +04:00
static struct kmem_cache * find_mergeable ( size_t size , size_t align ,
unsigned long flags , const char * name , void ( * ctor ) ( void * ) )
2007-05-07 01:49:36 +04:00
{
2007-07-17 15:03:19 +04:00
struct kmem_cache * s ;
2007-05-07 01:49:36 +04:00
if ( slub_nomerge | | ( flags & SLUB_NEVER_MERGE ) )
return NULL ;
2007-05-17 09:10:50 +04:00
if ( ctor )
2007-05-07 01:49:36 +04:00
return NULL ;
size = ALIGN ( size , sizeof ( void * ) ) ;
align = calculate_alignment ( flags , align , size ) ;
size = ALIGN ( size , align ) ;
2007-09-12 02:24:11 +04:00
flags = kmem_cache_flags ( size , flags , name , NULL ) ;
2007-05-07 01:49:36 +04:00
2007-07-17 15:03:19 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
2007-05-07 01:49:36 +04:00
if ( slab_unmergeable ( s ) )
continue ;
if ( size > s - > size )
continue ;
2007-09-12 02:24:11 +04:00
if ( ( flags & SLUB_MERGE_SAME ) ! = ( s - > flags & SLUB_MERGE_SAME ) )
memcg, slab: never try to merge memcg caches
When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.
Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.
The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.
So let's remove the useless code responsible for merging memcg caches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:39:23 +04:00
continue ;
2007-05-07 01:49:36 +04:00
/*
* Check if alignment is compatible .
* Courtesy of Adrian Drzewiecki
*/
2008-01-08 10:20:27 +03:00
if ( ( s - > size & ~ ( align - 1 ) ) ! = s - > size )
2007-05-07 01:49:36 +04:00
continue ;
if ( s - > size - size > = sizeof ( void * ) )
continue ;
return s ;
}
return NULL ;
}
2012-12-19 02:22:34 +04:00
struct kmem_cache *
memcg, slab: never try to merge memcg caches
When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.
Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.
The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.
So let's remove the useless code responsible for merging memcg caches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:39:23 +04:00
__kmem_cache_alias ( const char * name , size_t size , size_t align ,
unsigned long flags , void ( * ctor ) ( void * ) )
2007-05-07 01:49:36 +04:00
{
struct kmem_cache * s ;
memcg, slab: never try to merge memcg caches
When a kmem cache is created (kmem_cache_create_memcg()), we first try to
find a compatible cache that already exists and can handle requests from
the new cache, i.e. has the same object size, alignment, ctor, etc. If
there is such a cache, we do not create any new caches, instead we simply
increment the refcount of the cache found and return it.
Currently we do this procedure not only when creating root caches, but
also for memcg caches. However, there is no point in that, because, as
every memcg cache has exactly the same parameters as its parent and cache
merging cannot be turned off in runtime (only on boot by passing
"slub_nomerge"), the root caches of any two potentially mergeable memcg
caches should be merged already, i.e. it must be the same root cache, and
therefore we couldn't even get to the memcg cache creation, because it
already exists.
The only exception is boot caches - they are explicitly forbidden to be
merged by setting their refcount to -1. There are currently only two of
them - kmem_cache and kmem_cache_node, which are used in slab internals (I
do not count kmalloc caches as their refcount is set to 1 immediately
after creation). Since they are prevented from merging preliminary I
guess we should avoid to merge their children too.
So let's remove the useless code responsible for merging memcg caches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 02:39:23 +04:00
s = find_mergeable ( size , align , flags , name , ctor ) ;
2007-05-07 01:49:36 +04:00
if ( s ) {
2014-04-08 02:39:29 +04:00
int i ;
struct kmem_cache * c ;
2007-05-07 01:49:36 +04:00
s - > refcount + + ;
2014-04-08 02:39:29 +04:00
2007-05-07 01:49:36 +04:00
/*
* Adjust the object sizes so that we clear
* the complete object on kzalloc .
*/
2012-06-13 19:24:57 +04:00
s - > object_size = max ( s - > object_size , ( int ) size ) ;
2007-05-07 01:49:36 +04:00
s - > inuse = max_t ( int , s - > inuse , ALIGN ( size , sizeof ( void * ) ) ) ;
2008-02-16 10:45:26 +03:00
2014-04-08 02:39:29 +04:00
for_each_memcg_cache_index ( i ) {
c = cache_from_memcg_idx ( s , i ) ;
if ( ! c )
continue ;
c - > object_size = s - > object_size ;
c - > inuse = max_t ( int , c - > inuse ,
ALIGN ( size , sizeof ( void * ) ) ) ;
}
2008-12-18 09:09:46 +03:00
if ( sysfs_slab_alias ( s , name ) ) {
s - > refcount - - ;
2012-09-05 04:18:32 +04:00
s = NULL ;
2008-12-18 09:09:46 +03:00
}
2007-07-17 15:03:31 +04:00
}
2008-02-16 10:45:26 +03:00
2012-09-05 04:18:32 +04:00
return s ;
}
2010-09-15 00:21:12 +04:00
2012-09-05 03:18:33 +04:00
int __kmem_cache_create ( struct kmem_cache * s , unsigned long flags )
2012-09-05 04:18:32 +04:00
{
2012-09-05 13:07:44 +04:00
int err ;
err = kmem_cache_open ( s , flags ) ;
if ( err )
return err ;
2012-07-07 00:25:13 +04:00
2012-11-28 20:23:07 +04:00
/* Mutex is not taken during early boot */
if ( slab_state < = UP )
return 0 ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
memcg_propagate_slab_attrs ( s ) ;
2012-09-05 13:07:44 +04:00
err = sysfs_slab_add ( s ) ;
if ( err )
kmem_cache_close ( s ) ;
2012-07-07 00:25:13 +04:00
2012-09-05 13:07:44 +04:00
return err ;
2007-05-07 01:49:36 +04:00
}
# ifdef CONFIG_SMP
/*
2007-05-09 13:32:39 +04:00
* Use the cpu notifier to insure that the cpu slabs are flushed when
* necessary .
2007-05-07 01:49:36 +04:00
*/
2013-06-19 22:53:51 +04:00
static int slab_cpuup_callback ( struct notifier_block * nfb ,
2007-05-07 01:49:36 +04:00
unsigned long action , void * hcpu )
{
long cpu = ( long ) hcpu ;
2007-07-17 15:03:19 +04:00
struct kmem_cache * s ;
unsigned long flags ;
2007-05-07 01:49:36 +04:00
switch ( action ) {
case CPU_UP_CANCELED :
2007-05-09 13:35:10 +04:00
case CPU_UP_CANCELED_FROZEN :
2007-05-07 01:49:36 +04:00
case CPU_DEAD :
2007-05-09 13:35:10 +04:00
case CPU_DEAD_FROZEN :
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2007-07-17 15:03:19 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
local_irq_save ( flags ) ;
__flush_cpu_slab ( s , cpu ) ;
local_irq_restore ( flags ) ;
}
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-05-07 01:49:36 +04:00
break ;
default :
break ;
}
return NOTIFY_OK ;
}
2013-06-19 22:53:51 +04:00
static struct notifier_block slab_notifier = {
2008-02-06 04:57:39 +03:00
. notifier_call = slab_cpuup_callback
2008-01-08 10:20:27 +03:00
} ;
2007-05-07 01:49:36 +04:00
# endif
2008-08-19 21:43:25 +04:00
void * __kmalloc_track_caller ( size_t size , gfp_t gfpflags , unsigned long caller )
2007-05-07 01:49:36 +04:00
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-24 21:49:35 +04:00
void * ret ;
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) )
2008-02-11 23:47:46 +03:00
return kmalloc_large ( size , gfpflags ) ;
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , gfpflags ) ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:24:44 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2007-05-07 01:49:36 +04:00
2012-09-09 00:47:58 +04:00
ret = slab_alloc ( s , gfpflags , caller ) ;
2008-08-24 21:49:35 +04:00
2011-03-31 05:57:33 +04:00
/* Honor the call site pointer we received. */
2009-03-23 16:12:24 +03:00
trace_kmalloc ( caller , ret , size , s - > size , gfpflags ) ;
2008-08-24 21:49:35 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
2010-09-29 16:02:15 +04:00
# ifdef CONFIG_NUMA
2007-05-07 01:49:36 +04:00
void * __kmalloc_node_track_caller ( size_t size , gfp_t gfpflags ,
2008-08-19 21:43:25 +04:00
int node , unsigned long caller )
2007-05-07 01:49:36 +04:00
{
2007-10-16 12:24:38 +04:00
struct kmem_cache * s ;
2008-08-24 21:49:35 +04:00
void * ret ;
2007-10-16 12:24:38 +04:00
2013-01-10 23:14:19 +04:00
if ( unlikely ( size > KMALLOC_MAX_CACHE_SIZE ) ) {
2010-04-08 13:26:44 +04:00
ret = kmalloc_large_node ( size , gfpflags , node ) ;
trace_kmalloc_node ( caller , ret ,
size , PAGE_SIZE < < get_order ( size ) ,
gfpflags , node ) ;
return ret ;
}
2008-02-11 23:47:46 +03:00
2013-01-10 23:14:19 +04:00
s = kmalloc_slab ( size , gfpflags ) ;
2007-05-07 01:49:36 +04:00
2007-10-16 12:24:44 +04:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 15:03:22 +04:00
return s ;
2007-05-07 01:49:36 +04:00
2012-09-09 00:47:58 +04:00
ret = slab_alloc_node ( s , gfpflags , node , caller ) ;
2008-08-24 21:49:35 +04:00
2011-03-31 05:57:33 +04:00
/* Honor the call site pointer we received. */
2009-03-23 16:12:24 +03:00
trace_kmalloc_node ( caller , ret , size , s - > size , gfpflags , node ) ;
2008-08-24 21:49:35 +04:00
return ret ;
2007-05-07 01:49:36 +04:00
}
2010-09-29 16:02:15 +04:00
# endif
2007-05-07 01:49:36 +04:00
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SYSFS
2008-04-14 20:11:40 +04:00
static int count_inuse ( struct page * page )
{
return page - > inuse ;
}
static int count_total ( struct page * page )
{
return page - > objects ;
}
2010-10-05 22:57:26 +04:00
# endif
2008-04-14 20:11:40 +04:00
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2007-07-17 15:03:30 +04:00
static int validate_slab ( struct kmem_cache * s , struct page * page ,
unsigned long * map )
2007-05-07 01:49:43 +04:00
{
void * p ;
2008-03-02 00:40:44 +03:00
void * addr = page_address ( page ) ;
2007-05-07 01:49:43 +04:00
if ( ! check_slab ( s , page ) | |
! on_freelist ( s , page , NULL ) )
return 0 ;
/* Now we know that a valid freelist exists */
2008-04-14 20:11:30 +04:00
bitmap_zero ( map , page - > objects ) ;
2007-05-07 01:49:43 +04:00
2011-04-15 23:48:13 +04:00
get_map ( s , page , map ) ;
for_each_object ( p , s , addr , page - > objects ) {
if ( test_bit ( slab_index ( p , s , addr ) , map ) )
if ( ! check_object ( s , page , p , SLUB_RED_INACTIVE ) )
return 0 ;
2007-05-07 01:49:43 +04:00
}
2008-04-14 20:11:31 +04:00
for_each_object ( p , s , addr , page - > objects )
2007-05-09 13:32:40 +04:00
if ( ! test_bit ( slab_index ( p , s , addr ) , map ) )
2010-12-01 21:04:20 +03:00
if ( ! check_object ( s , page , p , SLUB_RED_ACTIVE ) )
2007-05-07 01:49:43 +04:00
return 0 ;
return 1 ;
}
2007-07-17 15:03:30 +04:00
static void validate_slab_slab ( struct kmem_cache * s , struct page * page ,
unsigned long * map )
2007-05-07 01:49:43 +04:00
{
2011-06-01 21:25:53 +04:00
slab_lock ( page ) ;
validate_slab ( s , page , map ) ;
slab_unlock ( page ) ;
2007-05-07 01:49:43 +04:00
}
2007-07-17 15:03:30 +04:00
static int validate_slab_node ( struct kmem_cache * s ,
struct kmem_cache_node * n , unsigned long * map )
2007-05-07 01:49:43 +04:00
{
unsigned long count = 0 ;
struct page * page ;
unsigned long flags ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
list_for_each_entry ( page , & n - > partial , lru ) {
2007-07-17 15:03:30 +04:00
validate_slab_slab ( s , page , map ) ;
2007-05-07 01:49:43 +04:00
count + + ;
}
if ( count ! = n - > nr_partial )
printk ( KERN_ERR " SLUB %s: %ld partial slabs counted but "
" counter=%ld \n " , s - > name , count , n - > nr_partial ) ;
if ( ! ( s - > flags & SLAB_STORE_USER ) )
goto out ;
list_for_each_entry ( page , & n - > full , lru ) {
2007-07-17 15:03:30 +04:00
validate_slab_slab ( s , page , map ) ;
2007-05-07 01:49:43 +04:00
count + + ;
}
if ( count ! = atomic_long_read ( & n - > nr_slabs ) )
printk ( KERN_ERR " SLUB: %s %ld slabs counted but "
" counter=%ld \n " , s - > name , count ,
atomic_long_read ( & n - > nr_slabs ) ) ;
out :
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
return count ;
}
2007-07-17 15:03:30 +04:00
static long validate_slab_cache ( struct kmem_cache * s )
2007-05-07 01:49:43 +04:00
{
int node ;
unsigned long count = 0 ;
2008-04-14 20:11:40 +04:00
unsigned long * map = kmalloc ( BITS_TO_LONGS ( oo_objects ( s - > max ) ) *
2007-07-17 15:03:30 +04:00
sizeof ( unsigned long ) , GFP_KERNEL ) ;
if ( ! map )
return - ENOMEM ;
2007-05-07 01:49:43 +04:00
flush_all ( s ) ;
2007-10-16 12:25:33 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-07 01:49:43 +04:00
struct kmem_cache_node * n = get_node ( s , node ) ;
2007-07-17 15:03:30 +04:00
count + = validate_slab_node ( s , n , map ) ;
2007-05-07 01:49:43 +04:00
}
2007-07-17 15:03:30 +04:00
kfree ( map ) ;
2007-05-07 01:49:43 +04:00
return count ;
}
2007-05-07 01:49:45 +04:00
/*
2007-05-09 13:32:39 +04:00
* Generate lists of code addresses where slabcache objects are allocated
2007-05-07 01:49:45 +04:00
* and freed .
*/
struct location {
unsigned long count ;
2008-08-19 21:43:25 +04:00
unsigned long addr ;
2007-05-09 13:32:45 +04:00
long long sum_time ;
long min_time ;
long max_time ;
long min_pid ;
long max_pid ;
2009-01-01 02:42:29 +03:00
DECLARE_BITMAP ( cpus , NR_CPUS ) ;
2007-05-09 13:32:45 +04:00
nodemask_t nodes ;
2007-05-07 01:49:45 +04:00
} ;
struct loc_track {
unsigned long max ;
unsigned long count ;
struct location * loc ;
} ;
static void free_loc_track ( struct loc_track * t )
{
if ( t - > max )
free_pages ( ( unsigned long ) t - > loc ,
get_order ( sizeof ( struct location ) * t - > max ) ) ;
}
2007-07-17 15:03:20 +04:00
static int alloc_loc_track ( struct loc_track * t , unsigned long max , gfp_t flags )
2007-05-07 01:49:45 +04:00
{
struct location * l ;
int order ;
order = get_order ( sizeof ( struct location ) * max ) ;
2007-07-17 15:03:20 +04:00
l = ( void * ) __get_free_pages ( flags , order ) ;
2007-05-07 01:49:45 +04:00
if ( ! l )
return 0 ;
if ( t - > count ) {
memcpy ( l , t - > loc , sizeof ( struct location ) * t - > count ) ;
free_loc_track ( t ) ;
}
t - > max = max ;
t - > loc = l ;
return 1 ;
}
static int add_location ( struct loc_track * t , struct kmem_cache * s ,
2007-05-09 13:32:45 +04:00
const struct track * track )
2007-05-07 01:49:45 +04:00
{
long start , end , pos ;
struct location * l ;
2008-08-19 21:43:25 +04:00
unsigned long caddr ;
2007-05-09 13:32:45 +04:00
unsigned long age = jiffies - track - > when ;
2007-05-07 01:49:45 +04:00
start = - 1 ;
end = t - > count ;
for ( ; ; ) {
pos = start + ( end - start + 1 ) / 2 ;
/*
* There is nothing at " end " . If we end up there
* we need to add something to before end .
*/
if ( pos = = end )
break ;
caddr = t - > loc [ pos ] . addr ;
2007-05-09 13:32:45 +04:00
if ( track - > addr = = caddr ) {
l = & t - > loc [ pos ] ;
l - > count + + ;
if ( track - > when ) {
l - > sum_time + = age ;
if ( age < l - > min_time )
l - > min_time = age ;
if ( age > l - > max_time )
l - > max_time = age ;
if ( track - > pid < l - > min_pid )
l - > min_pid = track - > pid ;
if ( track - > pid > l - > max_pid )
l - > max_pid = track - > pid ;
2009-01-01 02:42:29 +03:00
cpumask_set_cpu ( track - > cpu ,
to_cpumask ( l - > cpus ) ) ;
2007-05-09 13:32:45 +04:00
}
node_set ( page_to_nid ( virt_to_page ( track ) ) , l - > nodes ) ;
2007-05-07 01:49:45 +04:00
return 1 ;
}
2007-05-09 13:32:45 +04:00
if ( track - > addr < caddr )
2007-05-07 01:49:45 +04:00
end = pos ;
else
start = pos ;
}
/*
2007-05-09 13:32:39 +04:00
* Not found . Insert new tracking element .
2007-05-07 01:49:45 +04:00
*/
2007-07-17 15:03:20 +04:00
if ( t - > count > = t - > max & & ! alloc_loc_track ( t , 2 * t - > max , GFP_ATOMIC ) )
2007-05-07 01:49:45 +04:00
return 0 ;
l = t - > loc + pos ;
if ( pos < t - > count )
memmove ( l + 1 , l ,
( t - > count - pos ) * sizeof ( struct location ) ) ;
t - > count + + ;
l - > count = 1 ;
2007-05-09 13:32:45 +04:00
l - > addr = track - > addr ;
l - > sum_time = age ;
l - > min_time = age ;
l - > max_time = age ;
l - > min_pid = track - > pid ;
l - > max_pid = track - > pid ;
2009-01-01 02:42:29 +03:00
cpumask_clear ( to_cpumask ( l - > cpus ) ) ;
cpumask_set_cpu ( track - > cpu , to_cpumask ( l - > cpus ) ) ;
2007-05-09 13:32:45 +04:00
nodes_clear ( l - > nodes ) ;
node_set ( page_to_nid ( virt_to_page ( track ) ) , l - > nodes ) ;
2007-05-07 01:49:45 +04:00
return 1 ;
}
static void process_slab ( struct loc_track * t , struct kmem_cache * s ,
2010-03-25 00:25:47 +03:00
struct page * page , enum track_item alloc ,
2010-09-29 16:02:13 +04:00
unsigned long * map )
2007-05-07 01:49:45 +04:00
{
2008-03-02 00:40:44 +03:00
void * addr = page_address ( page ) ;
2007-05-07 01:49:45 +04:00
void * p ;
2008-04-14 20:11:30 +04:00
bitmap_zero ( map , page - > objects ) ;
2011-04-15 23:48:13 +04:00
get_map ( s , page , map ) ;
2007-05-07 01:49:45 +04:00
2008-04-14 20:11:31 +04:00
for_each_object ( p , s , addr , page - > objects )
2007-05-09 13:32:45 +04:00
if ( ! test_bit ( slab_index ( p , s , addr ) , map ) )
add_location ( t , s , get_track ( s , p , alloc ) ) ;
2007-05-07 01:49:45 +04:00
}
static int list_locations ( struct kmem_cache * s , char * buf ,
enum track_item alloc )
{
2008-02-01 02:20:50 +03:00
int len = 0 ;
2007-05-07 01:49:45 +04:00
unsigned long i ;
2007-07-17 15:03:20 +04:00
struct loc_track t = { 0 , 0 , NULL } ;
2007-05-07 01:49:45 +04:00
int node ;
2010-03-25 00:25:47 +03:00
unsigned long * map = kmalloc ( BITS_TO_LONGS ( oo_objects ( s - > max ) ) *
sizeof ( unsigned long ) , GFP_KERNEL ) ;
2007-05-07 01:49:45 +04:00
2010-03-25 00:25:47 +03:00
if ( ! map | | ! alloc_loc_track ( & t , PAGE_SIZE / sizeof ( struct location ) ,
GFP_TEMPORARY ) ) {
kfree ( map ) ;
2007-07-17 15:03:20 +04:00
return sprintf ( buf , " Out of memory \n " ) ;
2010-03-25 00:25:47 +03:00
}
2007-05-07 01:49:45 +04:00
/* Push back cpu slabs */
flush_all ( s ) ;
2007-10-16 12:25:33 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-07 01:49:45 +04:00
struct kmem_cache_node * n = get_node ( s , node ) ;
unsigned long flags ;
struct page * page ;
2007-08-23 01:01:56 +04:00
if ( ! atomic_long_read ( & n - > nr_slabs ) )
2007-05-07 01:49:45 +04:00
continue ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
list_for_each_entry ( page , & n - > partial , lru )
2010-03-25 00:25:47 +03:00
process_slab ( & t , s , page , alloc , map ) ;
2007-05-07 01:49:45 +04:00
list_for_each_entry ( page , & n - > full , lru )
2010-03-25 00:25:47 +03:00
process_slab ( & t , s , page , alloc , map ) ;
2007-05-07 01:49:45 +04:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
}
for ( i = 0 ; i < t . count ; i + + ) {
2007-05-09 13:32:45 +04:00
struct location * l = & t . loc [ i ] ;
2007-05-07 01:49:45 +04:00
2008-12-10 00:14:27 +03:00
if ( len > PAGE_SIZE - KSYM_SYMBOL_LEN - 100 )
2007-05-07 01:49:45 +04:00
break ;
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " %7ld " , l - > count ) ;
2007-05-09 13:32:45 +04:00
if ( l - > addr )
2011-01-14 02:45:52 +03:00
len + = sprintf ( buf + len , " %pS " , ( void * ) l - > addr ) ;
2007-05-07 01:49:45 +04:00
else
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " <not-available> " ) ;
2007-05-09 13:32:45 +04:00
if ( l - > sum_time ! = l - > min_time ) {
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " age=%ld/%ld/%ld " ,
2008-05-01 15:34:31 +04:00
l - > min_time ,
( long ) div_u64 ( l - > sum_time , l - > count ) ,
l - > max_time ) ;
2007-05-09 13:32:45 +04:00
} else
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " age=%ld " ,
2007-05-09 13:32:45 +04:00
l - > min_time ) ;
if ( l - > min_pid ! = l - > max_pid )
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " pid=%ld-%ld " ,
2007-05-09 13:32:45 +04:00
l - > min_pid , l - > max_pid ) ;
else
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " pid=%ld " ,
2007-05-09 13:32:45 +04:00
l - > min_pid ) ;
2009-01-01 02:42:29 +03:00
if ( num_online_cpus ( ) > 1 & &
! cpumask_empty ( to_cpumask ( l - > cpus ) ) & &
2008-02-01 02:20:50 +03:00
len < PAGE_SIZE - 60 ) {
len + = sprintf ( buf + len , " cpus= " ) ;
2013-07-15 05:05:29 +04:00
len + = cpulist_scnprintf ( buf + len ,
PAGE_SIZE - len - 50 ,
2009-01-01 02:42:29 +03:00
to_cpumask ( l - > cpus ) ) ;
2007-05-09 13:32:45 +04:00
}
2009-06-17 02:32:15 +04:00
if ( nr_online_nodes > 1 & & ! nodes_empty ( l - > nodes ) & &
2008-02-01 02:20:50 +03:00
len < PAGE_SIZE - 60 ) {
len + = sprintf ( buf + len , " nodes= " ) ;
2013-07-15 05:05:29 +04:00
len + = nodelist_scnprintf ( buf + len ,
PAGE_SIZE - len - 50 ,
l - > nodes ) ;
2007-05-09 13:32:45 +04:00
}
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf + len , " \n " ) ;
2007-05-07 01:49:45 +04:00
}
free_loc_track ( & t ) ;
2010-03-25 00:25:47 +03:00
kfree ( map ) ;
2007-05-07 01:49:45 +04:00
if ( ! t . count )
2008-02-01 02:20:50 +03:00
len + = sprintf ( buf , " No data \n " ) ;
return len ;
2007-05-07 01:49:45 +04:00
}
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:45 +04:00
2010-10-05 22:57:27 +04:00
# ifdef SLUB_RESILIENCY_TEST
static void resiliency_test ( void )
{
u8 * p ;
2013-01-10 23:14:19 +04:00
BUILD_BUG_ON ( KMALLOC_MIN_SIZE > 16 | | KMALLOC_SHIFT_HIGH < 10 ) ;
2010-10-05 22:57:27 +04:00
printk ( KERN_ERR " SLUB resiliency testing \n " ) ;
printk ( KERN_ERR " ----------------------- \n " ) ;
printk ( KERN_ERR " A. Corruption after allocation \n " ) ;
p = kzalloc ( 16 , GFP_KERNEL ) ;
p [ 16 ] = 0x12 ;
printk ( KERN_ERR " \n 1. kmalloc-16: Clobber Redzone/next pointer "
" 0x12->0x%p \n \n " , p + 16 ) ;
validate_slab_cache ( kmalloc_caches [ 4 ] ) ;
/* Hmmm... The next two are dangerous */
p = kzalloc ( 32 , GFP_KERNEL ) ;
p [ 32 + sizeof ( void * ) ] = 0x34 ;
printk ( KERN_ERR " \n 2. kmalloc-32: Clobber next pointer/next slab "
" 0x34 -> -0x%p \n " , p ) ;
printk ( KERN_ERR
" If allocated object is overwritten then not detectable \n \n " ) ;
validate_slab_cache ( kmalloc_caches [ 5 ] ) ;
p = kzalloc ( 64 , GFP_KERNEL ) ;
p + = 64 + ( get_cycles ( ) & 0xff ) * sizeof ( void * ) ;
* p = 0x56 ;
printk ( KERN_ERR " \n 3. kmalloc-64: corrupting random byte 0x56->0x%p \n " ,
p ) ;
printk ( KERN_ERR
" If allocated object is overwritten then not detectable \n \n " ) ;
validate_slab_cache ( kmalloc_caches [ 6 ] ) ;
printk ( KERN_ERR " \n B. Corruption after free \n " ) ;
p = kzalloc ( 128 , GFP_KERNEL ) ;
kfree ( p ) ;
* p = 0x78 ;
printk ( KERN_ERR " 1. kmalloc-128: Clobber first word 0x78->0x%p \n \n " , p ) ;
validate_slab_cache ( kmalloc_caches [ 7 ] ) ;
p = kzalloc ( 256 , GFP_KERNEL ) ;
kfree ( p ) ;
p [ 50 ] = 0x9a ;
printk ( KERN_ERR " \n 2. kmalloc-256: Clobber 50th byte 0x9a->0x%p \n \n " ,
p ) ;
validate_slab_cache ( kmalloc_caches [ 8 ] ) ;
p = kzalloc ( 512 , GFP_KERNEL ) ;
kfree ( p ) ;
p [ 512 ] = 0xab ;
printk ( KERN_ERR " \n 3. kmalloc-512: Clobber redzone 0xab->0x%p \n \n " , p ) ;
validate_slab_cache ( kmalloc_caches [ 9 ] ) ;
}
# else
# ifdef CONFIG_SYSFS
static void resiliency_test ( void ) { } ;
# endif
# endif
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SYSFS
2007-05-07 01:49:36 +04:00
enum slab_stat_type {
2008-04-14 20:11:40 +04:00
SL_ALL , /* All slabs */
SL_PARTIAL , /* Only partially allocated slabs */
SL_CPU , /* Only slabs used for cpu caches */
SL_OBJECTS , /* Determine allocated objects not slabs */
SL_TOTAL /* Determine object capacity not slabs */
2007-05-07 01:49:36 +04:00
} ;
2008-04-14 20:11:40 +04:00
# define SO_ALL (1 << SL_ALL)
2007-05-07 01:49:36 +04:00
# define SO_PARTIAL (1 << SL_PARTIAL)
# define SO_CPU (1 << SL_CPU)
# define SO_OBJECTS (1 << SL_OBJECTS)
2008-04-14 20:11:40 +04:00
# define SO_TOTAL (1 << SL_TOTAL)
2007-05-07 01:49:36 +04:00
2008-03-02 23:28:24 +03:00
static ssize_t show_slab_objects ( struct kmem_cache * s ,
char * buf , unsigned long flags )
2007-05-07 01:49:36 +04:00
{
unsigned long total = 0 ;
int node ;
int x ;
unsigned long * nodes ;
2013-07-12 04:23:48 +04:00
nodes = kzalloc ( sizeof ( unsigned long ) * nr_node_ids , GFP_KERNEL ) ;
2008-03-02 23:28:24 +03:00
if ( ! nodes )
return - ENOMEM ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:40 +04:00
if ( flags & SO_CPU ) {
int cpu ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:40 +04:00
for_each_possible_cpu ( cpu ) {
2013-07-15 05:05:29 +04:00
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab ,
cpu ) ;
2012-05-09 19:09:56 +04:00
int node ;
2011-08-10 01:12:27 +04:00
struct page * page ;
2007-10-16 12:26:05 +04:00
2011-11-22 19:02:02 +04:00
page = ACCESS_ONCE ( c - > page ) ;
2012-05-09 19:09:56 +04:00
if ( ! page )
continue ;
2008-04-14 20:11:40 +04:00
2012-05-09 19:09:56 +04:00
node = page_to_nid ( page ) ;
if ( flags & SO_TOTAL )
x = page - > objects ;
else if ( flags & SO_OBJECTS )
x = page - > inuse ;
else
x = 1 ;
2011-08-10 01:12:27 +04:00
2012-05-09 19:09:56 +04:00
total + = x ;
nodes [ node ] + = x ;
page = ACCESS_ONCE ( c - > partial ) ;
2011-08-10 01:12:27 +04:00
if ( page ) {
2013-09-10 07:43:37 +04:00
node = page_to_nid ( page ) ;
if ( flags & SO_TOTAL )
WARN_ON_ONCE ( 1 ) ;
else if ( flags & SO_OBJECTS )
WARN_ON_ONCE ( 1 ) ;
else
x = page - > pages ;
2011-11-22 19:02:02 +04:00
total + = x ;
nodes [ node ] + = x ;
2011-08-10 01:12:27 +04:00
}
2007-05-07 01:49:36 +04:00
}
}
2011-01-10 19:15:15 +03:00
lock_memory_hotplug ( ) ;
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2008-04-14 20:11:40 +04:00
if ( flags & SO_ALL ) {
for_each_node_state ( node , N_NORMAL_MEMORY ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
2013-07-15 05:05:29 +04:00
if ( flags & SO_TOTAL )
x = atomic_long_read ( & n - > total_objects ) ;
else if ( flags & SO_OBJECTS )
x = atomic_long_read ( & n - > total_objects ) -
count_partial ( n , count_free ) ;
2007-05-07 01:49:36 +04:00
else
2008-04-14 20:11:40 +04:00
x = atomic_long_read ( & n - > nr_slabs ) ;
2007-05-07 01:49:36 +04:00
total + = x ;
nodes [ node ] + = x ;
}
2010-10-05 22:57:26 +04:00
} else
# endif
if ( flags & SO_PARTIAL ) {
2008-04-14 20:11:40 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
2007-05-07 01:49:36 +04:00
2008-04-14 20:11:40 +04:00
if ( flags & SO_TOTAL )
x = count_partial ( n , count_total ) ;
else if ( flags & SO_OBJECTS )
x = count_partial ( n , count_inuse ) ;
2007-05-07 01:49:36 +04:00
else
2008-04-14 20:11:40 +04:00
x = n - > nr_partial ;
2007-05-07 01:49:36 +04:00
total + = x ;
nodes [ node ] + = x ;
}
}
x = sprintf ( buf , " %lu " , total ) ;
# ifdef CONFIG_NUMA
2007-10-16 12:25:33 +04:00
for_each_node_state ( node , N_NORMAL_MEMORY )
2007-05-07 01:49:36 +04:00
if ( nodes [ node ] )
x + = sprintf ( buf + x , " N%d=%lu " ,
node , nodes [ node ] ) ;
# endif
2011-01-10 19:15:15 +03:00
unlock_memory_hotplug ( ) ;
2007-05-07 01:49:36 +04:00
kfree ( nodes ) ;
return x + sprintf ( buf + x , " \n " ) ;
}
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2007-05-07 01:49:36 +04:00
static int any_slab_objects ( struct kmem_cache * s )
{
int node ;
2007-10-16 12:26:05 +04:00
for_each_online_node ( node ) {
2007-05-07 01:49:36 +04:00
struct kmem_cache_node * n = get_node ( s , node ) ;
2007-10-16 12:26:05 +04:00
if ( ! n )
continue ;
2008-05-07 07:42:39 +04:00
if ( atomic_long_read ( & n - > total_objects ) )
2007-05-07 01:49:36 +04:00
return 1 ;
}
return 0 ;
}
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:36 +04:00
# define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
2011-07-14 16:07:13 +04:00
# define to_slab(n) container_of(n, struct kmem_cache, kobj)
2007-05-07 01:49:36 +04:00
struct slab_attribute {
struct attribute attr ;
ssize_t ( * show ) ( struct kmem_cache * s , char * buf ) ;
ssize_t ( * store ) ( struct kmem_cache * s , const char * x , size_t count ) ;
} ;
# define SLAB_ATTR_RO(_name) \
mm: restrict access to slab files under procfs and sysfs
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:
1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.
2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.
3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).
4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.
5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.
6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.
Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.
Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:
http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:
groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER
And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):
chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <kees@ubuntu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27 21:54:53 +04:00
static struct slab_attribute _name # # _attr = \
__ATTR ( _name , 0400 , _name # # _show , NULL )
2007-05-07 01:49:36 +04:00
# define SLAB_ATTR(_name) \
static struct slab_attribute _name # # _attr = \
mm: restrict access to slab files under procfs and sysfs
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:
1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.
2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.
3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).
4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.
5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.
6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.
Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.
Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:
http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:
groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER
And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):
chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <kees@ubuntu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27 21:54:53 +04:00
__ATTR ( _name , 0600 , _name # # _show , _name # # _store )
2007-05-07 01:49:36 +04:00
static ssize_t slab_size_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > size ) ;
}
SLAB_ATTR_RO ( slab_size ) ;
static ssize_t align_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > align ) ;
}
SLAB_ATTR_RO ( align ) ;
static ssize_t object_size_show ( struct kmem_cache * s , char * buf )
{
2012-06-13 19:24:57 +04:00
return sprintf ( buf , " %d \n " , s - > object_size ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( object_size ) ;
static ssize_t objs_per_slab_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 20:11:31 +04:00
return sprintf ( buf , " %d \n " , oo_objects ( s - > oo ) ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( objs_per_slab ) ;
2008-04-14 20:11:41 +04:00
static ssize_t order_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
2008-04-30 03:11:12 +04:00
unsigned long order ;
int err ;
2013-09-12 01:20:25 +04:00
err = kstrtoul ( buf , 10 , & order ) ;
2008-04-30 03:11:12 +04:00
if ( err )
return err ;
2008-04-14 20:11:41 +04:00
if ( order > slub_max_order | | order < slub_min_order )
return - EINVAL ;
calculate_sizes ( s , order ) ;
return length ;
}
2007-05-07 01:49:36 +04:00
static ssize_t order_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 20:11:31 +04:00
return sprintf ( buf , " %d \n " , oo_order ( s - > oo ) ) ;
2007-05-07 01:49:36 +04:00
}
2008-04-14 20:11:41 +04:00
SLAB_ATTR ( order ) ;
2007-05-07 01:49:36 +04:00
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
static ssize_t min_partial_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %lu \n " , s - > min_partial ) ;
}
static ssize_t min_partial_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
unsigned long min ;
int err ;
2013-09-12 01:20:25 +04:00
err = kstrtoul ( buf , 10 , & min ) ;
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
if ( err )
return err ;
2009-02-25 10:16:35 +03:00
set_min_partial ( s , min ) ;
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
return length ;
}
SLAB_ATTR ( min_partial ) ;
2011-08-10 01:12:27 +04:00
static ssize_t cpu_partial_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %u \n " , s - > cpu_partial ) ;
}
static ssize_t cpu_partial_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
unsigned long objects ;
int err ;
2013-09-12 01:20:25 +04:00
err = kstrtoul ( buf , 10 , & objects ) ;
2011-08-10 01:12:27 +04:00
if ( err )
return err ;
2013-06-19 09:05:52 +04:00
if ( objects & & ! kmem_cache_has_cpu_partial ( s ) )
2012-01-10 01:19:45 +04:00
return - EINVAL ;
2011-08-10 01:12:27 +04:00
s - > cpu_partial = objects ;
flush_all ( s ) ;
return length ;
}
SLAB_ATTR ( cpu_partial ) ;
2007-05-07 01:49:36 +04:00
static ssize_t ctor_show ( struct kmem_cache * s , char * buf )
{
2011-01-14 02:45:52 +03:00
if ( ! s - > ctor )
return 0 ;
return sprintf ( buf , " %pS \n " , s - > ctor ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( ctor ) ;
static ssize_t aliases_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > refcount - 1 ) ;
}
SLAB_ATTR_RO ( aliases ) ;
static ssize_t partial_show ( struct kmem_cache * s , char * buf )
{
2008-02-16 02:22:21 +03:00
return show_slab_objects ( s , buf , SO_PARTIAL ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( partial ) ;
static ssize_t cpu_slabs_show ( struct kmem_cache * s , char * buf )
{
2008-02-16 02:22:21 +03:00
return show_slab_objects ( s , buf , SO_CPU ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( cpu_slabs ) ;
static ssize_t objects_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 20:11:40 +04:00
return show_slab_objects ( s , buf , SO_ALL | SO_OBJECTS ) ;
2007-05-07 01:49:36 +04:00
}
SLAB_ATTR_RO ( objects ) ;
2008-04-14 20:11:40 +04:00
static ssize_t objects_partial_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_PARTIAL | SO_OBJECTS ) ;
}
SLAB_ATTR_RO ( objects_partial ) ;
2011-08-10 01:12:27 +04:00
static ssize_t slabs_cpu_partial_show ( struct kmem_cache * s , char * buf )
{
int objects = 0 ;
int pages = 0 ;
int cpu ;
int len ;
for_each_online_cpu ( cpu ) {
struct page * page = per_cpu_ptr ( s - > cpu_slab , cpu ) - > partial ;
if ( page ) {
pages + = page - > pages ;
objects + = page - > pobjects ;
}
}
len = sprintf ( buf , " %d(%d) " , objects , pages ) ;
# ifdef CONFIG_SMP
for_each_online_cpu ( cpu ) {
struct page * page = per_cpu_ptr ( s - > cpu_slab , cpu ) - > partial ;
if ( page & & len < PAGE_SIZE - 20 )
len + = sprintf ( buf + len , " C%d=%d(%d) " , cpu ,
page - > pobjects , page - > pages ) ;
}
# endif
return len + sprintf ( buf + len , " \n " ) ;
}
SLAB_ATTR_RO ( slabs_cpu_partial ) ;
2010-10-05 22:57:27 +04:00
static ssize_t reclaim_account_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_RECLAIM_ACCOUNT ) ) ;
}
static ssize_t reclaim_account_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
s - > flags & = ~ SLAB_RECLAIM_ACCOUNT ;
if ( buf [ 0 ] = = ' 1 ' )
s - > flags | = SLAB_RECLAIM_ACCOUNT ;
return length ;
}
SLAB_ATTR ( reclaim_account ) ;
static ssize_t hwcache_align_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_HWCACHE_ALIGN ) ) ;
}
SLAB_ATTR_RO ( hwcache_align ) ;
# ifdef CONFIG_ZONE_DMA
static ssize_t cache_dma_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_CACHE_DMA ) ) ;
}
SLAB_ATTR_RO ( cache_dma ) ;
# endif
static ssize_t destroy_by_rcu_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_DESTROY_BY_RCU ) ) ;
}
SLAB_ATTR_RO ( destroy_by_rcu ) ;
2011-03-10 10:21:48 +03:00
static ssize_t reserved_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > reserved ) ;
}
SLAB_ATTR_RO ( reserved ) ;
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-10-05 22:57:27 +04:00
static ssize_t slabs_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_ALL ) ;
}
SLAB_ATTR_RO ( slabs ) ;
2008-04-14 20:11:40 +04:00
static ssize_t total_objects_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_ALL | SO_TOTAL ) ;
}
SLAB_ATTR_RO ( total_objects ) ;
2007-05-07 01:49:36 +04:00
static ssize_t sanity_checks_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_DEBUG_FREE ) ) ;
}
static ssize_t sanity_checks_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
s - > flags & = ~ SLAB_DEBUG_FREE ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_DEBUG_FREE ;
2011-06-01 21:25:49 +04:00
}
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( sanity_checks ) ;
static ssize_t trace_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_TRACE ) ) ;
}
static ssize_t trace_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
s - > flags & = ~ SLAB_TRACE ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_TRACE ;
2011-06-01 21:25:49 +04:00
}
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( trace ) ;
static ssize_t red_zone_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_RED_ZONE ) ) ;
}
static ssize_t red_zone_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( any_slab_objects ( s ) )
return - EBUSY ;
s - > flags & = ~ SLAB_RED_ZONE ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_RED_ZONE ;
2011-06-01 21:25:49 +04:00
}
2008-04-14 20:11:41 +04:00
calculate_sizes ( s , - 1 ) ;
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( red_zone ) ;
static ssize_t poison_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_POISON ) ) ;
}
static ssize_t poison_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( any_slab_objects ( s ) )
return - EBUSY ;
s - > flags & = ~ SLAB_POISON ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_POISON ;
2011-06-01 21:25:49 +04:00
}
2008-04-14 20:11:41 +04:00
calculate_sizes ( s , - 1 ) ;
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( poison ) ;
static ssize_t store_user_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_STORE_USER ) ) ;
}
static ssize_t store_user_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( any_slab_objects ( s ) )
return - EBUSY ;
s - > flags & = ~ SLAB_STORE_USER ;
2011-06-01 21:25:49 +04:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-07 01:49:36 +04:00
s - > flags | = SLAB_STORE_USER ;
2011-06-01 21:25:49 +04:00
}
2008-04-14 20:11:41 +04:00
calculate_sizes ( s , - 1 ) ;
2007-05-07 01:49:36 +04:00
return length ;
}
SLAB_ATTR ( store_user ) ;
2007-05-07 01:49:43 +04:00
static ssize_t validate_show ( struct kmem_cache * s , char * buf )
{
return 0 ;
}
static ssize_t validate_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
2007-07-17 15:03:30 +04:00
int ret = - EINVAL ;
if ( buf [ 0 ] = = ' 1 ' ) {
ret = validate_slab_cache ( s ) ;
if ( ret > = 0 )
ret = length ;
}
return ret ;
2007-05-07 01:49:43 +04:00
}
SLAB_ATTR ( validate ) ;
2010-10-05 22:57:27 +04:00
static ssize_t alloc_calls_show ( struct kmem_cache * s , char * buf )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return - ENOSYS ;
return list_locations ( s , buf , TRACK_ALLOC ) ;
}
SLAB_ATTR_RO ( alloc_calls ) ;
static ssize_t free_calls_show ( struct kmem_cache * s , char * buf )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return - ENOSYS ;
return list_locations ( s , buf , TRACK_FREE ) ;
}
SLAB_ATTR_RO ( free_calls ) ;
# endif /* CONFIG_SLUB_DEBUG */
# ifdef CONFIG_FAILSLAB
static ssize_t failslab_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_FAILSLAB ) ) ;
}
static ssize_t failslab_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
s - > flags & = ~ SLAB_FAILSLAB ;
if ( buf [ 0 ] = = ' 1 ' )
s - > flags | = SLAB_FAILSLAB ;
return length ;
}
SLAB_ATTR ( failslab ) ;
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:43 +04:00
2007-05-07 01:49:46 +04:00
static ssize_t shrink_show ( struct kmem_cache * s , char * buf )
{
return 0 ;
}
static ssize_t shrink_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( buf [ 0 ] = = ' 1 ' ) {
int rc = kmem_cache_shrink ( s ) ;
if ( rc )
return rc ;
} else
return - EINVAL ;
return length ;
}
SLAB_ATTR ( shrink ) ;
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_NUMA
2008-01-08 10:20:26 +03:00
static ssize_t remote_node_defrag_ratio_show ( struct kmem_cache * s , char * buf )
2007-05-07 01:49:36 +04:00
{
2008-01-08 10:20:26 +03:00
return sprintf ( buf , " %d \n " , s - > remote_node_defrag_ratio / 10 ) ;
2007-05-07 01:49:36 +04:00
}
2008-01-08 10:20:26 +03:00
static ssize_t remote_node_defrag_ratio_store ( struct kmem_cache * s ,
2007-05-07 01:49:36 +04:00
const char * buf , size_t length )
{
2008-04-30 03:11:12 +04:00
unsigned long ratio ;
int err ;
2013-09-12 01:20:25 +04:00
err = kstrtoul ( buf , 10 , & ratio ) ;
2008-04-30 03:11:12 +04:00
if ( err )
return err ;
2008-08-19 17:51:22 +04:00
if ( ratio < = 100 )
2008-04-30 03:11:12 +04:00
s - > remote_node_defrag_ratio = ratio * 10 ;
2007-05-07 01:49:36 +04:00
return length ;
}
2008-01-08 10:20:26 +03:00
SLAB_ATTR ( remote_node_defrag_ratio ) ;
2007-05-07 01:49:36 +04:00
# endif
2008-02-08 04:47:41 +03:00
# ifdef CONFIG_SLUB_STATS
static int show_stat ( struct kmem_cache * s , char * buf , enum stat_item si )
{
unsigned long sum = 0 ;
int cpu ;
int len ;
int * data = kmalloc ( nr_cpu_ids * sizeof ( int ) , GFP_KERNEL ) ;
if ( ! data )
return - ENOMEM ;
for_each_online_cpu ( cpu ) {
2009-12-19 01:26:20 +03:00
unsigned x = per_cpu_ptr ( s - > cpu_slab , cpu ) - > stat [ si ] ;
2008-02-08 04:47:41 +03:00
data [ cpu ] = x ;
sum + = x ;
}
len = sprintf ( buf , " %lu " , sum ) ;
2008-04-14 19:52:05 +04:00
# ifdef CONFIG_SMP
2008-02-08 04:47:41 +03:00
for_each_online_cpu ( cpu ) {
if ( data [ cpu ] & & len < PAGE_SIZE - 20 )
2008-04-14 19:52:05 +04:00
len + = sprintf ( buf + len , " C%d=%u " , cpu , data [ cpu ] ) ;
2008-02-08 04:47:41 +03:00
}
2008-04-14 19:52:05 +04:00
# endif
2008-02-08 04:47:41 +03:00
kfree ( data ) ;
return len + sprintf ( buf + len , " \n " ) ;
}
2009-10-15 13:20:22 +04:00
static void clear_stat ( struct kmem_cache * s , enum stat_item si )
{
int cpu ;
for_each_online_cpu ( cpu )
2009-12-19 01:26:20 +03:00
per_cpu_ptr ( s - > cpu_slab , cpu ) - > stat [ si ] = 0 ;
2009-10-15 13:20:22 +04:00
}
2008-02-08 04:47:41 +03:00
# define STAT_ATTR(si, text) \
static ssize_t text # # _show ( struct kmem_cache * s , char * buf ) \
{ \
return show_stat ( s , buf , si ) ; \
} \
2009-10-15 13:20:22 +04:00
static ssize_t text # # _store ( struct kmem_cache * s , \
const char * buf , size_t length ) \
{ \
if ( buf [ 0 ] ! = ' 0 ' ) \
return - EINVAL ; \
clear_stat ( s , si ) ; \
return length ; \
} \
SLAB_ATTR ( text ) ; \
2008-02-08 04:47:41 +03:00
STAT_ATTR ( ALLOC_FASTPATH , alloc_fastpath ) ;
STAT_ATTR ( ALLOC_SLOWPATH , alloc_slowpath ) ;
STAT_ATTR ( FREE_FASTPATH , free_fastpath ) ;
STAT_ATTR ( FREE_SLOWPATH , free_slowpath ) ;
STAT_ATTR ( FREE_FROZEN , free_frozen ) ;
STAT_ATTR ( FREE_ADD_PARTIAL , free_add_partial ) ;
STAT_ATTR ( FREE_REMOVE_PARTIAL , free_remove_partial ) ;
STAT_ATTR ( ALLOC_FROM_PARTIAL , alloc_from_partial ) ;
STAT_ATTR ( ALLOC_SLAB , alloc_slab ) ;
STAT_ATTR ( ALLOC_REFILL , alloc_refill ) ;
2011-06-01 21:25:57 +04:00
STAT_ATTR ( ALLOC_NODE_MISMATCH , alloc_node_mismatch ) ;
2008-02-08 04:47:41 +03:00
STAT_ATTR ( FREE_SLAB , free_slab ) ;
STAT_ATTR ( CPUSLAB_FLUSH , cpuslab_flush ) ;
STAT_ATTR ( DEACTIVATE_FULL , deactivate_full ) ;
STAT_ATTR ( DEACTIVATE_EMPTY , deactivate_empty ) ;
STAT_ATTR ( DEACTIVATE_TO_HEAD , deactivate_to_head ) ;
STAT_ATTR ( DEACTIVATE_TO_TAIL , deactivate_to_tail ) ;
STAT_ATTR ( DEACTIVATE_REMOTE_FREES , deactivate_remote_frees ) ;
2011-06-01 21:25:58 +04:00
STAT_ATTR ( DEACTIVATE_BYPASS , deactivate_bypass ) ;
2008-04-14 20:11:40 +04:00
STAT_ATTR ( ORDER_FALLBACK , order_fallback ) ;
2011-06-01 21:25:49 +04:00
STAT_ATTR ( CMPXCHG_DOUBLE_CPU_FAIL , cmpxchg_double_cpu_fail ) ;
STAT_ATTR ( CMPXCHG_DOUBLE_FAIL , cmpxchg_double_fail ) ;
2011-08-10 01:12:27 +04:00
STAT_ATTR ( CPU_PARTIAL_ALLOC , cpu_partial_alloc ) ;
STAT_ATTR ( CPU_PARTIAL_FREE , cpu_partial_free ) ;
2012-02-03 19:34:56 +04:00
STAT_ATTR ( CPU_PARTIAL_NODE , cpu_partial_node ) ;
STAT_ATTR ( CPU_PARTIAL_DRAIN , cpu_partial_drain ) ;
2008-02-08 04:47:41 +03:00
# endif
2008-01-08 10:20:27 +03:00
static struct attribute * slab_attrs [ ] = {
2007-05-07 01:49:36 +04:00
& slab_size_attr . attr ,
& object_size_attr . attr ,
& objs_per_slab_attr . attr ,
& order_attr . attr ,
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-23 04:40:09 +03:00
& min_partial_attr . attr ,
2011-08-10 01:12:27 +04:00
& cpu_partial_attr . attr ,
2007-05-07 01:49:36 +04:00
& objects_attr . attr ,
2008-04-14 20:11:40 +04:00
& objects_partial_attr . attr ,
2007-05-07 01:49:36 +04:00
& partial_attr . attr ,
& cpu_slabs_attr . attr ,
& ctor_attr . attr ,
& aliases_attr . attr ,
& align_attr . attr ,
& hwcache_align_attr . attr ,
& reclaim_account_attr . attr ,
& destroy_by_rcu_attr . attr ,
2010-10-05 22:57:27 +04:00
& shrink_attr . attr ,
2011-03-10 10:21:48 +03:00
& reserved_attr . attr ,
2011-08-10 01:12:27 +04:00
& slabs_cpu_partial_attr . attr ,
2010-10-05 22:57:26 +04:00
# ifdef CONFIG_SLUB_DEBUG
2010-10-05 22:57:27 +04:00
& total_objects_attr . attr ,
& slabs_attr . attr ,
& sanity_checks_attr . attr ,
& trace_attr . attr ,
2007-05-07 01:49:36 +04:00
& red_zone_attr . attr ,
& poison_attr . attr ,
& store_user_attr . attr ,
2007-05-07 01:49:43 +04:00
& validate_attr . attr ,
2007-05-07 01:49:45 +04:00
& alloc_calls_attr . attr ,
& free_calls_attr . attr ,
2010-10-05 22:57:26 +04:00
# endif
2007-05-07 01:49:36 +04:00
# ifdef CONFIG_ZONE_DMA
& cache_dma_attr . attr ,
# endif
# ifdef CONFIG_NUMA
2008-01-08 10:20:26 +03:00
& remote_node_defrag_ratio_attr . attr ,
2008-02-08 04:47:41 +03:00
# endif
# ifdef CONFIG_SLUB_STATS
& alloc_fastpath_attr . attr ,
& alloc_slowpath_attr . attr ,
& free_fastpath_attr . attr ,
& free_slowpath_attr . attr ,
& free_frozen_attr . attr ,
& free_add_partial_attr . attr ,
& free_remove_partial_attr . attr ,
& alloc_from_partial_attr . attr ,
& alloc_slab_attr . attr ,
& alloc_refill_attr . attr ,
2011-06-01 21:25:57 +04:00
& alloc_node_mismatch_attr . attr ,
2008-02-08 04:47:41 +03:00
& free_slab_attr . attr ,
& cpuslab_flush_attr . attr ,
& deactivate_full_attr . attr ,
& deactivate_empty_attr . attr ,
& deactivate_to_head_attr . attr ,
& deactivate_to_tail_attr . attr ,
& deactivate_remote_frees_attr . attr ,
2011-06-01 21:25:58 +04:00
& deactivate_bypass_attr . attr ,
2008-04-14 20:11:40 +04:00
& order_fallback_attr . attr ,
2011-06-01 21:25:49 +04:00
& cmpxchg_double_fail_attr . attr ,
& cmpxchg_double_cpu_fail_attr . attr ,
2011-08-10 01:12:27 +04:00
& cpu_partial_alloc_attr . attr ,
& cpu_partial_free_attr . attr ,
2012-02-03 19:34:56 +04:00
& cpu_partial_node_attr . attr ,
& cpu_partial_drain_attr . attr ,
2007-05-07 01:49:36 +04:00
# endif
2010-02-26 09:36:12 +03:00
# ifdef CONFIG_FAILSLAB
& failslab_attr . attr ,
# endif
2007-05-07 01:49:36 +04:00
NULL
} ;
static struct attribute_group slab_attr_group = {
. attrs = slab_attrs ,
} ;
static ssize_t slab_attr_show ( struct kobject * kobj ,
struct attribute * attr ,
char * buf )
{
struct slab_attribute * attribute ;
struct kmem_cache * s ;
int err ;
attribute = to_slab_attr ( attr ) ;
s = to_slab ( kobj ) ;
if ( ! attribute - > show )
return - EIO ;
err = attribute - > show ( s , buf ) ;
return err ;
}
static ssize_t slab_attr_store ( struct kobject * kobj ,
struct attribute * attr ,
const char * buf , size_t len )
{
struct slab_attribute * attribute ;
struct kmem_cache * s ;
int err ;
attribute = to_slab_attr ( attr ) ;
s = to_slab ( kobj ) ;
if ( ! attribute - > store )
return - EIO ;
err = attribute - > store ( s , buf , len ) ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
# ifdef CONFIG_MEMCG_KMEM
if ( slab_state > = FULL & & err > = 0 & & is_root_cache ( s ) ) {
int i ;
2007-05-07 01:49:36 +04:00
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
mutex_lock ( & slab_mutex ) ;
if ( s - > max_attr_size < len )
s - > max_attr_size = len ;
2012-12-19 02:23:10 +04:00
/*
* This is a best effort propagation , so this function ' s return
* value will be determined by the parent cache only . This is
* basically because not all attributes will have a well
* defined semantics for rollbacks - most of the actions will
* have permanent effects .
*
* Returning the error value of any of the children that fail
* is not 100 % defined , in the sense that users seeing the
* error code won ' t be able to know anything about the state of
* the cache .
*
* Only returning the error code for the parent cache at least
* has well defined semantics . The cache being written to
* directly either failed or succeeded , in which case we loop
* through the descendants with best - effort propagation .
*/
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
for_each_memcg_cache_index ( i ) {
2013-11-13 03:08:23 +04:00
struct kmem_cache * c = cache_from_memcg_idx ( s , i ) ;
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
if ( c )
attribute - > store ( c , buf , len ) ;
}
mutex_unlock ( & slab_mutex ) ;
}
# endif
2007-05-07 01:49:36 +04:00
return err ;
}
slub: slub-specific propagation changes
SLUB allows us to tune a particular cache behavior with sysfs-based
tunables. When creating a new memcg cache copy, we'd like to preserve any
tunables the parent cache already had.
This can be done by tapping into the store attribute function provided by
the allocator. We of course don't need to mess with read-only fields.
Since the attributes can have multiple types and are stored internally by
sysfs, the best strategy is to issue a ->show() in the root cache, and
then ->store() in the memcg cache.
The drawback of that, is that sysfs can allocate up to a page in buffering
for show(), that we are likely not to need, but also can't guarantee. To
avoid always allocating a page for that, we can update the caches at store
time with the maximum attribute size ever stored to the root cache. We
will then get a buffer big enough to hold it. The corolary to this, is
that if no stores happened, nothing will be propagated.
It can also happen that a root cache has its tunables updated during
normal system operation. In this case, we will propagate the change to
all caches that are already active.
[akpm@linux-foundation.org: tweak code to avoid __maybe_unused]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-19 02:23:05 +04:00
static void memcg_propagate_slab_attrs ( struct kmem_cache * s )
{
# ifdef CONFIG_MEMCG_KMEM
int i ;
char * buffer = NULL ;
if ( ! is_root_cache ( s ) )
return ;
/*
* This mean this cache had no attribute written . Therefore , no point
* in copying default values around
*/
if ( ! s - > max_attr_size )
return ;
for ( i = 0 ; i < ARRAY_SIZE ( slab_attrs ) ; i + + ) {
char mbuf [ 64 ] ;
char * buf ;
struct slab_attribute * attr = to_slab_attr ( slab_attrs [ i ] ) ;
if ( ! attr | | ! attr - > store | | ! attr - > show )
continue ;
/*
* It is really bad that we have to allocate here , so we will
* do it only as a fallback . If we actually allocate , though ,
* we can just use the allocated buffer until the end .
*
* Most of the slub attributes will tend to be very small in
* size , but sysfs allows buffers up to a page , so they can
* theoretically happen .
*/
if ( buffer )
buf = buffer ;
else if ( s - > max_attr_size < ARRAY_SIZE ( mbuf ) )
buf = mbuf ;
else {
buffer = ( char * ) get_zeroed_page ( GFP_KERNEL ) ;
if ( WARN_ON ( ! buffer ) )
continue ;
buf = buffer ;
}
attr - > show ( s - > memcg_params - > root_cache , buf ) ;
attr - > store ( s , buf , strlen ( buf ) ) ;
}
if ( buffer )
free_page ( ( unsigned long ) buffer ) ;
# endif
}
2010-01-19 04:58:23 +03:00
static const struct sysfs_ops slab_sysfs_ops = {
2007-05-07 01:49:36 +04:00
. show = slab_attr_show ,
. store = slab_attr_store ,
} ;
static struct kobj_type slab_ktype = {
. sysfs_ops = & slab_sysfs_ops ,
} ;
static int uevent_filter ( struct kset * kset , struct kobject * kobj )
{
struct kobj_type * ktype = get_ktype ( kobj ) ;
if ( ktype = = & slab_ktype )
return 1 ;
return 0 ;
}
2009-12-31 16:52:51 +03:00
static const struct kset_uevent_ops slab_uevent_ops = {
2007-05-07 01:49:36 +04:00
. filter = uevent_filter ,
} ;
2007-11-01 18:29:06 +03:00
static struct kset * slab_kset ;
2007-05-07 01:49:36 +04:00
2014-04-08 02:39:31 +04:00
static inline struct kset * cache_kset ( struct kmem_cache * s )
{
# ifdef CONFIG_MEMCG_KMEM
if ( ! is_root_cache ( s ) )
return s - > memcg_params - > root_cache - > memcg_kset ;
# endif
return slab_kset ;
}
2007-05-07 01:49:36 +04:00
# define ID_STR_LENGTH 64
/* Create a unique string id for a slab cache:
2008-02-16 10:45:26 +03:00
*
* Format : [ flags - ] size
2007-05-07 01:49:36 +04:00
*/
static char * create_unique_id ( struct kmem_cache * s )
{
char * name = kmalloc ( ID_STR_LENGTH , GFP_KERNEL ) ;
char * p = name ;
BUG_ON ( ! name ) ;
* p + + = ' : ' ;
/*
* First flags affecting slabcache operations . We will only
* get here for aliasable slabs so we do not need to support
* too many flags . The flags here must cover all flags that
* are matched during merging to guarantee that the id is
* unique .
*/
if ( s - > flags & SLAB_CACHE_DMA )
* p + + = ' d ' ;
if ( s - > flags & SLAB_RECLAIM_ACCOUNT )
* p + + = ' a ' ;
if ( s - > flags & SLAB_DEBUG_FREE )
* p + + = ' F ' ;
2008-04-04 02:54:48 +04:00
if ( ! ( s - > flags & SLAB_NOTRACK ) )
* p + + = ' t ' ;
2007-05-07 01:49:36 +04:00
if ( p ! = name + 1 )
* p + + = ' - ' ;
p + = sprintf ( p , " %07d " , s - > size ) ;
2012-12-19 02:22:34 +04:00
# ifdef CONFIG_MEMCG_KMEM
if ( ! is_root_cache ( s ) )
2013-07-15 05:05:29 +04:00
p + = sprintf ( p , " -%08d " ,
memcg_cache_id ( s - > memcg_params - > memcg ) ) ;
2012-12-19 02:22:34 +04:00
# endif
2007-05-07 01:49:36 +04:00
BUG_ON ( p > name + ID_STR_LENGTH - 1 ) ;
return name ;
}
static int sysfs_slab_add ( struct kmem_cache * s )
{
int err ;
const char * name ;
2012-11-28 20:23:07 +04:00
int unmergeable = slab_unmergeable ( s ) ;
2007-05-07 01:49:36 +04:00
if ( unmergeable ) {
/*
* Slabcache can never be merged so we can use the name proper .
* This is typically the case for debug situations . In that
* case we can catch duplicate names easily .
*/
2007-11-01 18:29:06 +03:00
sysfs_remove_link ( & slab_kset - > kobj , s - > name ) ;
2007-05-07 01:49:36 +04:00
name = s - > name ;
} else {
/*
* Create a unique name for the slab as a target
* for the symlinks .
*/
name = create_unique_id ( s ) ;
}
2014-04-08 02:39:31 +04:00
s - > kobj . kset = cache_kset ( s ) ;
2014-01-04 11:32:31 +04:00
err = kobject_init_and_add ( & s - > kobj , & slab_ktype , NULL , " %s " , name ) ;
2014-04-08 02:39:32 +04:00
if ( err )
goto out_put_kobj ;
2007-05-07 01:49:36 +04:00
err = sysfs_create_group ( & s - > kobj , & slab_attr_group ) ;
2014-04-08 02:39:32 +04:00
if ( err )
goto out_del_kobj ;
2014-04-08 02:39:31 +04:00
# ifdef CONFIG_MEMCG_KMEM
if ( is_root_cache ( s ) ) {
s - > memcg_kset = kset_create_and_add ( " cgroup " , NULL , & s - > kobj ) ;
if ( ! s - > memcg_kset ) {
2014-04-08 02:39:32 +04:00
err = - ENOMEM ;
goto out_del_kobj ;
2014-04-08 02:39:31 +04:00
}
}
# endif
2007-05-07 01:49:36 +04:00
kobject_uevent ( & s - > kobj , KOBJ_ADD ) ;
if ( ! unmergeable ) {
/* Setup first alias */
sysfs_slab_alias ( s , s - > name ) ;
}
2014-04-08 02:39:32 +04:00
out :
if ( ! unmergeable )
kfree ( name ) ;
return err ;
out_del_kobj :
kobject_del ( & s - > kobj ) ;
out_put_kobj :
kobject_put ( & s - > kobj ) ;
goto out ;
2007-05-07 01:49:36 +04:00
}
static void sysfs_slab_remove ( struct kmem_cache * s )
{
2012-07-07 00:25:11 +04:00
if ( slab_state < FULL )
2010-07-19 20:39:11 +04:00
/*
* Sysfs has not been setup yet so no need to remove the
* cache from sysfs .
*/
return ;
2014-04-08 02:39:31 +04:00
# ifdef CONFIG_MEMCG_KMEM
kset_unregister ( s - > memcg_kset ) ;
# endif
2007-05-07 01:49:36 +04:00
kobject_uevent ( & s - > kobj , KOBJ_REMOVE ) ;
kobject_del ( & s - > kobj ) ;
2008-01-08 09:29:05 +03:00
kobject_put ( & s - > kobj ) ;
2007-05-07 01:49:36 +04:00
}
/*
* Need to buffer aliases during bootup until sysfs becomes
2008-12-05 06:08:08 +03:00
* available lest we lose that information .
2007-05-07 01:49:36 +04:00
*/
struct saved_alias {
struct kmem_cache * s ;
const char * name ;
struct saved_alias * next ;
} ;
2007-07-17 15:03:27 +04:00
static struct saved_alias * alias_list ;
2007-05-07 01:49:36 +04:00
static int sysfs_slab_alias ( struct kmem_cache * s , const char * name )
{
struct saved_alias * al ;
2012-07-07 00:25:11 +04:00
if ( slab_state = = FULL ) {
2007-05-07 01:49:36 +04:00
/*
* If we have a leftover link then remove it .
*/
2007-11-01 18:29:06 +03:00
sysfs_remove_link ( & slab_kset - > kobj , name ) ;
return sysfs_create_link ( & slab_kset - > kobj , & s - > kobj , name ) ;
2007-05-07 01:49:36 +04:00
}
al = kmalloc ( sizeof ( struct saved_alias ) , GFP_KERNEL ) ;
if ( ! al )
return - ENOMEM ;
al - > s = s ;
al - > name = name ;
al - > next = alias_list ;
alias_list = al ;
return 0 ;
}
static int __init slab_sysfs_init ( void )
{
2007-07-17 15:03:19 +04:00
struct kmem_cache * s ;
2007-05-07 01:49:36 +04:00
int err ;
2012-07-07 00:25:12 +04:00
mutex_lock ( & slab_mutex ) ;
2010-07-19 20:39:11 +04:00
2007-11-06 21:36:58 +03:00
slab_kset = kset_create_and_add ( " slab " , & slab_uevent_ops , kernel_kobj ) ;
2007-11-01 18:29:06 +03:00
if ( ! slab_kset ) {
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-05-07 01:49:36 +04:00
printk ( KERN_ERR " Cannot register slab subsystem. \n " ) ;
return - ENOSYS ;
}
2012-07-07 00:25:11 +04:00
slab_state = FULL ;
2007-05-09 13:32:39 +04:00
2007-07-17 15:03:19 +04:00
list_for_each_entry ( s , & slab_caches , list ) {
2007-05-09 13:32:39 +04:00
err = sysfs_slab_add ( s ) ;
2007-08-31 10:56:26 +04:00
if ( err )
printk ( KERN_ERR " SLUB: Unable to add boot slab %s "
" to sysfs \n " , s - > name ) ;
2007-05-09 13:32:39 +04:00
}
2007-05-07 01:49:36 +04:00
while ( alias_list ) {
struct saved_alias * al = alias_list ;
alias_list = alias_list - > next ;
err = sysfs_slab_alias ( al - > s , al - > name ) ;
2007-08-31 10:56:26 +04:00
if ( err )
printk ( KERN_ERR " SLUB: Unable to add boot slab alias "
2012-07-08 15:37:40 +04:00
" %s to sysfs \n " , al - > name ) ;
2007-05-07 01:49:36 +04:00
kfree ( al ) ;
}
2012-07-07 00:25:12 +04:00
mutex_unlock ( & slab_mutex ) ;
2007-05-07 01:49:36 +04:00
resiliency_test ( ) ;
return 0 ;
}
__initcall ( slab_sysfs_init ) ;
2010-10-05 22:57:26 +04:00
# endif /* CONFIG_SYSFS */
2008-01-01 19:23:28 +03:00
/*
* The / proc / slabinfo ABI
*/
2008-01-03 00:04:48 +03:00
# ifdef CONFIG_SLABINFO
2012-10-19 18:20:27 +04:00
void get_slabinfo ( struct kmem_cache * s , struct slabinfo * sinfo )
2008-01-01 19:23:28 +03:00
{
unsigned long nr_slabs = 0 ;
2008-04-14 20:11:40 +04:00
unsigned long nr_objs = 0 ;
unsigned long nr_free = 0 ;
2008-01-01 19:23:28 +03:00
int node ;
for_each_online_node ( node ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
if ( ! n )
continue ;
2013-07-04 04:33:26 +04:00
nr_slabs + = node_nr_slabs ( n ) ;
nr_objs + = node_nr_objs ( n ) ;
2008-04-14 20:11:40 +04:00
nr_free + = count_partial ( n , count_free ) ;
2008-01-01 19:23:28 +03:00
}
2012-10-19 18:20:27 +04:00
sinfo - > active_objs = nr_objs - nr_free ;
sinfo - > num_objs = nr_objs ;
sinfo - > active_slabs = nr_slabs ;
sinfo - > num_slabs = nr_slabs ;
sinfo - > objects_per_slab = oo_objects ( s - > oo ) ;
sinfo - > cache_order = oo_order ( s - > oo ) ;
2008-01-01 19:23:28 +03:00
}
2012-10-19 18:20:27 +04:00
void slabinfo_show_stats ( struct seq_file * m , struct kmem_cache * s )
2008-10-06 02:42:17 +04:00
{
}
2012-10-19 18:20:25 +04:00
ssize_t slabinfo_write ( struct file * file , const char __user * buffer ,
size_t count , loff_t * ppos )
2008-10-06 02:42:17 +04:00
{
2012-10-19 18:20:25 +04:00
return - EIO ;
2008-10-06 02:42:17 +04:00
}
2008-01-03 00:04:48 +03:00
# endif /* CONFIG_SLABINFO */