2007-05-06 14:49:36 -07:00
/*
* SLUB : A slab allocator that limits cache line use instead of queuing
* objects in per cpu and per node lists .
*
2011-06-01 12:25:53 -05:00
* The allocator synchronizes using per slab locks or atomic operatios
* and only uses a centralized lock to manage a pool of partial slabs .
2007-05-06 14:49:36 -07:00
*
2008-07-04 09:59:22 -07:00
* ( C ) 2007 SGI , Christoph Lameter
2011-06-01 12:25:53 -05:00
* ( C ) 2011 Linux Foundation , Christoph Lameter
2007-05-06 14:49:36 -07:00
*/
# include <linux/mm.h>
2009-05-05 19:13:44 +10:00
# include <linux/swap.h> /* struct reclaim_state */
2007-05-06 14:49:36 -07:00
# include <linux/module.h>
# include <linux/bit_spinlock.h>
# include <linux/interrupt.h>
# include <linux/bitops.h>
# include <linux/slab.h>
2008-10-06 02:42:17 +04:00
# include <linux/proc_fs.h>
2007-05-06 14:49:36 -07:00
# include <linux/seq_file.h>
2008-04-04 00:54:48 +02:00
# include <linux/kmemcheck.h>
2007-05-06 14:49:36 -07:00
# include <linux/cpu.h>
# include <linux/cpuset.h>
# include <linux/mempolicy.h>
# include <linux/ctype.h>
2008-04-30 00:55:01 -07:00
# include <linux/debugobjects.h>
2007-05-06 14:49:36 -07:00
# include <linux/kallsyms.h>
2007-10-21 16:41:37 -07:00
# include <linux/memory.h>
2008-05-01 04:34:31 -07:00
# include <linux/math64.h>
2008-12-23 19:37:01 +09:00
# include <linux/fault-inject.h>
2011-07-07 22:47:01 +03:00
# include <linux/stacktrace.h>
2012-01-30 15:53:51 -06:00
# include <linux/prefetch.h>
2007-05-06 14:49:36 -07:00
2010-10-21 10:29:19 +01:00
# include <trace/events/kmem.h>
2007-05-06 14:49:36 -07:00
/*
* Lock order :
2011-06-01 12:25:53 -05:00
* 1. slub_lock ( Global Semaphore )
* 2. node - > list_lock
* 3. slab_lock ( page ) ( Only on some arches and for debugging )
2007-05-06 14:49:36 -07:00
*
2011-06-01 12:25:53 -05:00
* slub_lock
*
* The role of the slub_lock is to protect the list of all the slabs
* and to synchronize major metadata changes to slab cache structures .
*
* The slab_lock is only used for debugging and on arches that do not
* have the ability to do a cmpxchg_double . It only protects the second
* double word in the page struct . Meaning
* A . page - > freelist - > List of object free in a page
* B . page - > counters - > Counters of objects
* C . page - > frozen - > frozen state
*
* If a slab is frozen then it is exempt from list management . It is not
* on any list . The processor that froze the slab is the one who can
* perform list operations on the page . Other processors may put objects
* onto the freelist but the processor that froze the slab is the only
* one that can retrieve the objects from the page ' s freelist .
2007-05-06 14:49:36 -07:00
*
* The list_lock protects the partial and full list on each node and
* the partial slab counter . If taken then no new slabs may be added or
* removed from the lists nor make the number of partial slabs be modified .
* ( Note that the total number of slabs is an atomic value that may be
* modified without taking the list lock ) .
*
* The list_lock is a centralized lock and thus we avoid taking it as
* much as possible . As long as SLUB does not have to handle partial
* slabs , operations can continue without any centralized lock . F . e .
* allocating a long series of objects that fill up slabs does not require
* the list lock .
* Interrupts are disabled during allocation and deallocation in order to
* make the slab allocator safe to use in the context of an irq . In addition
* interrupts are disabled to ensure that the processor does not change
* while handling per_cpu slabs , due to kernel preemption .
*
* SLUB assigns one slab for allocation to each processor .
* Allocations only occur from these slabs called cpu slabs .
*
2007-05-09 02:32:39 -07:00
* Slabs with free elements are kept on a partial list and during regular
* operations no list for full slabs is used . If an object in a full slab is
2007-05-06 14:49:36 -07:00
* freed then the slab will show up again on the partial lists .
2007-05-09 02:32:39 -07:00
* We track full slabs for debugging purposes though because otherwise we
* cannot scan all objects .
2007-05-06 14:49:36 -07:00
*
* Slabs are freed when they become empty . Teardown and setup is
* minimal so we rely on the page allocators per cpu caches for
* fast frees and allocs .
*
* Overloading of page flags that are otherwise used for LRU management .
*
2007-05-16 22:10:53 -07:00
* PageActive The slab is frozen and exempt from list processing .
* This means that the slab is dedicated to a purpose
* such as satisfying allocations for a specific
* processor . Objects may be freed in the slab while
* it is frozen but slab_free will then skip the usual
* list operations . It is up to the processor holding
* the slab to integrate the slab into the slab lists
* when the slab is no longer needed .
*
* One use of this flag is to mark slabs that are
* used for allocations . Then such a slab becomes a cpu
* slab . The cpu slab may be equipped with an additional
2007-10-16 01:26:05 -07:00
* freelist that allows lockless access to
2007-05-10 03:15:16 -07:00
* free objects in addition to the regular freelist
* that requires the slab lock .
2007-05-06 14:49:36 -07:00
*
* PageError Slab requires special handling due to debug
* options set . This moves slab handling out of
2007-05-10 03:15:16 -07:00
* the fast path and disables lockless freelists .
2007-05-06 14:49:36 -07:00
*/
2010-07-09 14:07:14 -05:00
# define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
SLAB_TRACE | SLAB_DEBUG_FREE )
static inline int kmem_cache_debug ( struct kmem_cache * s )
{
2007-05-16 22:10:56 -07:00
# ifdef CONFIG_SLUB_DEBUG
2010-07-09 14:07:14 -05:00
return unlikely ( s - > flags & SLAB_DEBUG_FLAGS ) ;
2007-05-16 22:10:56 -07:00
# else
2010-07-09 14:07:14 -05:00
return 0 ;
2007-05-16 22:10:56 -07:00
# endif
2010-07-09 14:07:14 -05:00
}
2007-05-16 22:10:56 -07:00
2007-05-06 14:49:36 -07:00
/*
* Issues still to be resolved :
*
* - Support PAGE_ALLOC_DEBUG . Should be easy to do .
*
* - Variable sizing of the per node arrays
*/
/* Enable to test recovery from slab corruption on boot */
# undef SLUB_RESILIENCY_TEST
2011-06-01 12:25:49 -05:00
/* Enable to log cmpxchg failures */
# undef SLUB_DEBUG_CMPXCHG
2007-05-06 14:49:46 -07:00
/*
* Mininum number of partial slabs . These will be left on the partial
* lists even if they are empty . kmem_cache_shrink may reclaim them .
*/
2007-12-21 14:37:37 -08:00
# define MIN_PARTIAL 5
2007-05-06 14:49:44 -07:00
2007-05-06 14:49:46 -07:00
/*
* Maximum number of desirable partial slabs .
* The existence of more partial slabs makes kmem_cache_shrink
* sort the partial list by the number of objects in the .
*/
# define MAX_PARTIAL 10
2007-05-06 14:49:36 -07:00
# define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER )
2007-05-09 02:32:39 -07:00
2009-07-07 00:14:14 -07:00
/*
2009-07-27 18:30:35 -07:00
* Debugging flags that require metadata to be stored in the slab . These get
* disabled when slub_debug = O is used and a cache ' s min order increases with
* metadata .
2009-07-07 00:14:14 -07:00
*/
2009-07-27 18:30:35 -07:00
# define DEBUG_METADATA_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
2009-07-07 00:14:14 -07:00
2007-05-06 14:49:36 -07:00
/*
* Set of flags that will prevent slab merging
*/
# define SLUB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
2010-02-26 09:36:12 +03:00
SLAB_TRACE | SLAB_DESTROY_BY_RCU | SLAB_NOLEAKTRACE | \
SLAB_FAILSLAB )
2007-05-06 14:49:36 -07:00
# define SLUB_MERGE_SAME (SLAB_DEBUG_FREE | SLAB_RECLAIM_ACCOUNT | \
2008-04-04 00:54:48 +02:00
SLAB_CACHE_DMA | SLAB_NOTRACK )
2007-05-06 14:49:36 -07:00
2008-10-22 23:00:38 +04:00
# define OO_SHIFT 16
# define OO_MASK ((1 << OO_SHIFT) - 1)
2011-06-01 12:25:45 -05:00
# define MAX_OBJS_PER_PAGE 32767 /* since page.objects is u15 */
2008-10-22 23:00:38 +04:00
2007-05-06 14:49:36 -07:00
/* Internal SLUB flags */
2010-07-09 14:07:11 -05:00
# define __OBJECT_POISON 0x80000000UL /* Poison object */
2011-06-01 12:25:49 -05:00
# define __CMPXCHG_DOUBLE 0x40000000UL /* Use cmpxchg_double */
2007-05-06 14:49:36 -07:00
static int kmem_size = sizeof ( struct kmem_cache ) ;
# ifdef CONFIG_SMP
static struct notifier_block slab_notifier ;
# endif
static enum {
DOWN , /* No slab functionality available */
2010-08-20 12:37:15 -05:00
PARTIAL , /* Kmem_cache_node works */
2007-05-09 02:32:39 -07:00
UP , /* Everything works but does not show up in sysfs */
2007-05-06 14:49:36 -07:00
SYSFS /* Sysfs up */
} slab_state = DOWN ;
/* A list of all slab caches on the system */
static DECLARE_RWSEM ( slub_lock ) ;
2007-07-17 04:03:27 -07:00
static LIST_HEAD ( slab_caches ) ;
2007-05-06 14:49:36 -07:00
2007-05-09 02:32:43 -07:00
/*
* Tracking user of a slab .
*/
2011-07-07 11:36:36 -07:00
# define TRACK_ADDRS_COUNT 16
2007-05-09 02:32:43 -07:00
struct track {
2008-08-19 20:43:25 +03:00
unsigned long addr ; /* Called from address */
2011-07-07 11:36:36 -07:00
# ifdef CONFIG_STACKTRACE
unsigned long addrs [ TRACK_ADDRS_COUNT ] ; /* Called from address */
# endif
2007-05-09 02:32:43 -07:00
int cpu ; /* Was running on cpu */
int pid ; /* Pid context */
unsigned long when ; /* When did the operation occur */
} ;
enum track_item { TRACK_ALLOC , TRACK_FREE } ;
2010-10-05 13:57:26 -05:00
# ifdef CONFIG_SYSFS
2007-05-06 14:49:36 -07:00
static int sysfs_slab_add ( struct kmem_cache * ) ;
static int sysfs_slab_alias ( struct kmem_cache * , const char * ) ;
static void sysfs_slab_remove ( struct kmem_cache * ) ;
2008-02-07 17:47:41 -08:00
2007-05-06 14:49:36 -07:00
# else
2007-07-17 04:03:24 -07:00
static inline int sysfs_slab_add ( struct kmem_cache * s ) { return 0 ; }
static inline int sysfs_slab_alias ( struct kmem_cache * s , const char * p )
{ return 0 ; }
2008-01-07 22:29:05 -08:00
static inline void sysfs_slab_remove ( struct kmem_cache * s )
{
2010-09-14 23:21:12 +03:00
kfree ( s - > name ) ;
2008-01-07 22:29:05 -08:00
kfree ( s ) ;
}
2008-02-07 17:47:41 -08:00
2007-05-06 14:49:36 -07:00
# endif
2011-03-22 13:35:00 -05:00
static inline void stat ( const struct kmem_cache * s , enum stat_item si )
2008-02-07 17:47:41 -08:00
{
# ifdef CONFIG_SLUB_STATS
2009-12-18 16:26:23 -06:00
__this_cpu_inc ( s - > cpu_slab - > stat [ si ] ) ;
2008-02-07 17:47:41 -08:00
# endif
}
2007-05-06 14:49:36 -07:00
/********************************************************************
* Core slab cache functions
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
int slab_is_available ( void )
{
return slab_state > = UP ;
}
static inline struct kmem_cache_node * get_node ( struct kmem_cache * s , int node )
{
return s - > node [ node ] ;
}
2008-02-15 23:45:26 -08:00
/* Verify that a pointer has an address that is valid within a slab page */
2007-05-09 02:32:43 -07:00
static inline int check_valid_pointer ( struct kmem_cache * s ,
struct page * page , const void * object )
{
void * base ;
2008-03-01 13:40:44 -08:00
if ( ! object )
2007-05-09 02:32:43 -07:00
return 1 ;
2008-03-01 13:40:44 -08:00
base = page_address ( page ) ;
2008-04-14 19:11:30 +03:00
if ( object < base | | object > = base + page - > objects * s - > size | |
2007-05-09 02:32:43 -07:00
( object - base ) % s - > size ) {
return 0 ;
}
return 1 ;
}
2007-05-09 02:32:40 -07:00
static inline void * get_freepointer ( struct kmem_cache * s , void * object )
{
return * ( void * * ) ( object + s - > offset ) ;
}
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 16:25:34 +01:00
static void prefetch_freepointer ( const struct kmem_cache * s , void * object )
{
prefetch ( object + s - > offset ) ;
}
2011-05-16 15:26:08 -05:00
static inline void * get_freepointer_safe ( struct kmem_cache * s , void * object )
{
void * p ;
# ifdef CONFIG_DEBUG_PAGEALLOC
probe_kernel_read ( & p , ( void * * ) ( object + s - > offset ) , sizeof ( p ) ) ;
# else
p = get_freepointer ( s , object ) ;
# endif
return p ;
}
2007-05-09 02:32:40 -07:00
static inline void set_freepointer ( struct kmem_cache * s , void * object , void * fp )
{
* ( void * * ) ( object + s - > offset ) = fp ;
}
/* Loop over all objects in a slab */
2008-04-14 19:11:31 +03:00
# define for_each_object(__p, __s, __addr, __objects) \
for ( __p = ( __addr ) ; __p < ( __addr ) + ( __objects ) * ( __s ) - > size ; \
2007-05-09 02:32:40 -07:00
__p + = ( __s ) - > size )
/* Determine object index from a given position */
static inline int slab_index ( void * p , struct kmem_cache * s , void * addr )
{
return ( p - addr ) / s - > size ;
}
2011-02-26 20:10:26 +01:00
static inline size_t slab_ksize ( const struct kmem_cache * s )
{
# ifdef CONFIG_SLUB_DEBUG
/*
* Debugging requires use of the padding between object
* and whatever may come after it .
*/
if ( s - > flags & ( SLAB_RED_ZONE | SLAB_POISON ) )
return s - > objsize ;
# endif
/*
* If we have the need to store the freelist pointer
* back there or track user information then we can
* only use the space before that information .
*/
if ( s - > flags & ( SLAB_DESTROY_BY_RCU | SLAB_STORE_USER ) )
return s - > inuse ;
/*
* Else we can use all the padding etc for the allocation
*/
return s - > size ;
}
2011-03-10 15:21:48 +08:00
static inline int order_objects ( int order , unsigned long size , int reserved )
{
return ( ( PAGE_SIZE < < order ) - reserved ) / size ;
}
2008-04-14 19:11:31 +03:00
static inline struct kmem_cache_order_objects oo_make ( int order ,
2011-03-10 15:21:48 +08:00
unsigned long size , int reserved )
2008-04-14 19:11:31 +03:00
{
struct kmem_cache_order_objects x = {
2011-03-10 15:21:48 +08:00
( order < < OO_SHIFT ) + order_objects ( order , size , reserved )
2008-04-14 19:11:31 +03:00
} ;
return x ;
}
static inline int oo_order ( struct kmem_cache_order_objects x )
{
2008-10-22 23:00:38 +04:00
return x . x > > OO_SHIFT ;
2008-04-14 19:11:31 +03:00
}
static inline int oo_objects ( struct kmem_cache_order_objects x )
{
2008-10-22 23:00:38 +04:00
return x . x & OO_MASK ;
2008-04-14 19:11:31 +03:00
}
2011-06-01 12:25:53 -05:00
/*
* Per slab locking using the pagelock
*/
static __always_inline void slab_lock ( struct page * page )
{
bit_spin_lock ( PG_locked , & page - > flags ) ;
}
static __always_inline void slab_unlock ( struct page * page )
{
__bit_spin_unlock ( PG_locked , & page - > flags ) ;
}
2011-07-14 12:49:12 -05:00
/* Interrupts must be disabled (for the fallback code to work right) */
static inline bool __cmpxchg_double_slab ( struct kmem_cache * s , struct page * page ,
void * freelist_old , unsigned long counters_old ,
void * freelist_new , unsigned long counters_new ,
const char * n )
{
VM_BUG_ON ( ! irqs_disabled ( ) ) ;
2012-01-12 17:17:33 -08:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-07-14 12:49:12 -05:00
if ( s - > flags & __CMPXCHG_DOUBLE ) {
2012-01-02 17:02:18 +00:00
if ( cmpxchg_double ( & page - > freelist , & page - > counters ,
2011-07-14 12:49:12 -05:00
freelist_old , counters_old ,
freelist_new , counters_new ) )
return 1 ;
} else
# endif
{
slab_lock ( page ) ;
if ( page - > freelist = = freelist_old & & page - > counters = = counters_old ) {
page - > freelist = freelist_new ;
page - > counters = counters_new ;
slab_unlock ( page ) ;
return 1 ;
}
slab_unlock ( page ) ;
}
cpu_relax ( ) ;
stat ( s , CMPXCHG_DOUBLE_FAIL ) ;
# ifdef SLUB_DEBUG_CMPXCHG
printk ( KERN_INFO " %s %s: cmpxchg double redo " , n , s - > name ) ;
# endif
return 0 ;
}
2011-06-01 12:25:49 -05:00
static inline bool cmpxchg_double_slab ( struct kmem_cache * s , struct page * page ,
void * freelist_old , unsigned long counters_old ,
void * freelist_new , unsigned long counters_new ,
const char * n )
{
2012-01-12 17:17:33 -08:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-06-01 12:25:49 -05:00
if ( s - > flags & __CMPXCHG_DOUBLE ) {
2012-01-02 17:02:18 +00:00
if ( cmpxchg_double ( & page - > freelist , & page - > counters ,
2011-06-01 12:25:49 -05:00
freelist_old , counters_old ,
freelist_new , counters_new ) )
return 1 ;
} else
# endif
{
2011-07-14 12:49:12 -05:00
unsigned long flags ;
local_irq_save ( flags ) ;
2011-06-01 12:25:53 -05:00
slab_lock ( page ) ;
2011-06-01 12:25:49 -05:00
if ( page - > freelist = = freelist_old & & page - > counters = = counters_old ) {
page - > freelist = freelist_new ;
page - > counters = counters_new ;
2011-06-01 12:25:53 -05:00
slab_unlock ( page ) ;
2011-07-14 12:49:12 -05:00
local_irq_restore ( flags ) ;
2011-06-01 12:25:49 -05:00
return 1 ;
}
2011-06-01 12:25:53 -05:00
slab_unlock ( page ) ;
2011-07-14 12:49:12 -05:00
local_irq_restore ( flags ) ;
2011-06-01 12:25:49 -05:00
}
cpu_relax ( ) ;
stat ( s , CMPXCHG_DOUBLE_FAIL ) ;
# ifdef SLUB_DEBUG_CMPXCHG
printk ( KERN_INFO " %s %s: cmpxchg double redo " , n , s - > name ) ;
# endif
return 0 ;
}
2007-05-09 02:32:44 -07:00
# ifdef CONFIG_SLUB_DEBUG
2011-04-15 14:48:13 -05:00
/*
* Determine a map of object in use on a page .
*
2011-06-01 12:25:53 -05:00
* Node listlock must be held to guarantee that the page does
2011-04-15 14:48:13 -05:00
* not vanish from under us .
*/
static void get_map ( struct kmem_cache * s , struct page * page , unsigned long * map )
{
void * p ;
void * addr = page_address ( page ) ;
for ( p = page - > freelist ; p ; p = get_freepointer ( s , p ) )
set_bit ( slab_index ( p , s , addr ) , map ) ;
}
2007-05-09 02:32:44 -07:00
/*
* Debug settings :
*/
2007-07-15 23:38:14 -07:00
# ifdef CONFIG_SLUB_DEBUG_ON
static int slub_debug = DEBUG_DEFAULT_FLAGS ;
# else
2007-05-09 02:32:44 -07:00
static int slub_debug ;
2007-07-15 23:38:14 -07:00
# endif
2007-05-09 02:32:44 -07:00
static char * slub_debug_slabs ;
2009-07-07 00:14:14 -07:00
static int disable_higher_order_debug ;
2007-05-09 02:32:44 -07:00
2007-05-06 14:49:36 -07:00
/*
* Object debugging
*/
static void print_section ( char * text , u8 * addr , unsigned int length )
{
2011-07-29 14:10:20 +02:00
print_hex_dump ( KERN_ERR , text , DUMP_PREFIX_ADDRESS , 16 , 1 , addr ,
length , 1 ) ;
2007-05-06 14:49:36 -07:00
}
static struct track * get_track ( struct kmem_cache * s , void * object ,
enum track_item alloc )
{
struct track * p ;
if ( s - > offset )
p = object + s - > offset + sizeof ( void * ) ;
else
p = object + s - > inuse ;
return p + alloc ;
}
static void set_track ( struct kmem_cache * s , void * object ,
2008-08-19 20:43:25 +03:00
enum track_item alloc , unsigned long addr )
2007-05-06 14:49:36 -07:00
{
2009-03-07 00:36:21 +09:00
struct track * p = get_track ( s , object , alloc ) ;
2007-05-06 14:49:36 -07:00
if ( addr ) {
2011-07-07 11:36:36 -07:00
# ifdef CONFIG_STACKTRACE
struct stack_trace trace ;
int i ;
trace . nr_entries = 0 ;
trace . max_entries = TRACK_ADDRS_COUNT ;
trace . entries = p - > addrs ;
trace . skip = 3 ;
save_stack_trace ( & trace ) ;
/* See rant in lockdep.c */
if ( trace . nr_entries ! = 0 & &
trace . entries [ trace . nr_entries - 1 ] = = ULONG_MAX )
trace . nr_entries - - ;
for ( i = trace . nr_entries ; i < TRACK_ADDRS_COUNT ; i + + )
p - > addrs [ i ] = 0 ;
# endif
2007-05-06 14:49:36 -07:00
p - > addr = addr ;
p - > cpu = smp_processor_id ( ) ;
2008-06-23 02:58:37 +04:00
p - > pid = current - > pid ;
2007-05-06 14:49:36 -07:00
p - > when = jiffies ;
} else
memset ( p , 0 , sizeof ( struct track ) ) ;
}
static void init_tracking ( struct kmem_cache * s , void * object )
{
2007-07-17 04:03:18 -07:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2008-08-19 20:43:25 +03:00
set_track ( s , object , TRACK_FREE , 0UL ) ;
set_track ( s , object , TRACK_ALLOC , 0UL ) ;
2007-05-06 14:49:36 -07:00
}
static void print_track ( const char * s , struct track * t )
{
if ( ! t - > addr )
return ;
2008-07-14 12:12:53 -07:00
printk ( KERN_ERR " INFO: %s in %pS age=%lu cpu=%u pid=%d \n " ,
2008-08-19 20:43:25 +03:00
s , ( void * ) t - > addr , jiffies - t - > when , t - > cpu , t - > pid ) ;
2011-07-07 11:36:36 -07:00
# ifdef CONFIG_STACKTRACE
{
int i ;
for ( i = 0 ; i < TRACK_ADDRS_COUNT ; i + + )
if ( t - > addrs [ i ] )
printk ( KERN_ERR " \t %pS \n " , ( void * ) t - > addrs [ i ] ) ;
else
break ;
}
# endif
2007-07-17 04:03:18 -07:00
}
static void print_tracking ( struct kmem_cache * s , void * object )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
print_track ( " Allocated " , get_track ( s , object , TRACK_ALLOC ) ) ;
print_track ( " Freed " , get_track ( s , object , TRACK_FREE ) ) ;
}
static void print_page_info ( struct page * page )
{
2008-04-14 19:11:30 +03:00
printk ( KERN_ERR " INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx \n " ,
page , page - > objects , page - > inuse , page - > freelist , page - > flags ) ;
2007-07-17 04:03:18 -07:00
}
static void slab_bug ( struct kmem_cache * s , char * fmt , . . . )
{
va_list args ;
char buf [ 100 ] ;
va_start ( args , fmt ) ;
vsnprintf ( buf , sizeof ( buf ) , fmt , args ) ;
va_end ( args ) ;
printk ( KERN_ERR " ======================================== "
" ===================================== \n " ) ;
2011-11-15 15:04:00 -08:00
printk ( KERN_ERR " BUG %s (%s): %s \n " , s - > name , print_tainted ( ) , buf ) ;
2007-07-17 04:03:18 -07:00
printk ( KERN_ERR " ---------------------------------------- "
" ------------------------------------- \n \n " ) ;
2007-05-06 14:49:36 -07:00
}
2007-07-17 04:03:18 -07:00
static void slab_fix ( struct kmem_cache * s , char * fmt , . . . )
{
va_list args ;
char buf [ 100 ] ;
va_start ( args , fmt ) ;
vsnprintf ( buf , sizeof ( buf ) , fmt , args ) ;
va_end ( args ) ;
printk ( KERN_ERR " FIX %s: %s \n " , s - > name , buf ) ;
}
static void print_trailer ( struct kmem_cache * s , struct page * page , u8 * p )
2007-05-06 14:49:36 -07:00
{
unsigned int off ; /* Offset of last byte */
2008-03-01 13:40:44 -08:00
u8 * addr = page_address ( page ) ;
2007-07-17 04:03:18 -07:00
print_tracking ( s , p ) ;
print_page_info ( page ) ;
printk ( KERN_ERR " INFO: Object 0x%p @offset=%tu fp=0x%p \n \n " ,
p , p - addr , get_freepointer ( s , p ) ) ;
if ( p > addr + 16 )
2011-07-29 14:10:20 +02:00
print_section ( " Bytes b4 " , p - 16 , 16 ) ;
2007-05-06 14:49:36 -07:00
2011-07-29 14:10:20 +02:00
print_section ( " Object " , p , min_t ( unsigned long , s - > objsize ,
PAGE_SIZE ) ) ;
2007-05-06 14:49:36 -07:00
if ( s - > flags & SLAB_RED_ZONE )
2011-07-29 14:10:20 +02:00
print_section ( " Redzone " , p + s - > objsize ,
2007-05-06 14:49:36 -07:00
s - > inuse - s - > objsize ) ;
if ( s - > offset )
off = s - > offset + sizeof ( void * ) ;
else
off = s - > inuse ;
2007-07-17 04:03:18 -07:00
if ( s - > flags & SLAB_STORE_USER )
2007-05-06 14:49:36 -07:00
off + = 2 * sizeof ( struct track ) ;
if ( off ! = s - > size )
/* Beginning of the filler is the free pointer */
2011-07-29 14:10:20 +02:00
print_section ( " Padding " , p + off , s - > size - off ) ;
2007-07-17 04:03:18 -07:00
dump_stack ( ) ;
2007-05-06 14:49:36 -07:00
}
static void object_err ( struct kmem_cache * s , struct page * page ,
u8 * object , char * reason )
{
2008-04-23 12:28:01 -07:00
slab_bug ( s , " %s " , reason ) ;
2007-07-17 04:03:18 -07:00
print_trailer ( s , page , object ) ;
2007-05-06 14:49:36 -07:00
}
2007-07-17 04:03:18 -07:00
static void slab_err ( struct kmem_cache * s , struct page * page , char * fmt , . . . )
2007-05-06 14:49:36 -07:00
{
va_list args ;
char buf [ 100 ] ;
2007-07-17 04:03:18 -07:00
va_start ( args , fmt ) ;
vsnprintf ( buf , sizeof ( buf ) , fmt , args ) ;
2007-05-06 14:49:36 -07:00
va_end ( args ) ;
2008-04-23 12:28:01 -07:00
slab_bug ( s , " %s " , buf ) ;
2007-07-17 04:03:18 -07:00
print_page_info ( page ) ;
2007-05-06 14:49:36 -07:00
dump_stack ( ) ;
}
2010-09-29 07:15:01 -05:00
static void init_object ( struct kmem_cache * s , void * object , u8 val )
2007-05-06 14:49:36 -07:00
{
u8 * p = object ;
if ( s - > flags & __OBJECT_POISON ) {
memset ( p , POISON_FREE , s - > objsize - 1 ) ;
2008-01-07 23:20:27 -08:00
p [ s - > objsize - 1 ] = POISON_END ;
2007-05-06 14:49:36 -07:00
}
if ( s - > flags & SLAB_RED_ZONE )
2010-09-29 07:15:01 -05:00
memset ( p + s - > objsize , val , s - > inuse - s - > objsize ) ;
2007-05-06 14:49:36 -07:00
}
2007-07-17 04:03:18 -07:00
static void restore_bytes ( struct kmem_cache * s , char * message , u8 data ,
void * from , void * to )
{
slab_fix ( s , " Restoring 0x%p-0x%p=0x%x \n " , from , to - 1 , data ) ;
memset ( from , data , to - from ) ;
}
static int check_bytes_and_report ( struct kmem_cache * s , struct page * page ,
u8 * object , char * what ,
2008-01-07 23:20:27 -08:00
u8 * start , unsigned int value , unsigned int bytes )
2007-07-17 04:03:18 -07:00
{
u8 * fault ;
u8 * end ;
2011-10-31 17:08:07 -07:00
fault = memchr_inv ( start , value , bytes ) ;
2007-07-17 04:03:18 -07:00
if ( ! fault )
return 1 ;
end = start + bytes ;
while ( end > fault & & end [ - 1 ] = = value )
end - - ;
slab_bug ( s , " %s overwritten " , what ) ;
printk ( KERN_ERR " INFO: 0x%p-0x%p. First byte 0x%x instead of 0x%x \n " ,
fault , end - 1 , fault [ 0 ] , value ) ;
print_trailer ( s , page , object ) ;
restore_bytes ( s , what , value , fault , end ) ;
return 0 ;
2007-05-06 14:49:36 -07:00
}
/*
* Object layout :
*
* object address
* Bytes of the object to be managed .
* If the freepointer may overlay the object then the free
* pointer is the first word of the object .
2007-05-09 02:32:39 -07:00
*
2007-05-06 14:49:36 -07:00
* Poisoning uses 0x6b ( POISON_FREE ) and the last byte is
* 0xa5 ( POISON_END )
*
* object + s - > objsize
* Padding to reach word boundary . This is also used for Redzoning .
2007-05-09 02:32:39 -07:00
* Padding is extended by another word if Redzoning is enabled and
* objsize = = inuse .
*
2007-05-06 14:49:36 -07:00
* We fill with 0xbb ( RED_INACTIVE ) for inactive objects and with
* 0xcc ( RED_ACTIVE ) for objects in use .
*
* object + s - > inuse
2007-05-09 02:32:39 -07:00
* Meta data starts here .
*
2007-05-06 14:49:36 -07:00
* A . Free pointer ( if we cannot overwrite object on free )
* B . Tracking data for SLAB_STORE_USER
2007-05-09 02:32:39 -07:00
* C . Padding to reach required alignment boundary or at mininum
2008-02-15 23:45:26 -08:00
* one word if debugging is on to be able to detect writes
2007-05-09 02:32:39 -07:00
* before the word boundary .
*
* Padding is done using 0x5a ( POISON_INUSE )
2007-05-06 14:49:36 -07:00
*
* object + s - > size
2007-05-09 02:32:39 -07:00
* Nothing is used beyond s - > size .
2007-05-06 14:49:36 -07:00
*
2007-05-09 02:32:39 -07:00
* If slabcaches are merged then the objsize and inuse boundaries are mostly
* ignored . And therefore no slab options that rely on these boundaries
2007-05-06 14:49:36 -07:00
* may be used with merged slabcaches .
*/
static int check_pad_bytes ( struct kmem_cache * s , struct page * page , u8 * p )
{
unsigned long off = s - > inuse ; /* The end of info */
if ( s - > offset )
/* Freepointer is placed after the object. */
off + = sizeof ( void * ) ;
if ( s - > flags & SLAB_STORE_USER )
/* We also have user information there */
off + = 2 * sizeof ( struct track ) ;
if ( s - > size = = off )
return 1 ;
2007-07-17 04:03:18 -07:00
return check_bytes_and_report ( s , page , p , " Object padding " ,
p + off , POISON_INUSE , s - > size - off ) ;
2007-05-06 14:49:36 -07:00
}
2008-04-14 19:11:30 +03:00
/* Check the pad bytes at the end of a slab page */
2007-05-06 14:49:36 -07:00
static int slab_pad_check ( struct kmem_cache * s , struct page * page )
{
2007-07-17 04:03:18 -07:00
u8 * start ;
u8 * fault ;
u8 * end ;
int length ;
int remainder ;
2007-05-06 14:49:36 -07:00
if ( ! ( s - > flags & SLAB_POISON ) )
return 1 ;
2008-03-01 13:40:44 -08:00
start = page_address ( page ) ;
2011-03-10 15:21:48 +08:00
length = ( PAGE_SIZE < < compound_order ( page ) ) - s - > reserved ;
2008-04-14 19:11:30 +03:00
end = start + length ;
remainder = length % s - > size ;
2007-05-06 14:49:36 -07:00
if ( ! remainder )
return 1 ;
2011-10-31 17:08:07 -07:00
fault = memchr_inv ( end - remainder , POISON_INUSE , remainder ) ;
2007-07-17 04:03:18 -07:00
if ( ! fault )
return 1 ;
while ( end > fault & & end [ - 1 ] = = POISON_INUSE )
end - - ;
slab_err ( s , page , " Padding overwritten. 0x%p-0x%p " , fault , end - 1 ) ;
2011-07-29 14:10:20 +02:00
print_section ( " Padding " , end - remainder , remainder ) ;
2007-07-17 04:03:18 -07:00
2009-09-03 16:08:06 +02:00
restore_bytes ( s , " slab padding " , POISON_INUSE , end - remainder , end ) ;
2007-07-17 04:03:18 -07:00
return 0 ;
2007-05-06 14:49:36 -07:00
}
static int check_object ( struct kmem_cache * s , struct page * page ,
2010-09-29 07:15:01 -05:00
void * object , u8 val )
2007-05-06 14:49:36 -07:00
{
u8 * p = object ;
u8 * endobject = object + s - > objsize ;
if ( s - > flags & SLAB_RED_ZONE ) {
2007-07-17 04:03:18 -07:00
if ( ! check_bytes_and_report ( s , page , object , " Redzone " ,
2010-09-29 07:15:01 -05:00
endobject , val , s - > inuse - s - > objsize ) )
2007-05-06 14:49:36 -07:00
return 0 ;
} else {
2008-02-05 17:57:39 -08:00
if ( ( s - > flags & SLAB_POISON ) & & s - > objsize < s - > inuse ) {
check_bytes_and_report ( s , page , p , " Alignment padding " ,
endobject , POISON_INUSE , s - > inuse - s - > objsize ) ;
}
2007-05-06 14:49:36 -07:00
}
if ( s - > flags & SLAB_POISON ) {
2010-09-29 07:15:01 -05:00
if ( val ! = SLUB_RED_ACTIVE & & ( s - > flags & __OBJECT_POISON ) & &
2007-07-17 04:03:18 -07:00
( ! check_bytes_and_report ( s , page , p , " Poison " , p ,
POISON_FREE , s - > objsize - 1 ) | |
! check_bytes_and_report ( s , page , p , " Poison " ,
2008-01-07 23:20:27 -08:00
p + s - > objsize - 1 , POISON_END , 1 ) ) )
2007-05-06 14:49:36 -07:00
return 0 ;
/*
* check_pad_bytes cleans up on its own .
*/
check_pad_bytes ( s , page , p ) ;
}
2010-09-29 07:15:01 -05:00
if ( ! s - > offset & & val = = SLUB_RED_ACTIVE )
2007-05-06 14:49:36 -07:00
/*
* Object and freepointer overlap . Cannot check
* freepointer while object is allocated .
*/
return 1 ;
/* Check free pointer validity */
if ( ! check_valid_pointer ( s , page , get_freepointer ( s , p ) ) ) {
object_err ( s , page , p , " Freepointer corrupt " ) ;
/*
2008-12-05 14:08:08 +11:00
* No choice but to zap it and thus lose the remainder
2007-05-06 14:49:36 -07:00
* of the free objects in this slab . May cause
2007-05-09 02:32:39 -07:00
* another error because the object count is now wrong .
2007-05-06 14:49:36 -07:00
*/
2008-03-01 13:40:44 -08:00
set_freepointer ( s , p , NULL ) ;
2007-05-06 14:49:36 -07:00
return 0 ;
}
return 1 ;
}
static int check_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 19:11:30 +03:00
int maxobj ;
2007-05-06 14:49:36 -07:00
VM_BUG_ON ( ! irqs_disabled ( ) ) ;
if ( ! PageSlab ( page ) ) {
2007-07-17 04:03:18 -07:00
slab_err ( s , page , " Not a valid slab page " ) ;
2007-05-06 14:49:36 -07:00
return 0 ;
}
2008-04-14 19:11:30 +03:00
2011-03-10 15:21:48 +08:00
maxobj = order_objects ( compound_order ( page ) , s - > size , s - > reserved ) ;
2008-04-14 19:11:30 +03:00
if ( page - > objects > maxobj ) {
slab_err ( s , page , " objects %u > max %u " ,
s - > name , page - > objects , maxobj ) ;
return 0 ;
}
if ( page - > inuse > page - > objects ) {
2007-07-17 04:03:18 -07:00
slab_err ( s , page , " inuse %u > max %u " ,
2008-04-14 19:11:30 +03:00
s - > name , page - > inuse , page - > objects ) ;
2007-05-06 14:49:36 -07:00
return 0 ;
}
/* Slab_pad_check fixes things up after itself */
slab_pad_check ( s , page ) ;
return 1 ;
}
/*
2007-05-09 02:32:39 -07:00
* Determine if a certain object on a page is on the freelist . Must hold the
* slab lock to guarantee that the chains are in a consistent state .
2007-05-06 14:49:36 -07:00
*/
static int on_freelist ( struct kmem_cache * s , struct page * page , void * search )
{
int nr = 0 ;
2011-06-01 12:25:53 -05:00
void * fp ;
2007-05-06 14:49:36 -07:00
void * object = NULL ;
2008-04-14 19:11:31 +03:00
unsigned long max_objects ;
2007-05-06 14:49:36 -07:00
2011-06-01 12:25:53 -05:00
fp = page - > freelist ;
2008-04-14 19:11:30 +03:00
while ( fp & & nr < = page - > objects ) {
2007-05-06 14:49:36 -07:00
if ( fp = = search )
return 1 ;
if ( ! check_valid_pointer ( s , page , fp ) ) {
if ( object ) {
object_err ( s , page , object ,
" Freechain corrupt " ) ;
2008-03-01 13:40:44 -08:00
set_freepointer ( s , object , NULL ) ;
2007-05-06 14:49:36 -07:00
break ;
} else {
2007-07-17 04:03:18 -07:00
slab_err ( s , page , " Freepointer corrupt " ) ;
2008-03-01 13:40:44 -08:00
page - > freelist = NULL ;
2008-04-14 19:11:30 +03:00
page - > inuse = page - > objects ;
2007-07-17 04:03:18 -07:00
slab_fix ( s , " Freelist cleared " ) ;
2007-05-06 14:49:36 -07:00
return 0 ;
}
break ;
}
object = fp ;
fp = get_freepointer ( s , object ) ;
nr + + ;
}
2011-03-10 15:21:48 +08:00
max_objects = order_objects ( compound_order ( page ) , s - > size , s - > reserved ) ;
2008-10-22 23:00:38 +04:00
if ( max_objects > MAX_OBJS_PER_PAGE )
max_objects = MAX_OBJS_PER_PAGE ;
2008-04-14 19:11:31 +03:00
if ( page - > objects ! = max_objects ) {
slab_err ( s , page , " Wrong number of objects. Found %d but "
" should be %d " , page - > objects , max_objects ) ;
page - > objects = max_objects ;
slab_fix ( s , " Number of objects adjusted. " ) ;
}
2008-04-14 19:11:30 +03:00
if ( page - > inuse ! = page - > objects - nr ) {
2007-05-06 14:49:47 -07:00
slab_err ( s , page , " Wrong object count. Counter is %d but "
2008-04-14 19:11:30 +03:00
" counted were %d " , page - > inuse , page - > objects - nr ) ;
page - > inuse = page - > objects - nr ;
2007-07-17 04:03:18 -07:00
slab_fix ( s , " Object count adjusted. " ) ;
2007-05-06 14:49:36 -07:00
}
return search = = NULL ;
}
2008-04-29 16:11:12 -07:00
static void trace ( struct kmem_cache * s , struct page * page , void * object ,
int alloc )
2007-05-16 22:11:00 -07:00
{
if ( s - > flags & SLAB_TRACE ) {
printk ( KERN_INFO " TRACE %s %s 0x%p inuse=%d fp=0x%p \n " ,
s - > name ,
alloc ? " alloc " : " free " ,
object , page - > inuse ,
page - > freelist ) ;
if ( ! alloc )
2011-07-29 14:10:20 +02:00
print_section ( " Object " , ( void * ) object , s - > objsize ) ;
2007-05-16 22:11:00 -07:00
dump_stack ( ) ;
}
}
2010-08-20 12:37:16 -05:00
/*
* Hooks for other subsystems that check memory allocations . In a typical
* production configuration these hooks all should produce no code at all .
*/
static inline int slab_pre_alloc_hook ( struct kmem_cache * s , gfp_t flags )
{
2010-08-20 12:37:17 -05:00
flags & = gfp_allowed_mask ;
2010-08-20 12:37:16 -05:00
lockdep_trace_alloc ( flags ) ;
might_sleep_if ( flags & __GFP_WAIT ) ;
return should_failslab ( s - > objsize , flags , s - > flags ) ;
}
static inline void slab_post_alloc_hook ( struct kmem_cache * s , gfp_t flags , void * object )
{
2010-08-20 12:37:17 -05:00
flags & = gfp_allowed_mask ;
2011-02-14 18:35:22 +01:00
kmemcheck_slab_alloc ( s , flags , object , slab_ksize ( s ) ) ;
2010-08-20 12:37:16 -05:00
kmemleak_alloc_recursive ( object , s - > objsize , 1 , s - > flags , flags ) ;
}
static inline void slab_free_hook ( struct kmem_cache * s , void * x )
{
kmemleak_free_recursive ( x , s - > flags ) ;
2011-02-25 11:38:52 -06:00
/*
* Trouble is that we may no longer disable interupts in the fast path
* So in order to make the debug calls that expect irqs to be
* disabled we need to disable interrupts temporarily .
*/
# if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP)
{
unsigned long flags ;
local_irq_save ( flags ) ;
kmemcheck_slab_free ( s , x , s - > objsize ) ;
debug_check_no_locks_freed ( x , s - > objsize ) ;
local_irq_restore ( flags ) ;
}
# endif
2011-03-24 21:26:46 +02:00
if ( ! ( s - > flags & SLAB_DEBUG_OBJECTS ) )
debug_check_no_obj_freed ( x , s - > objsize ) ;
2010-08-20 12:37:16 -05:00
}
2007-05-06 14:49:42 -07:00
/*
2007-05-09 02:32:39 -07:00
* Tracking of fully allocated slabs for debugging purposes .
2011-06-01 12:25:50 -05:00
*
* list_lock must be held .
2007-05-06 14:49:42 -07:00
*/
2011-06-01 12:25:50 -05:00
static void add_full ( struct kmem_cache * s ,
struct kmem_cache_node * n , struct page * page )
2007-05-06 14:49:42 -07:00
{
2011-06-01 12:25:50 -05:00
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
2007-05-06 14:49:42 -07:00
list_add ( & page - > lru , & n - > full ) ;
}
2011-06-01 12:25:50 -05:00
/*
* list_lock must be held .
*/
2007-05-06 14:49:42 -07:00
static void remove_full ( struct kmem_cache * s , struct page * page )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return ;
list_del ( & page - > lru ) ;
}
2008-04-14 18:53:02 +03:00
/* Tracking of the number of slabs for debugging purposes */
static inline unsigned long slabs_node ( struct kmem_cache * s , int node )
{
struct kmem_cache_node * n = get_node ( s , node ) ;
return atomic_long_read ( & n - > nr_slabs ) ;
}
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_slabs ( struct kmem_cache_node * n )
{
return atomic_long_read ( & n - > nr_slabs ) ;
}
2008-04-14 19:11:40 +03:00
static inline void inc_slabs_node ( struct kmem_cache * s , int node , int objects )
2008-04-14 18:53:02 +03:00
{
struct kmem_cache_node * n = get_node ( s , node ) ;
/*
* May be called early in order to allocate a slab for the
* kmem_cache_node structure . Solve the chicken - egg
* dilemma by deferring the increment of the count during
* bootstrap ( see early_kmem_cache_node_alloc ) .
*/
2010-09-28 08:10:26 -05:00
if ( n ) {
2008-04-14 18:53:02 +03:00
atomic_long_inc ( & n - > nr_slabs ) ;
2008-04-14 19:11:40 +03:00
atomic_long_add ( objects , & n - > total_objects ) ;
}
2008-04-14 18:53:02 +03:00
}
2008-04-14 19:11:40 +03:00
static inline void dec_slabs_node ( struct kmem_cache * s , int node , int objects )
2008-04-14 18:53:02 +03:00
{
struct kmem_cache_node * n = get_node ( s , node ) ;
atomic_long_dec ( & n - > nr_slabs ) ;
2008-04-14 19:11:40 +03:00
atomic_long_sub ( objects , & n - > total_objects ) ;
2008-04-14 18:53:02 +03:00
}
/* Object debug checks for alloc/free paths */
2007-05-16 22:11:00 -07:00
static void setup_object_debug ( struct kmem_cache * s , struct page * page ,
void * object )
{
if ( ! ( s - > flags & ( SLAB_STORE_USER | SLAB_RED_ZONE | __OBJECT_POISON ) ) )
return ;
2010-09-29 07:15:01 -05:00
init_object ( s , object , SLUB_RED_INACTIVE ) ;
2007-05-16 22:11:00 -07:00
init_tracking ( s , object ) ;
}
2010-08-20 12:37:12 -05:00
static noinline int alloc_debug_processing ( struct kmem_cache * s , struct page * page ,
2008-08-19 20:43:25 +03:00
void * object , unsigned long addr )
2007-05-06 14:49:36 -07:00
{
if ( ! check_slab ( s , page ) )
goto bad ;
if ( ! check_valid_pointer ( s , page , object ) ) {
object_err ( s , page , object , " Freelist Pointer check fails " ) ;
2007-05-06 14:49:47 -07:00
goto bad ;
2007-05-06 14:49:36 -07:00
}
2010-09-29 07:15:01 -05:00
if ( ! check_object ( s , page , object , SLUB_RED_INACTIVE ) )
2007-05-06 14:49:36 -07:00
goto bad ;
2007-05-16 22:11:00 -07:00
/* Success perform special debug activities for allocs */
if ( s - > flags & SLAB_STORE_USER )
set_track ( s , object , TRACK_ALLOC , addr ) ;
trace ( s , page , object , 1 ) ;
2010-09-29 07:15:01 -05:00
init_object ( s , object , SLUB_RED_ACTIVE ) ;
2007-05-06 14:49:36 -07:00
return 1 ;
2007-05-16 22:11:00 -07:00
2007-05-06 14:49:36 -07:00
bad :
if ( PageSlab ( page ) ) {
/*
* If this is a slab page then lets do the best we can
* to avoid issues in the future . Marking all objects
2007-05-09 02:32:39 -07:00
* as used avoids touching the remaining objects .
2007-05-06 14:49:36 -07:00
*/
2007-07-17 04:03:18 -07:00
slab_fix ( s , " Marking all objects used " ) ;
2008-04-14 19:11:30 +03:00
page - > inuse = page - > objects ;
2008-03-01 13:40:44 -08:00
page - > freelist = NULL ;
2007-05-06 14:49:36 -07:00
}
return 0 ;
}
2010-08-20 12:37:12 -05:00
static noinline int free_debug_processing ( struct kmem_cache * s ,
struct page * page , void * object , unsigned long addr )
2007-05-06 14:49:36 -07:00
{
2011-06-01 12:25:54 -05:00
unsigned long flags ;
int rc = 0 ;
local_irq_save ( flags ) ;
2011-06-01 12:25:53 -05:00
slab_lock ( page ) ;
2007-05-06 14:49:36 -07:00
if ( ! check_slab ( s , page ) )
goto fail ;
if ( ! check_valid_pointer ( s , page , object ) ) {
2007-05-06 14:49:47 -07:00
slab_err ( s , page , " Invalid object pointer 0x%p " , object ) ;
2007-05-06 14:49:36 -07:00
goto fail ;
}
if ( on_freelist ( s , page , object ) ) {
2007-07-17 04:03:18 -07:00
object_err ( s , page , object , " Object already free " ) ;
2007-05-06 14:49:36 -07:00
goto fail ;
}
2010-09-29 07:15:01 -05:00
if ( ! check_object ( s , page , object , SLUB_RED_ACTIVE ) )
2011-06-01 12:25:54 -05:00
goto out ;
2007-05-06 14:49:36 -07:00
if ( unlikely ( s ! = page - > slab ) ) {
2008-02-05 17:57:39 -08:00
if ( ! PageSlab ( page ) ) {
2007-05-06 14:49:47 -07:00
slab_err ( s , page , " Attempt to free object(0x%p) "
" outside of slab " , object ) ;
2008-02-05 17:57:39 -08:00
} else if ( ! page - > slab ) {
2007-05-06 14:49:36 -07:00
printk ( KERN_ERR
2007-05-06 14:49:47 -07:00
" SLUB <none>: no slab for object 0x%p. \n " ,
2007-05-06 14:49:36 -07:00
object ) ;
2007-05-06 14:49:47 -07:00
dump_stack ( ) ;
2008-01-07 23:20:27 -08:00
} else
2007-07-17 04:03:18 -07:00
object_err ( s , page , object ,
" page slab pointer corrupt. " ) ;
2007-05-06 14:49:36 -07:00
goto fail ;
}
2007-05-16 22:11:00 -07:00
if ( s - > flags & SLAB_STORE_USER )
set_track ( s , object , TRACK_FREE , addr ) ;
trace ( s , page , object , 0 ) ;
2010-09-29 07:15:01 -05:00
init_object ( s , object , SLUB_RED_INACTIVE ) ;
2011-06-01 12:25:54 -05:00
rc = 1 ;
out :
2011-06-01 12:25:53 -05:00
slab_unlock ( page ) ;
2011-06-01 12:25:54 -05:00
local_irq_restore ( flags ) ;
return rc ;
2007-05-16 22:11:00 -07:00
2007-05-06 14:49:36 -07:00
fail :
2007-07-17 04:03:18 -07:00
slab_fix ( s , " Object at 0x%p not freed " , object ) ;
2011-06-01 12:25:54 -05:00
goto out ;
2007-05-06 14:49:36 -07:00
}
2007-05-09 02:32:44 -07:00
static int __init setup_slub_debug ( char * str )
{
2007-07-15 23:38:14 -07:00
slub_debug = DEBUG_DEFAULT_FLAGS ;
if ( * str + + ! = ' = ' | | ! * str )
/*
* No options specified . Switch on full debugging .
*/
goto out ;
if ( * str = = ' , ' )
/*
* No options but restriction on slabs . This means full
* debugging for slabs matching a pattern .
*/
goto check_slabs ;
2009-07-07 00:14:14 -07:00
if ( tolower ( * str ) = = ' o ' ) {
/*
* Avoid enabling debugging on caches if its minimum order
* would increase as a result .
*/
disable_higher_order_debug = 1 ;
goto out ;
}
2007-07-15 23:38:14 -07:00
slub_debug = 0 ;
if ( * str = = ' - ' )
/*
* Switch off all debugging measures .
*/
goto out ;
/*
* Determine which debug features should be switched on
*/
2008-01-07 23:20:27 -08:00
for ( ; * str & & * str ! = ' , ' ; str + + ) {
2007-07-15 23:38:14 -07:00
switch ( tolower ( * str ) ) {
case ' f ' :
slub_debug | = SLAB_DEBUG_FREE ;
break ;
case ' z ' :
slub_debug | = SLAB_RED_ZONE ;
break ;
case ' p ' :
slub_debug | = SLAB_POISON ;
break ;
case ' u ' :
slub_debug | = SLAB_STORE_USER ;
break ;
case ' t ' :
slub_debug | = SLAB_TRACE ;
break ;
2010-02-26 09:36:12 +03:00
case ' a ' :
slub_debug | = SLAB_FAILSLAB ;
break ;
2007-07-15 23:38:14 -07:00
default :
printk ( KERN_ERR " slub_debug option '%c' "
2008-01-07 23:20:27 -08:00
" unknown. skipped \n " , * str ) ;
2007-07-15 23:38:14 -07:00
}
2007-05-09 02:32:44 -07:00
}
2007-07-15 23:38:14 -07:00
check_slabs :
2007-05-09 02:32:44 -07:00
if ( * str = = ' , ' )
slub_debug_slabs = str + 1 ;
2007-07-15 23:38:14 -07:00
out :
2007-05-09 02:32:44 -07:00
return 1 ;
}
__setup ( " slub_debug " , setup_slub_debug ) ;
2007-09-11 15:24:11 -07:00
static unsigned long kmem_cache_flags ( unsigned long objsize ,
unsigned long flags , const char * name ,
2008-07-25 19:45:34 -07:00
void ( * ctor ) ( void * ) )
2007-05-09 02:32:44 -07:00
{
/*
2008-02-15 23:45:24 -08:00
* Enable debugging if selected on the kernel commandline .
2007-05-09 02:32:44 -07:00
*/
2008-02-15 23:45:24 -08:00
if ( slub_debug & & ( ! slub_debug_slabs | |
2009-07-27 18:30:35 -07:00
! strncmp ( slub_debug_slabs , name , strlen ( slub_debug_slabs ) ) ) )
flags | = slub_debug ;
2007-09-11 15:24:11 -07:00
return flags ;
2007-05-09 02:32:44 -07:00
}
# else
2007-05-16 22:11:00 -07:00
static inline void setup_object_debug ( struct kmem_cache * s ,
struct page * page , void * object ) { }
2007-05-09 02:32:44 -07:00
2007-05-16 22:11:00 -07:00
static inline int alloc_debug_processing ( struct kmem_cache * s ,
2008-08-19 20:43:25 +03:00
struct page * page , void * object , unsigned long addr ) { return 0 ; }
2007-05-09 02:32:44 -07:00
2007-05-16 22:11:00 -07:00
static inline int free_debug_processing ( struct kmem_cache * s ,
2008-08-19 20:43:25 +03:00
struct page * page , void * object , unsigned long addr ) { return 0 ; }
2007-05-09 02:32:44 -07:00
static inline int slab_pad_check ( struct kmem_cache * s , struct page * page )
{ return 1 ; }
static inline int check_object ( struct kmem_cache * s , struct page * page ,
2010-09-29 07:15:01 -05:00
void * object , u8 val ) { return 1 ; }
2011-06-01 12:25:50 -05:00
static inline void add_full ( struct kmem_cache * s , struct kmem_cache_node * n ,
struct page * page ) { }
2011-06-01 12:25:52 -05:00
static inline void remove_full ( struct kmem_cache * s , struct page * page ) { }
2007-09-11 15:24:11 -07:00
static inline unsigned long kmem_cache_flags ( unsigned long objsize ,
unsigned long flags , const char * name ,
2008-07-25 19:45:34 -07:00
void ( * ctor ) ( void * ) )
2007-09-11 15:24:11 -07:00
{
return flags ;
}
2007-05-09 02:32:44 -07:00
# define slub_debug 0
2008-04-14 18:53:02 +03:00
2009-09-15 11:00:26 +02:00
# define disable_higher_order_debug 0
2008-04-14 18:53:02 +03:00
static inline unsigned long slabs_node ( struct kmem_cache * s , int node )
{ return 0 ; }
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_slabs ( struct kmem_cache_node * n )
{ return 0 ; }
2008-04-14 19:11:40 +03:00
static inline void inc_slabs_node ( struct kmem_cache * s , int node ,
int objects ) { }
static inline void dec_slabs_node ( struct kmem_cache * s , int node ,
int objects ) { }
2010-08-25 14:07:16 -05:00
static inline int slab_pre_alloc_hook ( struct kmem_cache * s , gfp_t flags )
{ return 0 ; }
static inline void slab_post_alloc_hook ( struct kmem_cache * s , gfp_t flags ,
void * object ) { }
static inline void slab_free_hook ( struct kmem_cache * s , void * x ) { }
2010-10-05 13:57:26 -05:00
# endif /* CONFIG_SLUB_DEBUG */
2008-04-14 19:11:40 +03:00
2007-05-06 14:49:36 -07:00
/*
* Slab allocation and freeing
*/
2008-04-14 19:11:40 +03:00
static inline struct page * alloc_slab_page ( gfp_t flags , int node ,
struct kmem_cache_order_objects oo )
{
int order = oo_order ( oo ) ;
2008-11-25 16:55:53 +01:00
flags | = __GFP_NOTRACK ;
2010-07-09 14:07:10 -05:00
if ( node = = NUMA_NO_NODE )
2008-04-14 19:11:40 +03:00
return alloc_pages ( flags , order ) ;
else
2010-04-14 23:58:36 +09:00
return alloc_pages_exact_node ( node , flags , order ) ;
2008-04-14 19:11:40 +03:00
}
2007-05-06 14:49:36 -07:00
static struct page * allocate_slab ( struct kmem_cache * s , gfp_t flags , int node )
{
2008-01-07 23:20:27 -08:00
struct page * page ;
2008-04-14 19:11:31 +03:00
struct kmem_cache_order_objects oo = s - > oo ;
2009-06-24 21:59:51 +03:00
gfp_t alloc_gfp ;
2007-05-06 14:49:36 -07:00
2011-06-01 12:25:44 -05:00
flags & = gfp_allowed_mask ;
if ( flags & __GFP_WAIT )
local_irq_enable ( ) ;
2008-02-14 14:21:32 -08:00
flags | = s - > allocflags ;
2007-10-16 01:25:52 -07:00
2009-06-24 21:59:51 +03:00
/*
* Let the initial higher - order allocation fail under memory pressure
* so we fall - back to the minimum order allocation .
*/
alloc_gfp = ( flags | __GFP_NOWARN | __GFP_NORETRY ) & ~ __GFP_NOFAIL ;
page = alloc_slab_page ( alloc_gfp , node , oo ) ;
2008-04-14 19:11:40 +03:00
if ( unlikely ( ! page ) ) {
oo = s - > min ;
/*
* Allocation may have failed due to fragmentation .
* Try a lower order alloc if possible
*/
page = alloc_slab_page ( flags , node , oo ) ;
2007-05-06 14:49:36 -07:00
2011-06-01 12:25:44 -05:00
if ( page )
stat ( s , ORDER_FALLBACK ) ;
2008-04-14 19:11:40 +03:00
}
2008-04-04 00:54:48 +02:00
2011-06-01 12:25:44 -05:00
if ( flags & __GFP_WAIT )
local_irq_disable ( ) ;
if ( ! page )
return NULL ;
2008-04-04 00:54:48 +02:00
if ( kmemcheck_enabled
2009-08-19 21:44:13 +03:00
& & ! ( s - > flags & ( SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS ) ) ) {
2008-11-25 16:55:53 +01:00
int pages = 1 < < oo_order ( oo ) ;
kmemcheck_alloc_shadow ( page , oo_order ( oo ) , flags , node ) ;
/*
* Objects from caches that have a constructor don ' t get
* cleared when they ' re allocated , so we need to do it here .
*/
if ( s - > ctor )
kmemcheck_mark_uninitialized_pages ( page , pages ) ;
else
kmemcheck_mark_unallocated_pages ( page , pages ) ;
2008-04-04 00:54:48 +02:00
}
2008-04-14 19:11:31 +03:00
page - > objects = oo_objects ( oo ) ;
2007-05-06 14:49:36 -07:00
mod_zone_page_state ( page_zone ( page ) ,
( s - > flags & SLAB_RECLAIM_ACCOUNT ) ?
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE ,
2008-04-14 19:11:40 +03:00
1 < < oo_order ( oo ) ) ;
2007-05-06 14:49:36 -07:00
return page ;
}
static void setup_object ( struct kmem_cache * s , struct page * page ,
void * object )
{
2007-05-16 22:11:00 -07:00
setup_object_debug ( s , page , object ) ;
2007-05-06 14:50:17 -07:00
if ( unlikely ( s - > ctor ) )
2008-07-25 19:45:34 -07:00
s - > ctor ( object ) ;
2007-05-06 14:49:36 -07:00
}
static struct page * new_slab ( struct kmem_cache * s , gfp_t flags , int node )
{
struct page * page ;
void * start ;
void * last ;
void * p ;
2007-10-16 01:25:41 -07:00
BUG_ON ( flags & GFP_SLAB_BUG_MASK ) ;
2007-05-06 14:49:36 -07:00
2007-10-16 01:25:41 -07:00
page = allocate_slab ( s ,
flags & ( GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK ) , node ) ;
2007-05-06 14:49:36 -07:00
if ( ! page )
goto out ;
2008-04-14 19:11:40 +03:00
inc_slabs_node ( s , page_to_nid ( page ) , page - > objects ) ;
2007-05-06 14:49:36 -07:00
page - > slab = s ;
page - > flags | = 1 < < PG_slab ;
start = page_address ( page ) ;
if ( unlikely ( s - > flags & SLAB_POISON ) )
2008-04-14 19:11:31 +03:00
memset ( start , POISON_INUSE , PAGE_SIZE < < compound_order ( page ) ) ;
2007-05-06 14:49:36 -07:00
last = start ;
2008-04-14 19:11:31 +03:00
for_each_object ( p , s , start , page - > objects ) {
2007-05-06 14:49:36 -07:00
setup_object ( s , page , last ) ;
set_freepointer ( s , last , p ) ;
last = p ;
}
setup_object ( s , page , last ) ;
2008-03-01 13:40:44 -08:00
set_freepointer ( s , last , NULL ) ;
2007-05-06 14:49:36 -07:00
page - > freelist = start ;
2011-08-09 16:12:24 -05:00
page - > inuse = page - > objects ;
2011-06-01 12:25:46 -05:00
page - > frozen = 1 ;
2007-05-06 14:49:36 -07:00
out :
return page ;
}
static void __free_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 19:11:31 +03:00
int order = compound_order ( page ) ;
int pages = 1 < < order ;
2007-05-06 14:49:36 -07:00
2010-07-09 14:07:14 -05:00
if ( kmem_cache_debug ( s ) ) {
2007-05-06 14:49:36 -07:00
void * p ;
slab_pad_check ( s , page ) ;
2008-04-14 19:11:31 +03:00
for_each_object ( p , s , page_address ( page ) ,
page - > objects )
2010-09-29 07:15:01 -05:00
check_object ( s , page , p , SLUB_RED_INACTIVE ) ;
2007-05-06 14:49:36 -07:00
}
2008-11-25 16:55:53 +01:00
kmemcheck_free_shadow ( page , compound_order ( page ) ) ;
2008-04-04 00:54:48 +02:00
2007-05-06 14:49:36 -07:00
mod_zone_page_state ( page_zone ( page ) ,
( s - > flags & SLAB_RECLAIM_ACCOUNT ) ?
NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE ,
2008-01-07 23:20:27 -08:00
- pages ) ;
2007-05-06 14:49:36 -07:00
2008-04-14 18:52:18 +03:00
__ClearPageSlab ( page ) ;
reset_page_mapcount ( page ) ;
2009-05-05 19:13:44 +10:00
if ( current - > reclaim_state )
current - > reclaim_state - > reclaimed_slab + = pages ;
2008-04-14 19:11:31 +03:00
__free_pages ( page , order ) ;
2007-05-06 14:49:36 -07:00
}
2011-03-10 15:22:00 +08:00
# define need_reserve_slab_rcu \
( sizeof ( ( ( struct page * ) NULL ) - > lru ) < sizeof ( struct rcu_head ) )
2007-05-06 14:49:36 -07:00
static void rcu_free_slab ( struct rcu_head * h )
{
struct page * page ;
2011-03-10 15:22:00 +08:00
if ( need_reserve_slab_rcu )
page = virt_to_head_page ( h ) ;
else
page = container_of ( ( struct list_head * ) h , struct page , lru ) ;
2007-05-06 14:49:36 -07:00
__free_slab ( page - > slab , page ) ;
}
static void free_slab ( struct kmem_cache * s , struct page * page )
{
if ( unlikely ( s - > flags & SLAB_DESTROY_BY_RCU ) ) {
2011-03-10 15:22:00 +08:00
struct rcu_head * head ;
if ( need_reserve_slab_rcu ) {
int order = compound_order ( page ) ;
int offset = ( PAGE_SIZE < < order ) - s - > reserved ;
VM_BUG_ON ( s - > reserved ! = sizeof ( * head ) ) ;
head = page_address ( page ) + offset ;
} else {
/*
* RCU free overloads the RCU head over the LRU
*/
head = ( void * ) & page - > lru ;
}
2007-05-06 14:49:36 -07:00
call_rcu ( head , rcu_free_slab ) ;
} else
__free_slab ( s , page ) ;
}
static void discard_slab ( struct kmem_cache * s , struct page * page )
{
2008-04-14 19:11:40 +03:00
dec_slabs_node ( s , page_to_nid ( page ) , page - > objects ) ;
2007-05-06 14:49:36 -07:00
free_slab ( s , page ) ;
}
/*
2011-06-01 12:25:50 -05:00
* Management of partially allocated slabs .
*
* list_lock must be held .
2007-05-06 14:49:36 -07:00
*/
2011-06-01 12:25:50 -05:00
static inline void add_partial ( struct kmem_cache_node * n ,
2008-01-07 23:20:27 -08:00
struct page * page , int tail )
2007-05-06 14:49:36 -07:00
{
2007-05-06 14:49:44 -07:00
n - > nr_partial + + ;
2011-08-24 08:57:52 +08:00
if ( tail = = DEACTIVATE_TO_TAIL )
2008-01-07 23:20:27 -08:00
list_add_tail ( & page - > lru , & n - > partial ) ;
else
list_add ( & page - > lru , & n - > partial ) ;
2007-05-06 14:49:36 -07:00
}
2011-06-01 12:25:50 -05:00
/*
* list_lock must be held .
*/
static inline void remove_partial ( struct kmem_cache_node * n ,
2010-09-28 08:10:28 -05:00
struct page * page )
{
list_del ( & page - > lru ) ;
n - > nr_partial - - ;
}
2007-05-06 14:49:36 -07:00
/*
2011-06-01 12:25:50 -05:00
* Lock slab , remove from the partial list and put the object into the
* per cpu freelist .
2007-05-06 14:49:36 -07:00
*
2011-08-09 16:12:26 -05:00
* Returns a list of objects or NULL if it fails .
*
2007-05-09 02:32:39 -07:00
* Must hold list_lock .
2007-05-06 14:49:36 -07:00
*/
2011-08-09 16:12:26 -05:00
static inline void * acquire_slab ( struct kmem_cache * s ,
2011-08-09 16:12:25 -05:00
struct kmem_cache_node * n , struct page * page ,
2011-08-09 16:12:27 -05:00
int mode )
2007-05-06 14:49:36 -07:00
{
2011-06-01 12:25:52 -05:00
void * freelist ;
unsigned long counters ;
struct page new ;
/*
* Zap the freelist and set the frozen bit .
* The old freelist is the list of objects for the
* per cpu allocation list .
*/
do {
freelist = page - > freelist ;
counters = page - > counters ;
new . counters = counters ;
2011-08-09 16:12:27 -05:00
if ( mode )
new . inuse = page - > objects ;
2011-06-01 12:25:52 -05:00
VM_BUG_ON ( new . frozen ) ;
new . frozen = 1 ;
2011-07-14 12:49:12 -05:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 12:25:52 -05:00
freelist , counters ,
NULL , new . counters ,
" lock and freeze " ) ) ;
remove_partial ( n , page ) ;
2011-08-09 16:12:27 -05:00
return freelist ;
2007-05-06 14:49:36 -07:00
}
2011-08-09 16:12:27 -05:00
static int put_cpu_partial ( struct kmem_cache * s , struct page * page , int drain ) ;
2007-05-06 14:49:36 -07:00
/*
2007-05-09 02:32:39 -07:00
* Try to allocate a partial slab from a specific node .
2007-05-06 14:49:36 -07:00
*/
2011-08-09 16:12:26 -05:00
static void * get_partial_node ( struct kmem_cache * s ,
2011-08-09 16:12:25 -05:00
struct kmem_cache_node * n , struct kmem_cache_cpu * c )
2007-05-06 14:49:36 -07:00
{
2011-08-09 16:12:27 -05:00
struct page * page , * page2 ;
void * object = NULL ;
2007-05-06 14:49:36 -07:00
/*
* Racy check . If we mistakenly see no partial slabs then we
* just allocate an empty slab . If we mistakenly try to get a
2007-05-09 02:32:39 -07:00
* partial slab and there is none available then get_partials ( )
* will return NULL .
2007-05-06 14:49:36 -07:00
*/
if ( ! n | | ! n - > nr_partial )
return NULL ;
spin_lock ( & n - > list_lock ) ;
2011-08-09 16:12:27 -05:00
list_for_each_entry_safe ( page , page2 , & n - > partial , lru ) {
2011-09-07 10:26:36 +08:00
void * t = acquire_slab ( s , n , page , object = = NULL ) ;
2011-08-09 16:12:27 -05:00
int available ;
if ( ! t )
break ;
2011-09-07 10:26:36 +08:00
if ( ! object ) {
2011-08-09 16:12:27 -05:00
c - > page = page ;
c - > node = page_to_nid ( page ) ;
stat ( s , ALLOC_FROM_PARTIAL ) ;
object = t ;
available = page - > objects - page - > inuse ;
} else {
page - > freelist = t ;
available = put_cpu_partial ( s , page , 0 ) ;
2012-02-03 23:34:56 +08:00
stat ( s , CPU_PARTIAL_NODE ) ;
2011-08-09 16:12:27 -05:00
}
if ( kmem_cache_debug ( s ) | | available > s - > cpu_partial / 2 )
break ;
2011-08-09 16:12:26 -05:00
}
2007-05-06 14:49:36 -07:00
spin_unlock ( & n - > list_lock ) ;
2011-08-09 16:12:26 -05:00
return object ;
2007-05-06 14:49:36 -07:00
}
/*
2007-05-09 02:32:39 -07:00
* Get a page from somewhere . Search in increasing NUMA distances .
2007-05-06 14:49:36 -07:00
*/
2011-08-09 16:12:25 -05:00
static struct page * get_any_partial ( struct kmem_cache * s , gfp_t flags ,
struct kmem_cache_cpu * c )
2007-05-06 14:49:36 -07:00
{
# ifdef CONFIG_NUMA
struct zonelist * zonelist ;
2008-04-28 02:12:17 -07:00
struct zoneref * z ;
2008-04-28 02:12:16 -07:00
struct zone * zone ;
enum zone_type high_zoneidx = gfp_zone ( flags ) ;
2011-08-09 16:12:26 -05:00
void * object ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 16:34:11 -07:00
unsigned int cpuset_mems_cookie ;
2007-05-06 14:49:36 -07:00
/*
2007-05-09 02:32:39 -07:00
* The defrag ratio allows a configuration of the tradeoffs between
* inter node defragmentation and node local allocations . A lower
* defrag_ratio increases the tendency to do local allocations
* instead of attempting to obtain partial slabs from other nodes .
2007-05-06 14:49:36 -07:00
*
2007-05-09 02:32:39 -07:00
* If the defrag_ratio is set to 0 then kmalloc ( ) always
* returns node local objects . If the ratio is higher then kmalloc ( )
* may return off node objects because partial slabs are obtained
* from other nodes and filled up .
2007-05-06 14:49:36 -07:00
*
2008-02-15 23:45:26 -08:00
* If / sys / kernel / slab / xx / defrag_ratio is set to 100 ( which makes
2007-05-09 02:32:39 -07:00
* defrag_ratio = 1000 ) then every ( well almost ) allocation will
* first attempt to defrag slab caches on other nodes . This means
* scanning over all nodes to look for partial slabs which may be
* expensive if we do it every time we are trying to find a slab
* with available objects .
2007-05-06 14:49:36 -07:00
*/
2008-01-07 23:20:26 -08:00
if ( ! s - > remote_node_defrag_ratio | |
get_cycles ( ) % 1024 > s - > remote_node_defrag_ratio )
2007-05-06 14:49:36 -07:00
return NULL ;
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 16:34:11 -07:00
do {
cpuset_mems_cookie = get_mems_allowed ( ) ;
zonelist = node_zonelist ( slab_node ( current - > mempolicy ) , flags ) ;
for_each_zone_zonelist ( zone , z , zonelist , high_zoneidx ) {
struct kmem_cache_node * n ;
n = get_node ( s , zone_to_nid ( zone ) ) ;
if ( n & & cpuset_zone_allowed_hardwall ( zone , flags ) & &
n - > nr_partial > s - > min_partial ) {
object = get_partial_node ( s , n , c ) ;
if ( object ) {
/*
* Return the object even if
* put_mems_allowed indicated that
* the cpuset mems_allowed was
* updated in parallel . It ' s a
* harmless race between the alloc
* and the cpuset update .
*/
put_mems_allowed ( cpuset_mems_cookie ) ;
return object ;
}
2010-05-24 14:32:08 -07:00
}
2007-05-06 14:49:36 -07:00
}
cpuset: mm: reduce large amounts of memory barrier related damage v3
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21 16:34:11 -07:00
} while ( ! put_mems_allowed ( cpuset_mems_cookie ) ) ;
2007-05-06 14:49:36 -07:00
# endif
return NULL ;
}
/*
* Get a partial page , lock it and return it .
*/
2011-08-09 16:12:26 -05:00
static void * get_partial ( struct kmem_cache * s , gfp_t flags , int node ,
2011-08-09 16:12:25 -05:00
struct kmem_cache_cpu * c )
2007-05-06 14:49:36 -07:00
{
2011-08-09 16:12:26 -05:00
void * object ;
2010-07-09 14:07:10 -05:00
int searchnode = ( node = = NUMA_NO_NODE ) ? numa_node_id ( ) : node ;
2007-05-06 14:49:36 -07:00
2011-08-09 16:12:26 -05:00
object = get_partial_node ( s , get_node ( s , searchnode ) , c ) ;
if ( object | | node ! = NUMA_NO_NODE )
return object ;
2007-05-06 14:49:36 -07:00
2011-08-09 16:12:25 -05:00
return get_any_partial ( s , flags , c ) ;
2007-05-06 14:49:36 -07:00
}
2011-02-25 11:38:54 -06:00
# ifdef CONFIG_PREEMPT
/*
* Calculate the next globally unique transaction for disambiguiation
* during cmpxchg . The transactions start with the cpu number and are then
* incremented by CONFIG_NR_CPUS .
*/
# define TID_STEP roundup_pow_of_two(CONFIG_NR_CPUS)
# else
/*
* No preemption supported therefore also no need to check for
* different cpus .
*/
# define TID_STEP 1
# endif
static inline unsigned long next_tid ( unsigned long tid )
{
return tid + TID_STEP ;
}
static inline unsigned int tid_to_cpu ( unsigned long tid )
{
return tid % TID_STEP ;
}
static inline unsigned long tid_to_event ( unsigned long tid )
{
return tid / TID_STEP ;
}
static inline unsigned int init_tid ( int cpu )
{
return cpu ;
}
static inline void note_cmpxchg_failure ( const char * n ,
const struct kmem_cache * s , unsigned long tid )
{
# ifdef SLUB_DEBUG_CMPXCHG
unsigned long actual_tid = __this_cpu_read ( s - > cpu_slab - > tid ) ;
printk ( KERN_INFO " %s %s: cmpxchg redo " , n , s - > name ) ;
# ifdef CONFIG_PREEMPT
if ( tid_to_cpu ( tid ) ! = tid_to_cpu ( actual_tid ) )
printk ( " due to cpu change %d -> %d \n " ,
tid_to_cpu ( tid ) , tid_to_cpu ( actual_tid ) ) ;
else
# endif
if ( tid_to_event ( tid ) ! = tid_to_event ( actual_tid ) )
printk ( " due to cpu running other code. Event %ld->%ld \n " ,
tid_to_event ( tid ) , tid_to_event ( actual_tid ) ) ;
else
printk ( " for unknown reason: actual=%lx was=%lx target=%lx \n " ,
actual_tid , tid , next_tid ( tid ) ) ;
# endif
2011-03-22 13:35:00 -05:00
stat ( s , CMPXCHG_DOUBLE_CPU_FAIL ) ;
2011-02-25 11:38:54 -06:00
}
void init_kmem_cache_cpus ( struct kmem_cache * s )
{
int cpu ;
for_each_possible_cpu ( cpu )
per_cpu_ptr ( s - > cpu_slab , cpu ) - > tid = init_tid ( cpu ) ;
}
2011-06-01 12:25:52 -05:00
2007-05-06 14:49:36 -07:00
/*
* Remove the cpu slab
*/
2007-10-16 01:26:05 -07:00
static void deactivate_slab ( struct kmem_cache * s , struct kmem_cache_cpu * c )
2007-05-06 14:49:36 -07:00
{
2011-06-01 12:25:52 -05:00
enum slab_modes { M_NONE , M_PARTIAL , M_FULL , M_FREE } ;
2007-10-16 01:26:05 -07:00
struct page * page = c - > page ;
2011-06-01 12:25:52 -05:00
struct kmem_cache_node * n = get_node ( s , page_to_nid ( page ) ) ;
int lock = 0 ;
enum slab_modes l = M_NONE , m = M_NONE ;
void * freelist ;
void * nextfree ;
2011-08-24 08:57:52 +08:00
int tail = DEACTIVATE_TO_HEAD ;
2011-06-01 12:25:52 -05:00
struct page new ;
struct page old ;
if ( page - > freelist ) {
2009-12-18 16:26:23 -06:00
stat ( s , DEACTIVATE_REMOTE_FREES ) ;
2011-08-24 08:57:52 +08:00
tail = DEACTIVATE_TO_TAIL ;
2011-06-01 12:25:52 -05:00
}
c - > tid = next_tid ( c - > tid ) ;
c - > page = NULL ;
freelist = c - > freelist ;
c - > freelist = NULL ;
2007-05-10 03:15:16 -07:00
/*
2011-06-01 12:25:52 -05:00
* Stage one : Free all available per cpu objects back
* to the page freelist while it is still frozen . Leave the
* last one .
*
* There is no need to take the list - > lock because the page
* is still frozen .
*/
while ( freelist & & ( nextfree = get_freepointer ( s , freelist ) ) ) {
void * prior ;
unsigned long counters ;
do {
prior = page - > freelist ;
counters = page - > counters ;
set_freepointer ( s , freelist , prior ) ;
new . counters = counters ;
new . inuse - - ;
VM_BUG_ON ( ! new . frozen ) ;
2011-07-14 12:49:12 -05:00
} while ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 12:25:52 -05:00
prior , counters ,
freelist , new . counters ,
" drain percpu freelist " ) ) ;
freelist = nextfree ;
}
2007-05-10 03:15:16 -07:00
/*
2011-06-01 12:25:52 -05:00
* Stage two : Ensure that the page is unfrozen while the
* list presence reflects the actual number of objects
* during unfreeze .
*
* We setup the list membership and then perform a cmpxchg
* with the count . If there is a mismatch then the page
* is not unfrozen but the page is on the wrong list .
*
* Then we restart the process which may have to remove
* the page from the list that we just put it on again
* because the number of objects in the slab may have
* changed .
2007-05-10 03:15:16 -07:00
*/
2011-06-01 12:25:52 -05:00
redo :
2007-05-10 03:15:16 -07:00
2011-06-01 12:25:52 -05:00
old . freelist = page - > freelist ;
old . counters = page - > counters ;
VM_BUG_ON ( ! old . frozen ) ;
2008-01-07 23:20:27 -08:00
2011-06-01 12:25:52 -05:00
/* Determine target state of the slab */
new . counters = old . counters ;
if ( freelist ) {
new . inuse - - ;
set_freepointer ( s , freelist , old . freelist ) ;
new . freelist = freelist ;
} else
new . freelist = old . freelist ;
new . frozen = 0 ;
2011-08-09 13:01:32 -05:00
if ( ! new . inuse & & n - > nr_partial > s - > min_partial )
2011-06-01 12:25:52 -05:00
m = M_FREE ;
else if ( new . freelist ) {
m = M_PARTIAL ;
if ( ! lock ) {
lock = 1 ;
/*
* Taking the spinlock removes the possiblity
* that acquire_slab ( ) will see a slab page that
* is frozen
*/
spin_lock ( & n - > list_lock ) ;
}
} else {
m = M_FULL ;
if ( kmem_cache_debug ( s ) & & ! lock ) {
lock = 1 ;
/*
* This also ensures that the scanning of full
* slabs from diagnostic functions will not see
* any frozen slabs .
*/
spin_lock ( & n - > list_lock ) ;
}
}
if ( l ! = m ) {
if ( l = = M_PARTIAL )
remove_partial ( n , page ) ;
else if ( l = = M_FULL )
2007-05-10 03:15:16 -07:00
2011-06-01 12:25:52 -05:00
remove_full ( s , page ) ;
if ( m = = M_PARTIAL ) {
add_partial ( n , page , tail ) ;
2011-08-24 08:57:52 +08:00
stat ( s , tail ) ;
2011-06-01 12:25:52 -05:00
} else if ( m = = M_FULL ) {
2007-05-10 03:15:16 -07:00
2011-06-01 12:25:52 -05:00
stat ( s , DEACTIVATE_FULL ) ;
add_full ( s , n , page ) ;
}
}
l = m ;
2011-07-14 12:49:12 -05:00
if ( ! __cmpxchg_double_slab ( s , page ,
2011-06-01 12:25:52 -05:00
old . freelist , old . counters ,
new . freelist , new . counters ,
" unfreezing slab " ) )
goto redo ;
if ( lock )
spin_unlock ( & n - > list_lock ) ;
if ( m = = M_FREE ) {
stat ( s , DEACTIVATE_EMPTY ) ;
discard_slab ( s , page ) ;
stat ( s , FREE_SLAB ) ;
2007-05-10 03:15:16 -07:00
}
2007-05-06 14:49:36 -07:00
}
2011-08-09 16:12:27 -05:00
/* Unfreeze all the cpu partial slabs */
static void unfreeze_partials ( struct kmem_cache * s )
{
struct kmem_cache_node * n = NULL ;
struct kmem_cache_cpu * c = this_cpu_ptr ( s - > cpu_slab ) ;
2011-11-14 13:34:13 +08:00
struct page * page , * discard_page = NULL ;
2011-08-09 16:12:27 -05:00
while ( ( page = c - > partial ) ) {
enum slab_modes { M_PARTIAL , M_FREE } ;
enum slab_modes l , m ;
struct page new ;
struct page old ;
c - > partial = page - > next ;
l = M_FREE ;
do {
old . freelist = page - > freelist ;
old . counters = page - > counters ;
VM_BUG_ON ( ! old . frozen ) ;
new . counters = old . counters ;
new . freelist = old . freelist ;
new . frozen = 0 ;
2011-09-06 14:46:01 +08:00
if ( ! new . inuse & & ( ! n | | n - > nr_partial > s - > min_partial ) )
2011-08-09 16:12:27 -05:00
m = M_FREE ;
else {
struct kmem_cache_node * n2 = get_node ( s ,
page_to_nid ( page ) ) ;
m = M_PARTIAL ;
if ( n ! = n2 ) {
if ( n )
spin_unlock ( & n - > list_lock ) ;
n = n2 ;
spin_lock ( & n - > list_lock ) ;
}
}
if ( l ! = m ) {
2011-11-11 14:54:14 +08:00
if ( l = = M_PARTIAL ) {
2011-08-09 16:12:27 -05:00
remove_partial ( n , page ) ;
2011-11-11 14:54:14 +08:00
stat ( s , FREE_REMOVE_PARTIAL ) ;
} else {
2011-11-11 08:33:48 +08:00
add_partial ( n , page ,
DEACTIVATE_TO_TAIL ) ;
2011-11-11 14:54:14 +08:00
stat ( s , FREE_ADD_PARTIAL ) ;
}
2011-08-09 16:12:27 -05:00
l = m ;
}
} while ( ! cmpxchg_double_slab ( s , page ,
old . freelist , old . counters ,
new . freelist , new . counters ,
" unfreezing slab " ) ) ;
if ( m = = M_FREE ) {
2011-11-14 13:34:13 +08:00
page - > next = discard_page ;
discard_page = page ;
2011-08-09 16:12:27 -05:00
}
}
if ( n )
spin_unlock ( & n - > list_lock ) ;
2011-11-14 13:34:13 +08:00
while ( discard_page ) {
page = discard_page ;
discard_page = discard_page - > next ;
stat ( s , DEACTIVATE_EMPTY ) ;
discard_slab ( s , page ) ;
stat ( s , FREE_SLAB ) ;
}
2011-08-09 16:12:27 -05:00
}
/*
* Put a page that was just frozen ( in __slab_free ) into a partial page
* slot if available . This is done without interrupts disabled and without
* preemption disabled . The cmpxchg is racy and may put the partial page
* onto a random cpus partial slot .
*
* If we did not find a slot then simply move all the partials to the
* per node partial list .
*/
int put_cpu_partial ( struct kmem_cache * s , struct page * page , int drain )
{
struct page * oldpage ;
int pages ;
int pobjects ;
do {
pages = 0 ;
pobjects = 0 ;
oldpage = this_cpu_read ( s - > cpu_slab - > partial ) ;
if ( oldpage ) {
pobjects = oldpage - > pobjects ;
pages = oldpage - > pages ;
if ( drain & & pobjects > s - > cpu_partial ) {
unsigned long flags ;
/*
* partial array is full . Move the existing
* set to the per node partial list .
*/
local_irq_save ( flags ) ;
unfreeze_partials ( s ) ;
local_irq_restore ( flags ) ;
pobjects = 0 ;
pages = 0 ;
2012-02-03 23:34:56 +08:00
stat ( s , CPU_PARTIAL_DRAIN ) ;
2011-08-09 16:12:27 -05:00
}
}
pages + + ;
pobjects + = page - > objects - page - > inuse ;
page - > pages = pages ;
page - > pobjects = pobjects ;
page - > next = oldpage ;
2011-12-22 11:58:51 -06:00
} while ( this_cpu_cmpxchg ( s - > cpu_slab - > partial , oldpage , page ) ! = oldpage ) ;
2011-08-09 16:12:27 -05:00
return pobjects ;
}
2007-10-16 01:26:05 -07:00
static inline void flush_slab ( struct kmem_cache * s , struct kmem_cache_cpu * c )
2007-05-06 14:49:36 -07:00
{
2009-12-18 16:26:23 -06:00
stat ( s , CPUSLAB_FLUSH ) ;
2007-10-16 01:26:05 -07:00
deactivate_slab ( s , c ) ;
2007-05-06 14:49:36 -07:00
}
/*
* Flush cpu slab .
2008-02-15 23:45:26 -08:00
*
2007-05-06 14:49:36 -07:00
* Called from IPI handler with interrupts disabled .
*/
2007-07-17 04:03:24 -07:00
static inline void __flush_cpu_slab ( struct kmem_cache * s , int cpu )
2007-05-06 14:49:36 -07:00
{
2009-12-18 16:26:20 -06:00
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab , cpu ) ;
2007-05-06 14:49:36 -07:00
2011-08-09 16:12:27 -05:00
if ( likely ( c ) ) {
if ( c - > page )
flush_slab ( s , c ) ;
unfreeze_partials ( s ) ;
}
2007-05-06 14:49:36 -07:00
}
static void flush_cpu_slab ( void * d )
{
struct kmem_cache * s = d ;
2007-10-16 01:26:05 -07:00
__flush_cpu_slab ( s , smp_processor_id ( ) ) ;
2007-05-06 14:49:36 -07:00
}
2012-03-28 14:42:44 -07:00
static bool has_cpu_slab ( int cpu , void * info )
{
struct kmem_cache * s = info ;
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab , cpu ) ;
2012-05-17 17:03:26 -07:00
return c - > page | | c - > partial ;
2012-03-28 14:42:44 -07:00
}
2007-05-06 14:49:36 -07:00
static void flush_all ( struct kmem_cache * s )
{
2012-03-28 14:42:44 -07:00
on_each_cpu_cond ( has_cpu_slab , flush_cpu_slab , s , 1 , GFP_ATOMIC ) ;
2007-05-06 14:49:36 -07:00
}
2007-10-16 01:26:05 -07:00
/*
* Check if the objects in a per cpu structure fit numa
* locality expectations .
*/
static inline int node_match ( struct kmem_cache_cpu * c , int node )
{
# ifdef CONFIG_NUMA
2010-07-09 14:07:10 -05:00
if ( node ! = NUMA_NO_NODE & & c - > node ! = node )
2007-10-16 01:26:05 -07:00
return 0 ;
# endif
return 1 ;
}
2009-06-10 18:50:32 +03:00
static int count_free ( struct page * page )
{
return page - > objects - page - > inuse ;
}
static unsigned long count_partial ( struct kmem_cache_node * n ,
int ( * get_count ) ( struct page * ) )
{
unsigned long flags ;
unsigned long x = 0 ;
struct page * page ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
list_for_each_entry ( page , & n - > partial , lru )
x + = get_count ( page ) ;
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
return x ;
}
2009-06-11 14:08:48 +04:00
static inline unsigned long node_nr_objs ( struct kmem_cache_node * n )
{
# ifdef CONFIG_SLUB_DEBUG
return atomic_long_read ( & n - > total_objects ) ;
# else
return 0 ;
# endif
}
2009-06-10 18:50:32 +03:00
static noinline void
slab_out_of_memory ( struct kmem_cache * s , gfp_t gfpflags , int nid )
{
int node ;
printk ( KERN_WARNING
" SLUB: Unable to allocate memory on node %d (gfp=0x%x) \n " ,
nid , gfpflags ) ;
printk ( KERN_WARNING " cache: %s, object size: %d, buffer size: %d, "
" default order: %d, min order: %d \n " , s - > name , s - > objsize ,
s - > size , oo_order ( s - > oo ) , oo_order ( s - > min ) ) ;
2009-07-07 00:14:14 -07:00
if ( oo_order ( s - > min ) > get_order ( s - > objsize ) )
printk ( KERN_WARNING " %s debugging increased min order, use "
" slub_debug=O to disable. \n " , s - > name ) ;
2009-06-10 18:50:32 +03:00
for_each_online_node ( node ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
unsigned long nr_slabs ;
unsigned long nr_objs ;
unsigned long nr_free ;
if ( ! n )
continue ;
2009-06-11 14:08:48 +04:00
nr_free = count_partial ( n , count_free ) ;
nr_slabs = node_nr_slabs ( n ) ;
nr_objs = node_nr_objs ( n ) ;
2009-06-10 18:50:32 +03:00
printk ( KERN_WARNING
" node %d: slabs: %ld, objs: %ld, free: %ld \n " ,
node , nr_slabs , nr_objs , nr_free ) ;
}
}
2011-08-09 16:12:26 -05:00
static inline void * new_slab_objects ( struct kmem_cache * s , gfp_t flags ,
int node , struct kmem_cache_cpu * * pc )
{
void * object ;
struct kmem_cache_cpu * c ;
struct page * page = new_slab ( s , flags , node ) ;
if ( page ) {
c = __this_cpu_ptr ( s - > cpu_slab ) ;
if ( c - > page )
flush_slab ( s , c ) ;
/*
* No other reference to the page yet so we can
* muck around with it freely without cmpxchg
*/
object = page - > freelist ;
page - > freelist = NULL ;
stat ( s , ALLOC_SLAB ) ;
c - > node = page_to_nid ( page ) ;
c - > page = page ;
* pc = c ;
} else
object = NULL ;
return object ;
}
2011-11-11 14:07:14 -06:00
/*
* Check the page - > freelist of a page and either transfer the freelist to the per cpu freelist
* or deactivate the page .
*
* The page is still frozen if the return value is not NULL .
*
* If this function returns NULL then the page has been unfrozen .
*/
static inline void * get_freelist ( struct kmem_cache * s , struct page * page )
{
struct page new ;
unsigned long counters ;
void * freelist ;
do {
freelist = page - > freelist ;
counters = page - > counters ;
new . counters = counters ;
VM_BUG_ON ( ! new . frozen ) ;
new . inuse = page - > objects ;
new . frozen = freelist ! = NULL ;
} while ( ! cmpxchg_double_slab ( s , page ,
freelist , counters ,
NULL , new . counters ,
" get_freelist " ) ) ;
return freelist ;
}
2007-05-06 14:49:36 -07:00
/*
2007-05-10 03:15:16 -07:00
* Slow path . The lockless freelist is empty or we need to perform
* debugging duties .
*
* Processing is still very fast if new objects have been freed to the
* regular freelist . In that case we simply take over the regular freelist
* as the lockless freelist and zap the regular freelist .
2007-05-06 14:49:36 -07:00
*
2007-05-10 03:15:16 -07:00
* If that is not working then we fall back to the partial lists . We take the
* first element of the freelist as the object to allocate now and move the
* rest of the freelist to the lockless freelist .
2007-05-06 14:49:36 -07:00
*
2007-05-10 03:15:16 -07:00
* And if we were unable to get a new slab from the partial slab lists then
2008-02-15 23:45:26 -08:00
* we need to allocate a new slab . This is the slowest path since it involves
* a call to the page allocator and the setup of a new slab .
2007-05-06 14:49:36 -07:00
*/
2008-08-19 20:43:25 +03:00
static void * __slab_alloc ( struct kmem_cache * s , gfp_t gfpflags , int node ,
unsigned long addr , struct kmem_cache_cpu * c )
2007-05-06 14:49:36 -07:00
{
void * * object ;
2011-02-25 11:38:54 -06:00
unsigned long flags ;
local_irq_save ( flags ) ;
# ifdef CONFIG_PREEMPT
/*
* We may have been preempted and rescheduled on a different
* cpu before disabling interrupts . Need to reload cpu area
* pointer .
*/
c = this_cpu_ptr ( s - > cpu_slab ) ;
# endif
2007-05-06 14:49:36 -07:00
2011-08-09 16:12:26 -05:00
if ( ! c - > page )
2007-05-06 14:49:36 -07:00
goto new_slab ;
2011-08-09 16:12:27 -05:00
redo :
2011-06-01 12:25:56 -05:00
if ( unlikely ( ! node_match ( c , node ) ) ) {
2011-06-01 12:25:57 -05:00
stat ( s , ALLOC_NODE_MISMATCH ) ;
2011-06-01 12:25:56 -05:00
deactivate_slab ( s , c ) ;
goto new_slab ;
}
2008-02-15 23:45:26 -08:00
2011-12-13 04:57:06 +01:00
/* must check again c->freelist in case of cpu migration or IRQ */
object = c - > freelist ;
if ( object )
goto load_freelist ;
2011-06-01 12:25:58 -05:00
2011-06-01 12:25:52 -05:00
stat ( s , ALLOC_SLOWPATH ) ;
2011-06-01 12:25:58 -05:00
2011-11-11 14:07:14 -06:00
object = get_freelist ( s , c - > page ) ;
2008-02-15 23:45:26 -08:00
2011-08-09 16:12:27 -05:00
if ( ! object ) {
2011-06-01 12:25:58 -05:00
c - > page = NULL ;
stat ( s , DEACTIVATE_BYPASS ) ;
2011-06-01 12:25:56 -05:00
goto new_slab ;
2011-06-01 12:25:58 -05:00
}
2008-02-15 23:45:26 -08:00
2009-12-18 16:26:23 -06:00
stat ( s , ALLOC_REFILL ) ;
2008-02-15 23:45:26 -08:00
2007-05-10 03:15:16 -07:00
load_freelist :
2009-12-18 16:26:22 -06:00
c - > freelist = get_freepointer ( s , object ) ;
2011-02-25 11:38:54 -06:00
c - > tid = next_tid ( c - > tid ) ;
local_irq_restore ( flags ) ;
2007-05-06 14:49:36 -07:00
return object ;
new_slab :
2011-06-01 12:25:52 -05:00
2011-08-09 16:12:27 -05:00
if ( c - > partial ) {
c - > page = c - > partial ;
c - > partial = c - > page - > next ;
c - > node = page_to_nid ( c - > page ) ;
stat ( s , CPU_PARTIAL_ALLOC ) ;
c - > freelist = NULL ;
goto redo ;
2007-05-06 14:49:36 -07:00
}
2011-08-09 16:12:27 -05:00
/* Then do expensive stuff like retrieving pages from the partial lists */
2011-08-09 16:12:26 -05:00
object = get_partial ( s , gfpflags , node , c ) ;
2007-10-16 23:25:51 -07:00
2011-08-09 16:12:26 -05:00
if ( unlikely ( ! object ) ) {
2011-04-15 14:48:14 -05:00
2011-08-09 16:12:26 -05:00
object = new_slab_objects ( s , gfpflags , node , & c ) ;
2011-06-01 12:25:52 -05:00
2011-08-09 16:12:26 -05:00
if ( unlikely ( ! object ) ) {
if ( ! ( gfpflags & __GFP_NOWARN ) & & printk_ratelimit ( ) )
slab_out_of_memory ( s , gfpflags , node ) ;
2011-07-22 09:35:14 -05:00
2011-08-09 16:12:26 -05:00
local_irq_restore ( flags ) ;
return NULL ;
}
2007-05-06 14:49:36 -07:00
}
2011-06-01 12:25:52 -05:00
2011-08-09 16:12:26 -05:00
if ( likely ( ! kmem_cache_debug ( s ) ) )
2007-05-16 22:10:53 -07:00
goto load_freelist ;
2011-06-01 12:25:52 -05:00
2011-08-09 16:12:26 -05:00
/* Only entered in the debug case */
if ( ! alloc_debug_processing ( s , c - > page , object , addr ) )
goto new_slab ; /* Slab failed checks. Next slab needed */
2007-05-10 03:15:16 -07:00
2011-06-01 12:25:52 -05:00
c - > freelist = get_freepointer ( s , object ) ;
2011-05-17 16:29:31 -05:00
deactivate_slab ( s , c ) ;
2010-10-02 11:32:32 +03:00
c - > node = NUMA_NO_NODE ;
2011-05-25 09:47:43 -05:00
local_irq_restore ( flags ) ;
return object ;
2007-05-10 03:15:16 -07:00
}
/*
* Inlined fastpath so that allocation functions ( kmalloc , kmem_cache_alloc )
* have the fastpath folded into their functions . So no function call
* overhead for requests that can be satisfied on the fastpath .
*
* The fastpath works by first checking if the lockless freelist can be used .
* If not then __slab_alloc is called for slow processing .
*
* Otherwise we can simply pick the next object from the lockless free list .
*/
2008-01-07 23:20:27 -08:00
static __always_inline void * slab_alloc ( struct kmem_cache * s ,
2008-08-19 20:43:25 +03:00
gfp_t gfpflags , int node , unsigned long addr )
2007-05-10 03:15:16 -07:00
{
void * * object ;
2007-10-16 01:26:05 -07:00
struct kmem_cache_cpu * c ;
2011-02-25 11:38:54 -06:00
unsigned long tid ;
2008-01-07 23:20:30 -08:00
2010-08-20 12:37:16 -05:00
if ( slab_pre_alloc_hook ( s , gfpflags ) )
2008-12-23 19:37:01 +09:00
return NULL ;
2008-01-07 23:20:30 -08:00
2011-02-25 11:38:54 -06:00
redo :
/*
* Must read kmem_cache cpu data via this cpu ptr . Preemption is
* enabled . We may switch back and forth between cpus while
* reading from one cpu area . That does not matter as long
* as we end up on the original cpu again when doing the cmpxchg .
*/
2009-12-18 16:26:20 -06:00
c = __this_cpu_ptr ( s - > cpu_slab ) ;
2011-02-25 11:38:54 -06:00
/*
* The transaction ids are globally unique per cpu and per operation on
* a per cpu queue . Thus they can be guarantee that the cmpxchg_double
* occurs on the right processor and that there was no operation on the
* linked list in between .
*/
tid = c - > tid ;
barrier ( ) ;
2009-12-18 16:26:20 -06:00
object = c - > freelist ;
if ( unlikely ( ! object | | ! node_match ( c , node ) ) )
2007-05-10 03:15:16 -07:00
2007-10-16 01:26:05 -07:00
object = __slab_alloc ( s , gfpflags , node , addr , c ) ;
2007-05-10 03:15:16 -07:00
else {
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 16:25:34 +01:00
void * next_object = get_freepointer_safe ( s , object ) ;
2011-02-25 11:38:54 -06:00
/*
2011-03-30 22:57:33 -03:00
* The cmpxchg will only match if there was no additional
2011-02-25 11:38:54 -06:00
* operation and if we are on the right processor .
*
* The cmpxchg does the following atomically ( without lock semantics ! )
* 1. Relocate first pointer to the current per cpu area .
* 2. Verify that tid and freelist have not been changed
* 3. If they were not changed replace tid and freelist
*
* Since this is without lock semantics the protection is only against
* code executing on this cpu * not * from access by other cpus .
*/
2011-12-22 11:58:51 -06:00
if ( unlikely ( ! this_cpu_cmpxchg_double (
2011-02-25 11:38:54 -06:00
s - > cpu_slab - > freelist , s - > cpu_slab - > tid ,
object , tid ,
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 16:25:34 +01:00
next_object , next_tid ( tid ) ) ) ) {
2011-02-25 11:38:54 -06:00
note_cmpxchg_failure ( " slab_alloc " , s , tid ) ;
goto redo ;
}
slub: prefetch next freelist pointer in slab_alloc()
Recycling a page is a problem, since freelist link chain is hot on
cpu(s) which freed objects, and possibly very cold on cpu currently
owning slab.
Adding a prefetch of cache line containing the pointer to next object in
slab_alloc() helps a lot in many workloads, in particular on assymetric
ones (allocations done on one cpu, frees on another cpus). Added cost is
three machine instructions only.
Examples on my dual socket quad core ht machine (Intel CPU E5540
@2.53GHz) (16 logical cpus, 2 memory nodes), 64bit kernel.
Before patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
327577,471718 task-clock # 15,821 CPUs utilized ( +- 0,64% )
28 866 491 context-switches # 0,088 M/sec ( +- 1,80% )
1 506 929 CPU-migrations # 0,005 M/sec ( +- 3,24% )
127 151 page-faults # 0,000 M/sec ( +- 0,16% )
829 399 813 448 cycles # 2,532 GHz ( +- 0,64% )
580 664 691 740 stalled-cycles-frontend # 70,01% frontend cycles idle ( +- 0,71% )
197 431 700 448 stalled-cycles-backend # 23,80% backend cycles idle ( +- 1,03% )
503 548 648 975 instructions # 0,61 insns per cycle
# 1,15 stalled cycles per insn ( +- 0,46% )
95 780 068 471 branches # 292,389 M/sec ( +- 0,48% )
1 426 407 916 branch-misses # 1,49% of all branches ( +- 1,35% )
20,705679994 seconds time elapsed ( +- 0,64% )
After patch :
# perf stat -r 32 hackbench 50 process 4000 >/dev/null
Performance counter stats for 'hackbench 50 process 4000' (32 runs):
286236,542804 task-clock # 15,786 CPUs utilized ( +- 1,32% )
19 703 372 context-switches # 0,069 M/sec ( +- 4,99% )
1 658 249 CPU-migrations # 0,006 M/sec ( +- 6,62% )
126 776 page-faults # 0,000 M/sec ( +- 0,12% )
724 636 593 213 cycles # 2,532 GHz ( +- 1,32% )
499 320 714 837 stalled-cycles-frontend # 68,91% frontend cycles idle ( +- 1,47% )
156 555 126 809 stalled-cycles-backend # 21,60% backend cycles idle ( +- 2,22% )
463 897 792 661 instructions # 0,64 insns per cycle
# 1,08 stalled cycles per insn ( +- 0,94% )
87 717 352 563 branches # 306,451 M/sec ( +- 0,99% )
941 738 280 branch-misses # 1,07% of all branches ( +- 3,35% )
18,132070670 seconds time elapsed ( +- 1,30% )
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Rientjes <rientjes@google.com>
CC: "Alex,Shi" <alex.shi@intel.com>
CC: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-12-16 16:25:34 +01:00
prefetch_freepointer ( s , next_object ) ;
2009-12-18 16:26:23 -06:00
stat ( s , ALLOC_FASTPATH ) ;
2007-05-10 03:15:16 -07:00
}
2011-02-25 11:38:54 -06:00
2009-11-25 20:14:48 +02:00
if ( unlikely ( gfpflags & __GFP_ZERO ) & & object )
2009-12-18 16:26:22 -06:00
memset ( object , 0 , s - > objsize ) ;
2007-07-17 04:03:23 -07:00
2010-08-20 12:37:16 -05:00
slab_post_alloc_hook ( s , gfpflags , object ) ;
2008-04-04 00:54:48 +02:00
2007-05-10 03:15:16 -07:00
return object ;
2007-05-06 14:49:36 -07:00
}
void * kmem_cache_alloc ( struct kmem_cache * s , gfp_t gfpflags )
{
2010-07-09 14:07:10 -05:00
void * ret = slab_alloc ( s , gfpflags , NUMA_NO_NODE , _RET_IP_ ) ;
2008-08-19 20:43:26 +03:00
2009-03-23 15:12:24 +02:00
trace_kmem_cache_alloc ( _RET_IP_ , ret , s - > objsize , s - > size , gfpflags ) ;
2008-08-19 20:43:26 +03:00
return ret ;
2007-05-06 14:49:36 -07:00
}
EXPORT_SYMBOL ( kmem_cache_alloc ) ;
2009-12-11 15:45:30 +08:00
# ifdef CONFIG_TRACING
2010-10-21 10:29:19 +01:00
void * kmem_cache_alloc_trace ( struct kmem_cache * s , gfp_t gfpflags , size_t size )
{
void * ret = slab_alloc ( s , gfpflags , NUMA_NO_NODE , _RET_IP_ ) ;
trace_kmalloc ( _RET_IP_ , ret , size , s - > size , gfpflags ) ;
return ret ;
}
EXPORT_SYMBOL ( kmem_cache_alloc_trace ) ;
void * kmalloc_order_trace ( size_t size , gfp_t flags , unsigned int order )
2008-08-19 20:43:26 +03:00
{
2010-10-21 10:29:19 +01:00
void * ret = kmalloc_order ( size , flags , order ) ;
trace_kmalloc ( _RET_IP_ , ret , size , PAGE_SIZE < < order , flags ) ;
return ret ;
2008-08-19 20:43:26 +03:00
}
2010-10-21 10:29:19 +01:00
EXPORT_SYMBOL ( kmalloc_order_trace ) ;
2008-08-19 20:43:26 +03:00
# endif
2007-05-06 14:49:36 -07:00
# ifdef CONFIG_NUMA
void * kmem_cache_alloc_node ( struct kmem_cache * s , gfp_t gfpflags , int node )
{
2008-08-19 20:43:26 +03:00
void * ret = slab_alloc ( s , gfpflags , node , _RET_IP_ ) ;
2009-03-23 15:12:24 +02:00
trace_kmem_cache_alloc_node ( _RET_IP_ , ret ,
s - > objsize , s - > size , gfpflags , node ) ;
2008-08-19 20:43:26 +03:00
return ret ;
2007-05-06 14:49:36 -07:00
}
EXPORT_SYMBOL ( kmem_cache_alloc_node ) ;
2009-12-11 15:45:30 +08:00
# ifdef CONFIG_TRACING
2010-10-21 10:29:19 +01:00
void * kmem_cache_alloc_node_trace ( struct kmem_cache * s ,
2008-08-19 20:43:26 +03:00
gfp_t gfpflags ,
2010-10-21 10:29:19 +01:00
int node , size_t size )
2008-08-19 20:43:26 +03:00
{
2010-10-21 10:29:19 +01:00
void * ret = slab_alloc ( s , gfpflags , node , _RET_IP_ ) ;
trace_kmalloc_node ( _RET_IP_ , ret ,
size , s - > size , gfpflags , node ) ;
return ret ;
2008-08-19 20:43:26 +03:00
}
2010-10-21 10:29:19 +01:00
EXPORT_SYMBOL ( kmem_cache_alloc_node_trace ) ;
2008-08-19 20:43:26 +03:00
# endif
2010-09-29 21:02:15 +09:00
# endif
2008-08-19 20:43:26 +03:00
2007-05-06 14:49:36 -07:00
/*
2007-05-10 03:15:16 -07:00
* Slow patch handling . This may still be called frequently since objects
* have a longer lifetime than the cpu slabs in most processing loads .
2007-05-06 14:49:36 -07:00
*
2007-05-10 03:15:16 -07:00
* So we still attempt to reduce cache line usage . Just take the slab
* lock and free the item . If there is no additional partial page
* handling required then we can return immediately .
2007-05-06 14:49:36 -07:00
*/
2007-05-10 03:15:16 -07:00
static void __slab_free ( struct kmem_cache * s , struct page * page ,
2009-12-18 16:26:22 -06:00
void * x , unsigned long addr )
2007-05-06 14:49:36 -07:00
{
void * prior ;
void * * object = ( void * ) x ;
2011-06-01 12:25:52 -05:00
int was_frozen ;
int inuse ;
struct page new ;
unsigned long counters ;
struct kmem_cache_node * n = NULL ;
2011-06-01 12:25:51 -05:00
unsigned long uninitialized_var ( flags ) ;
2007-05-06 14:49:36 -07:00
2011-02-25 11:38:54 -06:00
stat ( s , FREE_SLOWPATH ) ;
2007-05-06 14:49:36 -07:00
2011-04-15 14:48:16 -05:00
if ( kmem_cache_debug ( s ) & & ! free_debug_processing ( s , page , x , addr ) )
2011-06-01 12:25:55 -05:00
return ;
2008-02-15 23:45:26 -08:00
2011-06-01 12:25:52 -05:00
do {
prior = page - > freelist ;
counters = page - > counters ;
set_freepointer ( s , object , prior ) ;
new . counters = counters ;
was_frozen = new . frozen ;
new . inuse - - ;
if ( ( ! new . inuse | | ! prior ) & & ! was_frozen & & ! n ) {
2011-08-09 16:12:27 -05:00
if ( ! kmem_cache_debug ( s ) & & ! prior )
/*
* Slab was on no list before and will be partially empty
* We can defer the list move and instead freeze it .
*/
new . frozen = 1 ;
else { /* Needs to be taken off a list */
n = get_node ( s , page_to_nid ( page ) ) ;
/*
* Speculatively acquire the list_lock .
* If the cmpxchg does not succeed then we may
* drop the list_lock without any processing .
*
* Otherwise the list_lock will synchronize with
* other processors updating the list of slabs .
*/
spin_lock_irqsave ( & n - > list_lock , flags ) ;
}
2011-06-01 12:25:52 -05:00
}
inuse = new . inuse ;
2007-05-06 14:49:36 -07:00
2011-06-01 12:25:52 -05:00
} while ( ! cmpxchg_double_slab ( s , page ,
prior , counters ,
object , new . counters ,
" __slab_free " ) ) ;
2007-05-06 14:49:36 -07:00
2011-06-01 12:25:52 -05:00
if ( likely ( ! n ) ) {
2011-08-09 16:12:27 -05:00
/*
* If we just froze the page then put it onto the
* per cpu partial list .
*/
2012-02-03 23:34:56 +08:00
if ( new . frozen & & ! was_frozen ) {
2011-08-09 16:12:27 -05:00
put_cpu_partial ( s , page , 1 ) ;
2012-02-03 23:34:56 +08:00
stat ( s , CPU_PARTIAL_FREE ) ;
}
2011-08-09 16:12:27 -05:00
/*
2011-06-01 12:25:52 -05:00
* The list lock was not taken therefore no list
* activity can be necessary .
*/
if ( was_frozen )
stat ( s , FREE_FROZEN ) ;
2011-06-01 12:25:55 -05:00
return ;
2011-06-01 12:25:52 -05:00
}
2007-05-06 14:49:36 -07:00
/*
2011-06-01 12:25:52 -05:00
* was_frozen may have been set after we acquired the list_lock in
* an earlier loop . So we need to check it here again .
2007-05-06 14:49:36 -07:00
*/
2011-06-01 12:25:52 -05:00
if ( was_frozen )
stat ( s , FREE_FROZEN ) ;
else {
if ( unlikely ( ! inuse & & n - > nr_partial > s - > min_partial ) )
goto slab_empty ;
2007-05-06 14:49:36 -07:00
2011-06-01 12:25:52 -05:00
/*
* Objects left in the slab . If it was not on the partial list before
* then add it .
*/
if ( unlikely ( ! prior ) ) {
remove_full ( s , page ) ;
2011-08-24 08:57:52 +08:00
add_partial ( n , page , DEACTIVATE_TO_TAIL ) ;
2011-06-01 12:25:52 -05:00
stat ( s , FREE_ADD_PARTIAL ) ;
}
2008-02-07 17:47:41 -08:00
}
2011-06-01 12:25:55 -05:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2007-05-06 14:49:36 -07:00
return ;
slab_empty :
2008-03-01 13:40:44 -08:00
if ( prior ) {
2007-05-06 14:49:36 -07:00
/*
2011-08-08 11:16:56 -05:00
* Slab on the partial list .
2007-05-06 14:49:36 -07:00
*/
2011-06-01 12:25:50 -05:00
remove_partial ( n , page ) ;
2009-12-18 16:26:23 -06:00
stat ( s , FREE_REMOVE_PARTIAL ) ;
2011-08-08 11:16:56 -05:00
} else
/* Slab must be on the full list */
remove_full ( s , page ) ;
2011-06-01 12:25:52 -05:00
2011-06-01 12:25:55 -05:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2009-12-18 16:26:23 -06:00
stat ( s , FREE_SLAB ) ;
2007-05-06 14:49:36 -07:00
discard_slab ( s , page ) ;
}
2007-05-10 03:15:16 -07:00
/*
* Fastpath with forced inlining to produce a kfree and kmem_cache_free that
* can perform fastpath freeing without additional function calls .
*
* The fastpath is only possible if we are freeing to the current cpu slab
* of this processor . This typically the case if we have just allocated
* the item before .
*
* If fastpath is not possible then fall back to __slab_free where we deal
* with all sorts of special processing .
*/
2008-01-07 23:20:27 -08:00
static __always_inline void slab_free ( struct kmem_cache * s ,
2008-08-19 20:43:25 +03:00
struct page * page , void * x , unsigned long addr )
2007-05-10 03:15:16 -07:00
{
void * * object = ( void * ) x ;
2007-10-16 01:26:05 -07:00
struct kmem_cache_cpu * c ;
2011-02-25 11:38:54 -06:00
unsigned long tid ;
2008-01-07 23:20:30 -08:00
2010-08-20 12:37:16 -05:00
slab_free_hook ( s , x ) ;
2011-02-25 11:38:54 -06:00
redo :
/*
* Determine the currently cpus per cpu slab .
* The cpu may change afterward . However that does not matter since
* data is retrieved via this pointer . If we are on the same cpu
* during the cmpxchg then the free will succedd .
*/
2009-12-18 16:26:20 -06:00
c = __this_cpu_ptr ( s - > cpu_slab ) ;
2010-08-20 12:37:16 -05:00
2011-02-25 11:38:54 -06:00
tid = c - > tid ;
barrier ( ) ;
2010-08-20 12:37:16 -05:00
2011-05-17 16:29:31 -05:00
if ( likely ( page = = c - > page ) ) {
2009-12-18 16:26:22 -06:00
set_freepointer ( s , object , c - > freelist ) ;
2011-02-25 11:38:54 -06:00
2011-12-22 11:58:51 -06:00
if ( unlikely ( ! this_cpu_cmpxchg_double (
2011-02-25 11:38:54 -06:00
s - > cpu_slab - > freelist , s - > cpu_slab - > tid ,
c - > freelist , tid ,
object , next_tid ( tid ) ) ) ) {
note_cmpxchg_failure ( " slab_free " , s , tid ) ;
goto redo ;
}
2009-12-18 16:26:23 -06:00
stat ( s , FREE_FASTPATH ) ;
2007-05-10 03:15:16 -07:00
} else
2009-12-18 16:26:22 -06:00
__slab_free ( s , page , x , addr ) ;
2007-05-10 03:15:16 -07:00
}
2007-05-06 14:49:36 -07:00
void kmem_cache_free ( struct kmem_cache * s , void * x )
{
2007-05-06 14:49:42 -07:00
struct page * page ;
2007-05-06 14:49:36 -07:00
2007-05-06 14:49:41 -07:00
page = virt_to_head_page ( x ) ;
2007-05-06 14:49:36 -07:00
2008-08-19 20:43:25 +03:00
slab_free ( s , page , x , _RET_IP_ ) ;
2008-08-19 20:43:26 +03:00
2009-03-23 15:12:24 +02:00
trace_kmem_cache_free ( _RET_IP_ , x ) ;
2007-05-06 14:49:36 -07:00
}
EXPORT_SYMBOL ( kmem_cache_free ) ;
/*
2007-05-09 02:32:39 -07:00
* Object placement in a slab is made very easy because we always start at
* offset 0. If we tune the size of the object to the alignment then we can
* get the required alignment by putting one properly sized object after
* another .
2007-05-06 14:49:36 -07:00
*
* Notice that the allocation order determines the sizes of the per cpu
* caches . Each processor has always one slab available for allocations .
* Increasing the allocation order reduces the number of times that slabs
2007-05-09 02:32:39 -07:00
* must be moved on and off the partial lists and is therefore a factor in
2007-05-06 14:49:36 -07:00
* locking overhead .
*/
/*
* Mininum / Maximum order of slab pages . This influences locking overhead
* and slab fragmentation . A higher order reduces the number of partial slabs
* and increases the number of allocations possible without having to
* take the list_lock .
*/
static int slub_min_order ;
2008-04-14 19:11:41 +03:00
static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER ;
2008-04-14 19:11:41 +03:00
static int slub_min_objects ;
2007-05-06 14:49:36 -07:00
/*
* Merge control . If this is set then no merging of slab caches will occur .
2007-05-09 02:32:39 -07:00
* ( Could be removed . This was introduced to pacify the merge skeptics . )
2007-05-06 14:49:36 -07:00
*/
static int slub_nomerge ;
/*
* Calculate the order of allocation given an slab object size .
*
2007-05-09 02:32:39 -07:00
* The order of allocation has significant impact on performance and other
* system components . Generally order 0 allocations should be preferred since
* order 0 does not cause fragmentation in the page allocator . Larger objects
* be problematic to put into order 0 slabs because there may be too much
2008-04-14 19:13:29 +03:00
* unused space left . We go to a higher order if more than 1 / 16 th of the slab
2007-05-09 02:32:39 -07:00
* would be wasted .
*
* In order to reach satisfactory performance we must ensure that a minimum
* number of objects is in one slab . Otherwise we may generate too much
* activity on the partial lists which requires taking the list_lock . This is
* less a concern for large slabs though which are rarely used .
2007-05-06 14:49:36 -07:00
*
2007-05-09 02:32:39 -07:00
* slub_max_order specifies the order where we begin to stop considering the
* number of objects in a slab as critical . If we reach slub_max_order then
* we try to keep the page order as low as possible . So we accept more waste
* of space in favor of a small page order .
2007-05-06 14:49:36 -07:00
*
2007-05-09 02:32:39 -07:00
* Higher order allocations also allow the placement of more objects in a
* slab and thereby reduce object handling overhead . If the user has
* requested a higher mininum order then we start with that one instead of
* the smallest order which will fit the object .
2007-05-06 14:49:36 -07:00
*/
2007-05-09 02:32:46 -07:00
static inline int slab_order ( int size , int min_objects ,
2011-03-10 15:21:48 +08:00
int max_order , int fract_leftover , int reserved )
2007-05-06 14:49:36 -07:00
{
int order ;
int rem ;
2007-07-17 04:03:20 -07:00
int min_order = slub_min_order ;
2007-05-06 14:49:36 -07:00
2011-03-10 15:21:48 +08:00
if ( order_objects ( min_order , size , reserved ) > MAX_OBJS_PER_PAGE )
2008-10-22 23:00:38 +04:00
return get_order ( size * MAX_OBJS_PER_PAGE ) - 1 ;
2008-04-14 19:11:30 +03:00
2007-07-17 04:03:20 -07:00
for ( order = max ( min_order ,
2007-05-09 02:32:46 -07:00
fls ( min_objects * size - 1 ) - PAGE_SHIFT ) ;
order < = max_order ; order + + ) {
2007-05-06 14:49:36 -07:00
2007-05-09 02:32:46 -07:00
unsigned long slab_size = PAGE_SIZE < < order ;
2007-05-06 14:49:36 -07:00
2011-03-10 15:21:48 +08:00
if ( slab_size < min_objects * size + reserved )
2007-05-06 14:49:36 -07:00
continue ;
2011-03-10 15:21:48 +08:00
rem = ( slab_size - reserved ) % size ;
2007-05-06 14:49:36 -07:00
2007-05-09 02:32:46 -07:00
if ( rem < = slab_size / fract_leftover )
2007-05-06 14:49:36 -07:00
break ;
}
2007-05-09 02:32:39 -07:00
2007-05-06 14:49:36 -07:00
return order ;
}
2011-03-10 15:21:48 +08:00
static inline int calculate_order ( int size , int reserved )
2007-05-09 02:32:46 -07:00
{
int order ;
int min_objects ;
int fraction ;
2009-02-12 18:00:17 +02:00
int max_objects ;
2007-05-09 02:32:46 -07:00
/*
* Attempt to find best configuration for a slab . This
* works by first attempting to generate a layout with
* the best configuration and backing off gradually .
*
* First we reduce the acceptable waste in a slab . Then
* we reduce the minimum objects required in a slab .
*/
min_objects = slub_min_objects ;
2008-04-14 19:11:41 +03:00
if ( ! min_objects )
min_objects = 4 * ( fls ( nr_cpu_ids ) + 1 ) ;
2011-03-10 15:21:48 +08:00
max_objects = order_objects ( slub_max_order , size , reserved ) ;
2009-02-12 18:00:17 +02:00
min_objects = min ( min_objects , max_objects ) ;
2007-05-09 02:32:46 -07:00
while ( min_objects > 1 ) {
2008-04-14 19:13:29 +03:00
fraction = 16 ;
2007-05-09 02:32:46 -07:00
while ( fraction > = 4 ) {
order = slab_order ( size , min_objects ,
2011-03-10 15:21:48 +08:00
slub_max_order , fraction , reserved ) ;
2007-05-09 02:32:46 -07:00
if ( order < = slub_max_order )
return order ;
fraction / = 2 ;
}
2009-08-19 21:44:13 +03:00
min_objects - - ;
2007-05-09 02:32:46 -07:00
}
/*
* We were unable to place multiple objects in a slab . Now
* lets see if we can place a single object there .
*/
2011-03-10 15:21:48 +08:00
order = slab_order ( size , 1 , slub_max_order , 1 , reserved ) ;
2007-05-09 02:32:46 -07:00
if ( order < = slub_max_order )
return order ;
/*
* Doh this slab cannot be placed using slub_max_order .
*/
2011-03-10 15:21:48 +08:00
order = slab_order ( size , 1 , MAX_ORDER , 1 , reserved ) ;
2009-04-23 09:58:22 +03:00
if ( order < MAX_ORDER )
2007-05-09 02:32:46 -07:00
return order ;
return - ENOSYS ;
}
2007-05-06 14:49:36 -07:00
/*
2007-05-09 02:32:39 -07:00
* Figure out what the alignment of the objects will be .
2007-05-06 14:49:36 -07:00
*/
static unsigned long calculate_alignment ( unsigned long flags ,
unsigned long align , unsigned long size )
{
/*
2008-02-15 23:45:26 -08:00
* If the user wants hardware cache aligned objects then follow that
* suggestion if the object is sufficiently large .
2007-05-06 14:49:36 -07:00
*
2008-02-15 23:45:26 -08:00
* The hardware cache alignment cannot override the specified
* alignment though . If that is greater then use it .
2007-05-06 14:49:36 -07:00
*/
2008-03-05 14:05:56 -08:00
if ( flags & SLAB_HWCACHE_ALIGN ) {
unsigned long ralign = cache_line_size ( ) ;
while ( size < = ralign / 2 )
ralign / = 2 ;
align = max ( align , ralign ) ;
}
2007-05-06 14:49:36 -07:00
if ( align < ARCH_SLAB_MINALIGN )
2008-03-05 14:05:56 -08:00
align = ARCH_SLAB_MINALIGN ;
2007-05-06 14:49:36 -07:00
return ALIGN ( align , sizeof ( void * ) ) ;
}
2008-08-05 09:28:47 +03:00
static void
init_kmem_cache_node ( struct kmem_cache_node * n , struct kmem_cache * s )
2007-05-06 14:49:36 -07:00
{
n - > nr_partial = 0 ;
spin_lock_init ( & n - > list_lock ) ;
INIT_LIST_HEAD ( & n - > partial ) ;
2007-07-17 04:03:32 -07:00
# ifdef CONFIG_SLUB_DEBUG
2008-04-14 18:53:02 +03:00
atomic_long_set ( & n - > nr_slabs , 0 ) ;
2008-09-11 12:25:41 -07:00
atomic_long_set ( & n - > total_objects , 0 ) ;
2007-05-06 14:49:42 -07:00
INIT_LIST_HEAD ( & n - > full ) ;
2007-07-17 04:03:32 -07:00
# endif
2007-05-06 14:49:36 -07:00
}
2010-08-20 12:37:13 -05:00
static inline int alloc_kmem_cache_cpus ( struct kmem_cache * s )
2007-10-16 01:26:08 -07:00
{
2010-08-20 12:37:14 -05:00
BUILD_BUG_ON ( PERCPU_DYNAMIC_EARLY_SIZE <
SLUB_PAGE_SHIFT * sizeof ( struct kmem_cache_cpu ) ) ;
2007-10-16 01:26:08 -07:00
2011-02-25 11:38:54 -06:00
/*
2011-06-02 10:19:41 -04:00
* Must align to double word boundary for the double cmpxchg
* instructions to work ; see __pcpu_double_call_return_bool ( ) .
2011-02-25 11:38:54 -06:00
*/
2011-06-02 10:19:41 -04:00
s - > cpu_slab = __alloc_percpu ( sizeof ( struct kmem_cache_cpu ) ,
2 * sizeof ( void * ) ) ;
2011-02-25 11:38:54 -06:00
if ( ! s - > cpu_slab )
return 0 ;
init_kmem_cache_cpus ( s ) ;
2007-10-16 01:26:08 -07:00
2011-02-25 11:38:54 -06:00
return 1 ;
2007-10-16 01:26:08 -07:00
}
2010-08-20 12:37:15 -05:00
static struct kmem_cache * kmem_cache_node ;
2007-05-06 14:49:36 -07:00
/*
* No kmalloc_node yet so do it by hand . We know that this is the first
* slab on the node for this slabcache . There are no concurrent accesses
* possible .
*
* Note that this function only works on the kmalloc_node_cache
2007-10-16 01:26:08 -07:00
* when allocating for the kmalloc_node_cache . This is used for bootstrapping
* memory on a fresh node that has no slab structures yet .
2007-05-06 14:49:36 -07:00
*/
2010-08-20 12:37:13 -05:00
static void early_kmem_cache_node_alloc ( int node )
2007-05-06 14:49:36 -07:00
{
struct page * page ;
struct kmem_cache_node * n ;
2010-08-20 12:37:15 -05:00
BUG_ON ( kmem_cache_node - > size < sizeof ( struct kmem_cache_node ) ) ;
2007-05-06 14:49:36 -07:00
2010-08-20 12:37:15 -05:00
page = new_slab ( kmem_cache_node , GFP_NOWAIT , node ) ;
2007-05-06 14:49:36 -07:00
BUG_ON ( ! page ) ;
2007-08-22 14:01:57 -07:00
if ( page_to_nid ( page ) ! = node ) {
printk ( KERN_ERR " SLUB: Unable to allocate memory from "
" node %d \n " , node ) ;
printk ( KERN_ERR " SLUB: Allocating a useless per node structure "
" in order to be able to continue \n " ) ;
}
2007-05-06 14:49:36 -07:00
n = page - > freelist ;
BUG_ON ( ! n ) ;
2010-08-20 12:37:15 -05:00
page - > freelist = get_freepointer ( kmem_cache_node , n ) ;
2011-08-09 16:12:24 -05:00
page - > inuse = 1 ;
2011-06-01 12:25:46 -05:00
page - > frozen = 0 ;
2010-08-20 12:37:15 -05:00
kmem_cache_node - > node [ node ] = n ;
2007-07-17 04:03:32 -07:00
# ifdef CONFIG_SLUB_DEBUG
2010-09-29 07:15:01 -05:00
init_object ( kmem_cache_node , n , SLUB_RED_ACTIVE ) ;
2010-08-20 12:37:15 -05:00
init_tracking ( kmem_cache_node , n ) ;
2007-07-17 04:03:32 -07:00
# endif
2010-08-20 12:37:15 -05:00
init_kmem_cache_node ( n , kmem_cache_node ) ;
inc_slabs_node ( kmem_cache_node , node , page - > objects ) ;
2008-02-15 23:45:26 -08:00
2011-08-24 08:57:52 +08:00
add_partial ( n , page , DEACTIVATE_TO_HEAD ) ;
2007-05-06 14:49:36 -07:00
}
static void free_kmem_cache_nodes ( struct kmem_cache * s )
{
int node ;
2007-10-16 01:25:33 -07:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-06 14:49:36 -07:00
struct kmem_cache_node * n = s - > node [ node ] ;
2010-08-20 12:37:15 -05:00
2010-05-21 14:41:35 -07:00
if ( n )
2010-08-20 12:37:15 -05:00
kmem_cache_free ( kmem_cache_node , n ) ;
2007-05-06 14:49:36 -07:00
s - > node [ node ] = NULL ;
}
}
2010-08-20 12:37:13 -05:00
static int init_kmem_cache_nodes ( struct kmem_cache * s )
2007-05-06 14:49:36 -07:00
{
int node ;
2007-10-16 01:25:33 -07:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-06 14:49:36 -07:00
struct kmem_cache_node * n ;
2010-05-21 14:41:35 -07:00
if ( slab_state = = DOWN ) {
2010-08-20 12:37:13 -05:00
early_kmem_cache_node_alloc ( node ) ;
2010-05-21 14:41:35 -07:00
continue ;
}
2010-08-20 12:37:15 -05:00
n = kmem_cache_alloc_node ( kmem_cache_node ,
2010-08-20 12:37:13 -05:00
GFP_KERNEL , node ) ;
2007-05-06 14:49:36 -07:00
2010-05-21 14:41:35 -07:00
if ( ! n ) {
free_kmem_cache_nodes ( s ) ;
return 0 ;
2007-05-06 14:49:36 -07:00
}
2010-05-21 14:41:35 -07:00
2007-05-06 14:49:36 -07:00
s - > node [ node ] = n ;
2008-08-05 09:28:47 +03:00
init_kmem_cache_node ( n , s ) ;
2007-05-06 14:49:36 -07:00
}
return 1 ;
}
2009-02-25 09:16:35 +02:00
static void set_min_partial ( struct kmem_cache * s , unsigned long min )
2009-02-22 17:40:07 -08:00
{
if ( min < MIN_PARTIAL )
min = MIN_PARTIAL ;
else if ( min > MAX_PARTIAL )
min = MAX_PARTIAL ;
s - > min_partial = min ;
}
2007-05-06 14:49:36 -07:00
/*
* calculate_sizes ( ) determines the order and the distribution of data within
* a slab object .
*/
2008-04-14 19:11:41 +03:00
static int calculate_sizes ( struct kmem_cache * s , int forced_order )
2007-05-06 14:49:36 -07:00
{
unsigned long flags = s - > flags ;
unsigned long size = s - > objsize ;
unsigned long align = s - > align ;
2008-04-14 19:11:31 +03:00
int order ;
2007-05-06 14:49:36 -07:00
2008-02-15 23:45:25 -08:00
/*
* Round up object size to the next word boundary . We can only
* place the free pointer at word boundaries and this determines
* the possible location of the free pointer .
*/
size = ALIGN ( size , sizeof ( void * ) ) ;
# ifdef CONFIG_SLUB_DEBUG
2007-05-06 14:49:36 -07:00
/*
* Determine if we can poison the object itself . If the user of
* the slab may touch the object after free or before allocation
* then we should never poison the object itself .
*/
if ( ( flags & SLAB_POISON ) & & ! ( flags & SLAB_DESTROY_BY_RCU ) & &
2007-05-16 22:10:50 -07:00
! s - > ctor )
2007-05-06 14:49:36 -07:00
s - > flags | = __OBJECT_POISON ;
else
s - > flags & = ~ __OBJECT_POISON ;
/*
2007-05-09 02:32:39 -07:00
* If we are Redzoning then check if there is some space between the
2007-05-06 14:49:36 -07:00
* end of the object and the free pointer . If not then add an
2007-05-09 02:32:39 -07:00
* additional word to have some bytes to store Redzone information .
2007-05-06 14:49:36 -07:00
*/
if ( ( flags & SLAB_RED_ZONE ) & & size = = s - > objsize )
size + = sizeof ( void * ) ;
2007-05-09 02:32:44 -07:00
# endif
2007-05-06 14:49:36 -07:00
/*
2007-05-09 02:32:39 -07:00
* With that we have determined the number of bytes in actual use
* by the object . This is the potential offset to the free pointer .
2007-05-06 14:49:36 -07:00
*/
s - > inuse = size ;
if ( ( ( flags & ( SLAB_DESTROY_BY_RCU | SLAB_POISON ) ) | |
2007-05-16 22:10:50 -07:00
s - > ctor ) ) {
2007-05-06 14:49:36 -07:00
/*
* Relocate free pointer after the object if it is not
* permitted to overwrite the first word of the object on
* kmem_cache_free .
*
* This is the case if we do RCU , have a constructor or
* destructor or are poisoning the objects .
*/
s - > offset = size ;
size + = sizeof ( void * ) ;
}
2007-05-23 13:57:31 -07:00
# ifdef CONFIG_SLUB_DEBUG
2007-05-06 14:49:36 -07:00
if ( flags & SLAB_STORE_USER )
/*
* Need to store information about allocs and frees after
* the object .
*/
size + = 2 * sizeof ( struct track ) ;
2007-05-09 02:32:36 -07:00
if ( flags & SLAB_RED_ZONE )
2007-05-06 14:49:36 -07:00
/*
* Add some empty padding so that we can catch
* overwrites from earlier objects rather than let
* tracking information or the free pointer be
2008-12-29 22:14:56 +01:00
* corrupted if a user writes before the start
2007-05-06 14:49:36 -07:00
* of the object .
*/
size + = sizeof ( void * ) ;
2007-05-09 02:32:44 -07:00
# endif
2007-05-09 02:32:39 -07:00
2007-05-06 14:49:36 -07:00
/*
* Determine the alignment based on various parameters that the
2007-05-09 02:32:35 -07:00
* user specified and the dynamic determination of cache line size
* on bootup .
2007-05-06 14:49:36 -07:00
*/
align = calculate_alignment ( flags , align , s - > objsize ) ;
2009-07-30 11:28:11 +08:00
s - > align = align ;
2007-05-06 14:49:36 -07:00
/*
* SLUB stores one object immediately after another beginning from
* offset 0. In order to align the objects we have to simply size
* each object to conform to the alignment .
*/
size = ALIGN ( size , align ) ;
s - > size = size ;
2008-04-14 19:11:41 +03:00
if ( forced_order > = 0 )
order = forced_order ;
else
2011-03-10 15:21:48 +08:00
order = calculate_order ( size , s - > reserved ) ;
2007-05-06 14:49:36 -07:00
2008-04-14 19:11:31 +03:00
if ( order < 0 )
2007-05-06 14:49:36 -07:00
return 0 ;
2008-02-14 14:21:32 -08:00
s - > allocflags = 0 ;
2008-04-14 19:11:31 +03:00
if ( order )
2008-02-14 14:21:32 -08:00
s - > allocflags | = __GFP_COMP ;
if ( s - > flags & SLAB_CACHE_DMA )
s - > allocflags | = SLUB_DMA ;
if ( s - > flags & SLAB_RECLAIM_ACCOUNT )
s - > allocflags | = __GFP_RECLAIMABLE ;
2007-05-06 14:49:36 -07:00
/*
* Determine the number of objects per slab
*/
2011-03-10 15:21:48 +08:00
s - > oo = oo_make ( order , size , s - > reserved ) ;
s - > min = oo_make ( get_order ( size ) , size , s - > reserved ) ;
2008-04-14 19:11:40 +03:00
if ( oo_objects ( s - > oo ) > oo_objects ( s - > max ) )
s - > max = s - > oo ;
2007-05-06 14:49:36 -07:00
2008-04-14 19:11:31 +03:00
return ! ! oo_objects ( s - > oo ) ;
2007-05-06 14:49:36 -07:00
}
2010-08-20 12:37:13 -05:00
static int kmem_cache_open ( struct kmem_cache * s ,
2007-05-06 14:49:36 -07:00
const char * name , size_t size ,
size_t align , unsigned long flags ,
2008-07-25 19:45:34 -07:00
void ( * ctor ) ( void * ) )
2007-05-06 14:49:36 -07:00
{
memset ( s , 0 , kmem_size ) ;
s - > name = name ;
s - > ctor = ctor ;
s - > objsize = size ;
s - > align = align ;
2007-09-11 15:24:11 -07:00
s - > flags = kmem_cache_flags ( size , flags , name , ctor ) ;
2011-03-10 15:21:48 +08:00
s - > reserved = 0 ;
2007-05-06 14:49:36 -07:00
2011-03-10 15:22:00 +08:00
if ( need_reserve_slab_rcu & & ( s - > flags & SLAB_DESTROY_BY_RCU ) )
s - > reserved = sizeof ( struct rcu_head ) ;
2007-05-06 14:49:36 -07:00
2008-04-14 19:11:41 +03:00
if ( ! calculate_sizes ( s , - 1 ) )
2007-05-06 14:49:36 -07:00
goto error ;
2009-07-27 18:30:35 -07:00
if ( disable_higher_order_debug ) {
/*
* Disable debugging flags that store metadata if the min slab
* order increased .
*/
if ( get_order ( s - > size ) > get_order ( s - > objsize ) ) {
s - > flags & = ~ DEBUG_METADATA_FLAGS ;
s - > offset = 0 ;
if ( ! calculate_sizes ( s , - 1 ) )
goto error ;
}
}
2007-05-06 14:49:36 -07:00
2012-01-12 17:17:33 -08:00
# if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined ( CONFIG_HAVE_ALIGNED_STRUCT_PAGE )
2011-06-01 12:25:49 -05:00
if ( system_has_cmpxchg_double ( ) & & ( s - > flags & SLAB_DEBUG_FLAGS ) = = 0 )
/* Enable fast mode */
s - > flags | = __CMPXCHG_DOUBLE ;
# endif
2009-02-22 17:40:07 -08:00
/*
* The larger the object size is , the more pages we want on the partial
* list to avoid pounding the page allocator excessively .
*/
2011-08-09 16:12:27 -05:00
set_min_partial ( s , ilog2 ( s - > size ) / 2 ) ;
/*
* cpu_partial determined the maximum number of objects kept in the
* per cpu partial lists of a processor .
*
* Per cpu partial lists mainly contain slabs that just have one
* object freed . If they are used for allocation then they can be
* filled up again with minimal effort . The slab will never hit the
* per node partial lists and therefore no locking will be required .
*
* This setting also determines
*
* A ) The number of objects from per cpu partial slabs dumped to the
* per node list when we reach the limit .
2011-09-01 11:32:18 +08:00
* B ) The number of objects in cpu partial slabs to extract from the
2011-08-09 16:12:27 -05:00
* per node list when we run out of per cpu objects . We only fetch 50 %
* to keep some capacity around for frees .
*/
2011-11-23 09:24:27 -06:00
if ( kmem_cache_debug ( s ) )
s - > cpu_partial = 0 ;
else if ( s - > size > = PAGE_SIZE )
2011-08-09 16:12:27 -05:00
s - > cpu_partial = 2 ;
else if ( s - > size > = 1024 )
s - > cpu_partial = 6 ;
else if ( s - > size > = 256 )
s - > cpu_partial = 13 ;
else
s - > cpu_partial = 30 ;
2007-05-06 14:49:36 -07:00
s - > refcount = 1 ;
# ifdef CONFIG_NUMA
2008-08-19 08:51:22 -05:00
s - > remote_node_defrag_ratio = 1000 ;
2007-05-06 14:49:36 -07:00
# endif
2010-08-20 12:37:13 -05:00
if ( ! init_kmem_cache_nodes ( s ) )
2007-10-16 01:26:05 -07:00
goto error ;
2007-05-06 14:49:36 -07:00
2010-08-20 12:37:13 -05:00
if ( alloc_kmem_cache_cpus ( s ) )
2007-05-06 14:49:36 -07:00
return 1 ;
2009-12-18 16:26:22 -06:00
2007-10-16 01:26:08 -07:00
free_kmem_cache_nodes ( s ) ;
2007-05-06 14:49:36 -07:00
error :
if ( flags & SLAB_PANIC )
panic ( " Cannot create slab %s size=%lu realsize=%u "
" order=%u offset=%u flags=%lx \n " ,
2008-04-14 19:11:31 +03:00
s - > name , ( unsigned long ) size , s - > size , oo_order ( s - > oo ) ,
2007-05-06 14:49:36 -07:00
s - > offset , flags ) ;
return 0 ;
}
/*
* Determine the size of a slab object
*/
unsigned int kmem_cache_size ( struct kmem_cache * s )
{
return s - > objsize ;
}
EXPORT_SYMBOL ( kmem_cache_size ) ;
2008-04-25 12:22:43 -07:00
static void list_slab_objects ( struct kmem_cache * s , struct page * page ,
const char * text )
{
# ifdef CONFIG_SLUB_DEBUG
void * addr = page_address ( page ) ;
void * p ;
2010-09-29 21:02:13 +09:00
unsigned long * map = kzalloc ( BITS_TO_LONGS ( page - > objects ) *
sizeof ( long ) , GFP_ATOMIC ) ;
2010-03-24 22:25:47 +01:00
if ( ! map )
return ;
2008-04-25 12:22:43 -07:00
slab_err ( s , page , " %s " , text ) ;
slab_lock ( page ) ;
2011-04-15 14:48:13 -05:00
get_map ( s , page , map ) ;
2008-04-25 12:22:43 -07:00
for_each_object ( p , s , addr , page - > objects ) {
if ( ! test_bit ( slab_index ( p , s , addr ) , map ) ) {
printk ( KERN_ERR " INFO: Object 0x%p @offset=%tu \n " ,
p , p - addr ) ;
print_tracking ( s , p ) ;
}
}
slab_unlock ( page ) ;
2010-03-24 22:25:47 +01:00
kfree ( map ) ;
2008-04-25 12:22:43 -07:00
# endif
}
2007-05-06 14:49:36 -07:00
/*
2008-04-23 12:36:52 -07:00
* Attempt to free all partial slabs on a node .
2011-08-09 16:12:22 -05:00
* This is called from kmem_cache_close ( ) . We must be the last thread
* using the cache and therefore we do not need to lock anymore .
2007-05-06 14:49:36 -07:00
*/
2008-04-23 12:36:52 -07:00
static void free_partial ( struct kmem_cache * s , struct kmem_cache_node * n )
2007-05-06 14:49:36 -07:00
{
struct page * page , * h ;
2008-04-25 12:22:43 -07:00
list_for_each_entry_safe ( page , h , & n - > partial , lru ) {
2007-05-06 14:49:36 -07:00
if ( ! page - > inuse ) {
2011-06-01 12:25:50 -05:00
remove_partial ( n , page ) ;
2007-05-06 14:49:36 -07:00
discard_slab ( s , page ) ;
2008-04-25 12:22:43 -07:00
} else {
list_slab_objects ( s , page ,
" Objects remaining on kmem_cache_close() " ) ;
2008-04-23 12:36:52 -07:00
}
2008-04-25 12:22:43 -07:00
}
2007-05-06 14:49:36 -07:00
}
/*
2007-05-09 02:32:39 -07:00
* Release all resources used by a slab cache .
2007-05-06 14:49:36 -07:00
*/
2007-07-17 04:03:24 -07:00
static inline int kmem_cache_close ( struct kmem_cache * s )
2007-05-06 14:49:36 -07:00
{
int node ;
flush_all ( s ) ;
2009-12-18 16:26:20 -06:00
free_percpu ( s - > cpu_slab ) ;
2007-05-06 14:49:36 -07:00
/* Attempt to free all objects */
2007-10-16 01:25:33 -07:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-06 14:49:36 -07:00
struct kmem_cache_node * n = get_node ( s , node ) ;
2008-04-23 12:36:52 -07:00
free_partial ( s , n ) ;
if ( n - > nr_partial | | slabs_node ( s , node ) )
2007-05-06 14:49:36 -07:00
return 1 ;
}
free_kmem_cache_nodes ( s ) ;
return 0 ;
}
/*
* Close a cache and release the kmem_cache structure
* ( must be used for caches created using kmem_cache_create )
*/
void kmem_cache_destroy ( struct kmem_cache * s )
{
down_write ( & slub_lock ) ;
s - > refcount - - ;
if ( ! s - > refcount ) {
list_del ( & s - > list ) ;
2011-08-09 16:12:22 -05:00
up_write ( & slub_lock ) ;
2008-04-23 22:31:08 +03:00
if ( kmem_cache_close ( s ) ) {
printk ( KERN_ERR " SLUB %s: %s called for cache that "
" still has objects. \n " , s - > name , __func__ ) ;
dump_stack ( ) ;
}
2009-09-03 22:38:59 +03:00
if ( s - > flags & SLAB_DESTROY_BY_RCU )
rcu_barrier ( ) ;
2007-05-06 14:49:36 -07:00
sysfs_slab_remove ( s ) ;
2011-08-09 16:12:22 -05:00
} else
up_write ( & slub_lock ) ;
2007-05-06 14:49:36 -07:00
}
EXPORT_SYMBOL ( kmem_cache_destroy ) ;
/********************************************************************
* Kmalloc subsystem
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
2010-08-20 12:37:15 -05:00
struct kmem_cache * kmalloc_caches [ SLUB_PAGE_SHIFT ] ;
2007-05-06 14:49:36 -07:00
EXPORT_SYMBOL ( kmalloc_caches ) ;
2010-08-20 12:37:15 -05:00
static struct kmem_cache * kmem_cache ;
2010-08-20 12:37:13 -05:00
# ifdef CONFIG_ZONE_DMA
2010-08-20 12:37:15 -05:00
static struct kmem_cache * kmalloc_dma_caches [ SLUB_PAGE_SHIFT ] ;
2010-08-20 12:37:13 -05:00
# endif
2007-05-06 14:49:36 -07:00
static int __init setup_slub_min_order ( char * str )
{
2008-01-07 23:20:27 -08:00
get_option ( & str , & slub_min_order ) ;
2007-05-06 14:49:36 -07:00
return 1 ;
}
__setup ( " slub_min_order= " , setup_slub_min_order ) ;
static int __init setup_slub_max_order ( char * str )
{
2008-01-07 23:20:27 -08:00
get_option ( & str , & slub_max_order ) ;
2009-04-23 09:58:22 +03:00
slub_max_order = min ( slub_max_order , MAX_ORDER - 1 ) ;
2007-05-06 14:49:36 -07:00
return 1 ;
}
__setup ( " slub_max_order= " , setup_slub_max_order ) ;
static int __init setup_slub_min_objects ( char * str )
{
2008-01-07 23:20:27 -08:00
get_option ( & str , & slub_min_objects ) ;
2007-05-06 14:49:36 -07:00
return 1 ;
}
__setup ( " slub_min_objects= " , setup_slub_min_objects ) ;
static int __init setup_slub_nomerge ( char * str )
{
slub_nomerge = 1 ;
return 1 ;
}
__setup ( " slub_nomerge " , setup_slub_nomerge ) ;
2010-08-20 12:37:15 -05:00
static struct kmem_cache * __init create_kmalloc_cache ( const char * name ,
int size , unsigned int flags )
2007-05-06 14:49:36 -07:00
{
2010-08-20 12:37:15 -05:00
struct kmem_cache * s ;
s = kmem_cache_alloc ( kmem_cache , GFP_NOWAIT ) ;
2009-06-10 19:40:04 +03:00
/*
* This function is called with IRQs disabled during early - boot on
* single CPU so there ' s no need to take slub_lock here .
*/
2010-08-20 12:37:13 -05:00
if ( ! kmem_cache_open ( s , name , size , ARCH_KMALLOC_MINALIGN ,
2008-04-14 19:11:41 +03:00
flags , NULL ) )
2007-05-06 14:49:36 -07:00
goto panic ;
list_add ( & s - > list , & slab_caches ) ;
2010-08-20 12:37:15 -05:00
return s ;
2007-05-06 14:49:36 -07:00
panic :
panic ( " Creation of kmalloc slab %s size=%d failed. \n " , name , size ) ;
2010-08-20 12:37:15 -05:00
return NULL ;
2007-05-06 14:49:36 -07:00
}
2007-07-17 04:03:26 -07:00
/*
* Conversion table for small slabs sizes / 8 to the index in the
* kmalloc array . This is necessary for slabs < 192 since we have non power
* of two cache sizes there . The size of larger slabs can be determined using
* fls .
*/
static s8 size_index [ 24 ] = {
3 , /* 8 */
4 , /* 16 */
5 , /* 24 */
5 , /* 32 */
6 , /* 40 */
6 , /* 48 */
6 , /* 56 */
6 , /* 64 */
1 , /* 72 */
1 , /* 80 */
1 , /* 88 */
1 , /* 96 */
7 , /* 104 */
7 , /* 112 */
7 , /* 120 */
7 , /* 128 */
2 , /* 136 */
2 , /* 144 */
2 , /* 152 */
2 , /* 160 */
2 , /* 168 */
2 , /* 176 */
2 , /* 184 */
2 /* 192 */
} ;
2009-08-28 14:28:54 +03:00
static inline int size_index_elem ( size_t bytes )
{
return ( bytes - 1 ) / 8 ;
}
2007-05-06 14:49:36 -07:00
static struct kmem_cache * get_slab ( size_t size , gfp_t flags )
{
2007-07-17 04:03:26 -07:00
int index ;
2007-05-06 14:49:36 -07:00
2007-07-17 04:03:26 -07:00
if ( size < = 192 ) {
if ( ! size )
return ZERO_SIZE_PTR ;
2007-05-06 14:49:36 -07:00
2009-08-28 14:28:54 +03:00
index = size_index [ size_index_elem ( size ) ] ;
2007-10-16 01:24:38 -07:00
} else
2007-07-17 04:03:26 -07:00
index = fls ( size - 1 ) ;
2007-05-06 14:49:36 -07:00
# ifdef CONFIG_ZONE_DMA
2007-07-17 04:03:26 -07:00
if ( unlikely ( ( flags & SLUB_DMA ) ) )
2010-08-20 12:37:15 -05:00
return kmalloc_dma_caches [ index ] ;
2007-07-17 04:03:26 -07:00
2007-05-06 14:49:36 -07:00
# endif
2010-08-20 12:37:15 -05:00
return kmalloc_caches [ index ] ;
2007-05-06 14:49:36 -07:00
}
void * __kmalloc ( size_t size , gfp_t flags )
{
2007-10-16 01:24:38 -07:00
struct kmem_cache * s ;
2008-08-19 20:43:26 +03:00
void * ret ;
2007-05-06 14:49:36 -07:00
2009-02-17 12:05:07 -05:00
if ( unlikely ( size > SLUB_MAX_SIZE ) )
2008-02-11 22:47:46 +02:00
return kmalloc_large ( size , flags ) ;
2007-10-16 01:24:38 -07:00
s = get_slab ( size , flags ) ;
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 04:03:22 -07:00
return s ;
2010-07-09 14:07:10 -05:00
ret = slab_alloc ( s , flags , NUMA_NO_NODE , _RET_IP_ ) ;
2008-08-19 20:43:26 +03:00
2009-03-23 15:12:24 +02:00
trace_kmalloc ( _RET_IP_ , ret , size , s - > size , flags ) ;
2008-08-19 20:43:26 +03:00
return ret ;
2007-05-06 14:49:36 -07:00
}
EXPORT_SYMBOL ( __kmalloc ) ;
2010-09-29 21:02:15 +09:00
# ifdef CONFIG_NUMA
2008-03-01 13:56:40 -08:00
static void * kmalloc_large_node ( size_t size , gfp_t flags , int node )
{
2008-11-25 16:55:53 +01:00
struct page * page ;
2009-07-07 10:32:59 +01:00
void * ptr = NULL ;
2008-03-01 13:56:40 -08:00
2008-11-25 16:55:53 +01:00
flags | = __GFP_COMP | __GFP_NOTRACK ;
page = alloc_pages_node ( node , flags , get_order ( size ) ) ;
2008-03-01 13:56:40 -08:00
if ( page )
2009-07-07 10:32:59 +01:00
ptr = page_address ( page ) ;
kmemleak_alloc ( ptr , size , 1 , flags ) ;
return ptr ;
2008-03-01 13:56:40 -08:00
}
2007-05-06 14:49:36 -07:00
void * __kmalloc_node ( size_t size , gfp_t flags , int node )
{
2007-10-16 01:24:38 -07:00
struct kmem_cache * s ;
2008-08-19 20:43:26 +03:00
void * ret ;
2007-05-06 14:49:36 -07:00
2009-02-20 12:15:30 +01:00
if ( unlikely ( size > SLUB_MAX_SIZE ) ) {
2008-08-19 20:43:26 +03:00
ret = kmalloc_large_node ( size , flags , node ) ;
2009-03-23 15:12:24 +02:00
trace_kmalloc_node ( _RET_IP_ , ret ,
size , PAGE_SIZE < < get_order ( size ) ,
flags , node ) ;
2008-08-19 20:43:26 +03:00
return ret ;
}
2007-10-16 01:24:38 -07:00
s = get_slab ( size , flags ) ;
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 04:03:22 -07:00
return s ;
2008-08-19 20:43:26 +03:00
ret = slab_alloc ( s , flags , node , _RET_IP_ ) ;
2009-03-23 15:12:24 +02:00
trace_kmalloc_node ( _RET_IP_ , ret , size , s - > size , flags , node ) ;
2008-08-19 20:43:26 +03:00
return ret ;
2007-05-06 14:49:36 -07:00
}
EXPORT_SYMBOL ( __kmalloc_node ) ;
# endif
size_t ksize ( const void * object )
{
2007-06-08 13:46:49 -07:00
struct page * page ;
2007-05-06 14:49:36 -07:00
2007-10-16 01:24:46 -07:00
if ( unlikely ( object = = ZERO_SIZE_PTR ) )
2007-06-08 13:46:49 -07:00
return 0 ;
2007-12-04 23:45:30 -08:00
page = virt_to_head_page ( object ) ;
2008-05-22 19:22:25 +03:00
if ( unlikely ( ! PageSlab ( page ) ) ) {
WARN_ON ( ! PageCompound ( page ) ) ;
2007-12-04 23:45:30 -08:00
return PAGE_SIZE < < compound_order ( page ) ;
2008-05-22 19:22:25 +03:00
}
2007-05-06 14:49:36 -07:00
2011-02-14 18:35:22 +01:00
return slab_ksize ( page - > slab ) ;
2007-05-06 14:49:36 -07:00
}
2009-02-10 15:21:44 +02:00
EXPORT_SYMBOL ( ksize ) ;
2007-05-06 14:49:36 -07:00
2011-07-07 11:36:37 -07:00
# ifdef CONFIG_SLUB_DEBUG
bool verify_mem_not_deleted ( const void * x )
{
struct page * page ;
void * object = ( void * ) x ;
unsigned long flags ;
bool rv ;
if ( unlikely ( ZERO_OR_NULL_PTR ( x ) ) )
return false ;
local_irq_save ( flags ) ;
page = virt_to_head_page ( x ) ;
if ( unlikely ( ! PageSlab ( page ) ) ) {
/* maybe it was from stack? */
rv = true ;
goto out_unlock ;
}
slab_lock ( page ) ;
if ( on_freelist ( page - > slab , page , object ) ) {
object_err ( page - > slab , page , object , " Object is on free-list " ) ;
rv = false ;
} else {
rv = true ;
}
slab_unlock ( page ) ;
out_unlock :
local_irq_restore ( flags ) ;
return rv ;
}
EXPORT_SYMBOL ( verify_mem_not_deleted ) ;
# endif
2007-05-06 14:49:36 -07:00
void kfree ( const void * x )
{
struct page * page ;
2008-02-07 17:47:41 -08:00
void * object = ( void * ) x ;
2007-05-06 14:49:36 -07:00
2009-03-25 11:05:57 +02:00
trace_kfree ( _RET_IP_ , x ) ;
2007-10-16 01:24:44 -07:00
if ( unlikely ( ZERO_OR_NULL_PTR ( x ) ) )
2007-05-06 14:49:36 -07:00
return ;
2007-05-06 14:49:41 -07:00
page = virt_to_head_page ( x ) ;
2007-10-16 01:24:38 -07:00
if ( unlikely ( ! PageSlab ( page ) ) ) {
2008-05-28 10:32:22 -07:00
BUG_ON ( ! PageCompound ( page ) ) ;
2009-07-07 10:32:59 +01:00
kmemleak_free ( x ) ;
2007-10-16 01:24:38 -07:00
put_page ( page ) ;
return ;
}
2008-08-19 20:43:25 +03:00
slab_free ( page - > slab , page , object , _RET_IP_ ) ;
2007-05-06 14:49:36 -07:00
}
EXPORT_SYMBOL ( kfree ) ;
2007-05-06 14:49:46 -07:00
/*
2007-05-09 02:32:39 -07:00
* kmem_cache_shrink removes empty slabs from the partial lists and sorts
* the remaining slabs by the number of items in use . The slabs with the
* most items in use come first . New allocations will then fill those up
* and thus they can be removed from the partial lists .
*
* The slabs with the least items are placed last . This results in them
* being allocated from last increasing the chance that the last objects
* are freed in them .
2007-05-06 14:49:46 -07:00
*/
int kmem_cache_shrink ( struct kmem_cache * s )
{
int node ;
int i ;
struct kmem_cache_node * n ;
struct page * page ;
struct page * t ;
2008-04-14 19:11:40 +03:00
int objects = oo_objects ( s - > max ) ;
2007-05-06 14:49:46 -07:00
struct list_head * slabs_by_inuse =
2008-04-14 19:11:31 +03:00
kmalloc ( sizeof ( struct list_head ) * objects , GFP_KERNEL ) ;
2007-05-06 14:49:46 -07:00
unsigned long flags ;
if ( ! slabs_by_inuse )
return - ENOMEM ;
flush_all ( s ) ;
2007-10-16 01:25:33 -07:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-06 14:49:46 -07:00
n = get_node ( s , node ) ;
if ( ! n - > nr_partial )
continue ;
2008-04-14 19:11:31 +03:00
for ( i = 0 ; i < objects ; i + + )
2007-05-06 14:49:46 -07:00
INIT_LIST_HEAD ( slabs_by_inuse + i ) ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
/*
2007-05-09 02:32:39 -07:00
* Build lists indexed by the items in use in each slab .
2007-05-06 14:49:46 -07:00
*
2007-05-09 02:32:39 -07:00
* Note that concurrent frees may occur while we hold the
* list_lock . page - > inuse here is the upper limit .
2007-05-06 14:49:46 -07:00
*/
list_for_each_entry_safe ( page , t , & n - > partial , lru ) {
2011-08-09 16:12:22 -05:00
list_move ( & page - > lru , slabs_by_inuse + page - > inuse ) ;
if ( ! page - > inuse )
n - > nr_partial - - ;
2007-05-06 14:49:46 -07:00
}
/*
2007-05-09 02:32:39 -07:00
* Rebuild the partial list with the slabs filled up most
* first and the least used slabs at the end .
2007-05-06 14:49:46 -07:00
*/
2011-08-09 16:12:22 -05:00
for ( i = objects - 1 ; i > 0 ; i - - )
2007-05-06 14:49:46 -07:00
list_splice ( slabs_by_inuse + i , n - > partial . prev ) ;
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
2011-08-09 16:12:22 -05:00
/* Release empty slabs */
list_for_each_entry_safe ( page , t , slabs_by_inuse , lru )
discard_slab ( s , page ) ;
2007-05-06 14:49:46 -07:00
}
kfree ( slabs_by_inuse ) ;
return 0 ;
}
EXPORT_SYMBOL ( kmem_cache_shrink ) ;
2010-10-06 16:58:16 +03:00
# if defined(CONFIG_MEMORY_HOTPLUG)
2007-10-21 16:41:37 -07:00
static int slab_mem_going_offline_callback ( void * arg )
{
struct kmem_cache * s ;
down_read ( & slub_lock ) ;
list_for_each_entry ( s , & slab_caches , list )
kmem_cache_shrink ( s ) ;
up_read ( & slub_lock ) ;
return 0 ;
}
static void slab_mem_offline_callback ( void * arg )
{
struct kmem_cache_node * n ;
struct kmem_cache * s ;
struct memory_notify * marg = arg ;
int offline_node ;
offline_node = marg - > status_change_nid ;
/*
* If the node still has available memory . we need kmem_cache_node
* for it yet .
*/
if ( offline_node < 0 )
return ;
down_read ( & slub_lock ) ;
list_for_each_entry ( s , & slab_caches , list ) {
n = get_node ( s , offline_node ) ;
if ( n ) {
/*
* if n - > nr_slabs > 0 , slabs still exist on the node
* that is going down . We were unable to free them ,
2009-12-18 15:40:42 -05:00
* and offline_pages ( ) function shouldn ' t call this
2007-10-21 16:41:37 -07:00
* callback . So , we must fail .
*/
2008-04-14 18:53:02 +03:00
BUG_ON ( slabs_node ( s , offline_node ) ) ;
2007-10-21 16:41:37 -07:00
s - > node [ offline_node ] = NULL ;
2010-08-25 14:51:14 -05:00
kmem_cache_free ( kmem_cache_node , n ) ;
2007-10-21 16:41:37 -07:00
}
}
up_read ( & slub_lock ) ;
}
static int slab_mem_going_online_callback ( void * arg )
{
struct kmem_cache_node * n ;
struct kmem_cache * s ;
struct memory_notify * marg = arg ;
int nid = marg - > status_change_nid ;
int ret = 0 ;
/*
* If the node ' s memory is already available , then kmem_cache_node is
* already created . Nothing to do .
*/
if ( nid < 0 )
return 0 ;
/*
2008-04-29 16:11:12 -07:00
* We are bringing a node online . No memory is available yet . We must
2007-10-21 16:41:37 -07:00
* allocate a kmem_cache_node structure in order to bring the node
* online .
*/
down_read ( & slub_lock ) ;
list_for_each_entry ( s , & slab_caches , list ) {
/*
* XXX : kmem_cache_alloc_node will fallback to other nodes
* since memory is not yet available from the node that
* is brought up .
*/
2010-08-25 14:51:14 -05:00
n = kmem_cache_alloc ( kmem_cache_node , GFP_KERNEL ) ;
2007-10-21 16:41:37 -07:00
if ( ! n ) {
ret = - ENOMEM ;
goto out ;
}
2008-08-05 09:28:47 +03:00
init_kmem_cache_node ( n , s ) ;
2007-10-21 16:41:37 -07:00
s - > node [ nid ] = n ;
}
out :
up_read ( & slub_lock ) ;
return ret ;
}
static int slab_memory_callback ( struct notifier_block * self ,
unsigned long action , void * arg )
{
int ret = 0 ;
switch ( action ) {
case MEM_GOING_ONLINE :
ret = slab_mem_going_online_callback ( arg ) ;
break ;
case MEM_GOING_OFFLINE :
ret = slab_mem_going_offline_callback ( arg ) ;
break ;
case MEM_OFFLINE :
case MEM_CANCEL_ONLINE :
slab_mem_offline_callback ( arg ) ;
break ;
case MEM_ONLINE :
case MEM_CANCEL_OFFLINE :
break ;
}
2008-12-01 13:13:48 -08:00
if ( ret )
ret = notifier_from_errno ( ret ) ;
else
ret = NOTIFY_OK ;
2007-10-21 16:41:37 -07:00
return ret ;
}
# endif /* CONFIG_MEMORY_HOTPLUG */
2007-05-06 14:49:36 -07:00
/********************************************************************
* Basic setup of slabs
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */
2010-08-20 12:37:15 -05:00
/*
* Used for early kmem_cache structures that were allocated using
* the page allocator
*/
static void __init kmem_cache_bootstrap_fixup ( struct kmem_cache * s )
{
int node ;
list_add ( & s - > list , & slab_caches ) ;
s - > refcount = - 1 ;
for_each_node_state ( node , N_NORMAL_MEMORY ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
struct page * p ;
if ( n ) {
list_for_each_entry ( p , & n - > partial , lru )
p - > slab = s ;
2011-04-12 15:22:26 +08:00
# ifdef CONFIG_SLUB_DEBUG
2010-08-20 12:37:15 -05:00
list_for_each_entry ( p , & n - > full , lru )
p - > slab = s ;
# endif
}
}
}
2007-05-06 14:49:36 -07:00
void __init kmem_cache_init ( void )
{
int i ;
2007-06-16 10:16:13 -07:00
int caches = 0 ;
2010-08-20 12:37:15 -05:00
struct kmem_cache * temp_kmem_cache ;
int order ;
struct kmem_cache * temp_kmem_cache_node ;
unsigned long kmalloc_size ;
2012-01-10 15:07:32 -08:00
if ( debug_guardpage_minorder ( ) )
slub_max_order = 0 ;
2010-08-20 12:37:15 -05:00
kmem_size = offsetof ( struct kmem_cache , node ) +
nr_node_ids * sizeof ( struct kmem_cache_node * ) ;
/* Allocate two kmem_caches from the page allocator */
kmalloc_size = ALIGN ( kmem_size , cache_line_size ( ) ) ;
order = get_order ( 2 * kmalloc_size ) ;
kmem_cache = ( void * ) __get_free_pages ( GFP_NOWAIT , order ) ;
2007-05-06 14:49:36 -07:00
/*
* Must first have the slab cache available for the allocations of the
2007-05-09 02:32:39 -07:00
* struct kmem_cache_node ' s . There is special bootstrap code in
2007-05-06 14:49:36 -07:00
* kmem_cache_open for slab_state = = DOWN .
*/
2010-08-20 12:37:15 -05:00
kmem_cache_node = ( void * ) kmem_cache + kmalloc_size ;
kmem_cache_open ( kmem_cache_node , " kmem_cache_node " ,
sizeof ( struct kmem_cache_node ) ,
0 , SLAB_HWCACHE_ALIGN | SLAB_PANIC , NULL ) ;
2007-10-21 16:41:37 -07:00
2008-04-29 01:00:41 -07:00
hotplug_memory_notifier ( slab_memory_callback , SLAB_CALLBACK_PRI ) ;
2007-05-06 14:49:36 -07:00
/* Able to allocate the per node structures */
slab_state = PARTIAL ;
2010-08-20 12:37:15 -05:00
temp_kmem_cache = kmem_cache ;
kmem_cache_open ( kmem_cache , " kmem_cache " , kmem_size ,
0 , SLAB_HWCACHE_ALIGN | SLAB_PANIC , NULL ) ;
kmem_cache = kmem_cache_alloc ( kmem_cache , GFP_NOWAIT ) ;
memcpy ( kmem_cache , temp_kmem_cache , kmem_size ) ;
2007-05-06 14:49:36 -07:00
2010-08-20 12:37:15 -05:00
/*
* Allocate kmem_cache_node properly from the kmem_cache slab .
* kmem_cache_node is separately allocated so no need to
* update any list pointers .
*/
temp_kmem_cache_node = kmem_cache_node ;
2007-05-06 14:49:36 -07:00
2010-08-20 12:37:15 -05:00
kmem_cache_node = kmem_cache_alloc ( kmem_cache , GFP_NOWAIT ) ;
memcpy ( kmem_cache_node , temp_kmem_cache_node , kmem_size ) ;
kmem_cache_bootstrap_fixup ( kmem_cache_node ) ;
caches + + ;
kmem_cache_bootstrap_fixup ( kmem_cache ) ;
caches + + ;
/* Free temporary boot structure */
free_pages ( ( unsigned long ) temp_kmem_cache , order ) ;
/* Now we can use the kmem_cache to allocate kmalloc slabs */
2007-07-17 04:03:26 -07:00
/*
* Patch up the size_index table if we have strange large alignment
* requirements for the kmalloc array . This is only the case for
2008-02-15 23:45:26 -08:00
* MIPS it seems . The standard arches will not generate any code here .
2007-07-17 04:03:26 -07:00
*
* Largest permitted alignment is 256 bytes due to the way we
* handle the index determination for the smaller caches .
*
* Make sure that nothing crazy happens if someone starts tinkering
* around with ARCH_KMALLOC_MINALIGN
*/
BUILD_BUG_ON ( KMALLOC_MIN_SIZE > 256 | |
( KMALLOC_MIN_SIZE & ( KMALLOC_MIN_SIZE - 1 ) ) ) ;
2009-08-28 14:28:54 +03:00
for ( i = 8 ; i < KMALLOC_MIN_SIZE ; i + = 8 ) {
int elem = size_index_elem ( i ) ;
if ( elem > = ARRAY_SIZE ( size_index ) )
break ;
size_index [ elem ] = KMALLOC_SHIFT_LOW ;
}
2007-07-17 04:03:26 -07:00
2009-08-28 14:28:54 +03:00
if ( KMALLOC_MIN_SIZE = = 64 ) {
/*
* The 96 byte size cache is not used if the alignment
* is 64 byte .
*/
for ( i = 64 + 8 ; i < = 96 ; i + = 8 )
size_index [ size_index_elem ( i ) ] = 7 ;
} else if ( KMALLOC_MIN_SIZE = = 128 ) {
2008-07-03 09:14:26 -05:00
/*
* The 192 byte sized cache is not used if the alignment
* is 128 byte . Redirect kmalloc to use the 256 byte cache
* instead .
*/
for ( i = 128 + 8 ; i < = 192 ; i + = 8 )
2009-08-28 14:28:54 +03:00
size_index [ size_index_elem ( i ) ] = 8 ;
2008-07-03 09:14:26 -05:00
}
2010-08-20 12:37:15 -05:00
/* Caches that are not of the two-to-the-power-of size */
if ( KMALLOC_MIN_SIZE < = 32 ) {
kmalloc_caches [ 1 ] = create_kmalloc_cache ( " kmalloc-96 " , 96 , 0 ) ;
caches + + ;
}
if ( KMALLOC_MIN_SIZE < = 64 ) {
kmalloc_caches [ 2 ] = create_kmalloc_cache ( " kmalloc-192 " , 192 , 0 ) ;
caches + + ;
}
for ( i = KMALLOC_SHIFT_LOW ; i < SLUB_PAGE_SHIFT ; i + + ) {
kmalloc_caches [ i ] = create_kmalloc_cache ( " kmalloc " , 1 < < i , 0 ) ;
caches + + ;
}
2007-05-06 14:49:36 -07:00
slab_state = UP ;
/* Provide the correct kmalloc names now that the caches are up */
2010-09-14 23:21:12 +03:00
if ( KMALLOC_MIN_SIZE < = 32 ) {
kmalloc_caches [ 1 ] - > name = kstrdup ( kmalloc_caches [ 1 ] - > name , GFP_NOWAIT ) ;
BUG_ON ( ! kmalloc_caches [ 1 ] - > name ) ;
}
if ( KMALLOC_MIN_SIZE < = 64 ) {
kmalloc_caches [ 2 ] - > name = kstrdup ( kmalloc_caches [ 2 ] - > name , GFP_NOWAIT ) ;
BUG_ON ( ! kmalloc_caches [ 2 ] - > name ) ;
}
2010-07-09 14:07:12 -05:00
for ( i = KMALLOC_SHIFT_LOW ; i < SLUB_PAGE_SHIFT ; i + + ) {
char * s = kasprintf ( GFP_NOWAIT , " kmalloc-%d " , 1 < < i ) ;
BUG_ON ( ! s ) ;
2010-08-20 12:37:15 -05:00
kmalloc_caches [ i ] - > name = s ;
2010-07-09 14:07:12 -05:00
}
2007-05-06 14:49:36 -07:00
# ifdef CONFIG_SMP
register_cpu_notifier ( & slab_notifier ) ;
2009-12-18 16:26:20 -06:00
# endif
2007-05-06 14:49:36 -07:00
2010-08-20 12:37:13 -05:00
# ifdef CONFIG_ZONE_DMA
2010-08-20 12:37:15 -05:00
for ( i = 0 ; i < SLUB_PAGE_SHIFT ; i + + ) {
struct kmem_cache * s = kmalloc_caches [ i ] ;
2010-08-20 12:37:13 -05:00
2010-08-20 12:37:15 -05:00
if ( s & & s - > size ) {
2010-08-20 12:37:13 -05:00
char * name = kasprintf ( GFP_NOWAIT ,
" dma-kmalloc-%d " , s - > objsize ) ;
BUG_ON ( ! name ) ;
2010-08-20 12:37:15 -05:00
kmalloc_dma_caches [ i ] = create_kmalloc_cache ( name ,
s - > objsize , SLAB_CACHE_DMA ) ;
2010-08-20 12:37:13 -05:00
}
}
# endif
2008-02-05 17:57:39 -08:00
printk ( KERN_INFO
" SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d, "
2007-06-16 10:16:13 -07:00
" CPUs=%d, Nodes=%d \n " ,
caches , cache_line_size ( ) ,
2007-05-06 14:49:36 -07:00
slub_min_order , slub_max_order , slub_min_objects ,
nr_cpu_ids , nr_node_ids ) ;
}
2009-06-12 14:03:06 +03:00
void __init kmem_cache_init_late ( void )
{
}
2007-05-06 14:49:36 -07:00
/*
* Find a mergeable slab cache
*/
static int slab_unmergeable ( struct kmem_cache * s )
{
if ( slub_nomerge | | ( s - > flags & SLUB_NEVER_MERGE ) )
return 1 ;
2007-05-16 22:10:50 -07:00
if ( s - > ctor )
2007-05-06 14:49:36 -07:00
return 1 ;
2007-05-31 00:40:51 -07:00
/*
* We may have set a slab to be unmergeable during bootstrap .
*/
if ( s - > refcount < 0 )
return 1 ;
2007-05-06 14:49:36 -07:00
return 0 ;
}
static struct kmem_cache * find_mergeable ( size_t size ,
2007-09-11 15:24:11 -07:00
size_t align , unsigned long flags , const char * name ,
2008-07-25 19:45:34 -07:00
void ( * ctor ) ( void * ) )
2007-05-06 14:49:36 -07:00
{
2007-07-17 04:03:19 -07:00
struct kmem_cache * s ;
2007-05-06 14:49:36 -07:00
if ( slub_nomerge | | ( flags & SLUB_NEVER_MERGE ) )
return NULL ;
2007-05-16 22:10:50 -07:00
if ( ctor )
2007-05-06 14:49:36 -07:00
return NULL ;
size = ALIGN ( size , sizeof ( void * ) ) ;
align = calculate_alignment ( flags , align , size ) ;
size = ALIGN ( size , align ) ;
2007-09-11 15:24:11 -07:00
flags = kmem_cache_flags ( size , flags , name , NULL ) ;
2007-05-06 14:49:36 -07:00
2007-07-17 04:03:19 -07:00
list_for_each_entry ( s , & slab_caches , list ) {
2007-05-06 14:49:36 -07:00
if ( slab_unmergeable ( s ) )
continue ;
if ( size > s - > size )
continue ;
2007-09-11 15:24:11 -07:00
if ( ( flags & SLUB_MERGE_SAME ) ! = ( s - > flags & SLUB_MERGE_SAME ) )
2007-05-06 14:49:36 -07:00
continue ;
/*
* Check if alignment is compatible .
* Courtesy of Adrian Drzewiecki
*/
2008-01-07 23:20:27 -08:00
if ( ( s - > size & ~ ( align - 1 ) ) ! = s - > size )
2007-05-06 14:49:36 -07:00
continue ;
if ( s - > size - size > = sizeof ( void * ) )
continue ;
return s ;
}
return NULL ;
}
struct kmem_cache * kmem_cache_create ( const char * name , size_t size ,
2008-07-25 19:45:34 -07:00
size_t align , unsigned long flags , void ( * ctor ) ( void * ) )
2007-05-06 14:49:36 -07:00
{
struct kmem_cache * s ;
2010-09-14 23:21:12 +03:00
char * n ;
2007-05-06 14:49:36 -07:00
2009-09-21 17:02:30 -07:00
if ( WARN_ON ( ! name ) )
return NULL ;
2007-05-06 14:49:36 -07:00
down_write ( & slub_lock ) ;
2007-09-11 15:24:11 -07:00
s = find_mergeable ( size , align , flags , name , ctor ) ;
2007-05-06 14:49:36 -07:00
if ( s ) {
s - > refcount + + ;
/*
* Adjust the object sizes so that we clear
* the complete object on kzalloc .
*/
s - > objsize = max ( s - > objsize , ( int ) size ) ;
s - > inuse = max_t ( int , s - > inuse , ALIGN ( size , sizeof ( void * ) ) ) ;
2008-02-15 23:45:26 -08:00
2008-12-17 22:09:46 -08:00
if ( sysfs_slab_alias ( s , name ) ) {
s - > refcount - - ;
2007-05-06 14:49:36 -07:00
goto err ;
2008-12-17 22:09:46 -08:00
}
2010-07-19 11:39:11 -05:00
up_write ( & slub_lock ) ;
2007-07-17 04:03:31 -07:00
return s ;
}
2008-02-15 23:45:26 -08:00
2010-09-14 23:21:12 +03:00
n = kstrdup ( name , GFP_KERNEL ) ;
if ( ! n )
goto err ;
2007-07-17 04:03:31 -07:00
s = kmalloc ( kmem_size , GFP_KERNEL ) ;
if ( s ) {
2010-09-14 23:21:12 +03:00
if ( kmem_cache_open ( s , n ,
2007-05-16 22:10:50 -07:00
size , align , flags , ctor ) ) {
2007-05-06 14:49:36 -07:00
list_add ( & s - > list , & slab_caches ) ;
2012-01-17 09:27:31 -06:00
up_write ( & slub_lock ) ;
2008-12-17 22:09:46 -08:00
if ( sysfs_slab_add ( s ) ) {
2012-01-17 09:27:31 -06:00
down_write ( & slub_lock ) ;
2008-12-17 22:09:46 -08:00
list_del ( & s - > list ) ;
2010-09-14 23:21:12 +03:00
kfree ( n ) ;
2008-12-17 22:09:46 -08:00
kfree ( s ) ;
2007-07-17 04:03:31 -07:00
goto err ;
2008-12-17 22:09:46 -08:00
}
2007-07-17 04:03:31 -07:00
return s ;
}
2010-09-14 23:21:12 +03:00
kfree ( n ) ;
2007-07-17 04:03:31 -07:00
kfree ( s ) ;
2007-05-06 14:49:36 -07:00
}
2010-10-28 13:50:37 +04:00
err :
2007-05-06 14:49:36 -07:00
up_write ( & slub_lock ) ;
if ( flags & SLAB_PANIC )
panic ( " Cannot create slabcache %s \n " , name ) ;
else
s = NULL ;
return s ;
}
EXPORT_SYMBOL ( kmem_cache_create ) ;
# ifdef CONFIG_SMP
/*
2007-05-09 02:32:39 -07:00
* Use the cpu notifier to insure that the cpu slabs are flushed when
* necessary .
2007-05-06 14:49:36 -07:00
*/
static int __cpuinit slab_cpuup_callback ( struct notifier_block * nfb ,
unsigned long action , void * hcpu )
{
long cpu = ( long ) hcpu ;
2007-07-17 04:03:19 -07:00
struct kmem_cache * s ;
unsigned long flags ;
2007-05-06 14:49:36 -07:00
switch ( action ) {
case CPU_UP_CANCELED :
2007-05-09 02:35:10 -07:00
case CPU_UP_CANCELED_FROZEN :
2007-05-06 14:49:36 -07:00
case CPU_DEAD :
2007-05-09 02:35:10 -07:00
case CPU_DEAD_FROZEN :
2007-07-17 04:03:19 -07:00
down_read ( & slub_lock ) ;
list_for_each_entry ( s , & slab_caches , list ) {
local_irq_save ( flags ) ;
__flush_cpu_slab ( s , cpu ) ;
local_irq_restore ( flags ) ;
}
up_read ( & slub_lock ) ;
2007-05-06 14:49:36 -07:00
break ;
default :
break ;
}
return NOTIFY_OK ;
}
2008-01-07 23:20:27 -08:00
static struct notifier_block __cpuinitdata slab_notifier = {
2008-02-05 17:57:39 -08:00
. notifier_call = slab_cpuup_callback
2008-01-07 23:20:27 -08:00
} ;
2007-05-06 14:49:36 -07:00
# endif
2008-08-19 20:43:25 +03:00
void * __kmalloc_track_caller ( size_t size , gfp_t gfpflags , unsigned long caller )
2007-05-06 14:49:36 -07:00
{
2007-10-16 01:24:38 -07:00
struct kmem_cache * s ;
2008-08-24 20:49:35 +03:00
void * ret ;
2007-10-16 01:24:38 -07:00
2009-02-17 12:05:07 -05:00
if ( unlikely ( size > SLUB_MAX_SIZE ) )
2008-02-11 22:47:46 +02:00
return kmalloc_large ( size , gfpflags ) ;
2007-10-16 01:24:38 -07:00
s = get_slab ( size , gfpflags ) ;
2007-05-06 14:49:36 -07:00
2007-10-16 01:24:44 -07:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 04:03:22 -07:00
return s ;
2007-05-06 14:49:36 -07:00
2010-07-09 14:07:10 -05:00
ret = slab_alloc ( s , gfpflags , NUMA_NO_NODE , caller ) ;
2008-08-24 20:49:35 +03:00
2011-03-30 22:57:33 -03:00
/* Honor the call site pointer we received. */
2009-03-23 15:12:24 +02:00
trace_kmalloc ( caller , ret , size , s - > size , gfpflags ) ;
2008-08-24 20:49:35 +03:00
return ret ;
2007-05-06 14:49:36 -07:00
}
2010-09-29 21:02:15 +09:00
# ifdef CONFIG_NUMA
2007-05-06 14:49:36 -07:00
void * __kmalloc_node_track_caller ( size_t size , gfp_t gfpflags ,
2008-08-19 20:43:25 +03:00
int node , unsigned long caller )
2007-05-06 14:49:36 -07:00
{
2007-10-16 01:24:38 -07:00
struct kmem_cache * s ;
2008-08-24 20:49:35 +03:00
void * ret ;
2007-10-16 01:24:38 -07:00
2010-04-08 17:26:44 +08:00
if ( unlikely ( size > SLUB_MAX_SIZE ) ) {
ret = kmalloc_large_node ( size , gfpflags , node ) ;
trace_kmalloc_node ( caller , ret ,
size , PAGE_SIZE < < get_order ( size ) ,
gfpflags , node ) ;
return ret ;
}
2008-02-11 22:47:46 +02:00
2007-10-16 01:24:38 -07:00
s = get_slab ( size , gfpflags ) ;
2007-05-06 14:49:36 -07:00
2007-10-16 01:24:44 -07:00
if ( unlikely ( ZERO_OR_NULL_PTR ( s ) ) )
2007-07-17 04:03:22 -07:00
return s ;
2007-05-06 14:49:36 -07:00
2008-08-24 20:49:35 +03:00
ret = slab_alloc ( s , gfpflags , node , caller ) ;
2011-03-30 22:57:33 -03:00
/* Honor the call site pointer we received. */
2009-03-23 15:12:24 +02:00
trace_kmalloc_node ( caller , ret , size , s - > size , gfpflags , node ) ;
2008-08-24 20:49:35 +03:00
return ret ;
2007-05-06 14:49:36 -07:00
}
2010-09-29 21:02:15 +09:00
# endif
2007-05-06 14:49:36 -07:00
2010-10-05 13:57:26 -05:00
# ifdef CONFIG_SYSFS
2008-04-14 19:11:40 +03:00
static int count_inuse ( struct page * page )
{
return page - > inuse ;
}
static int count_total ( struct page * page )
{
return page - > objects ;
}
2010-10-05 13:57:26 -05:00
# endif
2008-04-14 19:11:40 +03:00
2010-10-05 13:57:26 -05:00
# ifdef CONFIG_SLUB_DEBUG
2007-07-17 04:03:30 -07:00
static int validate_slab ( struct kmem_cache * s , struct page * page ,
unsigned long * map )
2007-05-06 14:49:43 -07:00
{
void * p ;
2008-03-01 13:40:44 -08:00
void * addr = page_address ( page ) ;
2007-05-06 14:49:43 -07:00
if ( ! check_slab ( s , page ) | |
! on_freelist ( s , page , NULL ) )
return 0 ;
/* Now we know that a valid freelist exists */
2008-04-14 19:11:30 +03:00
bitmap_zero ( map , page - > objects ) ;
2007-05-06 14:49:43 -07:00
2011-04-15 14:48:13 -05:00
get_map ( s , page , map ) ;
for_each_object ( p , s , addr , page - > objects ) {
if ( test_bit ( slab_index ( p , s , addr ) , map ) )
if ( ! check_object ( s , page , p , SLUB_RED_INACTIVE ) )
return 0 ;
2007-05-06 14:49:43 -07:00
}
2008-04-14 19:11:31 +03:00
for_each_object ( p , s , addr , page - > objects )
2007-05-09 02:32:40 -07:00
if ( ! test_bit ( slab_index ( p , s , addr ) , map ) )
2010-12-01 20:04:20 +02:00
if ( ! check_object ( s , page , p , SLUB_RED_ACTIVE ) )
2007-05-06 14:49:43 -07:00
return 0 ;
return 1 ;
}
2007-07-17 04:03:30 -07:00
static void validate_slab_slab ( struct kmem_cache * s , struct page * page ,
unsigned long * map )
2007-05-06 14:49:43 -07:00
{
2011-06-01 12:25:53 -05:00
slab_lock ( page ) ;
validate_slab ( s , page , map ) ;
slab_unlock ( page ) ;
2007-05-06 14:49:43 -07:00
}
2007-07-17 04:03:30 -07:00
static int validate_slab_node ( struct kmem_cache * s ,
struct kmem_cache_node * n , unsigned long * map )
2007-05-06 14:49:43 -07:00
{
unsigned long count = 0 ;
struct page * page ;
unsigned long flags ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
list_for_each_entry ( page , & n - > partial , lru ) {
2007-07-17 04:03:30 -07:00
validate_slab_slab ( s , page , map ) ;
2007-05-06 14:49:43 -07:00
count + + ;
}
if ( count ! = n - > nr_partial )
printk ( KERN_ERR " SLUB %s: %ld partial slabs counted but "
" counter=%ld \n " , s - > name , count , n - > nr_partial ) ;
if ( ! ( s - > flags & SLAB_STORE_USER ) )
goto out ;
list_for_each_entry ( page , & n - > full , lru ) {
2007-07-17 04:03:30 -07:00
validate_slab_slab ( s , page , map ) ;
2007-05-06 14:49:43 -07:00
count + + ;
}
if ( count ! = atomic_long_read ( & n - > nr_slabs ) )
printk ( KERN_ERR " SLUB: %s %ld slabs counted but "
" counter=%ld \n " , s - > name , count ,
atomic_long_read ( & n - > nr_slabs ) ) ;
out :
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
return count ;
}
2007-07-17 04:03:30 -07:00
static long validate_slab_cache ( struct kmem_cache * s )
2007-05-06 14:49:43 -07:00
{
int node ;
unsigned long count = 0 ;
2008-04-14 19:11:40 +03:00
unsigned long * map = kmalloc ( BITS_TO_LONGS ( oo_objects ( s - > max ) ) *
2007-07-17 04:03:30 -07:00
sizeof ( unsigned long ) , GFP_KERNEL ) ;
if ( ! map )
return - ENOMEM ;
2007-05-06 14:49:43 -07:00
flush_all ( s ) ;
2007-10-16 01:25:33 -07:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-06 14:49:43 -07:00
struct kmem_cache_node * n = get_node ( s , node ) ;
2007-07-17 04:03:30 -07:00
count + = validate_slab_node ( s , n , map ) ;
2007-05-06 14:49:43 -07:00
}
2007-07-17 04:03:30 -07:00
kfree ( map ) ;
2007-05-06 14:49:43 -07:00
return count ;
}
2007-05-06 14:49:45 -07:00
/*
2007-05-09 02:32:39 -07:00
* Generate lists of code addresses where slabcache objects are allocated
2007-05-06 14:49:45 -07:00
* and freed .
*/
struct location {
unsigned long count ;
2008-08-19 20:43:25 +03:00
unsigned long addr ;
2007-05-09 02:32:45 -07:00
long long sum_time ;
long min_time ;
long max_time ;
long min_pid ;
long max_pid ;
2009-01-01 10:12:29 +10:30
DECLARE_BITMAP ( cpus , NR_CPUS ) ;
2007-05-09 02:32:45 -07:00
nodemask_t nodes ;
2007-05-06 14:49:45 -07:00
} ;
struct loc_track {
unsigned long max ;
unsigned long count ;
struct location * loc ;
} ;
static void free_loc_track ( struct loc_track * t )
{
if ( t - > max )
free_pages ( ( unsigned long ) t - > loc ,
get_order ( sizeof ( struct location ) * t - > max ) ) ;
}
2007-07-17 04:03:20 -07:00
static int alloc_loc_track ( struct loc_track * t , unsigned long max , gfp_t flags )
2007-05-06 14:49:45 -07:00
{
struct location * l ;
int order ;
order = get_order ( sizeof ( struct location ) * max ) ;
2007-07-17 04:03:20 -07:00
l = ( void * ) __get_free_pages ( flags , order ) ;
2007-05-06 14:49:45 -07:00
if ( ! l )
return 0 ;
if ( t - > count ) {
memcpy ( l , t - > loc , sizeof ( struct location ) * t - > count ) ;
free_loc_track ( t ) ;
}
t - > max = max ;
t - > loc = l ;
return 1 ;
}
static int add_location ( struct loc_track * t , struct kmem_cache * s ,
2007-05-09 02:32:45 -07:00
const struct track * track )
2007-05-06 14:49:45 -07:00
{
long start , end , pos ;
struct location * l ;
2008-08-19 20:43:25 +03:00
unsigned long caddr ;
2007-05-09 02:32:45 -07:00
unsigned long age = jiffies - track - > when ;
2007-05-06 14:49:45 -07:00
start = - 1 ;
end = t - > count ;
for ( ; ; ) {
pos = start + ( end - start + 1 ) / 2 ;
/*
* There is nothing at " end " . If we end up there
* we need to add something to before end .
*/
if ( pos = = end )
break ;
caddr = t - > loc [ pos ] . addr ;
2007-05-09 02:32:45 -07:00
if ( track - > addr = = caddr ) {
l = & t - > loc [ pos ] ;
l - > count + + ;
if ( track - > when ) {
l - > sum_time + = age ;
if ( age < l - > min_time )
l - > min_time = age ;
if ( age > l - > max_time )
l - > max_time = age ;
if ( track - > pid < l - > min_pid )
l - > min_pid = track - > pid ;
if ( track - > pid > l - > max_pid )
l - > max_pid = track - > pid ;
2009-01-01 10:12:29 +10:30
cpumask_set_cpu ( track - > cpu ,
to_cpumask ( l - > cpus ) ) ;
2007-05-09 02:32:45 -07:00
}
node_set ( page_to_nid ( virt_to_page ( track ) ) , l - > nodes ) ;
2007-05-06 14:49:45 -07:00
return 1 ;
}
2007-05-09 02:32:45 -07:00
if ( track - > addr < caddr )
2007-05-06 14:49:45 -07:00
end = pos ;
else
start = pos ;
}
/*
2007-05-09 02:32:39 -07:00
* Not found . Insert new tracking element .
2007-05-06 14:49:45 -07:00
*/
2007-07-17 04:03:20 -07:00
if ( t - > count > = t - > max & & ! alloc_loc_track ( t , 2 * t - > max , GFP_ATOMIC ) )
2007-05-06 14:49:45 -07:00
return 0 ;
l = t - > loc + pos ;
if ( pos < t - > count )
memmove ( l + 1 , l ,
( t - > count - pos ) * sizeof ( struct location ) ) ;
t - > count + + ;
l - > count = 1 ;
2007-05-09 02:32:45 -07:00
l - > addr = track - > addr ;
l - > sum_time = age ;
l - > min_time = age ;
l - > max_time = age ;
l - > min_pid = track - > pid ;
l - > max_pid = track - > pid ;
2009-01-01 10:12:29 +10:30
cpumask_clear ( to_cpumask ( l - > cpus ) ) ;
cpumask_set_cpu ( track - > cpu , to_cpumask ( l - > cpus ) ) ;
2007-05-09 02:32:45 -07:00
nodes_clear ( l - > nodes ) ;
node_set ( page_to_nid ( virt_to_page ( track ) ) , l - > nodes ) ;
2007-05-06 14:49:45 -07:00
return 1 ;
}
static void process_slab ( struct loc_track * t , struct kmem_cache * s ,
2010-03-24 22:25:47 +01:00
struct page * page , enum track_item alloc ,
2010-09-29 21:02:13 +09:00
unsigned long * map )
2007-05-06 14:49:45 -07:00
{
2008-03-01 13:40:44 -08:00
void * addr = page_address ( page ) ;
2007-05-06 14:49:45 -07:00
void * p ;
2008-04-14 19:11:30 +03:00
bitmap_zero ( map , page - > objects ) ;
2011-04-15 14:48:13 -05:00
get_map ( s , page , map ) ;
2007-05-06 14:49:45 -07:00
2008-04-14 19:11:31 +03:00
for_each_object ( p , s , addr , page - > objects )
2007-05-09 02:32:45 -07:00
if ( ! test_bit ( slab_index ( p , s , addr ) , map ) )
add_location ( t , s , get_track ( s , p , alloc ) ) ;
2007-05-06 14:49:45 -07:00
}
static int list_locations ( struct kmem_cache * s , char * buf ,
enum track_item alloc )
{
2008-01-31 15:20:50 -08:00
int len = 0 ;
2007-05-06 14:49:45 -07:00
unsigned long i ;
2007-07-17 04:03:20 -07:00
struct loc_track t = { 0 , 0 , NULL } ;
2007-05-06 14:49:45 -07:00
int node ;
2010-03-24 22:25:47 +01:00
unsigned long * map = kmalloc ( BITS_TO_LONGS ( oo_objects ( s - > max ) ) *
sizeof ( unsigned long ) , GFP_KERNEL ) ;
2007-05-06 14:49:45 -07:00
2010-03-24 22:25:47 +01:00
if ( ! map | | ! alloc_loc_track ( & t , PAGE_SIZE / sizeof ( struct location ) ,
GFP_TEMPORARY ) ) {
kfree ( map ) ;
2007-07-17 04:03:20 -07:00
return sprintf ( buf , " Out of memory \n " ) ;
2010-03-24 22:25:47 +01:00
}
2007-05-06 14:49:45 -07:00
/* Push back cpu slabs */
flush_all ( s ) ;
2007-10-16 01:25:33 -07:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
2007-05-06 14:49:45 -07:00
struct kmem_cache_node * n = get_node ( s , node ) ;
unsigned long flags ;
struct page * page ;
2007-08-22 14:01:56 -07:00
if ( ! atomic_long_read ( & n - > nr_slabs ) )
2007-05-06 14:49:45 -07:00
continue ;
spin_lock_irqsave ( & n - > list_lock , flags ) ;
list_for_each_entry ( page , & n - > partial , lru )
2010-03-24 22:25:47 +01:00
process_slab ( & t , s , page , alloc , map ) ;
2007-05-06 14:49:45 -07:00
list_for_each_entry ( page , & n - > full , lru )
2010-03-24 22:25:47 +01:00
process_slab ( & t , s , page , alloc , map ) ;
2007-05-06 14:49:45 -07:00
spin_unlock_irqrestore ( & n - > list_lock , flags ) ;
}
for ( i = 0 ; i < t . count ; i + + ) {
2007-05-09 02:32:45 -07:00
struct location * l = & t . loc [ i ] ;
2007-05-06 14:49:45 -07:00
2008-12-09 13:14:27 -08:00
if ( len > PAGE_SIZE - KSYM_SYMBOL_LEN - 100 )
2007-05-06 14:49:45 -07:00
break ;
2008-01-31 15:20:50 -08:00
len + = sprintf ( buf + len , " %7ld " , l - > count ) ;
2007-05-09 02:32:45 -07:00
if ( l - > addr )
2011-01-13 15:45:52 -08:00
len + = sprintf ( buf + len , " %pS " , ( void * ) l - > addr ) ;
2007-05-06 14:49:45 -07:00
else
2008-01-31 15:20:50 -08:00
len + = sprintf ( buf + len , " <not-available> " ) ;
2007-05-09 02:32:45 -07:00
if ( l - > sum_time ! = l - > min_time ) {
2008-01-31 15:20:50 -08:00
len + = sprintf ( buf + len , " age=%ld/%ld/%ld " ,
2008-05-01 04:34:31 -07:00
l - > min_time ,
( long ) div_u64 ( l - > sum_time , l - > count ) ,
l - > max_time ) ;
2007-05-09 02:32:45 -07:00
} else
2008-01-31 15:20:50 -08:00
len + = sprintf ( buf + len , " age=%ld " ,
2007-05-09 02:32:45 -07:00
l - > min_time ) ;
if ( l - > min_pid ! = l - > max_pid )
2008-01-31 15:20:50 -08:00
len + = sprintf ( buf + len , " pid=%ld-%ld " ,
2007-05-09 02:32:45 -07:00
l - > min_pid , l - > max_pid ) ;
else
2008-01-31 15:20:50 -08:00
len + = sprintf ( buf + len , " pid=%ld " ,
2007-05-09 02:32:45 -07:00
l - > min_pid ) ;
2009-01-01 10:12:29 +10:30
if ( num_online_cpus ( ) > 1 & &
! cpumask_empty ( to_cpumask ( l - > cpus ) ) & &
2008-01-31 15:20:50 -08:00
len < PAGE_SIZE - 60 ) {
len + = sprintf ( buf + len , " cpus= " ) ;
len + = cpulist_scnprintf ( buf + len , PAGE_SIZE - len - 50 ,
2009-01-01 10:12:29 +10:30
to_cpumask ( l - > cpus ) ) ;
2007-05-09 02:32:45 -07:00
}
2009-06-16 15:32:15 -07:00
if ( nr_online_nodes > 1 & & ! nodes_empty ( l - > nodes ) & &
2008-01-31 15:20:50 -08:00
len < PAGE_SIZE - 60 ) {
len + = sprintf ( buf + len , " nodes= " ) ;
len + = nodelist_scnprintf ( buf + len , PAGE_SIZE - len - 50 ,
2007-05-09 02:32:45 -07:00
l - > nodes ) ;
}
2008-01-31 15:20:50 -08:00
len + = sprintf ( buf + len , " \n " ) ;
2007-05-06 14:49:45 -07:00
}
free_loc_track ( & t ) ;
2010-03-24 22:25:47 +01:00
kfree ( map ) ;
2007-05-06 14:49:45 -07:00
if ( ! t . count )
2008-01-31 15:20:50 -08:00
len + = sprintf ( buf , " No data \n " ) ;
return len ;
2007-05-06 14:49:45 -07:00
}
2010-10-05 13:57:26 -05:00
# endif
2007-05-06 14:49:45 -07:00
2010-10-05 13:57:27 -05:00
# ifdef SLUB_RESILIENCY_TEST
static void resiliency_test ( void )
{
u8 * p ;
BUILD_BUG_ON ( KMALLOC_MIN_SIZE > 16 | | SLUB_PAGE_SHIFT < 10 ) ;
printk ( KERN_ERR " SLUB resiliency testing \n " ) ;
printk ( KERN_ERR " ----------------------- \n " ) ;
printk ( KERN_ERR " A. Corruption after allocation \n " ) ;
p = kzalloc ( 16 , GFP_KERNEL ) ;
p [ 16 ] = 0x12 ;
printk ( KERN_ERR " \n 1. kmalloc-16: Clobber Redzone/next pointer "
" 0x12->0x%p \n \n " , p + 16 ) ;
validate_slab_cache ( kmalloc_caches [ 4 ] ) ;
/* Hmmm... The next two are dangerous */
p = kzalloc ( 32 , GFP_KERNEL ) ;
p [ 32 + sizeof ( void * ) ] = 0x34 ;
printk ( KERN_ERR " \n 2. kmalloc-32: Clobber next pointer/next slab "
" 0x34 -> -0x%p \n " , p ) ;
printk ( KERN_ERR
" If allocated object is overwritten then not detectable \n \n " ) ;
validate_slab_cache ( kmalloc_caches [ 5 ] ) ;
p = kzalloc ( 64 , GFP_KERNEL ) ;
p + = 64 + ( get_cycles ( ) & 0xff ) * sizeof ( void * ) ;
* p = 0x56 ;
printk ( KERN_ERR " \n 3. kmalloc-64: corrupting random byte 0x56->0x%p \n " ,
p ) ;
printk ( KERN_ERR
" If allocated object is overwritten then not detectable \n \n " ) ;
validate_slab_cache ( kmalloc_caches [ 6 ] ) ;
printk ( KERN_ERR " \n B. Corruption after free \n " ) ;
p = kzalloc ( 128 , GFP_KERNEL ) ;
kfree ( p ) ;
* p = 0x78 ;
printk ( KERN_ERR " 1. kmalloc-128: Clobber first word 0x78->0x%p \n \n " , p ) ;
validate_slab_cache ( kmalloc_caches [ 7 ] ) ;
p = kzalloc ( 256 , GFP_KERNEL ) ;
kfree ( p ) ;
p [ 50 ] = 0x9a ;
printk ( KERN_ERR " \n 2. kmalloc-256: Clobber 50th byte 0x9a->0x%p \n \n " ,
p ) ;
validate_slab_cache ( kmalloc_caches [ 8 ] ) ;
p = kzalloc ( 512 , GFP_KERNEL ) ;
kfree ( p ) ;
p [ 512 ] = 0xab ;
printk ( KERN_ERR " \n 3. kmalloc-512: Clobber redzone 0xab->0x%p \n \n " , p ) ;
validate_slab_cache ( kmalloc_caches [ 9 ] ) ;
}
# else
# ifdef CONFIG_SYSFS
static void resiliency_test ( void ) { } ;
# endif
# endif
2010-10-05 13:57:26 -05:00
# ifdef CONFIG_SYSFS
2007-05-06 14:49:36 -07:00
enum slab_stat_type {
2008-04-14 19:11:40 +03:00
SL_ALL , /* All slabs */
SL_PARTIAL , /* Only partially allocated slabs */
SL_CPU , /* Only slabs used for cpu caches */
SL_OBJECTS , /* Determine allocated objects not slabs */
SL_TOTAL /* Determine object capacity not slabs */
2007-05-06 14:49:36 -07:00
} ;
2008-04-14 19:11:40 +03:00
# define SO_ALL (1 << SL_ALL)
2007-05-06 14:49:36 -07:00
# define SO_PARTIAL (1 << SL_PARTIAL)
# define SO_CPU (1 << SL_CPU)
# define SO_OBJECTS (1 << SL_OBJECTS)
2008-04-14 19:11:40 +03:00
# define SO_TOTAL (1 << SL_TOTAL)
2007-05-06 14:49:36 -07:00
2008-03-02 23:28:24 +03:00
static ssize_t show_slab_objects ( struct kmem_cache * s ,
char * buf , unsigned long flags )
2007-05-06 14:49:36 -07:00
{
unsigned long total = 0 ;
int node ;
int x ;
unsigned long * nodes ;
unsigned long * per_cpu ;
nodes = kzalloc ( 2 * sizeof ( unsigned long ) * nr_node_ids , GFP_KERNEL ) ;
2008-03-02 23:28:24 +03:00
if ( ! nodes )
return - ENOMEM ;
2007-05-06 14:49:36 -07:00
per_cpu = nodes + nr_node_ids ;
2008-04-14 19:11:40 +03:00
if ( flags & SO_CPU ) {
int cpu ;
2007-05-06 14:49:36 -07:00
2008-04-14 19:11:40 +03:00
for_each_possible_cpu ( cpu ) {
2009-12-18 16:26:20 -06:00
struct kmem_cache_cpu * c = per_cpu_ptr ( s - > cpu_slab , cpu ) ;
2011-11-22 16:02:02 +01:00
int node = ACCESS_ONCE ( c - > node ) ;
2011-08-09 16:12:27 -05:00
struct page * page ;
2007-10-16 01:26:05 -07:00
2011-11-22 16:02:02 +01:00
if ( node < 0 )
2008-04-14 19:11:40 +03:00
continue ;
2011-11-22 16:02:02 +01:00
page = ACCESS_ONCE ( c - > page ) ;
if ( page ) {
if ( flags & SO_TOTAL )
x = page - > objects ;
2008-04-14 19:11:40 +03:00
else if ( flags & SO_OBJECTS )
2011-11-22 16:02:02 +01:00
x = page - > inuse ;
2007-05-06 14:49:36 -07:00
else
x = 1 ;
2008-04-14 19:11:40 +03:00
2007-05-06 14:49:36 -07:00
total + = x ;
2011-11-22 16:02:02 +01:00
nodes [ node ] + = x ;
2007-05-06 14:49:36 -07:00
}
2011-08-09 16:12:27 -05:00
page = c - > partial ;
if ( page ) {
x = page - > pobjects ;
2011-11-22 16:02:02 +01:00
total + = x ;
nodes [ node ] + = x ;
2011-08-09 16:12:27 -05:00
}
2011-11-22 16:02:02 +01:00
per_cpu [ node ] + + ;
2007-05-06 14:49:36 -07:00
}
}
2011-01-10 10:15:15 -06:00
lock_memory_hotplug ( ) ;
2010-10-05 13:57:26 -05:00
# ifdef CONFIG_SLUB_DEBUG
2008-04-14 19:11:40 +03:00
if ( flags & SO_ALL ) {
for_each_node_state ( node , N_NORMAL_MEMORY ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
if ( flags & SO_TOTAL )
x = atomic_long_read ( & n - > total_objects ) ;
else if ( flags & SO_OBJECTS )
x = atomic_long_read ( & n - > total_objects ) -
count_partial ( n , count_free ) ;
2007-05-06 14:49:36 -07:00
else
2008-04-14 19:11:40 +03:00
x = atomic_long_read ( & n - > nr_slabs ) ;
2007-05-06 14:49:36 -07:00
total + = x ;
nodes [ node ] + = x ;
}
2010-10-05 13:57:26 -05:00
} else
# endif
if ( flags & SO_PARTIAL ) {
2008-04-14 19:11:40 +03:00
for_each_node_state ( node , N_NORMAL_MEMORY ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
2007-05-06 14:49:36 -07:00
2008-04-14 19:11:40 +03:00
if ( flags & SO_TOTAL )
x = count_partial ( n , count_total ) ;
else if ( flags & SO_OBJECTS )
x = count_partial ( n , count_inuse ) ;
2007-05-06 14:49:36 -07:00
else
2008-04-14 19:11:40 +03:00
x = n - > nr_partial ;
2007-05-06 14:49:36 -07:00
total + = x ;
nodes [ node ] + = x ;
}
}
x = sprintf ( buf , " %lu " , total ) ;
# ifdef CONFIG_NUMA
2007-10-16 01:25:33 -07:00
for_each_node_state ( node , N_NORMAL_MEMORY )
2007-05-06 14:49:36 -07:00
if ( nodes [ node ] )
x + = sprintf ( buf + x , " N%d=%lu " ,
node , nodes [ node ] ) ;
# endif
2011-01-10 10:15:15 -06:00
unlock_memory_hotplug ( ) ;
2007-05-06 14:49:36 -07:00
kfree ( nodes ) ;
return x + sprintf ( buf + x , " \n " ) ;
}
2010-10-05 13:57:26 -05:00
# ifdef CONFIG_SLUB_DEBUG
2007-05-06 14:49:36 -07:00
static int any_slab_objects ( struct kmem_cache * s )
{
int node ;
2007-10-16 01:26:05 -07:00
for_each_online_node ( node ) {
2007-05-06 14:49:36 -07:00
struct kmem_cache_node * n = get_node ( s , node ) ;
2007-10-16 01:26:05 -07:00
if ( ! n )
continue ;
2008-05-06 20:42:39 -07:00
if ( atomic_long_read ( & n - > total_objects ) )
2007-05-06 14:49:36 -07:00
return 1 ;
}
return 0 ;
}
2010-10-05 13:57:26 -05:00
# endif
2007-05-06 14:49:36 -07:00
# define to_slab_attr(n) container_of(n, struct slab_attribute, attr)
2011-07-14 15:07:13 +03:00
# define to_slab(n) container_of(n, struct kmem_cache, kobj)
2007-05-06 14:49:36 -07:00
struct slab_attribute {
struct attribute attr ;
ssize_t ( * show ) ( struct kmem_cache * s , char * buf ) ;
ssize_t ( * store ) ( struct kmem_cache * s , const char * x , size_t count ) ;
} ;
# define SLAB_ATTR_RO(_name) \
mm: restrict access to slab files under procfs and sysfs
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:
1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.
2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.
3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).
4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.
5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.
6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.
Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.
Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:
http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:
groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER
And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):
chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <kees@ubuntu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27 21:54:53 +04:00
static struct slab_attribute _name # # _attr = \
__ATTR ( _name , 0400 , _name # # _show , NULL )
2007-05-06 14:49:36 -07:00
# define SLAB_ATTR(_name) \
static struct slab_attribute _name # # _attr = \
mm: restrict access to slab files under procfs and sysfs
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:
1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.
2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.
3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).
4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.
5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.
6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.
Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.
Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:
http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:
groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER
And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):
chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <kees@ubuntu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27 21:54:53 +04:00
__ATTR ( _name , 0600 , _name # # _show , _name # # _store )
2007-05-06 14:49:36 -07:00
static ssize_t slab_size_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > size ) ;
}
SLAB_ATTR_RO ( slab_size ) ;
static ssize_t align_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > align ) ;
}
SLAB_ATTR_RO ( align ) ;
static ssize_t object_size_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > objsize ) ;
}
SLAB_ATTR_RO ( object_size ) ;
static ssize_t objs_per_slab_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 19:11:31 +03:00
return sprintf ( buf , " %d \n " , oo_objects ( s - > oo ) ) ;
2007-05-06 14:49:36 -07:00
}
SLAB_ATTR_RO ( objs_per_slab ) ;
2008-04-14 19:11:41 +03:00
static ssize_t order_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
2008-04-29 16:11:12 -07:00
unsigned long order ;
int err ;
err = strict_strtoul ( buf , 10 , & order ) ;
if ( err )
return err ;
2008-04-14 19:11:41 +03:00
if ( order > slub_max_order | | order < slub_min_order )
return - EINVAL ;
calculate_sizes ( s , order ) ;
return length ;
}
2007-05-06 14:49:36 -07:00
static ssize_t order_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 19:11:31 +03:00
return sprintf ( buf , " %d \n " , oo_order ( s - > oo ) ) ;
2007-05-06 14:49:36 -07:00
}
2008-04-14 19:11:41 +03:00
SLAB_ATTR ( order ) ;
2007-05-06 14:49:36 -07:00
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-22 17:40:09 -08:00
static ssize_t min_partial_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %lu \n " , s - > min_partial ) ;
}
static ssize_t min_partial_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
unsigned long min ;
int err ;
err = strict_strtoul ( buf , 10 , & min ) ;
if ( err )
return err ;
2009-02-25 09:16:35 +02:00
set_min_partial ( s , min ) ;
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-22 17:40:09 -08:00
return length ;
}
SLAB_ATTR ( min_partial ) ;
2011-08-09 16:12:27 -05:00
static ssize_t cpu_partial_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %u \n " , s - > cpu_partial ) ;
}
static ssize_t cpu_partial_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
unsigned long objects ;
int err ;
err = strict_strtoul ( buf , 10 , & objects ) ;
if ( err )
return err ;
2012-01-09 13:19:45 -08:00
if ( objects & & kmem_cache_debug ( s ) )
return - EINVAL ;
2011-08-09 16:12:27 -05:00
s - > cpu_partial = objects ;
flush_all ( s ) ;
return length ;
}
SLAB_ATTR ( cpu_partial ) ;
2007-05-06 14:49:36 -07:00
static ssize_t ctor_show ( struct kmem_cache * s , char * buf )
{
2011-01-13 15:45:52 -08:00
if ( ! s - > ctor )
return 0 ;
return sprintf ( buf , " %pS \n " , s - > ctor ) ;
2007-05-06 14:49:36 -07:00
}
SLAB_ATTR_RO ( ctor ) ;
static ssize_t aliases_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > refcount - 1 ) ;
}
SLAB_ATTR_RO ( aliases ) ;
static ssize_t partial_show ( struct kmem_cache * s , char * buf )
{
2008-02-15 15:22:21 -08:00
return show_slab_objects ( s , buf , SO_PARTIAL ) ;
2007-05-06 14:49:36 -07:00
}
SLAB_ATTR_RO ( partial ) ;
static ssize_t cpu_slabs_show ( struct kmem_cache * s , char * buf )
{
2008-02-15 15:22:21 -08:00
return show_slab_objects ( s , buf , SO_CPU ) ;
2007-05-06 14:49:36 -07:00
}
SLAB_ATTR_RO ( cpu_slabs ) ;
static ssize_t objects_show ( struct kmem_cache * s , char * buf )
{
2008-04-14 19:11:40 +03:00
return show_slab_objects ( s , buf , SO_ALL | SO_OBJECTS ) ;
2007-05-06 14:49:36 -07:00
}
SLAB_ATTR_RO ( objects ) ;
2008-04-14 19:11:40 +03:00
static ssize_t objects_partial_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_PARTIAL | SO_OBJECTS ) ;
}
SLAB_ATTR_RO ( objects_partial ) ;
2011-08-09 16:12:27 -05:00
static ssize_t slabs_cpu_partial_show ( struct kmem_cache * s , char * buf )
{
int objects = 0 ;
int pages = 0 ;
int cpu ;
int len ;
for_each_online_cpu ( cpu ) {
struct page * page = per_cpu_ptr ( s - > cpu_slab , cpu ) - > partial ;
if ( page ) {
pages + = page - > pages ;
objects + = page - > pobjects ;
}
}
len = sprintf ( buf , " %d(%d) " , objects , pages ) ;
# ifdef CONFIG_SMP
for_each_online_cpu ( cpu ) {
struct page * page = per_cpu_ptr ( s - > cpu_slab , cpu ) - > partial ;
if ( page & & len < PAGE_SIZE - 20 )
len + = sprintf ( buf + len , " C%d=%d(%d) " , cpu ,
page - > pobjects , page - > pages ) ;
}
# endif
return len + sprintf ( buf + len , " \n " ) ;
}
SLAB_ATTR_RO ( slabs_cpu_partial ) ;
2010-10-05 13:57:27 -05:00
static ssize_t reclaim_account_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_RECLAIM_ACCOUNT ) ) ;
}
static ssize_t reclaim_account_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
s - > flags & = ~ SLAB_RECLAIM_ACCOUNT ;
if ( buf [ 0 ] = = ' 1 ' )
s - > flags | = SLAB_RECLAIM_ACCOUNT ;
return length ;
}
SLAB_ATTR ( reclaim_account ) ;
static ssize_t hwcache_align_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_HWCACHE_ALIGN ) ) ;
}
SLAB_ATTR_RO ( hwcache_align ) ;
# ifdef CONFIG_ZONE_DMA
static ssize_t cache_dma_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_CACHE_DMA ) ) ;
}
SLAB_ATTR_RO ( cache_dma ) ;
# endif
static ssize_t destroy_by_rcu_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_DESTROY_BY_RCU ) ) ;
}
SLAB_ATTR_RO ( destroy_by_rcu ) ;
2011-03-10 15:21:48 +08:00
static ssize_t reserved_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , s - > reserved ) ;
}
SLAB_ATTR_RO ( reserved ) ;
2010-10-05 13:57:26 -05:00
# ifdef CONFIG_SLUB_DEBUG
2010-10-05 13:57:27 -05:00
static ssize_t slabs_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_ALL ) ;
}
SLAB_ATTR_RO ( slabs ) ;
2008-04-14 19:11:40 +03:00
static ssize_t total_objects_show ( struct kmem_cache * s , char * buf )
{
return show_slab_objects ( s , buf , SO_ALL | SO_TOTAL ) ;
}
SLAB_ATTR_RO ( total_objects ) ;
2007-05-06 14:49:36 -07:00
static ssize_t sanity_checks_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_DEBUG_FREE ) ) ;
}
static ssize_t sanity_checks_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
s - > flags & = ~ SLAB_DEBUG_FREE ;
2011-06-01 12:25:49 -05:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-06 14:49:36 -07:00
s - > flags | = SLAB_DEBUG_FREE ;
2011-06-01 12:25:49 -05:00
}
2007-05-06 14:49:36 -07:00
return length ;
}
SLAB_ATTR ( sanity_checks ) ;
static ssize_t trace_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_TRACE ) ) ;
}
static ssize_t trace_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
s - > flags & = ~ SLAB_TRACE ;
2011-06-01 12:25:49 -05:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-06 14:49:36 -07:00
s - > flags | = SLAB_TRACE ;
2011-06-01 12:25:49 -05:00
}
2007-05-06 14:49:36 -07:00
return length ;
}
SLAB_ATTR ( trace ) ;
static ssize_t red_zone_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_RED_ZONE ) ) ;
}
static ssize_t red_zone_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( any_slab_objects ( s ) )
return - EBUSY ;
s - > flags & = ~ SLAB_RED_ZONE ;
2011-06-01 12:25:49 -05:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-06 14:49:36 -07:00
s - > flags | = SLAB_RED_ZONE ;
2011-06-01 12:25:49 -05:00
}
2008-04-14 19:11:41 +03:00
calculate_sizes ( s , - 1 ) ;
2007-05-06 14:49:36 -07:00
return length ;
}
SLAB_ATTR ( red_zone ) ;
static ssize_t poison_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_POISON ) ) ;
}
static ssize_t poison_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( any_slab_objects ( s ) )
return - EBUSY ;
s - > flags & = ~ SLAB_POISON ;
2011-06-01 12:25:49 -05:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-06 14:49:36 -07:00
s - > flags | = SLAB_POISON ;
2011-06-01 12:25:49 -05:00
}
2008-04-14 19:11:41 +03:00
calculate_sizes ( s , - 1 ) ;
2007-05-06 14:49:36 -07:00
return length ;
}
SLAB_ATTR ( poison ) ;
static ssize_t store_user_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_STORE_USER ) ) ;
}
static ssize_t store_user_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( any_slab_objects ( s ) )
return - EBUSY ;
s - > flags & = ~ SLAB_STORE_USER ;
2011-06-01 12:25:49 -05:00
if ( buf [ 0 ] = = ' 1 ' ) {
s - > flags & = ~ __CMPXCHG_DOUBLE ;
2007-05-06 14:49:36 -07:00
s - > flags | = SLAB_STORE_USER ;
2011-06-01 12:25:49 -05:00
}
2008-04-14 19:11:41 +03:00
calculate_sizes ( s , - 1 ) ;
2007-05-06 14:49:36 -07:00
return length ;
}
SLAB_ATTR ( store_user ) ;
2007-05-06 14:49:43 -07:00
static ssize_t validate_show ( struct kmem_cache * s , char * buf )
{
return 0 ;
}
static ssize_t validate_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
2007-07-17 04:03:30 -07:00
int ret = - EINVAL ;
if ( buf [ 0 ] = = ' 1 ' ) {
ret = validate_slab_cache ( s ) ;
if ( ret > = 0 )
ret = length ;
}
return ret ;
2007-05-06 14:49:43 -07:00
}
SLAB_ATTR ( validate ) ;
2010-10-05 13:57:27 -05:00
static ssize_t alloc_calls_show ( struct kmem_cache * s , char * buf )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return - ENOSYS ;
return list_locations ( s , buf , TRACK_ALLOC ) ;
}
SLAB_ATTR_RO ( alloc_calls ) ;
static ssize_t free_calls_show ( struct kmem_cache * s , char * buf )
{
if ( ! ( s - > flags & SLAB_STORE_USER ) )
return - ENOSYS ;
return list_locations ( s , buf , TRACK_FREE ) ;
}
SLAB_ATTR_RO ( free_calls ) ;
# endif /* CONFIG_SLUB_DEBUG */
# ifdef CONFIG_FAILSLAB
static ssize_t failslab_show ( struct kmem_cache * s , char * buf )
{
return sprintf ( buf , " %d \n " , ! ! ( s - > flags & SLAB_FAILSLAB ) ) ;
}
static ssize_t failslab_store ( struct kmem_cache * s , const char * buf ,
size_t length )
{
s - > flags & = ~ SLAB_FAILSLAB ;
if ( buf [ 0 ] = = ' 1 ' )
s - > flags | = SLAB_FAILSLAB ;
return length ;
}
SLAB_ATTR ( failslab ) ;
2010-10-05 13:57:26 -05:00
# endif
2007-05-06 14:49:43 -07:00
2007-05-06 14:49:46 -07:00
static ssize_t shrink_show ( struct kmem_cache * s , char * buf )
{
return 0 ;
}
static ssize_t shrink_store ( struct kmem_cache * s ,
const char * buf , size_t length )
{
if ( buf [ 0 ] = = ' 1 ' ) {
int rc = kmem_cache_shrink ( s ) ;
if ( rc )
return rc ;
} else
return - EINVAL ;
return length ;
}
SLAB_ATTR ( shrink ) ;
2007-05-06 14:49:36 -07:00
# ifdef CONFIG_NUMA
2008-01-07 23:20:26 -08:00
static ssize_t remote_node_defrag_ratio_show ( struct kmem_cache * s , char * buf )
2007-05-06 14:49:36 -07:00
{
2008-01-07 23:20:26 -08:00
return sprintf ( buf , " %d \n " , s - > remote_node_defrag_ratio / 10 ) ;
2007-05-06 14:49:36 -07:00
}
2008-01-07 23:20:26 -08:00
static ssize_t remote_node_defrag_ratio_store ( struct kmem_cache * s ,
2007-05-06 14:49:36 -07:00
const char * buf , size_t length )
{
2008-04-29 16:11:12 -07:00
unsigned long ratio ;
int err ;
err = strict_strtoul ( buf , 10 , & ratio ) ;
if ( err )
return err ;
2008-08-19 08:51:22 -05:00
if ( ratio < = 100 )
2008-04-29 16:11:12 -07:00
s - > remote_node_defrag_ratio = ratio * 10 ;
2007-05-06 14:49:36 -07:00
return length ;
}
2008-01-07 23:20:26 -08:00
SLAB_ATTR ( remote_node_defrag_ratio ) ;
2007-05-06 14:49:36 -07:00
# endif
2008-02-07 17:47:41 -08:00
# ifdef CONFIG_SLUB_STATS
static int show_stat ( struct kmem_cache * s , char * buf , enum stat_item si )
{
unsigned long sum = 0 ;
int cpu ;
int len ;
int * data = kmalloc ( nr_cpu_ids * sizeof ( int ) , GFP_KERNEL ) ;
if ( ! data )
return - ENOMEM ;
for_each_online_cpu ( cpu ) {
2009-12-18 16:26:20 -06:00
unsigned x = per_cpu_ptr ( s - > cpu_slab , cpu ) - > stat [ si ] ;
2008-02-07 17:47:41 -08:00
data [ cpu ] = x ;
sum + = x ;
}
len = sprintf ( buf , " %lu " , sum ) ;
2008-04-14 18:52:05 +03:00
# ifdef CONFIG_SMP
2008-02-07 17:47:41 -08:00
for_each_online_cpu ( cpu ) {
if ( data [ cpu ] & & len < PAGE_SIZE - 20 )
2008-04-14 18:52:05 +03:00
len + = sprintf ( buf + len , " C%d=%u " , cpu , data [ cpu ] ) ;
2008-02-07 17:47:41 -08:00
}
2008-04-14 18:52:05 +03:00
# endif
2008-02-07 17:47:41 -08:00
kfree ( data ) ;
return len + sprintf ( buf + len , " \n " ) ;
}
2009-10-15 02:20:22 -07:00
static void clear_stat ( struct kmem_cache * s , enum stat_item si )
{
int cpu ;
for_each_online_cpu ( cpu )
2009-12-18 16:26:20 -06:00
per_cpu_ptr ( s - > cpu_slab , cpu ) - > stat [ si ] = 0 ;
2009-10-15 02:20:22 -07:00
}
2008-02-07 17:47:41 -08:00
# define STAT_ATTR(si, text) \
static ssize_t text # # _show ( struct kmem_cache * s , char * buf ) \
{ \
return show_stat ( s , buf , si ) ; \
} \
2009-10-15 02:20:22 -07:00
static ssize_t text # # _store ( struct kmem_cache * s , \
const char * buf , size_t length ) \
{ \
if ( buf [ 0 ] ! = ' 0 ' ) \
return - EINVAL ; \
clear_stat ( s , si ) ; \
return length ; \
} \
SLAB_ATTR ( text ) ; \
2008-02-07 17:47:41 -08:00
STAT_ATTR ( ALLOC_FASTPATH , alloc_fastpath ) ;
STAT_ATTR ( ALLOC_SLOWPATH , alloc_slowpath ) ;
STAT_ATTR ( FREE_FASTPATH , free_fastpath ) ;
STAT_ATTR ( FREE_SLOWPATH , free_slowpath ) ;
STAT_ATTR ( FREE_FROZEN , free_frozen ) ;
STAT_ATTR ( FREE_ADD_PARTIAL , free_add_partial ) ;
STAT_ATTR ( FREE_REMOVE_PARTIAL , free_remove_partial ) ;
STAT_ATTR ( ALLOC_FROM_PARTIAL , alloc_from_partial ) ;
STAT_ATTR ( ALLOC_SLAB , alloc_slab ) ;
STAT_ATTR ( ALLOC_REFILL , alloc_refill ) ;
2011-06-01 12:25:57 -05:00
STAT_ATTR ( ALLOC_NODE_MISMATCH , alloc_node_mismatch ) ;
2008-02-07 17:47:41 -08:00
STAT_ATTR ( FREE_SLAB , free_slab ) ;
STAT_ATTR ( CPUSLAB_FLUSH , cpuslab_flush ) ;
STAT_ATTR ( DEACTIVATE_FULL , deactivate_full ) ;
STAT_ATTR ( DEACTIVATE_EMPTY , deactivate_empty ) ;
STAT_ATTR ( DEACTIVATE_TO_HEAD , deactivate_to_head ) ;
STAT_ATTR ( DEACTIVATE_TO_TAIL , deactivate_to_tail ) ;
STAT_ATTR ( DEACTIVATE_REMOTE_FREES , deactivate_remote_frees ) ;
2011-06-01 12:25:58 -05:00
STAT_ATTR ( DEACTIVATE_BYPASS , deactivate_bypass ) ;
2008-04-14 19:11:40 +03:00
STAT_ATTR ( ORDER_FALLBACK , order_fallback ) ;
2011-06-01 12:25:49 -05:00
STAT_ATTR ( CMPXCHG_DOUBLE_CPU_FAIL , cmpxchg_double_cpu_fail ) ;
STAT_ATTR ( CMPXCHG_DOUBLE_FAIL , cmpxchg_double_fail ) ;
2011-08-09 16:12:27 -05:00
STAT_ATTR ( CPU_PARTIAL_ALLOC , cpu_partial_alloc ) ;
STAT_ATTR ( CPU_PARTIAL_FREE , cpu_partial_free ) ;
2012-02-03 23:34:56 +08:00
STAT_ATTR ( CPU_PARTIAL_NODE , cpu_partial_node ) ;
STAT_ATTR ( CPU_PARTIAL_DRAIN , cpu_partial_drain ) ;
2008-02-07 17:47:41 -08:00
# endif
2008-01-07 23:20:27 -08:00
static struct attribute * slab_attrs [ ] = {
2007-05-06 14:49:36 -07:00
& slab_size_attr . attr ,
& object_size_attr . attr ,
& objs_per_slab_attr . attr ,
& order_attr . attr ,
slub: add min_partial sysfs tunable
Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.
It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().
The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's
# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)
The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.
There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.
This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.
There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-22 17:40:09 -08:00
& min_partial_attr . attr ,
2011-08-09 16:12:27 -05:00
& cpu_partial_attr . attr ,
2007-05-06 14:49:36 -07:00
& objects_attr . attr ,
2008-04-14 19:11:40 +03:00
& objects_partial_attr . attr ,
2007-05-06 14:49:36 -07:00
& partial_attr . attr ,
& cpu_slabs_attr . attr ,
& ctor_attr . attr ,
& aliases_attr . attr ,
& align_attr . attr ,
& hwcache_align_attr . attr ,
& reclaim_account_attr . attr ,
& destroy_by_rcu_attr . attr ,
2010-10-05 13:57:27 -05:00
& shrink_attr . attr ,
2011-03-10 15:21:48 +08:00
& reserved_attr . attr ,
2011-08-09 16:12:27 -05:00
& slabs_cpu_partial_attr . attr ,
2010-10-05 13:57:26 -05:00
# ifdef CONFIG_SLUB_DEBUG
2010-10-05 13:57:27 -05:00
& total_objects_attr . attr ,
& slabs_attr . attr ,
& sanity_checks_attr . attr ,
& trace_attr . attr ,
2007-05-06 14:49:36 -07:00
& red_zone_attr . attr ,
& poison_attr . attr ,
& store_user_attr . attr ,
2007-05-06 14:49:43 -07:00
& validate_attr . attr ,
2007-05-06 14:49:45 -07:00
& alloc_calls_attr . attr ,
& free_calls_attr . attr ,
2010-10-05 13:57:26 -05:00
# endif
2007-05-06 14:49:36 -07:00
# ifdef CONFIG_ZONE_DMA
& cache_dma_attr . attr ,
# endif
# ifdef CONFIG_NUMA
2008-01-07 23:20:26 -08:00
& remote_node_defrag_ratio_attr . attr ,
2008-02-07 17:47:41 -08:00
# endif
# ifdef CONFIG_SLUB_STATS
& alloc_fastpath_attr . attr ,
& alloc_slowpath_attr . attr ,
& free_fastpath_attr . attr ,
& free_slowpath_attr . attr ,
& free_frozen_attr . attr ,
& free_add_partial_attr . attr ,
& free_remove_partial_attr . attr ,
& alloc_from_partial_attr . attr ,
& alloc_slab_attr . attr ,
& alloc_refill_attr . attr ,
2011-06-01 12:25:57 -05:00
& alloc_node_mismatch_attr . attr ,
2008-02-07 17:47:41 -08:00
& free_slab_attr . attr ,
& cpuslab_flush_attr . attr ,
& deactivate_full_attr . attr ,
& deactivate_empty_attr . attr ,
& deactivate_to_head_attr . attr ,
& deactivate_to_tail_attr . attr ,
& deactivate_remote_frees_attr . attr ,
2011-06-01 12:25:58 -05:00
& deactivate_bypass_attr . attr ,
2008-04-14 19:11:40 +03:00
& order_fallback_attr . attr ,
2011-06-01 12:25:49 -05:00
& cmpxchg_double_fail_attr . attr ,
& cmpxchg_double_cpu_fail_attr . attr ,
2011-08-09 16:12:27 -05:00
& cpu_partial_alloc_attr . attr ,
& cpu_partial_free_attr . attr ,
2012-02-03 23:34:56 +08:00
& cpu_partial_node_attr . attr ,
& cpu_partial_drain_attr . attr ,
2007-05-06 14:49:36 -07:00
# endif
2010-02-26 09:36:12 +03:00
# ifdef CONFIG_FAILSLAB
& failslab_attr . attr ,
# endif
2007-05-06 14:49:36 -07:00
NULL
} ;
static struct attribute_group slab_attr_group = {
. attrs = slab_attrs ,
} ;
static ssize_t slab_attr_show ( struct kobject * kobj ,
struct attribute * attr ,
char * buf )
{
struct slab_attribute * attribute ;
struct kmem_cache * s ;
int err ;
attribute = to_slab_attr ( attr ) ;
s = to_slab ( kobj ) ;
if ( ! attribute - > show )
return - EIO ;
err = attribute - > show ( s , buf ) ;
return err ;
}
static ssize_t slab_attr_store ( struct kobject * kobj ,
struct attribute * attr ,
const char * buf , size_t len )
{
struct slab_attribute * attribute ;
struct kmem_cache * s ;
int err ;
attribute = to_slab_attr ( attr ) ;
s = to_slab ( kobj ) ;
if ( ! attribute - > store )
return - EIO ;
err = attribute - > store ( s , buf , len ) ;
return err ;
}
2008-01-07 22:29:05 -08:00
static void kmem_cache_release ( struct kobject * kobj )
{
struct kmem_cache * s = to_slab ( kobj ) ;
2010-09-14 23:21:12 +03:00
kfree ( s - > name ) ;
2008-01-07 22:29:05 -08:00
kfree ( s ) ;
}
2010-01-19 02:58:23 +01:00
static const struct sysfs_ops slab_sysfs_ops = {
2007-05-06 14:49:36 -07:00
. show = slab_attr_show ,
. store = slab_attr_store ,
} ;
static struct kobj_type slab_ktype = {
. sysfs_ops = & slab_sysfs_ops ,
2008-01-07 22:29:05 -08:00
. release = kmem_cache_release
2007-05-06 14:49:36 -07:00
} ;
static int uevent_filter ( struct kset * kset , struct kobject * kobj )
{
struct kobj_type * ktype = get_ktype ( kobj ) ;
if ( ktype = = & slab_ktype )
return 1 ;
return 0 ;
}
2009-12-31 14:52:51 +01:00
static const struct kset_uevent_ops slab_uevent_ops = {
2007-05-06 14:49:36 -07:00
. filter = uevent_filter ,
} ;
2007-11-01 09:29:06 -06:00
static struct kset * slab_kset ;
2007-05-06 14:49:36 -07:00
# define ID_STR_LENGTH 64
/* Create a unique string id for a slab cache:
2008-02-15 23:45:26 -08:00
*
* Format : [ flags - ] size
2007-05-06 14:49:36 -07:00
*/
static char * create_unique_id ( struct kmem_cache * s )
{
char * name = kmalloc ( ID_STR_LENGTH , GFP_KERNEL ) ;
char * p = name ;
BUG_ON ( ! name ) ;
* p + + = ' : ' ;
/*
* First flags affecting slabcache operations . We will only
* get here for aliasable slabs so we do not need to support
* too many flags . The flags here must cover all flags that
* are matched during merging to guarantee that the id is
* unique .
*/
if ( s - > flags & SLAB_CACHE_DMA )
* p + + = ' d ' ;
if ( s - > flags & SLAB_RECLAIM_ACCOUNT )
* p + + = ' a ' ;
if ( s - > flags & SLAB_DEBUG_FREE )
* p + + = ' F ' ;
2008-04-04 00:54:48 +02:00
if ( ! ( s - > flags & SLAB_NOTRACK ) )
* p + + = ' t ' ;
2007-05-06 14:49:36 -07:00
if ( p ! = name + 1 )
* p + + = ' - ' ;
p + = sprintf ( p , " %07d " , s - > size ) ;
BUG_ON ( p > name + ID_STR_LENGTH - 1 ) ;
return name ;
}
static int sysfs_slab_add ( struct kmem_cache * s )
{
int err ;
const char * name ;
int unmergeable ;
if ( slab_state < SYSFS )
/* Defer until later */
return 0 ;
unmergeable = slab_unmergeable ( s ) ;
if ( unmergeable ) {
/*
* Slabcache can never be merged so we can use the name proper .
* This is typically the case for debug situations . In that
* case we can catch duplicate names easily .
*/
2007-11-01 09:29:06 -06:00
sysfs_remove_link ( & slab_kset - > kobj , s - > name ) ;
2007-05-06 14:49:36 -07:00
name = s - > name ;
} else {
/*
* Create a unique name for the slab as a target
* for the symlinks .
*/
name = create_unique_id ( s ) ;
}
2007-11-01 09:29:06 -06:00
s - > kobj . kset = slab_kset ;
2007-12-17 23:05:35 -07:00
err = kobject_init_and_add ( & s - > kobj , & slab_ktype , NULL , name ) ;
if ( err ) {
kobject_put ( & s - > kobj ) ;
2007-05-06 14:49:36 -07:00
return err ;
2007-12-17 23:05:35 -07:00
}
2007-05-06 14:49:36 -07:00
err = sysfs_create_group ( & s - > kobj , & slab_attr_group ) ;
2009-07-22 11:28:53 +08:00
if ( err ) {
kobject_del ( & s - > kobj ) ;
kobject_put ( & s - > kobj ) ;
2007-05-06 14:49:36 -07:00
return err ;
2009-07-22 11:28:53 +08:00
}
2007-05-06 14:49:36 -07:00
kobject_uevent ( & s - > kobj , KOBJ_ADD ) ;
if ( ! unmergeable ) {
/* Setup first alias */
sysfs_slab_alias ( s , s - > name ) ;
kfree ( name ) ;
}
return 0 ;
}
static void sysfs_slab_remove ( struct kmem_cache * s )
{
2010-07-19 11:39:11 -05:00
if ( slab_state < SYSFS )
/*
* Sysfs has not been setup yet so no need to remove the
* cache from sysfs .
*/
return ;
2007-05-06 14:49:36 -07:00
kobject_uevent ( & s - > kobj , KOBJ_REMOVE ) ;
kobject_del ( & s - > kobj ) ;
2008-01-07 22:29:05 -08:00
kobject_put ( & s - > kobj ) ;
2007-05-06 14:49:36 -07:00
}
/*
* Need to buffer aliases during bootup until sysfs becomes
2008-12-05 14:08:08 +11:00
* available lest we lose that information .
2007-05-06 14:49:36 -07:00
*/
struct saved_alias {
struct kmem_cache * s ;
const char * name ;
struct saved_alias * next ;
} ;
2007-07-17 04:03:27 -07:00
static struct saved_alias * alias_list ;
2007-05-06 14:49:36 -07:00
static int sysfs_slab_alias ( struct kmem_cache * s , const char * name )
{
struct saved_alias * al ;
if ( slab_state = = SYSFS ) {
/*
* If we have a leftover link then remove it .
*/
2007-11-01 09:29:06 -06:00
sysfs_remove_link ( & slab_kset - > kobj , name ) ;
return sysfs_create_link ( & slab_kset - > kobj , & s - > kobj , name ) ;
2007-05-06 14:49:36 -07:00
}
al = kmalloc ( sizeof ( struct saved_alias ) , GFP_KERNEL ) ;
if ( ! al )
return - ENOMEM ;
al - > s = s ;
al - > name = name ;
al - > next = alias_list ;
alias_list = al ;
return 0 ;
}
static int __init slab_sysfs_init ( void )
{
2007-07-17 04:03:19 -07:00
struct kmem_cache * s ;
2007-05-06 14:49:36 -07:00
int err ;
2010-07-19 11:39:11 -05:00
down_write ( & slub_lock ) ;
2007-11-06 10:36:58 -08:00
slab_kset = kset_create_and_add ( " slab " , & slab_uevent_ops , kernel_kobj ) ;
2007-11-01 09:29:06 -06:00
if ( ! slab_kset ) {
2010-07-19 11:39:11 -05:00
up_write ( & slub_lock ) ;
2007-05-06 14:49:36 -07:00
printk ( KERN_ERR " Cannot register slab subsystem. \n " ) ;
return - ENOSYS ;
}
2007-05-09 02:32:39 -07:00
slab_state = SYSFS ;
2007-07-17 04:03:19 -07:00
list_for_each_entry ( s , & slab_caches , list ) {
2007-05-09 02:32:39 -07:00
err = sysfs_slab_add ( s ) ;
2007-08-30 23:56:26 -07:00
if ( err )
printk ( KERN_ERR " SLUB: Unable to add boot slab %s "
" to sysfs \n " , s - > name ) ;
2007-05-09 02:32:39 -07:00
}
2007-05-06 14:49:36 -07:00
while ( alias_list ) {
struct saved_alias * al = alias_list ;
alias_list = alias_list - > next ;
err = sysfs_slab_alias ( al - > s , al - > name ) ;
2007-08-30 23:56:26 -07:00
if ( err )
printk ( KERN_ERR " SLUB: Unable to add boot slab alias "
" %s to sysfs \n " , s - > name ) ;
2007-05-06 14:49:36 -07:00
kfree ( al ) ;
}
2010-07-19 11:39:11 -05:00
up_write ( & slub_lock ) ;
2007-05-06 14:49:36 -07:00
resiliency_test ( ) ;
return 0 ;
}
__initcall ( slab_sysfs_init ) ;
2010-10-05 13:57:26 -05:00
# endif /* CONFIG_SYSFS */
2008-01-01 17:23:28 +01:00
/*
* The / proc / slabinfo ABI
*/
2008-01-02 13:04:48 -08:00
# ifdef CONFIG_SLABINFO
2008-01-01 17:23:28 +01:00
static void print_slabinfo_header ( struct seq_file * m )
{
seq_puts ( m , " slabinfo - version: 2.1 \n " ) ;
seq_puts ( m , " # name <active_objs> <num_objs> <objsize> "
" <objperslab> <pagesperslab> " ) ;
seq_puts ( m , " : tunables <limit> <batchcount> <sharedfactor> " ) ;
seq_puts ( m , " : slabdata <active_slabs> <num_slabs> <sharedavail> " ) ;
seq_putc ( m , ' \n ' ) ;
}
static void * s_start ( struct seq_file * m , loff_t * pos )
{
loff_t n = * pos ;
down_read ( & slub_lock ) ;
if ( ! n )
print_slabinfo_header ( m ) ;
return seq_list_start ( & slab_caches , * pos ) ;
}
static void * s_next ( struct seq_file * m , void * p , loff_t * pos )
{
return seq_list_next ( p , & slab_caches , pos ) ;
}
static void s_stop ( struct seq_file * m , void * p )
{
up_read ( & slub_lock ) ;
}
static int s_show ( struct seq_file * m , void * p )
{
unsigned long nr_partials = 0 ;
unsigned long nr_slabs = 0 ;
unsigned long nr_inuse = 0 ;
2008-04-14 19:11:40 +03:00
unsigned long nr_objs = 0 ;
unsigned long nr_free = 0 ;
2008-01-01 17:23:28 +01:00
struct kmem_cache * s ;
int node ;
s = list_entry ( p , struct kmem_cache , list ) ;
for_each_online_node ( node ) {
struct kmem_cache_node * n = get_node ( s , node ) ;
if ( ! n )
continue ;
nr_partials + = n - > nr_partial ;
nr_slabs + = atomic_long_read ( & n - > nr_slabs ) ;
2008-04-14 19:11:40 +03:00
nr_objs + = atomic_long_read ( & n - > total_objects ) ;
nr_free + = count_partial ( n , count_free ) ;
2008-01-01 17:23:28 +01:00
}
2008-04-14 19:11:40 +03:00
nr_inuse = nr_objs - nr_free ;
2008-01-01 17:23:28 +01:00
seq_printf ( m , " %-17s %6lu %6lu %6u %4u %4d " , s - > name , nr_inuse ,
2008-04-14 19:11:31 +03:00
nr_objs , s - > size , oo_objects ( s - > oo ) ,
( 1 < < oo_order ( s - > oo ) ) ) ;
2008-01-01 17:23:28 +01:00
seq_printf ( m , " : tunables %4u %4u %4u " , 0 , 0 , 0 ) ;
seq_printf ( m , " : slabdata %6lu %6lu %6lu " , nr_slabs , nr_slabs ,
0UL ) ;
seq_putc ( m , ' \n ' ) ;
return 0 ;
}
2008-10-06 02:42:17 +04:00
static const struct seq_operations slabinfo_op = {
2008-01-01 17:23:28 +01:00
. start = s_start ,
. next = s_next ,
. stop = s_stop ,
. show = s_show ,
} ;
2008-10-06 02:42:17 +04:00
static int slabinfo_open ( struct inode * inode , struct file * file )
{
return seq_open ( file , & slabinfo_op ) ;
}
static const struct file_operations proc_slabinfo_operations = {
. open = slabinfo_open ,
. read = seq_read ,
. llseek = seq_lseek ,
. release = seq_release ,
} ;
static int __init slab_proc_init ( void )
{
mm: restrict access to slab files under procfs and sysfs
Historically /proc/slabinfo and files under /sys/kernel/slab/* have
world read permissions and are accessible to the world. slabinfo
contains rather private information related both to the kernel and
userspace tasks. Depending on the situation, it might reveal either
private information per se or information useful to make another
targeted attack. Some examples of what can be learned by
reading/watching for /proc/slabinfo entries:
1) dentry (and different *inode*) number might reveal other processes fs
activity. The number of dentry "active objects" doesn't strictly show
file count opened/touched by a process, however, there is a good
correlation between them. The patch "proc: force dcache drop on
unauthorized access" relies on the privacy of dentry count.
2) different inode entries might reveal the same information as (1), but
these are more fine granted counters. If a filesystem is mounted in a
private mount point (or even a private namespace) and fs type differs from
other mounted fs types, fs activity in this mount point/namespace is
revealed. If there is a single ecryptfs mount point, the whole fs
activity of a single user is revealed. Number of files in ecryptfs
mount point is a private information per se.
3) fuse_* reveals number of files / fs activity of a user in a user
private mount point. It is approx. the same severity as ecryptfs
infoleak in (2).
4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo
the precise number of sysfs files is known to the world.
5) buffer_head might reveal some kernel activity. With other
information leaks an attacker might identify what specific kernel
routines generate buffer_head activity.
6) *kmalloc* infoleaks are very situational. Attacker should watch for
the specific kmalloc size entry and filter the noise related to the unrelated
kernel activity. If an attacker has relatively silent victim system, he
might get rather precise counters.
Additional information sources might significantly increase the slabinfo
infoleak benefits. E.g. if an attacker knows that the processes
activity on the system is very low (only core daemons like syslog and
cron), he may run setxid binaries / trigger local daemon activity /
trigger network services activity / await sporadic cron jobs activity
/ etc. and get rather precise counters for fs and network activity of
these privileged tasks, which is unknown otherwise.
Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
exploitation of kernel heap overflows (and possibly, other bugs). The
related discussion:
http://thread.gmane.org/gmane.linux.kernel/1108378
To keep compatibility with old permission model where non-root
monitoring daemon could watch for kernel memleaks though slabinfo one
should do:
groupadd slabinfo
usermod -a -G slabinfo $MONITOR_USER
And add the following commands to init scripts (to mountall.conf in
Ubuntu's upstart case):
chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Reviewed-by: Kees Cook <kees@ubuntu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: David Rientjes <rientjes@google.com>
CC: Valdis.Kletnieks@vt.edu
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Alan Cox <alan@linux.intel.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27 21:54:53 +04:00
proc_create ( " slabinfo " , S_IRUSR , NULL , & proc_slabinfo_operations ) ;
2008-10-06 02:42:17 +04:00
return 0 ;
}
module_init ( slab_proc_init ) ;
2008-01-02 13:04:48 -08:00
# endif /* CONFIG_SLABINFO */