2005-04-17 02:20:36 +04:00
/*
* INET An implementation of the TCP / IP protocol suite for the LINUX
* operating system . INET is implemented using the BSD Socket
* interface as the means of communication with the user level .
*
* Definitions for the AF_INET socket handler .
*
* Version : @ ( # ) sock . h 1.0 .4 05 / 13 / 93
*
2005-05-06 03:16:16 +04:00
* Authors : Ross Biro
2005-04-17 02:20:36 +04:00
* Fred N . van Kempen , < waltje @ uWalt . NL . Mugnet . ORG >
* Corey Minyard < wf - rch ! minyard @ relay . EU . net >
* Florian La Roche < flla @ stud . uni - sb . de >
*
* Fixes :
* Alan Cox : Volatiles in skbuff pointers . See
* skbuff comments . May be overdone ,
* better to prove they can be removed
* than the reverse .
* Alan Cox : Added a zapped field for tcp to note
* a socket is reset and must stay shut up
* Alan Cox : New fields for options
* Pauline Middelink : identd support
* Alan Cox : Eliminate low level recv / recvfrom
* David S . Miller : New socket lookup architecture .
* Steve Whitehouse : Default routines for sock_ops
* Arnaldo C . Melo : removed net_pinfo , tp_pinfo and made
* protinfo be just a void pointer , as the
* protocol specific parts were moved to
* respective headers and ipv4 / v6 , etc now
* use private slabcaches for its socks
* Pedro Hortas : New flags field for socket options
*
*
* This program is free software ; you can redistribute it and / or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation ; either version
* 2 of the License , or ( at your option ) any later version .
*/
# ifndef _SOCK_H
# define _SOCK_H
2011-06-06 14:43:46 +04:00
# include <linux/hardirq.h>
2007-08-29 02:50:33 +04:00
# include <linux/kernel.h>
2005-04-17 02:20:36 +04:00
# include <linux/list.h>
2008-11-17 06:39:21 +03:00
# include <linux/list_nulls.h>
2005-04-17 02:20:36 +04:00
# include <linux/timer.h>
# include <linux/cache.h>
memcg: decrement static keys at real destroy time
We call the destroy function when a cgroup starts to be removed, such as
by a rmdir event.
However, because of our reference counters, some objects are still
inflight. Right now, we are decrementing the static_keys at destroy()
time, meaning that if we get rid of the last static_key reference, some
objects will still have charges, but the code to properly uncharge them
won't be run.
This becomes a problem specially if it is ever enabled again, because now
new charges will be added to the staled charges making keeping it pretty
much impossible.
We just need to be careful with the static branch activation: since there
is no particular preferred order of their activation, we need to make sure
that we only start using it after all call sites are active. This is
achieved by having a per-memcg flag that is only updated after
static_key_slow_inc() returns. At this time, we are sure all sites are
active.
This is made per-memcg, not global, for a reason: it also has the effect
of making socket accounting more consistent. The first memcg to be
limited will trigger static_key() activation, therefore, accounting. But
all the others will then be accounted no matter what. After this patch,
only limited memcgs will have its sockets accounted.
[akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
document enum sock_flag_bits,
convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
redo tcp_update_limit() comment to 80 cols]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-30 02:07:11 +04:00
# include <linux/bitops.h>
2006-07-03 11:25:35 +04:00
# include <linux/lockdep.h>
2005-04-17 02:20:36 +04:00
# include <linux/netdevice.h>
# include <linux/skbuff.h> /* struct sk_buff */
2006-12-04 07:15:30 +03:00
# include <linux/mm.h>
2005-04-17 02:20:36 +04:00
# include <linux/security.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 11:04:11 +03:00
# include <linux/slab.h>
2011-04-05 09:30:30 +04:00
# include <linux/uaccess.h>
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:31 +03:00
# include <linux/page_counter.h>
2011-12-12 01:47:02 +04:00
# include <linux/memcontrol.h>
2012-02-24 11:31:31 +04:00
# include <linux/static_key.h>
2012-02-13 07:58:52 +04:00
# include <linux/sched.h>
2015-11-26 08:55:39 +03:00
# include <linux/wait.h>
2005-04-17 02:20:36 +04:00
# include <linux/filter.h>
2008-11-17 06:39:21 +03:00
# include <linux/rculist_nulls.h>
2009-07-08 16:09:13 +04:00
# include <linux/poll.h>
2005-04-17 02:20:36 +04:00
2010-11-15 22:58:26 +03:00
# include <linux/atomic.h>
2005-04-17 02:20:36 +04:00
# include <net/dst.h>
# include <net/checksum.h>
2015-03-16 07:12:12 +03:00
# include <net/tcp_states.h>
2014-08-05 06:11:46 +04:00
# include <linux/net_tstamp.h>
2005-04-17 02:20:36 +04:00
2011-12-13 07:59:08 +04:00
struct cgroup ;
struct cgroup_subsys ;
2011-12-16 04:52:00 +04:00
# ifdef CONFIG_NET
2012-04-10 02:36:33 +04:00
int mem_cgroup_sockets_init ( struct mem_cgroup * memcg , struct cgroup_subsys * ss ) ;
void mem_cgroup_sockets_destroy ( struct mem_cgroup * memcg ) ;
2011-12-16 04:52:00 +04:00
# else
static inline
2012-04-10 02:36:33 +04:00
int mem_cgroup_sockets_init ( struct mem_cgroup * memcg , struct cgroup_subsys * ss )
2011-12-16 04:52:00 +04:00
{
return 0 ;
}
static inline
2012-04-10 02:36:33 +04:00
void mem_cgroup_sockets_destroy ( struct mem_cgroup * memcg )
2011-12-16 04:52:00 +04:00
{
}
# endif
2005-04-17 02:20:36 +04:00
/*
* This structure really needs to be cleaned up .
* Most of it is for TCP , and not used by any of
* the other protocols .
*/
/* Define this to get the SOCK_DBG debugging facility. */
# define SOCK_DEBUGGING
# ifdef SOCK_DEBUGGING
# define SOCK_DEBUG(sk, msg...) do { if ((sk) && sock_flag((sk), SOCK_DBG)) \
printk ( KERN_DEBUG msg ) ; } while ( 0 )
# else
2008-03-22 01:54:53 +03:00
/* Validate arguments and do nothing */
2011-11-01 04:11:33 +04:00
static inline __printf ( 2 , 3 )
2012-05-17 02:48:15 +04:00
void SOCK_DEBUG ( const struct sock * sk , const char * msg , . . . )
2008-03-22 01:54:53 +03:00
{
}
2005-04-17 02:20:36 +04:00
# endif
/* This is the per-socket lock. The spinlock provides a synchronization
* between user contexts and software interrupt processing , whereas the
* mini - semaphore synchronizes multiple users amongst themselves .
*/
typedef struct {
spinlock_t slock ;
2007-09-12 12:44:19 +04:00
int owned ;
2005-04-17 02:20:36 +04:00
wait_queue_head_t wq ;
2006-07-03 11:25:35 +04:00
/*
* We express the mutex - alike socket_lock semantics
* to the lock validator by explicitly managing
* the slock as a lock variant ( in addition to
* the slock itself ) :
*/
# ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map ;
# endif
2005-04-17 02:20:36 +04:00
} socket_lock_t ;
struct sock ;
2005-08-10 07:09:30 +04:00
struct proto ;
2007-12-04 12:15:45 +03:00
struct net ;
2005-04-17 02:20:36 +04:00
2012-12-02 11:33:10 +04:00
typedef __u32 __bitwise __portpair ;
typedef __u64 __bitwise __addrpair ;
2005-04-17 02:20:36 +04:00
/**
2005-05-01 19:59:25 +04:00
* struct sock_common - minimal network layer representation of sockets
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 22:04:07 +03:00
* @ skc_daddr : Foreign IPv4 addr
* @ skc_rcv_saddr : Bound local IPv4 addr
2009-07-16 03:13:10 +04:00
* @ skc_hash : hash value used with various protocol lookup tables
2009-11-08 13:17:30 +03:00
* @ skc_u16hashes : two u16 hash values used by UDP lookup tables
2012-11-30 13:49:27 +04:00
* @ skc_dport : placeholder for inet_dport / tw_dport
* @ skc_num : placeholder for inet_num / tw_num
2005-05-01 19:59:25 +04:00
* @ skc_family : network address family
* @ skc_state : Connection state
* @ skc_reuse : % SO_REUSEADDR setting
2013-01-22 13:49:50 +04:00
* @ skc_reuseport : % SO_REUSEPORT setting
2005-05-01 19:59:25 +04:00
* @ skc_bound_dev_if : bound device index if ! = 0
* @ skc_bind_node : bind hash linkage for various protocol lookup tables
2009-11-08 13:17:58 +03:00
* @ skc_portaddr_node : second hash linkage for UDP / UDP - Lite protocol
2005-08-10 07:09:30 +04:00
* @ skc_prot : protocol handlers inside a network family
2007-09-12 13:58:02 +04:00
* @ skc_net : reference to the network namespace of this socket
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 22:04:07 +03:00
* @ skc_node : main hash linkage for various protocol lookup tables
* @ skc_nulls_node : main hash linkage for TCP / UDP / UDP - Lite protocol
* @ skc_tx_queue_mapping : tx queue number for this connection
2015-10-09 05:33:22 +03:00
* @ skc_flags : place holder for sk_flags
* % SO_LINGER ( l_onoff ) , % SO_BROADCAST , % SO_KEEPALIVE ,
* % SO_OOBINLINE settings , % SO_TIMESTAMPING settings
2015-10-09 05:33:21 +03:00
* @ skc_incoming_cpu : record / match cpu processing incoming packets
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 22:04:07 +03:00
* @ skc_refcnt : reference count
2005-05-01 19:59:25 +04:00
*
* This is the minimal network layer representation of sockets , the header
2005-08-10 07:09:30 +04:00
* for struct sock and struct inet_timewait_sock .
*/
2005-04-17 02:20:36 +04:00
struct sock_common {
2012-11-30 13:49:27 +04:00
/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 11:22:02 +04:00
* address on 64 bit arches : cf INET_MATCH ( )
2009-07-16 03:13:10 +04:00
*/
2012-11-30 13:49:27 +04:00
union {
2012-12-02 11:33:10 +04:00
__addrpair skc_addrpair ;
2012-11-30 13:49:27 +04:00
struct {
__be32 skc_daddr ;
__be32 skc_rcv_saddr ;
} ;
} ;
2009-11-08 13:17:30 +03:00
union {
unsigned int skc_hash ;
__u16 skc_u16hashes [ 2 ] ;
} ;
2012-11-30 13:49:27 +04:00
/* skc_dport && skc_num must be grouped as well */
union {
2012-12-02 11:33:10 +04:00
__portpair skc_portpair ;
2012-11-30 13:49:27 +04:00
struct {
__be16 skc_dport ;
__u16 skc_num ;
} ;
} ;
2009-07-16 03:13:10 +04:00
unsigned short skc_family ;
volatile unsigned char skc_state ;
2013-01-22 13:49:50 +04:00
unsigned char skc_reuse : 4 ;
2014-06-27 19:36:16 +04:00
unsigned char skc_reuseport : 1 ;
unsigned char skc_ipv6only : 1 ;
2015-05-09 05:10:31 +03:00
unsigned char skc_net_refcnt : 1 ;
2009-07-16 03:13:10 +04:00
int skc_bound_dev_if ;
2009-11-08 13:17:58 +03:00
union {
struct hlist_node skc_bind_node ;
struct hlist_nulls_node skc_portaddr_node ;
} ;
2005-08-10 07:09:30 +04:00
struct proto * skc_prot ;
2015-03-12 07:06:44 +03:00
possible_net_t skc_net ;
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 02:42:29 +04:00
# if IS_ENABLED(CONFIG_IPV6)
struct in6_addr skc_v6_daddr ;
struct in6_addr skc_v6_rcv_saddr ;
# endif
2015-03-12 04:53:14 +03:00
atomic64_t skc_cookie ;
2015-10-09 05:33:22 +03:00
/* following fields are padding to force
* offset ( struct sock , sk_refcnt ) = = 128 on 64 bit arches
* assuming IPV6 is enabled . We use this padding differently
* for different kind of ' sockets '
*/
union {
unsigned long skc_flags ;
struct sock * skc_listener ; /* request_sock */
struct inet_timewait_death_row * skc_tw_dr ; /* inet_timewait_sock */
} ;
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 22:04:07 +03:00
/*
* fields between dontcopy_begin / dontcopy_end
* are not copied in sock_copy ( )
*/
2011-01-08 20:39:21 +03:00
/* private: */
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 22:04:07 +03:00
int skc_dontcopy_begin [ 0 ] ;
2011-01-08 20:39:21 +03:00
/* public: */
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 22:04:07 +03:00
union {
struct hlist_node skc_node ;
struct hlist_nulls_node skc_nulls_node ;
} ;
int skc_tx_queue_mapping ;
2015-10-09 05:33:23 +03:00
union {
int skc_incoming_cpu ;
u32 skc_rcv_wnd ;
2015-10-09 05:33:24 +03:00
u32 skc_tw_rcv_nxt ; /* struct tcp_timewait_sock */
2015-10-09 05:33:23 +03:00
} ;
2015-10-09 05:33:21 +03:00
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 22:04:07 +03:00
atomic_t skc_refcnt ;
2011-01-08 20:39:21 +03:00
/* private: */
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 22:04:07 +03:00
int skc_dontcopy_end [ 0 ] ;
2015-10-09 05:33:23 +03:00
union {
u32 skc_rxhash ;
u32 skc_window_clamp ;
2015-10-09 05:33:24 +03:00
u32 skc_tw_snd_nxt ; /* struct tcp_timewait_sock */
2015-10-09 05:33:23 +03:00
} ;
2011-01-08 20:39:21 +03:00
/* public: */
2005-04-17 02:20:36 +04:00
} ;
2011-12-12 01:47:03 +04:00
struct cg_proto ;
2005-04-17 02:20:36 +04:00
/**
* struct sock - network layer representation of sockets
2005-08-10 07:09:30 +04:00
* @ __sk_common : shared layout with inet_timewait_sock
2005-05-01 19:59:25 +04:00
* @ sk_shutdown : mask of % SEND_SHUTDOWN and / or % RCV_SHUTDOWN
* @ sk_userlocks : % SO_SNDBUF and % SO_RCVBUF settings
* @ sk_lock : synchronizer
* @ sk_rcvbuf : size of receive buffer in bytes
2010-04-29 15:01:49 +04:00
* @ sk_wq : sock wait queue and async head
2013-10-07 20:01:39 +04:00
* @ sk_rx_dst : receive input route used by early demux
2005-05-01 19:59:25 +04:00
* @ sk_dst_cache : destination cache
* @ sk_policy : flow policy
* @ sk_receive_queue : incoming packets
* @ sk_wmem_alloc : transmit queue bytes committed
* @ sk_write_queue : Packet sending queue
* @ sk_omem_alloc : " o " is " option " or " other "
* @ sk_wmem_queued : persistent queue size
* @ sk_forward_alloc : space allocated forward
2013-06-10 12:39:50 +04:00
* @ sk_napi_id : id of the last napi context to receive data for sk
2013-06-14 17:33:57 +04:00
* @ sk_ll_usec : usecs to busypoll when there is no data
2005-05-01 19:59:25 +04:00
* @ sk_allocation : allocation mode
tcp: TSO packets automatic sizing
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27 16:46:32 +04:00
* @ sk_pacing_rate : Pacing rate ( if supported by transport / packet scheduler )
2013-09-29 12:12:40 +04:00
* @ sk_max_pacing_rate : Maximum pacing rate ( % SO_MAX_PACING_RATE )
2005-05-01 19:59:25 +04:00
* @ sk_sndbuf : size of send buffer in bytes
2014-05-23 19:47:19 +04:00
* @ sk_no_check_tx : % SO_NO_CHECK setting , set checksum in TX packets
* @ sk_no_check_rx : allow zero checksum in RX packets
2005-05-01 19:59:25 +04:00
* @ sk_route_caps : route capabilities ( e . g . % NETIF_F_TSO )
2010-05-16 11:36:33 +04:00
* @ sk_route_nocaps : forbidden route capabilities ( e . g NETIF_F_GSO_MASK )
2006-07-01 00:36:35 +04:00
* @ sk_gso_type : GSO type ( e . g . % SKB_GSO_TCPV4 )
[NET]: Add per-connection option to set max TSO frame size
Update: My mailer ate one of Jarek's feedback mails... Fixed the
parameter in netif_set_gso_max_size() to be u32, not u16. Fixed the
whitespace issue due to a patch import botch. Changed the types from
u32 to unsigned int to be more consistent with other variables in the
area. Also brought the patch up to the latest net-2.6.26 tree.
Update: Made gso_max_size container 32 bits, not 16. Moved the
location of gso_max_size within netdev to be less hotpath. Made more
consistent names between the sock and netdev layers, and added a
define for the max GSO size.
Update: Respun for net-2.6.26 tree.
Update: changed max_gso_frame_size and sk_gso_max_size from signed to
unsigned - thanks Stephen!
This patch adds the ability for device drivers to control the size of
the TSO frames being sent to them, per TCP connection. By setting the
netdevice's gso_max_size value, the socket layer will set the GSO
frame size based on that value. This will propogate into the TCP
layer, and send TSO's of that size to the hardware.
This can be desirable to help tune the bursty nature of TSO on a
per-adapter basis, where one may have 1 GbE and 10 GbE devices
coexisting in a system, one running multiqueue and the other not, etc.
This can also be desirable for devices that cannot support full 64 KB
TSO's, but still want to benefit from some level of segmentation
offloading.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 13:43:19 +03:00
* @ sk_gso_max_size : Maximum GSO segment size to build
2012-07-30 20:11:42 +04:00
* @ sk_gso_max_segs : Maximum number of GSO segments
2005-05-01 19:59:25 +04:00
* @ sk_lingertime : % SO_LINGER l_linger setting
* @ sk_backlog : always used with the per - socket spinlock held
* @ sk_callback_lock : used with the callbacks in the end of this struct
* @ sk_error_queue : rarely used
2007-11-14 07:30:01 +03:00
* @ sk_prot_creator : sk_prot of original sock creator ( see ipv6_setsockopt ,
* IPV6_ADDRFORM for instance )
2005-05-01 19:59:25 +04:00
* @ sk_err : last error
2007-11-14 07:30:01 +03:00
* @ sk_err_soft : errors that don ' t cause failure but are the cause of a
* persistent failure not just ' timed out '
2008-06-18 08:04:56 +04:00
* @ sk_drops : raw / udp drops counter
2005-05-01 19:59:25 +04:00
* @ sk_ack_backlog : current listen backlog
* @ sk_max_ack_backlog : listen backlog set in listen ( )
* @ sk_priority : % SO_PRIORITY setting
2012-01-21 13:03:10 +04:00
* @ sk_cgrp_prioidx : socket group ' s priority map index
2005-05-01 19:59:25 +04:00
* @ sk_type : socket type ( % SOCK_STREAM , etc )
* @ sk_protocol : which protocol this socket belongs in this network family
2010-08-09 17:41:07 +04:00
* @ sk_peer_pid : & struct pid for this socket ' s peer
* @ sk_peer_cred : % SO_PEERCRED setting
2005-05-01 19:59:25 +04:00
* @ sk_rcvlowat : % SO_RCVLOWAT setting
* @ sk_rcvtimeo : % SO_RCVTIMEO setting
* @ sk_sndtimeo : % SO_SNDTIMEO setting
2014-07-02 08:32:17 +04:00
* @ sk_txhash : computed flow hash for use on transmit
2005-05-01 19:59:25 +04:00
* @ sk_filter : socket filtering instructions
* @ sk_timer : sock cleanup timer
* @ sk_stamp : time stamp of last packet received
2014-08-05 06:11:46 +04:00
* @ sk_tsflags : SO_TIMESTAMPING socket options
2014-08-05 06:11:47 +04:00
* @ sk_tskey : counter to disambiguate concurrent tstamp requests
2005-05-01 19:59:25 +04:00
* @ sk_socket : Identd and reporting IO signals
* @ sk_user_data : RPC layer private data
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
* @ sk_frag : cached page frag
2012-04-17 18:03:53 +04:00
* @ sk_peek_off : current peek_offset value
2005-05-01 19:59:25 +04:00
* @ sk_send_head : front of stuff to transmit
2005-05-01 19:59:26 +04:00
* @ sk_security : used by security modules
2008-02-19 07:52:13 +03:00
* @ sk_mark : generic packet mark
2010-08-09 17:41:07 +04:00
* @ sk_classid : this socket ' s cgroup classid
2011-12-12 01:47:03 +04:00
* @ sk_cgrp : this socket ' s cgroup - specific proto data
2005-05-01 19:59:25 +04:00
* @ sk_write_pending : a write to stream socket waits to start
* @ sk_state_change : callback to indicate change in the state of the sock
* @ sk_data_ready : callback to indicate there is data to be processed
* @ sk_write_space : callback to indicate there is bf sending space available
* @ sk_error_report : callback to indicate errors ( e . g . % MSG_ERRQUEUE )
* @ sk_backlog_rcv : callback to process the backlog
* @ sk_destruct : called at sock freeing time , i . e . when all refcnt = = 0
2005-04-17 02:20:36 +04:00
*/
struct sock {
/*
2005-08-10 07:09:30 +04:00
* Now struct inet_timewait_sock also uses sock_common , so please just
2005-04-17 02:20:36 +04:00
* don ' t add nothing before this first member ( __sk_common ) - - acme
*/
struct sock_common __sk_common ;
2009-07-16 03:13:10 +04:00
# define sk_node __sk_common.skc_node
# define sk_nulls_node __sk_common.skc_nulls_node
# define sk_refcnt __sk_common.skc_refcnt
2009-10-20 03:46:20 +04:00
# define sk_tx_queue_mapping __sk_common.skc_tx_queue_mapping
2009-07-16 03:13:10 +04:00
net: optimize INET input path further
Followup of commit b178bb3dfc30 (net: reorder struct sock fields)
Optimize INET input path a bit further, by :
1) moving sk_refcnt close to sk_lock.
This reduces number of dirtied cache lines by one on 64bit arches (and
64 bytes cache line size).
2) moving inet_daddr & inet_rcv_saddr at the beginning of sk
(same cache line than hash / family / bound_dev_if / nulls_node)
This reduces number of accessed cache lines in lookups by one, and dont
increase size of inet and timewait socks.
inet and tw sockets now share same place-holder for these fields.
Before patch :
offsetof(struct sock, sk_refcnt) = 0x10
offsetof(struct sock, sk_lock) = 0x40
offsetof(struct sock, sk_receive_queue) = 0x60
offsetof(struct inet_sock, inet_daddr) = 0x270
offsetof(struct inet_sock, inet_rcv_saddr) = 0x274
After patch :
offsetof(struct sock, sk_refcnt) = 0x44
offsetof(struct sock, sk_lock) = 0x48
offsetof(struct sock, sk_receive_queue) = 0x68
offsetof(struct inet_sock, inet_daddr) = 0x0
offsetof(struct inet_sock, inet_rcv_saddr) = 0x4
compute_score() (udp or tcp) now use a single cache line per ignored
item, instead of two.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30 22:04:07 +03:00
# define sk_dontcopy_begin __sk_common.skc_dontcopy_begin
# define sk_dontcopy_end __sk_common.skc_dontcopy_end
2009-07-16 03:13:10 +04:00
# define sk_hash __sk_common.skc_hash
inet: consolidate INET_TW_MATCH
TCP listener refactoring, part 2 :
We can use a generic lookup, sockets being in whatever state, if
we are sure all relevant fields are at the same place in all socket
types (ESTABLISH, TIME_WAIT, SYN_RECV)
This patch removes these macros :
inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair
And adds :
sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr
Then, INET_TW_MATCH() is really the same than INET_MATCH()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-02 15:29:50 +04:00
# define sk_portpair __sk_common.skc_portpair
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 11:22:02 +04:00
# define sk_num __sk_common.skc_num
# define sk_dport __sk_common.skc_dport
inet: consolidate INET_TW_MATCH
TCP listener refactoring, part 2 :
We can use a generic lookup, sockets being in whatever state, if
we are sure all relevant fields are at the same place in all socket
types (ESTABLISH, TIME_WAIT, SYN_RECV)
This patch removes these macros :
inet_addrpair, inet_addrpair, tw_addrpair, tw_portpair
And adds :
sk_portpair, sk_addrpair, sk_daddr, sk_rcv_saddr
Then, INET_TW_MATCH() is really the same than INET_MATCH()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-02 15:29:50 +04:00
# define sk_addrpair __sk_common.skc_addrpair
# define sk_daddr __sk_common.skc_daddr
# define sk_rcv_saddr __sk_common.skc_rcv_saddr
2005-04-17 02:20:36 +04:00
# define sk_family __sk_common.skc_family
# define sk_state __sk_common.skc_state
# define sk_reuse __sk_common.skc_reuse
2013-01-22 13:49:50 +04:00
# define sk_reuseport __sk_common.skc_reuseport
2014-06-27 19:36:16 +04:00
# define sk_ipv6only __sk_common.skc_ipv6only
2015-05-09 05:10:31 +03:00
# define sk_net_refcnt __sk_common.skc_net_refcnt
2005-04-17 02:20:36 +04:00
# define sk_bound_dev_if __sk_common.skc_bound_dev_if
# define sk_bind_node __sk_common.skc_bind_node
2005-08-10 07:09:30 +04:00
# define sk_prot __sk_common.skc_prot
2007-09-12 13:58:02 +04:00
# define sk_net __sk_common.skc_net
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 02:42:29 +04:00
# define sk_v6_daddr __sk_common.skc_v6_daddr
# define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
2015-03-12 04:53:14 +03:00
# define sk_cookie __sk_common.skc_cookie
2015-10-09 05:33:21 +03:00
# define sk_incoming_cpu __sk_common.skc_incoming_cpu
2015-10-09 05:33:22 +03:00
# define sk_flags __sk_common.skc_flags
2015-10-09 05:33:23 +03:00
# define sk_rxhash __sk_common.skc_rxhash
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.
We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-04 02:42:29 +04:00
2005-04-17 02:20:36 +04:00
socket_lock_t sk_lock ;
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 08:56:04 +03:00
struct sk_buff_head sk_receive_queue ;
2007-03-05 03:05:44 +03:00
/*
* The backlog queue is special , it is always used with
* the per - socket spinlock held and requires low latency
* access . Therefore we special case it ' s implementation .
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 08:56:04 +03:00
* Note : rmem_alloc is in this structure to fill a hole
* on 64 bit arches , not because its logically part of
* backlog .
2007-03-05 03:05:44 +03:00
*/
struct {
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 08:56:04 +03:00
atomic_t rmem_alloc ;
int len ;
struct sk_buff * head ;
struct sk_buff * tail ;
2007-03-05 03:05:44 +03:00
} sk_backlog ;
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 08:56:04 +03:00
# define sk_rmem_alloc sk_backlog.rmem_alloc
int sk_forward_alloc ;
net: introduce SO_INCOMING_CPU
Alternative to RPS/RFS is to use hardware support for multiple
queues.
Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.
Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.
We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.
After accept(), connect(), or even file descriptor passing around
processes, applications can use :
int cpu;
socklen_t len = sizeof(cpu);
getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 16:54:28 +03:00
2014-07-02 08:32:17 +04:00
__u32 sk_txhash ;
2013-08-01 07:10:25 +04:00
# ifdef CONFIG_NET_RX_BUSY_POLL
2013-06-10 12:39:50 +04:00
unsigned int sk_napi_id ;
2013-06-14 17:33:57 +04:00
unsigned int sk_ll_usec ;
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 08:56:04 +03:00
# endif
atomic_t sk_drops ;
int sk_rcvbuf ;
struct sk_filter __rcu * sk_filter ;
2015-11-30 07:03:11 +03:00
union {
struct socket_wq __rcu * sk_wq ;
struct socket_wq * sk_wq_raw ;
} ;
2008-10-28 23:24:06 +03:00
# ifdef CONFIG_XFRM
2005-04-17 02:20:36 +04:00
struct xfrm_policy * sk_policy [ 2 ] ;
2008-10-28 23:24:06 +03:00
# endif
2012-06-25 00:22:49 +04:00
struct dst_entry * sk_rx_dst ;
2013-01-23 01:09:51 +04:00
struct dst_entry __rcu * sk_dst_cache ;
2015-12-03 08:53:57 +03:00
/* Note: 32bit hole on 64bit arches */
2005-04-17 02:20:36 +04:00
atomic_t sk_wmem_alloc ;
atomic_t sk_omem_alloc ;
2007-05-30 00:17:47 +04:00
int sk_sndbuf ;
2005-04-17 02:20:36 +04:00
struct sk_buff_head sk_write_queue ;
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 08:56:04 +03:00
kmemcheck_bitfield_begin ( flags ) ;
unsigned int sk_shutdown : 2 ,
2014-05-23 19:47:19 +04:00
sk_no_check_tx : 1 ,
sk_no_check_rx : 1 ,
net: reorder struct sock fields
Right now, fields in struct sock are not optimally ordered, because each
path (RX softirq, TX completion, RX user, TX user) has to touch fields
that are contained in many different cache lines.
The really critical thing is to shrink number of cache lines that are
used at RX softirq time : CPU handling softirqs for a device can receive
many frames per second for many sockets. If load is too big, we can drop
frames at NIC level. RPS or multiqueue cards can help, but better reduce
latency if possible.
This patch starts with UDP protocol, then additional patches will try to
reduce latencies of other ones as well.
At RX softirq time, fields of interest for UDP protocol are :
(not counting ones in inet struct for the lookup)
Read/Written:
sk_refcnt (atomic increment/decrement)
sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
sk_receive_queue
sk_backlog (if socket locked by user program)
sk_rxhash
sk_forward_alloc
sk_drops
Read only:
sk_rcvbuf (sk_rcvqueues_full())
sk_filter
sk_wq
sk_policy[0]
sk_flags
Additional notes :
- sk_backlog has one hole on 64bit arches. We can fill it to save 8
bytes.
- sk_backlog is used only if RX sofirq handler finds the socket while
locked by user.
- sk_rxhash is written only once per flow.
- sk_drops is written only if queues are full
Final layout :
[1] One section grouping all read/write fields, but placing rxhash and
sk_backlog at the end of this section.
[2] One section grouping all read fields in RX handler
(sk_filter, sk_rcv_buf, sk_wq)
[3] Section used by other paths
I'll post a patch on its own to put sk_refcnt at the end of struct
sock_common so that it shares same cache line than section [1]
New offsets on 64bit arch :
sizeof(struct sock)=0x268
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x48
offsetof(struct sock, sk_receive_queue)=0x68
offsetof(struct sock, sk_backlog)=0x80
offsetof(struct sock, sk_rmem_alloc)=0x80
offsetof(struct sock, sk_forward_alloc)=0x98
offsetof(struct sock, sk_rxhash)=0x9c
offsetof(struct sock, sk_rcvbuf)=0xa4
offsetof(struct sock, sk_drops) =0xa0
offsetof(struct sock, sk_filter)=0xa8
offsetof(struct sock, sk_wq)=0xb0
offsetof(struct sock, sk_policy)=0xd0
offsetof(struct sock, sk_flags) =0xe0
Instead of :
sizeof(struct sock)=0x270
offsetof(struct sock, sk_refcnt) =0x10
offsetof(struct sock, sk_lock) =0x50
offsetof(struct sock, sk_receive_queue)=0xc0
offsetof(struct sock, sk_backlog)=0x70
offsetof(struct sock, sk_rmem_alloc)=0xac
offsetof(struct sock, sk_forward_alloc)=0x10c
offsetof(struct sock, sk_rxhash)=0x128
offsetof(struct sock, sk_rcvbuf)=0x4c
offsetof(struct sock, sk_drops) =0x16c
offsetof(struct sock, sk_filter)=0x198
offsetof(struct sock, sk_wq)=0x88
offsetof(struct sock, sk_policy)=0x98
offsetof(struct sock, sk_flags) =0x130
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-16 08:56:04 +03:00
sk_userlocks : 4 ,
sk_protocol : 8 ,
sk_type : 16 ;
kmemcheck_bitfield_end ( flags ) ;
2005-04-17 02:20:36 +04:00
int sk_wmem_queued ;
2005-10-21 11:20:43 +04:00
gfp_t sk_allocation ;
tcp: TSO packets automatic sizing
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27 16:46:32 +04:00
u32 sk_pacing_rate ; /* bytes per second */
2013-09-24 19:20:52 +04:00
u32 sk_max_pacing_rate ;
2011-11-15 19:29:55 +04:00
netdev_features_t sk_route_caps ;
netdev_features_t sk_route_nocaps ;
2006-07-01 00:36:35 +04:00
int sk_gso_type ;
[NET]: Add per-connection option to set max TSO frame size
Update: My mailer ate one of Jarek's feedback mails... Fixed the
parameter in netif_set_gso_max_size() to be u32, not u16. Fixed the
whitespace issue due to a patch import botch. Changed the types from
u32 to unsigned int to be more consistent with other variables in the
area. Also brought the patch up to the latest net-2.6.26 tree.
Update: Made gso_max_size container 32 bits, not 16. Moved the
location of gso_max_size within netdev to be less hotpath. Made more
consistent names between the sock and netdev layers, and added a
define for the max GSO size.
Update: Respun for net-2.6.26 tree.
Update: changed max_gso_frame_size and sk_gso_max_size from signed to
unsigned - thanks Stephen!
This patch adds the ability for device drivers to control the size of
the TSO frames being sent to them, per TCP connection. By setting the
netdevice's gso_max_size value, the socket layer will set the GSO
frame size based on that value. This will propogate into the TCP
layer, and send TSO's of that size to the hardware.
This can be desirable to help tune the bursty nature of TSO on a
per-adapter basis, where one may have 1 GbE and 10 GbE devices
coexisting in a system, one running multiqueue and the other not, etc.
This can also be desirable for devices that cannot support full 64 KB
TSO's, but still want to benefit from some level of segmentation
offloading.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-21 13:43:19 +03:00
unsigned int sk_gso_max_size ;
2012-07-30 20:11:42 +04:00
u16 sk_gso_max_segs ;
2006-03-25 02:12:37 +03:00
int sk_rcvlowat ;
2005-04-17 02:20:36 +04:00
unsigned long sk_lingertime ;
struct sk_buff_head sk_error_queue ;
2005-05-06 00:35:15 +04:00
struct proto * sk_prot_creator ;
2005-04-17 02:20:36 +04:00
rwlock_t sk_callback_lock ;
int sk_err ,
sk_err_soft ;
2015-03-20 05:04:21 +03:00
u32 sk_ack_backlog ;
u32 sk_max_ack_backlog ;
2005-04-17 02:20:36 +04:00
__u32 sk_priority ;
2013-12-29 20:27:11 +04:00
# if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
2011-11-22 09:10:51 +04:00
__u32 sk_cgrp_prioidx ;
# endif
2010-06-13 07:30:14 +04:00
struct pid * sk_peer_pid ;
const struct cred * sk_peer_cred ;
2005-04-17 02:20:36 +04:00
long sk_rcvtimeo ;
long sk_sndtimeo ;
struct timer_list sk_timer ;
2007-04-20 03:16:32 +04:00
ktime_t sk_stamp ;
2014-08-05 06:11:46 +04:00
u16 sk_tsflags ;
2014-08-05 06:11:47 +04:00
u32 sk_tskey ;
2005-04-17 02:20:36 +04:00
struct socket * sk_socket ;
void * sk_user_data ;
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
struct page_frag sk_frag ;
2005-04-17 02:20:36 +04:00
struct sk_buff * sk_send_head ;
2012-02-21 11:31:34 +04:00
__s32 sk_peek_off ;
2005-04-17 02:20:36 +04:00
int sk_write_pending ;
2008-11-05 01:45:58 +03:00
# ifdef CONFIG_SECURITY
2005-04-17 02:20:36 +04:00
void * sk_security ;
2008-11-05 01:45:58 +03:00
# endif
2008-01-31 06:08:16 +03:00
__u32 sk_mark ;
2015-07-19 23:21:13 +03:00
# ifdef CONFIG_CGROUP_NET_CLASSID
cls_cgroup: Store classid in struct sock
Up until now cls_cgroup has relied on fetching the classid out of
the current executing thread. This runs into trouble when a packet
processing is delayed in which case it may execute out of another
thread's context.
Furthermore, even when a packet is not delayed we may fail to
classify it if soft IRQs have been disabled, because this scenario
is indistinguishable from one where a packet unrelated to the
current thread is processed by a real soft IRQ.
In fact, the current semantics is inherently broken, as a single
skb may be constructed out of the writes of two different tasks.
A different manifestation of this problem is when the TCP stack
transmits in response of an incoming ACK. This is currently
unclassified.
As we already have a concept of packet ownership for accounting
purposes in the skb->sk pointer, this is a natural place to store
the classid in a persistent manner.
This patch adds the cls_cgroup classid in struct sock, filling up
an existing hole on 64-bit :)
The value is set at socket creation time. So all sockets created
via socket(2) automatically gains the ID of the thread creating it.
Whenever another process touches the socket by either reading or
writing to it, we will change the socket classid to that of the
process if it has a valid (non-zero) classid.
For sockets created on inbound connections through accept(2), we
inherit the classid of the original listening socket through
sk_clone, possibly preceding the actual accept(2) call.
In order to minimise risks, I have not made this the authoritative
classid. For now it is only used as a backup when we execute
with soft IRQs disabled. Once we're completely happy with its
semantics we can use it as the sole classid.
Footnote: I have rearranged the error path on cls_group module
creation. If we didn't do this, then there is a window where
someone could create a tc rule using cls_group before the cgroup
subsystem has been registered.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-24 11:12:34 +04:00
u32 sk_classid ;
2015-07-19 23:21:13 +03:00
# endif
2011-12-12 01:47:03 +04:00
struct cg_proto * sk_cgrp ;
2005-04-17 02:20:36 +04:00
void ( * sk_state_change ) ( struct sock * sk ) ;
2014-04-12 00:15:36 +04:00
void ( * sk_data_ready ) ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
void ( * sk_write_space ) ( struct sock * sk ) ;
void ( * sk_error_report ) ( struct sock * sk ) ;
2012-05-17 02:48:15 +04:00
int ( * sk_backlog_rcv ) ( struct sock * sk ,
struct sk_buff * skb ) ;
2005-04-17 02:20:36 +04:00
void ( * sk_destruct ) ( struct sock * sk ) ;
} ;
2013-09-24 21:25:40 +04:00
# define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
# define rcu_dereference_sk_user_data(sk) rcu_dereference(__sk_user_data((sk)))
# define rcu_assign_sk_user_data(sk, ptr) rcu_assign_pointer(__sk_user_data((sk)), ptr)
2012-04-19 07:39:36 +04:00
/*
* SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
* or not whether his port will be reused by someone else . SK_FORCE_REUSE
* on a socket means that the socket will reuse everybody else ' s port
* without looking at the other ' s sk_reuse value .
*/
# define SK_NO_REUSE 0
# define SK_CAN_REUSE 1
# define SK_FORCE_REUSE 2
2012-02-21 11:31:34 +04:00
static inline int sk_peek_offset ( struct sock * sk , int flags )
{
if ( ( flags & MSG_PEEK ) & & ( sk - > sk_peek_off > = 0 ) )
return sk - > sk_peek_off ;
else
return 0 ;
}
static inline void sk_peek_offset_bwd ( struct sock * sk , int val )
{
if ( sk - > sk_peek_off > = 0 ) {
if ( sk - > sk_peek_off > = val )
sk - > sk_peek_off - = val ;
else
sk - > sk_peek_off = 0 ;
}
}
static inline void sk_peek_offset_fwd ( struct sock * sk , int val )
{
if ( sk - > sk_peek_off > = 0 )
sk - > sk_peek_off + = val ;
}
2005-04-17 02:20:36 +04:00
/*
* Hashed lists helper routines
*/
2010-02-09 02:18:45 +03:00
static inline struct sock * sk_entry ( const struct hlist_node * node )
{
return hlist_entry ( node , struct sock , sk_node ) ;
}
2005-08-10 07:09:46 +04:00
static inline struct sock * __sk_head ( const struct hlist_head * head )
2005-04-17 02:20:36 +04:00
{
return hlist_entry ( head - > first , struct sock , sk_node ) ;
}
2005-08-10 07:09:46 +04:00
static inline struct sock * sk_head ( const struct hlist_head * head )
2005-04-17 02:20:36 +04:00
{
return hlist_empty ( head ) ? NULL : __sk_head ( head ) ;
}
2008-11-17 06:39:21 +03:00
static inline struct sock * __sk_nulls_head ( const struct hlist_nulls_head * head )
{
return hlist_nulls_entry ( head - > first , struct sock , sk_nulls_node ) ;
}
static inline struct sock * sk_nulls_head ( const struct hlist_nulls_head * head )
{
return hlist_nulls_empty ( head ) ? NULL : __sk_nulls_head ( head ) ;
}
2005-08-10 07:09:46 +04:00
static inline struct sock * sk_next ( const struct sock * sk )
2005-04-17 02:20:36 +04:00
{
return sk - > sk_node . next ?
hlist_entry ( sk - > sk_node . next , struct sock , sk_node ) : NULL ;
}
2008-11-17 06:39:21 +03:00
static inline struct sock * sk_nulls_next ( const struct sock * sk )
{
return ( ! is_a_nulls ( sk - > sk_nulls_node . next ) ) ?
hlist_nulls_entry ( sk - > sk_nulls_node . next ,
struct sock , sk_nulls_node ) :
NULL ;
}
2012-05-17 02:48:15 +04:00
static inline bool sk_unhashed ( const struct sock * sk )
2005-04-17 02:20:36 +04:00
{
return hlist_unhashed ( & sk - > sk_node ) ;
}
2012-05-17 02:48:15 +04:00
static inline bool sk_hashed ( const struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2006-04-29 02:21:23 +04:00
return ! sk_unhashed ( sk ) ;
2005-04-17 02:20:36 +04:00
}
2012-05-17 02:48:15 +04:00
static inline void sk_node_init ( struct hlist_node * node )
2005-04-17 02:20:36 +04:00
{
node - > pprev = NULL ;
}
2012-05-17 02:48:15 +04:00
static inline void sk_nulls_node_init ( struct hlist_nulls_node * node )
2008-11-17 06:39:21 +03:00
{
node - > pprev = NULL ;
}
2012-05-17 02:48:15 +04:00
static inline void __sk_del_node ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
__hlist_del ( & sk - > sk_node ) ;
}
2010-02-22 10:57:18 +03:00
/* NB: equivalent to hlist_del_init_rcu */
2012-05-17 02:48:15 +04:00
static inline bool __sk_del_node_init ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
if ( sk_hashed ( sk ) ) {
__sk_del_node ( sk ) ;
sk_node_init ( & sk - > sk_node ) ;
2012-05-17 02:48:15 +04:00
return true ;
2005-04-17 02:20:36 +04:00
}
2012-05-17 02:48:15 +04:00
return false ;
2005-04-17 02:20:36 +04:00
}
/* Grab socket reference count. This operation is valid only
when sk is ALREADY grabbed f . e . it is found in hash table
or a list and the lookup is made under lock preventing hash table
modifications .
*/
static inline void sock_hold ( struct sock * sk )
{
atomic_inc ( & sk - > sk_refcnt ) ;
}
/* Ungrab socket in the context, which assumes that socket refcnt
cannot hit zero , f . e . it is true in context of any socketcall .
*/
static inline void __sock_put ( struct sock * sk )
{
atomic_dec ( & sk - > sk_refcnt ) ;
}
2012-05-17 02:48:15 +04:00
static inline bool sk_del_node_init ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2012-05-17 02:48:15 +04:00
bool rc = __sk_del_node_init ( sk ) ;
2005-04-17 02:20:36 +04:00
if ( rc ) {
/* paranoid for a while -acme */
WARN_ON ( atomic_read ( & sk - > sk_refcnt ) = = 1 ) ;
__sock_put ( sk ) ;
}
return rc ;
}
2010-02-22 10:57:18 +03:00
# define sk_del_node_init_rcu(sk) sk_del_node_init(sk)
2005-04-17 02:20:36 +04:00
2012-05-17 02:48:15 +04:00
static inline bool __sk_nulls_del_node_init_rcu ( struct sock * sk )
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
{
if ( sk_hashed ( sk ) ) {
2008-11-17 06:39:21 +03:00
hlist_nulls_del_init_rcu ( & sk - > sk_nulls_node ) ;
2012-05-17 02:48:15 +04:00
return true ;
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
}
2012-05-17 02:48:15 +04:00
return false ;
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
}
2012-05-17 02:48:15 +04:00
static inline bool sk_nulls_del_node_init_rcu ( struct sock * sk )
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
{
2012-05-17 02:48:15 +04:00
bool rc = __sk_nulls_del_node_init_rcu ( sk ) ;
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
if ( rc ) {
/* paranoid for a while -acme */
WARN_ON ( atomic_read ( & sk - > sk_refcnt ) = = 1 ) ;
__sock_put ( sk ) ;
}
return rc ;
}
2012-05-17 02:48:15 +04:00
static inline void __sk_add_node ( struct sock * sk , struct hlist_head * list )
2005-04-17 02:20:36 +04:00
{
hlist_add_head ( & sk - > sk_node , list ) ;
}
2012-05-17 02:48:15 +04:00
static inline void sk_add_node ( struct sock * sk , struct hlist_head * list )
2005-04-17 02:20:36 +04:00
{
sock_hold ( sk ) ;
__sk_add_node ( sk , list ) ;
}
2012-05-17 02:48:15 +04:00
static inline void sk_add_node_rcu ( struct sock * sk , struct hlist_head * list )
2010-02-22 10:57:18 +03:00
{
sock_hold ( sk ) ;
hlist_add_head_rcu ( & sk - > sk_node , list ) ;
}
2012-05-17 02:48:15 +04:00
static inline void __sk_nulls_add_node_rcu ( struct sock * sk , struct hlist_nulls_head * list )
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
{
2008-11-17 06:39:21 +03:00
hlist_nulls_add_head_rcu ( & sk - > sk_nulls_node , list ) ;
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
}
2012-05-17 02:48:15 +04:00
static inline void sk_nulls_add_node_rcu ( struct sock * sk , struct hlist_nulls_head * list )
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
{
sock_hold ( sk ) ;
2008-11-17 06:39:21 +03:00
__sk_nulls_add_node_rcu ( sk , list ) ;
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
}
2012-05-17 02:48:15 +04:00
static inline void __sk_del_bind_node ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
__hlist_del ( & sk - > sk_bind_node ) ;
}
2012-05-17 02:48:15 +04:00
static inline void sk_add_bind_node ( struct sock * sk ,
2005-04-17 02:20:36 +04:00
struct hlist_head * list )
{
hlist_add_head ( & sk - > sk_bind_node , list ) ;
}
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 05:06:00 +04:00
# define sk_for_each(__sk, list) \
hlist_for_each_entry ( __sk , list , sk_node )
# define sk_for_each_rcu(__sk, list) \
hlist_for_each_entry_rcu ( __sk , list , sk_node )
2008-11-17 06:39:21 +03:00
# define sk_nulls_for_each(__sk, node, list) \
hlist_nulls_for_each_entry ( __sk , node , list , sk_nulls_node )
# define sk_nulls_for_each_rcu(__sk, node, list) \
hlist_nulls_for_each_entry_rcu ( __sk , node , list , sk_nulls_node )
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 05:06:00 +04:00
# define sk_for_each_from(__sk) \
hlist_for_each_entry_from ( __sk , sk_node )
2008-11-17 06:39:21 +03:00
# define sk_nulls_for_each_from(__sk, node) \
if ( __sk & & ( { node = & ( __sk ) - > sk_nulls_node ; 1 ; } ) ) \
hlist_nulls_for_each_entry_from ( __sk , node , sk_nulls_node )
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 05:06:00 +04:00
# define sk_for_each_safe(__sk, tmp, list) \
hlist_for_each_entry_safe ( __sk , tmp , list , sk_node )
# define sk_for_each_bound(__sk, list) \
hlist_for_each_entry ( __sk , list , sk_bind_node )
2005-04-17 02:20:36 +04:00
2014-07-16 07:28:32 +04:00
/**
* sk_nulls_for_each_entry_offset - iterate over a list at a given struct offset
* @ tpos : the type * to use as a loop cursor .
* @ pos : the & struct hlist_node to use as a loop cursor .
* @ head : the head for your list .
* @ offset : offset of hlist_node within the struct .
*
*/
# define sk_nulls_for_each_entry_offset(tpos, pos, head, offset) \
for ( pos = ( head ) - > first ; \
( ! is_a_nulls ( pos ) ) & & \
( { tpos = ( typeof ( * tpos ) * ) ( ( void * ) pos - offset ) ; 1 ; } ) ; \
pos = pos - > next )
2012-05-25 03:56:43 +04:00
static inline struct user_namespace * sk_user_ns ( struct sock * sk )
{
/* Careful only use this in a context where these parameters
* can not change and must all be valid , such as recvmsg from
* userspace .
*/
return sk - > sk_socket - > file - > f_cred - > user_ns ;
}
2005-04-17 02:20:36 +04:00
/* Sock flags */
enum sock_flags {
SOCK_DEAD ,
SOCK_DONE ,
SOCK_URGINLINE ,
SOCK_KEEPOPEN ,
SOCK_LINGER ,
SOCK_DESTROY ,
SOCK_BROADCAST ,
SOCK_TIMESTAMP ,
SOCK_ZAPPED ,
SOCK_USE_WRITE_QUEUE , /* whether to call sk->sk_write_space in sock_wfree */
SOCK_DBG , /* %SO_DEBUG setting */
SOCK_RCVTSTAMP , /* %SO_TIMESTAMP setting */
2007-03-26 09:14:49 +04:00
SOCK_RCVTSTAMPNS , /* %SO_TIMESTAMPNS setting */
2005-04-17 02:20:36 +04:00
SOCK_LOCALROUTE , /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK , /* write queue has been shrunk recently */
2012-08-01 03:44:16 +04:00
SOCK_MEMALLOC , /* VM depends on this socket for swapping */
2009-02-12 08:03:38 +03:00
SOCK_TIMESTAMPING_RX_SOFTWARE , /* %SOF_TIMESTAMPING_RX_SOFTWARE */
net: speedup sk_wake_async()
An incoming datagram must bring into cpu cache *lot* of cache lines,
in particular : (other parts omitted (hash chains, ip route cache...))
On 32bit arches :
offsetof(struct sock, sk_rcvbuf) =0x30 (read)
offsetof(struct sock, sk_lock) =0x34 (rw)
offsetof(struct sock, sk_sleep) =0x50 (read)
offsetof(struct sock, sk_rmem_alloc) =0x64 (rw)
offsetof(struct sock, sk_receive_queue)=0x74 (rw)
offsetof(struct sock, sk_forward_alloc)=0x98 (rw)
offsetof(struct sock, sk_callback_lock)=0xcc (rw)
offsetof(struct sock, sk_drops) =0xd8 (read if we add dropcount support, rw if frame dropped)
offsetof(struct sock, sk_filter) =0xf8 (read)
offsetof(struct sock, sk_socket) =0x138 (read)
offsetof(struct sock, sk_data_ready) =0x15c (read)
We can avoid sk->sk_socket and socket->fasync_list referencing on sockets
with no fasync() structures. (socket->fasync_list ptr is probably already in cache
because it shares a cache line with socket->wait, ie location pointed by sk->sk_sleep)
This avoids one cache line load per incoming packet for common cases (no fasync())
We can leave (or even move in a future patch) sk->sk_socket in a cold location
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-07 04:28:29 +04:00
SOCK_FASYNC , /* fasync() active */
net: Generalize socket rx gap / receive queue overflow cmsg
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-13 00:26:31 +04:00
SOCK_RXQ_OVFL ,
2011-07-06 16:17:30 +04:00
SOCK_ZEROCOPY , /* buffers from userspace */
2011-11-09 13:15:42 +04:00
SOCK_WIFI_STATUS , /* push wifi status to userspace */
2012-02-11 19:39:30 +04:00
SOCK_NOFCS , /* Tell NIC not to do the Ethernet FCS.
* Will use last 4 bytes of packet sent from
* user - space instead .
*/
2013-01-17 01:55:49 +04:00
SOCK_FILTER_LOCKED , /* Filter cannot be changed anymore */
2013-03-28 15:19:25 +04:00
SOCK_SELECT_ERR_QUEUE , /* Wake select on error queue */
2005-04-17 02:20:36 +04:00
} ;
2005-08-23 21:11:30 +04:00
static inline void sock_copy_flags ( struct sock * nsk , struct sock * osk )
{
nsk - > sk_flags = osk - > sk_flags ;
}
2005-04-17 02:20:36 +04:00
static inline void sock_set_flag ( struct sock * sk , enum sock_flags flag )
{
__set_bit ( flag , & sk - > sk_flags ) ;
}
static inline void sock_reset_flag ( struct sock * sk , enum sock_flags flag )
{
__clear_bit ( flag , & sk - > sk_flags ) ;
}
2012-05-16 09:57:07 +04:00
static inline bool sock_flag ( const struct sock * sk , enum sock_flags flag )
2005-04-17 02:20:36 +04:00
{
return test_bit ( flag , & sk - > sk_flags ) ;
}
2012-08-01 03:44:19 +04:00
# ifdef CONFIG_NET
extern struct static_key memalloc_socks ;
static inline int sk_memalloc_socks ( void )
{
return static_key_false ( & memalloc_socks ) ;
}
# else
static inline int sk_memalloc_socks ( void )
{
return 0 ;
}
# endif
2015-11-30 19:57:28 +03:00
static inline gfp_t sk_gfp_mask ( const struct sock * sk , gfp_t gfp_mask )
2012-08-01 03:44:14 +04:00
{
2015-11-30 19:57:28 +03:00
return gfp_mask | ( sk - > sk_allocation & __GFP_MEMALLOC ) ;
2012-08-01 03:44:14 +04:00
}
2005-04-17 02:20:36 +04:00
static inline void sk_acceptq_removed ( struct sock * sk )
{
sk - > sk_ack_backlog - - ;
}
static inline void sk_acceptq_added ( struct sock * sk )
{
sk - > sk_ack_backlog + + ;
}
2012-05-17 02:48:15 +04:00
static inline bool sk_acceptq_is_full ( const struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2007-03-06 22:21:05 +03:00
return sk - > sk_ack_backlog > sk - > sk_max_ack_backlog ;
2005-04-17 02:20:36 +04:00
}
/*
* Compute minimal free write space needed to queue new packets .
*/
2012-05-17 02:48:15 +04:00
static inline int sk_stream_min_wspace ( const struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2007-12-21 14:07:41 +03:00
return sk - > sk_wmem_queued > > 1 ;
2005-04-17 02:20:36 +04:00
}
2012-05-17 02:48:15 +04:00
static inline int sk_stream_wspace ( const struct sock * sk )
2005-04-17 02:20:36 +04:00
{
return sk - > sk_sndbuf - sk - > sk_wmem_queued ;
}
2013-09-22 21:32:26 +04:00
void sk_stream_write_space ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
2010-03-04 21:01:40 +03:00
/* OOB backlog add */
2010-03-04 21:01:47 +03:00
static inline void __sk_add_backlog ( struct sock * sk , struct sk_buff * skb )
2005-11-08 20:39:42 +03:00
{
2010-05-12 03:19:48 +04:00
/* dont let skb dst not refcounted, we are going to leave rcu lock */
skb_dst_force ( skb ) ;
if ( ! sk - > sk_backlog . tail )
sk - > sk_backlog . head = skb ;
else
2005-11-08 20:39:42 +03:00
sk - > sk_backlog . tail - > next = skb ;
2010-05-12 03:19:48 +04:00
sk - > sk_backlog . tail = skb ;
2005-11-08 20:39:42 +03:00
skb - > next = NULL ;
}
2005-04-17 02:20:36 +04:00
2010-04-28 02:13:20 +04:00
/*
* Take into account size of receive queue and backlog queue
2011-12-21 11:11:44 +04:00
* Do not take into account this skb truesize ,
* to allow even a single big packet to come .
2010-04-28 02:13:20 +04:00
*/
2014-07-22 22:16:51 +04:00
static inline bool sk_rcvqueues_full ( const struct sock * sk , unsigned int limit )
2010-04-28 02:13:20 +04:00
{
unsigned int qsize = sk - > sk_backlog . len + atomic_read ( & sk - > sk_rmem_alloc ) ;
2012-04-23 03:34:26 +04:00
return qsize > limit ;
2010-04-28 02:13:20 +04:00
}
2010-03-04 21:01:40 +03:00
/* The per-socket spinlock must be held here. */
2012-04-23 03:34:26 +04:00
static inline __must_check int sk_add_backlog ( struct sock * sk , struct sk_buff * skb ,
unsigned int limit )
2010-03-04 21:01:40 +03:00
{
2014-07-22 22:16:51 +04:00
if ( sk_rcvqueues_full ( sk , limit ) )
2010-03-04 21:01:40 +03:00
return - ENOBUFS ;
2015-09-30 04:52:25 +03:00
/*
* If the skb was allocated from pfmemalloc reserves , only
* allow SOCK_MEMALLOC sockets to use it as this socket is
* helping free memory
*/
if ( skb_pfmemalloc ( skb ) & & ! sock_flag ( sk , SOCK_MEMALLOC ) )
return - ENOMEM ;
2010-03-04 21:01:47 +03:00
__sk_add_backlog ( sk , skb ) ;
2010-03-04 21:01:40 +03:00
sk - > sk_backlog . len + = skb - > truesize ;
return 0 ;
}
2013-09-22 21:32:26 +04:00
int __sk_backlog_rcv ( struct sock * sk , struct sk_buff * skb ) ;
2012-08-01 03:44:26 +04:00
2008-10-08 01:18:42 +04:00
static inline int sk_backlog_rcv ( struct sock * sk , struct sk_buff * skb )
{
2012-08-01 03:44:26 +04:00
if ( sk_memalloc_socks ( ) & & skb_pfmemalloc ( skb ) )
return __sk_backlog_rcv ( sk , skb ) ;
2008-10-08 01:18:42 +04:00
return sk - > sk_backlog_rcv ( sk , skb ) ;
}
net: introduce SO_INCOMING_CPU
Alternative to RPS/RFS is to use hardware support for multiple
queues.
Then split a set of million of sockets into worker threads, each
one using epoll() to manage events on its own socket pool.
Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
know after accept() or connect() on which queue/cpu a socket is managed.
We normally use one cpu per RX queue (IRQ smp_affinity being properly
set), so remembering on socket structure which cpu delivered last packet
is enough to solve the problem.
After accept(), connect(), or even file descriptor passing around
processes, applications can use :
int cpu;
socklen_t len = sizeof(cpu);
getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
And use this information to put the socket into the right silo
for optimal performance, as all networking stack should run
on the appropriate cpu, without need to send IPI (RPS/RFS).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11 16:54:28 +03:00
static inline void sk_incoming_cpu_update ( struct sock * sk )
{
sk - > sk_incoming_cpu = raw_smp_processor_id ( ) ;
}
2013-12-22 14:54:31 +04:00
static inline void sock_rps_record_flow_hash ( __u32 hash )
2010-04-28 02:05:31 +04:00
{
# ifdef CONFIG_RPS
struct rps_sock_flow_table * sock_flow_table ;
rcu_read_lock ( ) ;
sock_flow_table = rcu_dereference ( rps_sock_flow_table ) ;
2013-12-22 14:54:31 +04:00
rps_record_sock_flow ( sock_flow_table , hash ) ;
2010-04-28 02:05:31 +04:00
rcu_read_unlock ( ) ;
# endif
}
2013-12-22 14:54:31 +04:00
static inline void sock_rps_record_flow ( const struct sock * sk )
{
2014-01-01 00:31:01 +04:00
# ifdef CONFIG_RPS
2013-12-22 14:54:31 +04:00
sock_rps_record_flow_hash ( sk - > sk_rxhash ) ;
2014-01-01 00:31:01 +04:00
# endif
2013-12-22 14:54:31 +04:00
}
2011-08-14 23:45:55 +04:00
static inline void sock_rps_save_rxhash ( struct sock * sk ,
const struct sk_buff * skb )
2010-04-28 02:05:31 +04:00
{
# ifdef CONFIG_RPS
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 23:59:01 +03:00
if ( unlikely ( sk - > sk_rxhash ! = skb - > hash ) )
2014-03-25 02:34:47 +04:00
sk - > sk_rxhash = skb - > hash ;
2010-04-28 02:05:31 +04:00
# endif
}
2011-08-14 23:45:55 +04:00
static inline void sock_rps_reset_rxhash ( struct sock * sk )
{
# ifdef CONFIG_RPS
sk - > sk_rxhash = 0 ;
# endif
}
2007-10-09 12:59:42 +04:00
# define sk_wait_event(__sk, __timeo, __condition) \
( { int __rc ; \
release_sock ( __sk ) ; \
__rc = __condition ; \
if ( ! __rc ) { \
* ( __timeo ) = schedule_timeout ( * ( __timeo ) ) ; \
} \
2014-09-24 12:18:54 +04:00
sched_annotate_sleep ( ) ; \
2007-10-09 12:59:42 +04:00
lock_sock ( __sk ) ; \
__rc = __condition ; \
__rc ; \
} )
2005-04-17 02:20:36 +04:00
2013-09-22 21:32:26 +04:00
int sk_stream_wait_connect ( struct sock * sk , long * timeo_p ) ;
int sk_stream_wait_memory ( struct sock * sk , long * timeo_p ) ;
void sk_stream_wait_close ( struct sock * sk , long timeo_p ) ;
int sk_stream_error ( struct sock * sk , int flags , int err ) ;
void sk_stream_kill_queues ( struct sock * sk ) ;
void sk_set_memalloc ( struct sock * sk ) ;
void sk_clear_memalloc ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
2015-07-24 19:19:25 +03:00
int sk_wait_data ( struct sock * sk , long * timeo , const struct sk_buff * skb ) ;
2005-04-17 02:20:36 +04:00
2005-06-19 09:47:21 +04:00
struct request_sock_ops ;
2005-12-14 10:25:19 +03:00
struct timewait_sock_ops ;
[SOCK] proto: Add hashinfo member to struct proto
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03 15:06:04 +03:00
struct inet_hashinfo ;
2008-03-23 02:56:51 +03:00
struct raw_hashinfo ;
2011-05-26 21:46:22 +04:00
struct module ;
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-19 09:46:52 +04:00
2013-05-09 14:28:16 +04:00
/*
* caches using SLAB_DESTROY_BY_RCU should let . next pointer from nulls nodes
* un - modified . Special care is taken when initializing object to zero .
*/
static inline void sk_prot_clear_nulls ( struct sock * sk , int size )
{
if ( offsetof ( struct sock , sk_node . next ) ! = 0 )
memset ( sk , 0 , offsetof ( struct sock , sk_node . next ) ) ;
memset ( & sk - > sk_node . pprev , 0 ,
size - offsetof ( struct sock , sk_node . pprev ) ) ;
}
2005-04-17 02:20:36 +04:00
/* Networking protocol blocks we attach to sockets.
* socket layer - > transport layer interface
*/
struct proto {
2012-05-17 02:48:15 +04:00
void ( * close ) ( struct sock * sk ,
2005-04-17 02:20:36 +04:00
long timeout ) ;
int ( * connect ) ( struct sock * sk ,
2012-05-17 02:48:15 +04:00
struct sockaddr * uaddr ,
2005-04-17 02:20:36 +04:00
int addr_len ) ;
int ( * disconnect ) ( struct sock * sk , int flags ) ;
2012-05-17 02:48:15 +04:00
struct sock * ( * accept ) ( struct sock * sk , int flags , int * err ) ;
2005-04-17 02:20:36 +04:00
int ( * ioctl ) ( struct sock * sk , int cmd ,
unsigned long arg ) ;
int ( * init ) ( struct sock * sk ) ;
2008-06-15 04:04:49 +04:00
void ( * destroy ) ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
void ( * shutdown ) ( struct sock * sk , int how ) ;
2012-05-17 02:48:15 +04:00
int ( * setsockopt ) ( struct sock * sk , int level ,
2005-04-17 02:20:36 +04:00
int optname , char __user * optval ,
2009-10-01 03:12:20 +04:00
unsigned int optlen ) ;
2012-05-17 02:48:15 +04:00
int ( * getsockopt ) ( struct sock * sk , int level ,
int optname , char __user * optval ,
int __user * option ) ;
2008-08-28 13:53:51 +04:00
# ifdef CONFIG_COMPAT
2006-03-21 09:45:21 +03:00
int ( * compat_setsockopt ) ( struct sock * sk ,
int level ,
int optname , char __user * optval ,
2009-10-01 03:12:20 +04:00
unsigned int optlen ) ;
2006-03-21 09:45:21 +03:00
int ( * compat_getsockopt ) ( struct sock * sk ,
int level ,
int optname , char __user * optval ,
int __user * option ) ;
2011-01-29 19:15:56 +03:00
int ( * compat_ioctl ) ( struct sock * sk ,
unsigned int cmd , unsigned long arg ) ;
2008-08-28 13:53:51 +04:00
# endif
2015-03-02 10:37:48 +03:00
int ( * sendmsg ) ( struct sock * sk , struct msghdr * msg ,
size_t len ) ;
int ( * recvmsg ) ( struct sock * sk , struct msghdr * msg ,
2012-05-17 02:48:15 +04:00
size_t len , int noblock , int flags ,
int * addr_len ) ;
2005-04-17 02:20:36 +04:00
int ( * sendpage ) ( struct sock * sk , struct page * page ,
int offset , size_t size , int flags ) ;
2012-05-17 02:48:15 +04:00
int ( * bind ) ( struct sock * sk ,
2005-04-17 02:20:36 +04:00
struct sockaddr * uaddr , int addr_len ) ;
2012-05-17 02:48:15 +04:00
int ( * backlog_rcv ) ( struct sock * sk ,
2005-04-17 02:20:36 +04:00
struct sk_buff * skb ) ;
tcp: TCP Small Queues
This introduce TSQ (TCP Small Queues)
TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.
sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.
TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.
As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.
This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.
Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.
Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)
I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.
As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.
If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.
[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
but some drivers call it in their start_xmit() handler.
These drivers should at least use BQL, or else a single TCP
session can still fill the whole NIC TX ring, since TSQ will
have no effect.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11 09:50:31 +04:00
void ( * release_cb ) ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
/* Keeping track of sk's, looking them up, and port selection methods. */
void ( * hash ) ( struct sock * sk ) ;
void ( * unhash ) ( struct sock * sk ) ;
udp: add rehash on connect()
commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).
Problem is that following sequence :
fd = socket(...)
connect(fd, &remote, ...)
not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)
Sequence is :
- autobind() : choose a random local port, insert socket in hash tables
[while local address is INADDR_ANY]
- connect() : set remote address and port, change local address to IP
given by a route lookup.
When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.
One solution to this problem is to rehash datagram socket if needed.
We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.
This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.
Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-09-08 09:08:44 +04:00
void ( * rehash ) ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
int ( * get_port ) ( struct sock * sk , unsigned short snum ) ;
2010-12-17 01:26:56 +03:00
void ( * clear_sk ) ( struct sock * sk , int size ) ;
2005-04-17 02:20:36 +04:00
[NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA way.
"struct proto" currently uses an array stats[NR_CPUS] to track change on
'inuse' sockets per protocol.
If NR_CPUS is big, this means we use a big memory area for this.
Moreover, all this memory area is located on a single node on NUMA
machines, increasing memory pressure on the boot node.
In this patch, I tried to :
- Keep a fast !CONFIG_SMP implementation
- Keep a fast CONFIG_SMP implementation for often used protocols
(tcp,udp,raw,...)
- Introduce a NUMA efficient implementation
Some helper macros are defined in include/net/sock.h
These macros take into account CONFIG_SMP
If a "struct proto" is declared without using DEFINE_PROTO_INUSE /
REF_PROTO_INUSE
macros, it will automatically use a default implementation, using a
dynamically allocated percpu zone.
This default implementation will be NUMA efficient, but might use 32/64
bytes per possible cpu
because of current alloc_percpu() implementation.
However it still should be better than previous implementation based on
stats[NR_CPUS] field.
When a "struct proto" is changed to use the new macros, we use a single
static "int" percpu variable,
lowering the memory and cpu costs, still preserving NUMA efficiency.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-06 10:38:39 +03:00
/* Keeping track of sockets in use */
2008-01-04 07:46:48 +03:00
# ifdef CONFIG_PROC_FS
2008-03-29 02:38:17 +03:00
unsigned int inuse_idx ;
2008-01-04 07:46:48 +03:00
# endif
2007-11-21 17:08:50 +03:00
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 07:27:07 +04:00
bool ( * stream_memory_free ) ( const struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
/* Memory pressure */
2008-07-17 07:28:10 +04:00
void ( * enter_memory_pressure ) ( struct sock * sk ) ;
2010-11-10 02:24:26 +03:00
atomic_long_t * memory_allocated ; /* Current allocated memory. */
2008-11-26 08:16:35 +03:00
struct percpu_counter * sockets_allocated ; /* Current number of sockets. */
2005-04-17 02:20:36 +04:00
/*
* Pressure flag : try to collapse .
* Technical note : it is used by multiple contexts non atomically .
2007-12-31 11:11:19 +03:00
* All the __sk_mem_schedule ( ) is of this nature : accounting
2005-04-17 02:20:36 +04:00
* is strict , actions are advisory and have some latency .
*/
int * memory_pressure ;
2010-11-10 02:24:26 +03:00
long * sysctl_mem ;
2005-04-17 02:20:36 +04:00
int * sysctl_wmem ;
int * sysctl_rmem ;
int max_header ;
2010-07-11 00:41:55 +04:00
bool no_autobind ;
2005-04-17 02:20:36 +04:00
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
struct kmem_cache * slab ;
2005-04-17 02:20:36 +04:00
unsigned int obj_size ;
udp: RCU handling for Unicast packets.
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29 12:11:14 +03:00
int slab_flags ;
2005-04-17 02:20:36 +04:00
2008-11-26 08:17:14 +03:00
struct percpu_counter * orphan_count ;
2005-08-10 07:09:30 +04:00
2005-06-19 09:47:21 +04:00
struct request_sock_ops * rsk_prot ;
2005-12-14 10:25:19 +03:00
struct timewait_sock_ops * twsk_prot ;
[NET] Generalise TCP's struct open_request minisock infrastructure
Kept this first changeset minimal, without changing existing names to
ease peer review.
Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn
has two new members:
->slab, that replaces tcp_openreq_cachep
->obj_size, to inform the size of the openreq descendant for
a specific protocol
The protocol specific fields in struct open_request were moved to a
class hierarchy, with the things that are common to all connection
oriented PF_INET protocols in struct inet_request_sock, the TCP ones
in tcp_request_sock, that is an inet_request_sock, that is an
open_request.
I.e. this uses the same approach used for the struct sock class
hierarchy, with sk_prot indicating if the protocol wants to use the
open_request infrastructure by filling in sk_prot->rsk_prot with an
or_calltable.
Results? Performance is improved and TCP v4 now uses only 64 bytes per
open request minisock, down from 96 without this patch :-)
Next changeset will rename some of the structs, fields and functions
mentioned above, struct or_calltable is way unclear, better name it
struct request_sock_ops, s/struct open_request/struct request_sock/g,
etc.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-06-19 09:46:52 +04:00
2008-03-23 02:50:58 +03:00
union {
struct inet_hashinfo * hashinfo ;
2008-10-29 11:41:45 +03:00
struct udp_table * udp_table ;
2008-03-23 02:56:51 +03:00
struct raw_hashinfo * raw_hash ;
2008-03-23 02:50:58 +03:00
} h ;
[SOCK] proto: Add hashinfo member to struct proto
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-02-03 15:06:04 +03:00
2005-04-17 02:20:36 +04:00
struct module * owner ;
char name [ 32 ] ;
struct list_head node ;
2005-08-10 06:45:38 +04:00
# ifdef SOCK_REFCNT_DEBUG
atomic_t socks ;
# endif
2012-08-01 03:43:02 +04:00
# ifdef CONFIG_MEMCG_KMEM
2011-12-12 01:47:03 +04:00
/*
* cgroup specific init / deinit functions . Called once for all
* protocols that implement it , from cgroups populate function .
* This function has to setup any files the protocol want to
* appear in the kmem cgroup filesystem .
*/
2012-04-10 02:36:33 +04:00
int ( * init_cgroup ) ( struct mem_cgroup * memcg ,
2011-12-12 01:47:03 +04:00
struct cgroup_subsys * ss ) ;
2012-04-10 02:36:33 +04:00
void ( * destroy_cgroup ) ( struct mem_cgroup * memcg ) ;
2011-12-12 01:47:03 +04:00
struct cg_proto * ( * proto_cgroup ) ( struct mem_cgroup * memcg ) ;
# endif
} ;
2013-09-22 21:32:26 +04:00
int proto_register ( struct proto * prot , int alloc_slab ) ;
void proto_unregister ( struct proto * prot ) ;
2005-04-17 02:20:36 +04:00
2005-08-10 06:45:38 +04:00
# ifdef SOCK_REFCNT_DEBUG
static inline void sk_refcnt_debug_inc ( struct sock * sk )
{
atomic_inc ( & sk - > sk_prot - > socks ) ;
}
static inline void sk_refcnt_debug_dec ( struct sock * sk )
{
atomic_dec ( & sk - > sk_prot - > socks ) ;
printk ( KERN_DEBUG " %s socket %p released, %d are still alive \n " ,
sk - > sk_prot - > name , sk , atomic_read ( & sk - > sk_prot - > socks ) ) ;
}
2013-02-16 02:28:25 +04:00
static inline void sk_refcnt_debug_release ( const struct sock * sk )
2005-08-10 06:45:38 +04:00
{
if ( atomic_read ( & sk - > sk_refcnt ) ! = 1 )
printk ( KERN_DEBUG " Destruction of the %s socket %p delayed, refcnt=%d \n " ,
sk - > sk_prot - > name , sk , atomic_read ( & sk - > sk_refcnt ) ) ;
}
# else /* SOCK_REFCNT_DEBUG */
# define sk_refcnt_debug_inc(sk) do { } while (0)
# define sk_refcnt_debug_dec(sk) do { } while (0)
# define sk_refcnt_debug_release(sk) do { } while (0)
# endif /* SOCK_REFCNT_DEBUG */
2012-08-01 03:43:02 +04:00
# if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_NET)
2012-02-24 11:31:31 +04:00
extern struct static_key memcg_socket_limit_enabled ;
2011-12-12 01:47:03 +04:00
static inline struct cg_proto * parent_cg_proto ( struct proto * proto ,
struct cg_proto * cg_proto )
{
return proto - > proto_cgroup ( parent_mem_cgroup ( cg_proto - > memcg ) ) ;
}
2012-02-24 11:31:31 +04:00
# define mem_cgroup_sockets_enabled static_key_false(&memcg_socket_limit_enabled)
2011-12-12 01:47:03 +04:00
# else
# define mem_cgroup_sockets_enabled 0
static inline struct cg_proto * parent_cg_proto ( struct proto * proto ,
struct cg_proto * cg_proto )
{
return NULL ;
}
# endif
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 07:27:07 +04:00
static inline bool sk_stream_memory_free ( const struct sock * sk )
{
if ( sk - > sk_wmem_queued > = sk - > sk_sndbuf )
return false ;
return sk - > sk_prot - > stream_memory_free ?
sk - > sk_prot - > stream_memory_free ( sk ) : true ;
}
2013-07-23 07:26:31 +04:00
static inline bool sk_stream_is_writeable ( const struct sock * sk )
{
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 07:27:07 +04:00
return sk_stream_wspace ( sk ) > = sk_stream_min_wspace ( sk ) & &
sk_stream_memory_free ( sk ) ;
2013-07-23 07:26:31 +04:00
}
2011-12-12 01:47:03 +04:00
tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.
TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :
Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)
For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())
This patch adds two ways to set the limit :
1) Per socket option TCP_NOTSENT_LOWAT
2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat
Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.
Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
Specify the minimum number of bytes in the buffer until
the socket layer will pass the data to the protocol)
Tested:
netperf sessions, and watching /proc/net/protocols "memory" column for TCP
With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y
Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.
A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.
lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
412,514 context-switches
200.034645535 seconds time elapsed
lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
Size Size Size (sec) Util Util Util Util Demand Demand Units
Final Final % Method % Method
1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB
Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
2,675,818 context-switches
200.029651391 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-23 07:27:07 +04:00
2011-12-12 01:47:02 +04:00
static inline bool sk_has_memory_pressure ( const struct sock * sk )
{
return sk - > sk_prot - > memory_pressure ! = NULL ;
}
static inline bool sk_under_memory_pressure ( const struct sock * sk )
{
if ( ! sk - > sk_prot - > memory_pressure )
return false ;
2011-12-12 01:47:03 +04:00
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp )
2013-10-20 03:26:19 +04:00
return ! ! sk - > sk_cgrp - > memory_pressure ;
2011-12-12 01:47:03 +04:00
2013-10-23 23:49:21 +04:00
return ! ! * sk - > sk_prot - > memory_pressure ;
2011-12-12 01:47:02 +04:00
}
static inline void sk_leave_memory_pressure ( struct sock * sk )
{
int * memory_pressure = sk - > sk_prot - > memory_pressure ;
2011-12-12 01:47:03 +04:00
if ( ! memory_pressure )
return ;
if ( * memory_pressure )
2011-12-12 01:47:02 +04:00
* memory_pressure = 0 ;
2011-12-12 01:47:03 +04:00
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp ) {
struct cg_proto * cg_proto = sk - > sk_cgrp ;
struct proto * prot = sk - > sk_prot ;
for ( ; cg_proto ; cg_proto = parent_cg_proto ( prot , cg_proto ) )
2013-12-05 08:12:04 +04:00
cg_proto - > memory_pressure = 0 ;
2011-12-12 01:47:03 +04:00
}
2011-12-12 01:47:02 +04:00
}
static inline void sk_enter_memory_pressure ( struct sock * sk )
{
2011-12-12 01:47:03 +04:00
if ( ! sk - > sk_prot - > enter_memory_pressure )
return ;
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp ) {
struct cg_proto * cg_proto = sk - > sk_cgrp ;
struct proto * prot = sk - > sk_prot ;
for ( ; cg_proto ; cg_proto = parent_cg_proto ( prot , cg_proto ) )
2013-12-05 08:12:04 +04:00
cg_proto - > memory_pressure = 1 ;
2011-12-12 01:47:03 +04:00
}
sk - > sk_prot - > enter_memory_pressure ( sk ) ;
2011-12-12 01:47:02 +04:00
}
static inline long sk_prot_mem_limits ( const struct sock * sk , int index )
{
long * prot = sk - > sk_prot - > sysctl_mem ;
2011-12-12 01:47:03 +04:00
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp )
prot = sk - > sk_cgrp - > sysctl_mem ;
2011-12-12 01:47:02 +04:00
return prot [ index ] ;
}
2011-12-12 01:47:03 +04:00
static inline void memcg_memory_allocated_add ( struct cg_proto * prot ,
unsigned long amt ,
int * parent_status )
{
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:31 +03:00
page_counter_charge ( & prot - > memory_allocated , amt ) ;
2011-12-12 01:47:03 +04:00
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:31 +03:00
if ( page_counter_read ( & prot - > memory_allocated ) >
prot - > memory_allocated . limit )
2011-12-12 01:47:03 +04:00
* parent_status = OVER_LIMIT ;
}
static inline void memcg_memory_allocated_sub ( struct cg_proto * prot ,
unsigned long amt )
{
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:31 +03:00
page_counter_uncharge ( & prot - > memory_allocated , amt ) ;
2011-12-12 01:47:03 +04:00
}
2011-12-12 01:47:02 +04:00
static inline long
sk_memory_allocated ( const struct sock * sk )
{
struct proto * prot = sk - > sk_prot ;
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:31 +03:00
2011-12-12 01:47:03 +04:00
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp )
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:31 +03:00
return page_counter_read ( & sk - > sk_cgrp - > memory_allocated ) ;
2011-12-12 01:47:03 +04:00
2011-12-12 01:47:02 +04:00
return atomic_long_read ( prot - > memory_allocated ) ;
}
static inline long
2011-12-12 01:47:03 +04:00
sk_memory_allocated_add ( struct sock * sk , int amt , int * parent_status )
2011-12-12 01:47:02 +04:00
{
struct proto * prot = sk - > sk_prot ;
2011-12-12 01:47:03 +04:00
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp ) {
memcg_memory_allocated_add ( sk - > sk_cgrp , amt , parent_status ) ;
/* update the root cgroup regardless */
atomic_long_add_return ( amt , prot - > memory_allocated ) ;
mm: memcontrol: lockless page counters
Memory is internally accounted in bytes, using spinlock-protected 64-bit
counters, even though the smallest accounting delta is a page. The
counter interface is also convoluted and does too many things.
Introduce a new lockless word-sized page counter API, then change all
memory accounting over to it. The translation from and to bytes then only
happens when interfacing with userspace.
The removed locking overhead is noticable when scaling beyond the per-cpu
charge caches - on a 4-socket machine with 144-threads, the following test
shows the performance differences of 288 memcgs concurrently running a
page fault benchmark:
vanilla:
18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% )
1,380,638 context-switches # 0.074 K/sec ( +- 0.75% )
24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% )
1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% )
50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% )
1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% )
1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% )
132.474343877 seconds time elapsed ( +- 0.21% )
lockless:
12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% )
832,850 context-switches # 0.068 K/sec ( +- 0.54% )
15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% )
1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% )
32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% )
2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% )
1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% )
91.369330729 seconds time elapsed ( +- 0.45% )
On top of improved scalability, this also gets rid of the icky long long
types in the very heart of memcg, which is great for 32 bit and also makes
the code a lot more readable.
Notable differences between the old and new API:
- res_counter_charge() and res_counter_charge_nofail() become
page_counter_try_charge() and page_counter_charge() resp. to match
the more common kernel naming scheme of try_do()/do()
- res_counter_uncharge_until() is only ever used to cancel a local
counter and never to uncharge bigger segments of a hierarchy, so
it's replaced by the simpler page_counter_cancel()
- res_counter_set_limit() is replaced by page_counter_limit(), which
expects its callers to serialize against themselves
- res_counter_memparse_write_strategy() is replaced by
page_counter_limit(), which rounds down to the nearest page size -
rather than up. This is more reasonable for explicitely requested
hard upper limits.
- to keep charging light-weight, page_counter_try_charge() charges
speculatively, only to roll back if the result exceeds the limit.
Because of this, a failing bigger charge can temporarily lock out
smaller charges that would otherwise succeed. The error is bounded
to the difference between the smallest and the biggest possible
charge size, so for memcg, this means that a failing THP charge can
send base page charges into reclaim upto 2MB (4MB) before the limit
would have been reached. This should be acceptable.
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
[akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11 02:42:31 +03:00
return page_counter_read ( & sk - > sk_cgrp - > memory_allocated ) ;
2011-12-12 01:47:03 +04:00
}
2011-12-12 01:47:02 +04:00
return atomic_long_add_return ( amt , prot - > memory_allocated ) ;
}
static inline void
net: introduce res_counter_charge_nofail() for socket allocations
There is a case in __sk_mem_schedule(), where an allocation
is beyond the maximum, but yet we are allowed to proceed.
It happens under the following condition:
sk->sk_wmem_queued + size >= sk->sk_sndbuf
The network code won't revert the allocation in this case,
meaning that at some point later it'll try to do it. Since
this is never communicated to the underlying res_counter
code, there is an inbalance in res_counter uncharge operation.
I see two ways of fixing this:
1) storing the information about those allocations somewhere
in memcg, and then deducting from that first, before
we start draining the res_counter,
2) providing a slightly different allocation function for
the res_counter, that matches the original behavior of
the network code more closely.
I decided to go for #2 here, believing it to be more elegant,
since #1 would require us to do basically that, but in a more
obscure way.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: Li Zefan <lizf@cn.fujitsu.com>
CC: Laurent Chavey <chavey@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-20 08:57:16 +04:00
sk_memory_allocated_sub ( struct sock * sk , int amt )
2011-12-12 01:47:02 +04:00
{
struct proto * prot = sk - > sk_prot ;
2011-12-12 01:47:03 +04:00
net: introduce res_counter_charge_nofail() for socket allocations
There is a case in __sk_mem_schedule(), where an allocation
is beyond the maximum, but yet we are allowed to proceed.
It happens under the following condition:
sk->sk_wmem_queued + size >= sk->sk_sndbuf
The network code won't revert the allocation in this case,
meaning that at some point later it'll try to do it. Since
this is never communicated to the underlying res_counter
code, there is an inbalance in res_counter uncharge operation.
I see two ways of fixing this:
1) storing the information about those allocations somewhere
in memcg, and then deducting from that first, before
we start draining the res_counter,
2) providing a slightly different allocation function for
the res_counter, that matches the original behavior of
the network code more closely.
I decided to go for #2 here, believing it to be more elegant,
since #1 would require us to do basically that, but in a more
obscure way.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: Li Zefan <lizf@cn.fujitsu.com>
CC: Laurent Chavey <chavey@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-20 08:57:16 +04:00
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp )
2011-12-12 01:47:03 +04:00
memcg_memory_allocated_sub ( sk - > sk_cgrp , amt ) ;
2011-12-12 01:47:02 +04:00
atomic_long_sub ( amt , prot - > memory_allocated ) ;
}
static inline void sk_sockets_allocated_dec ( struct sock * sk )
{
struct proto * prot = sk - > sk_prot ;
2011-12-12 01:47:03 +04:00
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp ) {
struct cg_proto * cg_proto = sk - > sk_cgrp ;
for ( ; cg_proto ; cg_proto = parent_cg_proto ( prot , cg_proto ) )
2013-10-20 03:26:19 +04:00
percpu_counter_dec ( & cg_proto - > sockets_allocated ) ;
2011-12-12 01:47:03 +04:00
}
2011-12-12 01:47:02 +04:00
percpu_counter_dec ( prot - > sockets_allocated ) ;
}
static inline void sk_sockets_allocated_inc ( struct sock * sk )
{
struct proto * prot = sk - > sk_prot ;
2011-12-12 01:47:03 +04:00
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp ) {
struct cg_proto * cg_proto = sk - > sk_cgrp ;
for ( ; cg_proto ; cg_proto = parent_cg_proto ( prot , cg_proto ) )
2013-10-20 03:26:19 +04:00
percpu_counter_inc ( & cg_proto - > sockets_allocated ) ;
2011-12-12 01:47:03 +04:00
}
2011-12-12 01:47:02 +04:00
percpu_counter_inc ( prot - > sockets_allocated ) ;
}
static inline int
sk_sockets_allocated_read_positive ( struct sock * sk )
{
struct proto * prot = sk - > sk_prot ;
2011-12-12 01:47:03 +04:00
if ( mem_cgroup_sockets_enabled & & sk - > sk_cgrp )
2013-10-20 03:26:19 +04:00
return percpu_counter_read_positive ( & sk - > sk_cgrp - > sockets_allocated ) ;
2011-12-12 01:47:03 +04:00
2012-04-29 03:21:56 +04:00
return percpu_counter_read_positive ( prot - > sockets_allocated ) ;
2011-12-12 01:47:02 +04:00
}
static inline int
proto_sockets_allocated_sum_positive ( struct proto * prot )
{
return percpu_counter_sum_positive ( prot - > sockets_allocated ) ;
}
static inline long
proto_memory_allocated ( struct proto * prot )
{
return atomic_long_read ( prot - > memory_allocated ) ;
}
static inline bool
proto_memory_pressure ( struct proto * prot )
{
if ( ! prot - > memory_pressure )
return false ;
return ! ! * prot - > memory_pressure ;
}
2008-01-04 07:46:48 +03:00
# ifdef CONFIG_PROC_FS
2005-04-17 02:20:36 +04:00
/* Called with local bh disabled */
2013-09-22 21:32:26 +04:00
void sock_prot_inuse_add ( struct net * net , struct proto * prot , int inc ) ;
int sock_prot_inuse_get ( struct net * net , struct proto * proto ) ;
2008-01-04 07:46:48 +03:00
# else
2012-05-17 02:48:15 +04:00
static inline void sock_prot_inuse_add ( struct net * net , struct proto * prot ,
2008-04-01 06:41:46 +04:00
int inc )
2008-01-04 07:46:48 +03:00
{
}
# endif
2005-04-17 02:20:36 +04:00
2005-08-10 06:47:37 +04:00
/* With per-bucket locks this operation is not-atomic, so that
* this version is not worse .
*/
static inline void __sk_prot_rehash ( struct sock * sk )
{
sk - > sk_prot - > unhash ( sk ) ;
sk - > sk_prot - > hash ( sk ) ;
}
2010-12-17 01:26:56 +03:00
void sk_prot_clear_portaddr_nulls ( struct sock * sk , int size ) ;
2005-04-17 02:20:36 +04:00
/* About 10 seconds */
# define SOCK_DESTROY_TIME (10*HZ)
/* Sockets 0-1023 can't be bound to unless you are superuser */
# define PROT_SOCK 1024
# define SHUTDOWN_MASK 3
# define RCV_SHUTDOWN 1
# define SEND_SHUTDOWN 2
# define SOCK_SNDBUF_LOCK 1
# define SOCK_RCVBUF_LOCK 2
# define SOCK_BINDADDR_LOCK 4
# define SOCK_BINDPORT_LOCK 8
struct socket_alloc {
struct socket socket ;
struct inode vfs_inode ;
} ;
static inline struct socket * SOCKET_I ( struct inode * inode )
{
return & container_of ( inode , struct socket_alloc , vfs_inode ) - > socket ;
}
static inline struct inode * SOCK_INODE ( struct socket * socket )
{
return & container_of ( socket , struct socket_alloc , socket ) - > vfs_inode ;
}
2007-12-31 11:11:19 +03:00
/*
* Functions for memory accounting
*/
2013-09-22 21:32:26 +04:00
int __sk_mem_schedule ( struct sock * sk , int size , int kind ) ;
2015-05-15 22:39:25 +03:00
void __sk_mem_reclaim ( struct sock * sk , int amount ) ;
2005-04-17 02:20:36 +04:00
2007-12-31 11:11:19 +03:00
# define SK_MEM_QUANTUM ((int)PAGE_SIZE)
# define SK_MEM_QUANTUM_SHIFT ilog2(SK_MEM_QUANTUM)
# define SK_MEM_SEND 0
# define SK_MEM_RECV 1
2005-04-17 02:20:36 +04:00
2007-12-31 11:11:19 +03:00
static inline int sk_mem_pages ( int amt )
2005-04-17 02:20:36 +04:00
{
2007-12-31 11:11:19 +03:00
return ( amt + SK_MEM_QUANTUM - 1 ) > > SK_MEM_QUANTUM_SHIFT ;
2005-04-17 02:20:36 +04:00
}
2012-05-17 02:48:15 +04:00
static inline bool sk_has_account ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2007-12-31 11:11:19 +03:00
/* return true if protocol supports memory accounting */
return ! ! sk - > sk_prot - > memory_allocated ;
2005-04-17 02:20:36 +04:00
}
2012-05-17 02:48:15 +04:00
static inline bool sk_wmem_schedule ( struct sock * sk , int size )
2005-04-17 02:20:36 +04:00
{
2007-12-31 11:11:19 +03:00
if ( ! sk_has_account ( sk ) )
2012-05-17 02:48:15 +04:00
return true ;
2007-12-31 11:11:19 +03:00
return size < = sk - > sk_forward_alloc | |
__sk_mem_schedule ( sk , size , SK_MEM_SEND ) ;
2005-04-17 02:20:36 +04:00
}
netvm: prevent a stream-specific deadlock
This patch series is based on top of "Swap-over-NBD without deadlocking
v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. In diskless systems this is not an option so if swap if
required then swapping over the network is considered. The two likely
scenarios are when blade servers are used as part of a cluster where the
form factor or maintenance costs do not allow the use of disks and thin
clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap but this is not always an option. There is no
guarantee that the network attached storage (NAS) device is running Linux
or supports NBD. However, it is likely that it supports NFS so there are
users that want support for swapping over NFS despite any performance
concern. Some distributions currently carry patches that support swapping
over NFS but it would be preferable to support it in the mainline kernel.
Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
reserves.
Patch 3 adds three helpers for filesystems to handle swap cache pages.
For example, page_file_mapping() returns page->mapping for
file-backed pages and the address_space of the underlying
swap file for swap cache pages.
Patch 4 adds two address_space_operations to allow a filesystem
to pin all metadata relevant to a swapfile in memory. Upon
successful activation, the swapfile is marked SWP_FILE and
the address space operation ->direct_IO is used for writing
and ->readpage for reading in swap pages.
Patch 5 notes that patch 3 is bolting
filesystem-specific-swapfile-support onto the side and that
the default handlers have different information to what
is available to the filesystem. This patch refactors the
code so that there are generic handlers for each of the new
address_space operations.
Patch 6 adds an API to allow a vector of kernel addresses to be
translated to struct pages and pinned for IO.
Patch 7 adds support for using highmem pages for swap by kmapping
the pages before calling the direct_IO handler.
Patch 8 updates NFS to use the helpers from patch 3 where necessary.
Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
Patch 10 implements the new swapfile-related address_space operations
for NFS and teaches the direct IO handler how to manage
kernel addresses.
Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
where appropriate.
Patch 12 fixes a NULL pointer dereference that occurs when using
swap-over-NFS.
With the patches applied, it is possible to mount a swapfile that is on an
NFS filesystem. Swap performance is not great with a swap stress test
taking roughly twice as long to complete than if the swap device was
backed by NBD.
This patch: netvm: prevent a stream-specific deadlock
It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
buffers from receiving data, which will prevent userspace from running,
which is needed to reduce the buffered data.
Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
this change it applied, it is important that sockets that set
SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
If this happens, a warning is generated and the tokens reclaimed to avoid
accounting errors until the bug is fixed.
[davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:44:41 +04:00
static inline bool
2012-09-18 01:09:11 +04:00
sk_rmem_schedule ( struct sock * sk , struct sk_buff * skb , int size )
2005-09-02 04:48:23 +04:00
{
2007-12-31 11:11:19 +03:00
if ( ! sk_has_account ( sk ) )
2012-05-17 02:48:15 +04:00
return true ;
netvm: prevent a stream-specific deadlock
This patch series is based on top of "Swap-over-NBD without deadlocking
v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. In diskless systems this is not an option so if swap if
required then swapping over the network is considered. The two likely
scenarios are when blade servers are used as part of a cluster where the
form factor or maintenance costs do not allow the use of disks and thin
clients.
The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap but this is not always an option. There is no
guarantee that the network attached storage (NAS) device is running Linux
or supports NBD. However, it is likely that it supports NFS so there are
users that want support for swapping over NFS despite any performance
concern. Some distributions currently carry patches that support swapping
over NFS but it would be preferable to support it in the mainline kernel.
Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
reserves.
Patch 3 adds three helpers for filesystems to handle swap cache pages.
For example, page_file_mapping() returns page->mapping for
file-backed pages and the address_space of the underlying
swap file for swap cache pages.
Patch 4 adds two address_space_operations to allow a filesystem
to pin all metadata relevant to a swapfile in memory. Upon
successful activation, the swapfile is marked SWP_FILE and
the address space operation ->direct_IO is used for writing
and ->readpage for reading in swap pages.
Patch 5 notes that patch 3 is bolting
filesystem-specific-swapfile-support onto the side and that
the default handlers have different information to what
is available to the filesystem. This patch refactors the
code so that there are generic handlers for each of the new
address_space operations.
Patch 6 adds an API to allow a vector of kernel addresses to be
translated to struct pages and pinned for IO.
Patch 7 adds support for using highmem pages for swap by kmapping
the pages before calling the direct_IO handler.
Patch 8 updates NFS to use the helpers from patch 3 where necessary.
Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
Patch 10 implements the new swapfile-related address_space operations
for NFS and teaches the direct IO handler how to manage
kernel addresses.
Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
where appropriate.
Patch 12 fixes a NULL pointer dereference that occurs when using
swap-over-NFS.
With the patches applied, it is possible to mount a swapfile that is on an
NFS filesystem. Swap performance is not great with a swap stress test
taking roughly twice as long to complete than if the swap device was
backed by NBD.
This patch: netvm: prevent a stream-specific deadlock
It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
buffers from receiving data, which will prevent userspace from running,
which is needed to reduce the buffered data.
Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
this change it applied, it is important that sockets that set
SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
If this happens, a warning is generated and the tokens reclaimed to avoid
accounting errors until the bug is fixed.
[davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01 03:44:41 +04:00
return size < = sk - > sk_forward_alloc | |
__sk_mem_schedule ( sk , size , SK_MEM_RECV ) | |
skb_pfmemalloc ( skb ) ;
2007-12-31 11:11:19 +03:00
}
static inline void sk_mem_reclaim ( struct sock * sk )
{
if ( ! sk_has_account ( sk ) )
return ;
if ( sk - > sk_forward_alloc > = SK_MEM_QUANTUM )
2015-05-15 22:39:25 +03:00
__sk_mem_reclaim ( sk , sk - > sk_forward_alloc ) ;
2007-12-31 11:11:19 +03:00
}
2008-01-11 08:56:38 +03:00
static inline void sk_mem_reclaim_partial ( struct sock * sk )
{
if ( ! sk_has_account ( sk ) )
return ;
if ( sk - > sk_forward_alloc > SK_MEM_QUANTUM )
2015-05-15 22:39:25 +03:00
__sk_mem_reclaim ( sk , sk - > sk_forward_alloc - 1 ) ;
2008-01-11 08:56:38 +03:00
}
2007-12-31 11:11:19 +03:00
static inline void sk_mem_charge ( struct sock * sk , int size )
{
if ( ! sk_has_account ( sk ) )
return ;
sk - > sk_forward_alloc - = size ;
}
static inline void sk_mem_uncharge ( struct sock * sk , int size )
{
if ( ! sk_has_account ( sk ) )
return ;
sk - > sk_forward_alloc + = size ;
}
static inline void sk_wmem_free_skb ( struct sock * sk , struct sk_buff * skb )
{
sock_set_flag ( sk , SOCK_QUEUE_SHRUNK ) ;
sk - > sk_wmem_queued - = skb - > truesize ;
sk_mem_uncharge ( sk , skb - > truesize ) ;
__kfree_skb ( skb ) ;
2005-09-02 04:48:23 +04:00
}
2005-04-17 02:20:36 +04:00
/* Used by processes to "lock" a socket state, so that
* interrupts and bottom half handlers won ' t change it
* from under us . It essentially blocks any incoming
* packets , so that we won ' t get any new data or any
* packets that change the state of the socket .
*
* While locked , BH processing will add new packets to
* the backlog queue . This queue is processed by the
* owner of the socket lock right before it is released .
*
* Since ~ 2.3 .5 it is also exclusive sleep lock serializing
* accesses from user process context .
*/
2007-09-12 12:44:19 +04:00
# define sock_owned_by_user(sk) ((sk)->sk_lock.owned)
2005-04-17 02:20:36 +04:00
tcp: tcp_release_cb() should release socket ownership
Lars Persson reported following deadlock :
-000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock
-001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0
-002 |ip_local_deliver_finish(skb = 0x8BD527A0)
-003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?)
-004 |netif_receive_skb(skb = 0x8BD527A0)
-005 |elk_poll(napi = 0x8C770500, budget = 64)
-006 |net_rx_action(?)
-007 |__do_softirq()
-008 |do_softirq()
-009 |local_bh_enable()
-010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?)
-011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0)
-012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0)
-013 |tcp_release_cb(sk = 0x8BE6B2A0)
-014 |release_sock(sk = 0x8BE6B2A0)
-015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?)
-016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096)
-017 |kernel_sendmsg(?, ?, ?, ?, size = 4096)
-018 |smb_send_kvec()
-019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0)
-020 |cifs_call_async()
-021 |cifs_async_writev(wdata = 0x87FD6580)
-022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88)
-023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88)
-024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88)
-025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88)
-026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88)
-027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0)
-028 |bdi_writeback_workfn(work = 0x87E4A9CC)
-029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC)
-030 |worker_thread(__worker = 0x8B045880)
-031 |kthread(_create = 0x87CADD90)
-032 |ret_from_kernel_thread(asm)
Bug occurs because __tcp_checksum_complete_user() enables BH, assuming
it is running from softirq context.
Lars trace involved a NIC without RX checksum support but other points
are problematic as well, like the prequeue stuff.
Problem is triggered by a timer, that found socket being owned by user.
tcp_release_cb() should call tcp_write_timer_handler() or
tcp_delack_timer_handler() in the appropriate context :
BH disabled and socket lock held, but 'owned' field cleared,
as if they were running from timer handlers.
Fixes: 6f458dfb4092 ("tcp: improve latencies of timer triggered events")
Reported-by: Lars Persson <lars.persson@axis.com>
Tested-by: Lars Persson <lars.persson@axis.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 20:50:11 +04:00
static inline void sock_release_ownership ( struct sock * sk )
{
sk - > sk_lock . owned = 0 ;
}
2006-12-07 07:35:24 +03:00
/*
* Macro so as to not evaluate some arguments when
* lockdep is not enabled .
*
* Mark both the sk_lock and the sk_lock . slock as a
* per - address - family lock class .
*/
2012-05-17 02:48:15 +04:00
# define sock_lock_init_class_and_name(sk, sname, skey, name, key) \
2006-12-07 07:35:24 +03:00
do { \
2008-11-12 04:38:36 +03:00
sk - > sk_lock . owned = 0 ; \
2006-12-07 07:35:24 +03:00
init_waitqueue_head ( & sk - > sk_lock . wq ) ; \
spin_lock_init ( & ( sk ) - > sk_lock . slock ) ; \
debug_check_no_locks_freed ( ( void * ) & ( sk ) - > sk_lock , \
sizeof ( ( sk ) - > sk_lock ) ) ; \
lockdep_set_class_and_name ( & ( sk ) - > sk_lock . slock , \
2012-05-17 02:48:15 +04:00
( skey ) , ( sname ) ) ; \
2006-12-07 07:35:24 +03:00
lockdep_init_map ( & ( sk ) - > sk_lock . dep_map , ( name ) , ( key ) , 0 ) ; \
} while ( 0 )
2013-09-22 21:32:26 +04:00
void lock_sock_nested ( struct sock * sk , int subclass ) ;
2006-11-09 09:44:35 +03:00
static inline void lock_sock ( struct sock * sk )
{
lock_sock_nested ( sk , 0 ) ;
}
2013-09-22 21:32:26 +04:00
void release_sock ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
/* BH context may only use the following locking interface. */
# define bh_lock_sock(__sk) spin_lock(&((__sk)->sk_lock.slock))
2006-07-03 11:25:13 +04:00
# define bh_lock_sock_nested(__sk) \
spin_lock_nested ( & ( ( __sk ) - > sk_lock . slock ) , \
SINGLE_DEPTH_NESTING )
2005-04-17 02:20:36 +04:00
# define bh_unlock_sock(__sk) spin_unlock(&((__sk)->sk_lock.slock))
2013-09-22 21:32:26 +04:00
bool lock_sock_fast ( struct sock * sk ) ;
2010-05-26 23:20:18 +04:00
/**
* unlock_sock_fast - complement of lock_sock_fast
* @ sk : socket
* @ slow : slow mode
*
* fast unlock socket for user context .
* If slow mode is on , we call regular release_sock ( )
*/
static inline void unlock_sock_fast ( struct sock * sk , bool slow )
2010-04-29 01:35:48 +04:00
{
2010-05-26 23:20:18 +04:00
if ( slow )
release_sock ( sk ) ;
else
spin_unlock_bh ( & sk - > sk_lock . slock ) ;
2010-04-29 01:35:48 +04:00
}
2013-09-22 21:32:26 +04:00
struct sock * sk_alloc ( struct net * net , int family , gfp_t priority ,
2015-05-09 05:09:13 +03:00
struct proto * prot , int kern ) ;
2013-09-22 21:32:26 +04:00
void sk_free ( struct sock * sk ) ;
2015-06-15 18:26:18 +03:00
void sk_destruct ( struct sock * sk ) ;
2013-09-22 21:32:26 +04:00
struct sock * sk_clone_lock ( const struct sock * sk , const gfp_t priority ) ;
struct sk_buff * sock_wmalloc ( struct sock * sk , unsigned long size , int force ,
gfp_t priority ) ;
void sock_wfree ( struct sk_buff * skb ) ;
void skb_orphan_partial ( struct sk_buff * skb ) ;
void sock_rfree ( struct sk_buff * skb ) ;
2014-09-04 21:31:35 +04:00
void sock_efree ( struct sk_buff * skb ) ;
2014-09-04 21:32:11 +04:00
# ifdef CONFIG_INET
2013-09-22 21:32:26 +04:00
void sock_edemux ( struct sk_buff * skb ) ;
2014-09-04 21:32:11 +04:00
# else
# define sock_edemux(skb) sock_efree(skb)
# endif
2013-09-22 21:32:26 +04:00
int sock_setsockopt ( struct socket * sock , int level , int op ,
char __user * optval , unsigned int optlen ) ;
int sock_getsockopt ( struct socket * sock , int level , int op ,
char __user * optval , int __user * optlen ) ;
struct sk_buff * sock_alloc_send_skb ( struct sock * sk , unsigned long size ,
int noblock , int * errcode ) ;
struct sk_buff * sock_alloc_send_pskb ( struct sock * sk , unsigned long header_len ,
unsigned long data_len , int noblock ,
int * errcode , int max_page_order ) ;
void * sock_kmalloc ( struct sock * sk , int size , gfp_t priority ) ;
void sock_kfree_s ( struct sock * sk , void * mem , int size ) ;
2014-11-19 19:13:11 +03:00
void sock_kzfree_s ( struct sock * sk , void * mem , int size ) ;
2013-09-22 21:32:26 +04:00
void sk_send_sigurg ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
2015-10-09 00:56:48 +03:00
struct sockcm_cookie {
u32 mark ;
} ;
int sock_cmsg_send ( struct sock * sk , struct msghdr * msg ,
struct sockcm_cookie * sockc ) ;
2005-04-17 02:20:36 +04:00
/*
* Functions to fill in entries in struct proto_ops when a protocol
* does not implement a particular function .
*/
2013-09-22 21:32:26 +04:00
int sock_no_bind ( struct socket * , struct sockaddr * , int ) ;
int sock_no_connect ( struct socket * , struct sockaddr * , int , int ) ;
int sock_no_socketpair ( struct socket * , struct socket * ) ;
int sock_no_accept ( struct socket * , struct socket * , int ) ;
int sock_no_getname ( struct socket * , struct sockaddr * , int * , int ) ;
unsigned int sock_no_poll ( struct file * , struct socket * ,
struct poll_table_struct * ) ;
int sock_no_ioctl ( struct socket * , unsigned int , unsigned long ) ;
int sock_no_listen ( struct socket * , int ) ;
int sock_no_shutdown ( struct socket * , int ) ;
int sock_no_getsockopt ( struct socket * , int , int , char __user * , int __user * ) ;
int sock_no_setsockopt ( struct socket * , int , int , char __user * , unsigned int ) ;
2015-03-02 10:37:48 +03:00
int sock_no_sendmsg ( struct socket * , struct msghdr * , size_t ) ;
int sock_no_recvmsg ( struct socket * , struct msghdr * , size_t , int ) ;
2013-09-22 21:32:26 +04:00
int sock_no_mmap ( struct file * file , struct socket * sock ,
struct vm_area_struct * vma ) ;
ssize_t sock_no_sendpage ( struct socket * sock , struct page * page , int offset ,
size_t size , int flags ) ;
2005-04-17 02:20:36 +04:00
/*
* Functions to fill in entries in struct proto_ops when a protocol
* uses the inet style .
*/
2013-09-22 21:32:26 +04:00
int sock_common_getsockopt ( struct socket * sock , int level , int optname ,
2005-04-17 02:20:36 +04:00
char __user * optval , int __user * optlen ) ;
2015-03-02 10:37:48 +03:00
int sock_common_recvmsg ( struct socket * sock , struct msghdr * msg , size_t size ,
int flags ) ;
2013-09-22 21:32:26 +04:00
int sock_common_setsockopt ( struct socket * sock , int level , int optname ,
2009-10-01 03:12:20 +04:00
char __user * optval , unsigned int optlen ) ;
2013-09-22 21:32:26 +04:00
int compat_sock_common_getsockopt ( struct socket * sock , int level ,
2006-03-21 09:45:21 +03:00
int optname , char __user * optval , int __user * optlen ) ;
2013-09-22 21:32:26 +04:00
int compat_sock_common_setsockopt ( struct socket * sock , int level ,
2009-10-01 03:12:20 +04:00
int optname , char __user * optval , unsigned int optlen ) ;
2005-04-17 02:20:36 +04:00
2013-09-22 21:32:26 +04:00
void sk_common_release ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
/*
* Default socket callbacks and setup code
*/
2012-05-17 02:48:15 +04:00
2005-04-17 02:20:36 +04:00
/* Initialise core socket variables */
2013-09-22 21:32:26 +04:00
void sock_init_data ( struct socket * sock , struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
/*
* Socket reference counting postulates .
*
* * Each user of socket SHOULD hold a reference count .
* * Each access point to socket ( an hash table bucket , reference from a list ,
* running timer , skb in flight MUST hold a reference count .
* * When reference count hits 0 , it means it will never increase back .
* * When reference count hits 0 , it means that no references from
* outside exist to this socket and current process on current CPU
* is last user and may / should destroy this socket .
* * sk_free is called from any context : process , BH , IRQ . When
* it is called , socket has no references from outside - > sk_free
* may release descendant resources allocated by the socket , but
* to the time when it is called , socket is NOT referenced by any
* hash tables , lists etc .
* * Packets , delivered from outside ( from network or from another process )
* and enqueued on receive / error queues SHOULD NOT grab reference count ,
* when they sit in queue . Otherwise , packets will leak to hole , when
* socket is looked up by one cpu and unhasing is made by another CPU .
* It is true for udp / raw , netlink ( leak to receive and error queues ) , tcp
* ( leak to backlog ) . Packet socket does all the processing inside
* BR_NETPROTO_LOCK , so that it has not this race condition . UNIX sockets
* use separate SMP lock , so that they are prone too .
*/
/* Ungrab socket and destroy it, if it was the last reference. */
static inline void sock_put ( struct sock * sk )
{
if ( atomic_dec_and_test ( & sk - > sk_refcnt ) )
sk_free ( sk ) ;
}
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 11:22:02 +04:00
/* Generic version of sock_put(), dealing with all sockets
2015-03-13 02:44:08 +03:00
* ( TCP_TIMEWAIT , TCP_NEW_SYN_RECV , ESTABLISHED . . . )
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 11:22:02 +04:00
*/
void sock_gen_put ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
2013-09-22 21:32:26 +04:00
int sk_receive_skb ( struct sock * sk , struct sk_buff * skb , const int nested ) ;
2005-12-27 07:42:22 +03:00
2009-10-20 03:46:20 +04:00
static inline void sk_tx_queue_set ( struct sock * sk , int tx_queue )
{
sk - > sk_tx_queue_mapping = tx_queue ;
}
static inline void sk_tx_queue_clear ( struct sock * sk )
{
sk - > sk_tx_queue_mapping = - 1 ;
}
static inline int sk_tx_queue_get ( const struct sock * sk )
{
2010-07-15 07:50:29 +04:00
return sk ? sk - > sk_tx_queue_mapping : - 1 ;
2009-10-20 03:46:20 +04:00
}
2008-06-18 09:41:38 +04:00
static inline void sk_set_socket ( struct sock * sk , struct socket * sock )
{
2009-10-20 03:46:20 +04:00
sk_tx_queue_clear ( sk ) ;
2008-06-18 09:41:38 +04:00
sk - > sk_socket = sock ;
}
2010-04-20 17:03:51 +04:00
static inline wait_queue_head_t * sk_sleep ( struct sock * sk )
{
2011-02-18 06:26:36 +03:00
BUILD_BUG_ON ( offsetof ( struct socket_wq , wait ) ! = 0 ) ;
return & rcu_dereference_raw ( sk - > sk_wq ) - > wait ;
2010-04-20 17:03:51 +04:00
}
2005-04-17 02:20:36 +04:00
/* Detach socket from process context.
* Announce socket dead , detach it from wait queue and inode .
* Note that parent inode held reference count on this struct sock ,
* we do not release it in this function , because protocol
* probably wants some additional cleanups or even continuing
* to work with this socket ( TCP ) .
*/
static inline void sock_orphan ( struct sock * sk )
{
write_lock_bh ( & sk - > sk_callback_lock ) ;
sock_set_flag ( sk , SOCK_DEAD ) ;
2008-06-18 09:41:38 +04:00
sk_set_socket ( sk , NULL ) ;
2010-04-29 15:01:49 +04:00
sk - > sk_wq = NULL ;
2005-04-17 02:20:36 +04:00
write_unlock_bh ( & sk - > sk_callback_lock ) ;
}
static inline void sock_graft ( struct sock * sk , struct socket * parent )
{
write_lock_bh ( & sk - > sk_callback_lock ) ;
2011-02-18 06:26:36 +03:00
sk - > sk_wq = parent - > wq ;
2005-04-17 02:20:36 +04:00
parent - > sk = sk ;
2008-06-18 09:41:38 +04:00
sk_set_socket ( sk , parent ) ;
2006-07-25 10:32:50 +04:00
security_sock_graft ( sk , parent ) ;
2005-04-17 02:20:36 +04:00
write_unlock_bh ( & sk - > sk_callback_lock ) ;
}
2013-09-22 21:32:26 +04:00
kuid_t sock_i_uid ( struct sock * sk ) ;
unsigned long sock_i_ino ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
2015-09-16 01:24:20 +03:00
static inline u32 net_tx_rndhash ( void )
2015-07-29 02:02:05 +03:00
{
2015-09-16 01:24:20 +03:00
u32 v = prandom_u32 ( ) ;
return v ? : 1 ;
}
2015-07-29 02:02:05 +03:00
2015-09-16 01:24:20 +03:00
static inline void sk_set_txhash ( struct sock * sk )
{
sk - > sk_txhash = net_tx_rndhash ( ) ;
2015-07-29 02:02:05 +03:00
}
2015-07-29 02:02:06 +03:00
static inline void sk_rethink_txhash ( struct sock * sk )
{
if ( sk - > sk_txhash )
sk_set_txhash ( sk ) ;
}
2005-04-17 02:20:36 +04:00
static inline struct dst_entry *
__sk_dst_get ( struct sock * sk )
{
2011-07-08 16:39:41 +04:00
return rcu_dereference_check ( sk - > sk_dst_cache , sock_owned_by_user ( sk ) | |
2010-04-23 03:06:59 +04:00
lockdep_is_held ( & sk - > sk_lock . slock ) ) ;
2005-04-17 02:20:36 +04:00
}
static inline struct dst_entry *
sk_dst_get ( struct sock * sk )
{
struct dst_entry * dst ;
2010-04-09 03:03:29 +04:00
rcu_read_lock ( ) ;
dst = rcu_dereference ( sk - > sk_dst_cache ) ;
2014-06-24 21:05:11 +04:00
if ( dst & & ! atomic_inc_not_zero ( & dst - > __refcnt ) )
dst = NULL ;
2010-04-09 03:03:29 +04:00
rcu_read_unlock ( ) ;
2005-04-17 02:20:36 +04:00
return dst ;
}
2010-04-09 03:03:29 +04:00
static inline void dst_negative_advice ( struct sock * sk )
{
struct dst_entry * ndst , * dst = __sk_dst_get ( sk ) ;
2015-07-29 02:02:06 +03:00
sk_rethink_txhash ( sk ) ;
2010-04-09 03:03:29 +04:00
if ( dst & & dst - > ops - > negative_advice ) {
ndst = dst - > ops - > negative_advice ( dst ) ;
if ( ndst ! = dst ) {
rcu_assign_pointer ( sk - > sk_dst_cache , ndst ) ;
2013-10-22 12:23:38 +04:00
sk_tx_queue_clear ( sk ) ;
2010-04-09 03:03:29 +04:00
}
}
}
2005-04-17 02:20:36 +04:00
static inline void
__sk_dst_set ( struct sock * sk , struct dst_entry * dst )
{
struct dst_entry * old_dst ;
2009-10-20 03:46:20 +04:00
sk_tx_queue_clear ( sk ) ;
2010-04-27 00:40:43 +04:00
/*
* This can be called while sk is owned by the caller only ,
* with no state that can be checked in a rcu_dereference_check ( ) cond
*/
old_dst = rcu_dereference_raw ( sk - > sk_dst_cache ) ;
2010-04-09 03:03:29 +04:00
rcu_assign_pointer ( sk - > sk_dst_cache , dst ) ;
2005-04-17 02:20:36 +04:00
dst_release ( old_dst ) ;
}
static inline void
sk_dst_set ( struct sock * sk , struct dst_entry * dst )
{
2014-06-30 12:26:23 +04:00
struct dst_entry * old_dst ;
sk_tx_queue_clear ( sk ) ;
2014-07-02 13:39:38 +04:00
old_dst = xchg ( ( __force struct dst_entry * * ) & sk - > sk_dst_cache , dst ) ;
2014-06-30 12:26:23 +04:00
dst_release ( old_dst ) ;
2005-04-17 02:20:36 +04:00
}
static inline void
__sk_dst_reset ( struct sock * sk )
{
2010-04-09 03:03:29 +04:00
__sk_dst_set ( sk , NULL ) ;
2005-04-17 02:20:36 +04:00
}
static inline void
sk_dst_reset ( struct sock * sk )
{
2014-06-30 12:26:23 +04:00
sk_dst_set ( sk , NULL ) ;
2005-04-17 02:20:36 +04:00
}
2013-09-22 21:32:26 +04:00
struct dst_entry * __sk_dst_check ( struct sock * sk , u32 cookie ) ;
2005-04-17 02:20:36 +04:00
2013-09-22 21:32:26 +04:00
struct dst_entry * sk_dst_check ( struct sock * sk , u32 cookie ) ;
2005-04-17 02:20:36 +04:00
2015-04-01 18:07:44 +03:00
bool sk_mc_loop ( struct sock * sk ) ;
2012-05-17 02:48:15 +04:00
static inline bool sk_can_gso ( const struct sock * sk )
2006-07-01 00:36:35 +04:00
{
return net_gso_ok ( sk - > sk_route_caps , sk - > sk_gso_type ) ;
}
2013-09-22 21:32:26 +04:00
void sk_setup_caps ( struct sock * sk , struct dst_entry * dst ) ;
2005-08-10 06:49:02 +04:00
2011-11-15 19:29:55 +04:00
static inline void sk_nocaps_add ( struct sock * sk , netdev_features_t flags )
2010-05-16 11:36:33 +04:00
{
sk - > sk_route_nocaps | = flags ;
sk - > sk_route_caps & = ~ flags ;
}
2011-04-05 09:30:30 +04:00
static inline int skb_do_copy_data_nocache ( struct sock * sk , struct sk_buff * skb ,
2014-11-28 21:40:20 +03:00
struct iov_iter * from , char * to ,
2011-04-06 22:40:12 +04:00
int copy , int offset )
2011-04-05 09:30:30 +04:00
{
if ( skb - > ip_summed = = CHECKSUM_NONE ) {
2014-11-28 21:40:20 +03:00
__wsum csum = 0 ;
if ( csum_and_copy_from_iter ( to , copy , & csum , from ) ! = copy )
return - EFAULT ;
2011-04-06 22:40:12 +04:00
skb - > csum = csum_block_add ( skb - > csum , csum , offset ) ;
2011-04-05 09:30:30 +04:00
} else if ( sk - > sk_route_caps & NETIF_F_NOCACHE_COPY ) {
2014-11-28 21:40:20 +03:00
if ( copy_from_iter_nocache ( to , copy , from ) ! = copy )
2011-04-05 09:30:30 +04:00
return - EFAULT ;
2014-11-28 21:40:20 +03:00
} else if ( copy_from_iter ( to , copy , from ) ! = copy )
2011-04-05 09:30:30 +04:00
return - EFAULT ;
return 0 ;
}
static inline int skb_add_data_nocache ( struct sock * sk , struct sk_buff * skb ,
2014-11-28 21:40:20 +03:00
struct iov_iter * from , int copy )
2011-04-05 09:30:30 +04:00
{
2011-04-06 22:40:12 +04:00
int err , offset = skb - > len ;
2011-04-05 09:30:30 +04:00
2011-04-06 22:40:12 +04:00
err = skb_do_copy_data_nocache ( sk , skb , from , skb_put ( skb , copy ) ,
copy , offset ) ;
2011-04-05 09:30:30 +04:00
if ( err )
2011-04-06 22:40:12 +04:00
__skb_trim ( skb , offset ) ;
2011-04-05 09:30:30 +04:00
return err ;
}
2014-11-28 21:40:20 +03:00
static inline int skb_copy_to_page_nocache ( struct sock * sk , struct iov_iter * from ,
2011-04-05 09:30:30 +04:00
struct sk_buff * skb ,
struct page * page ,
int off , int copy )
{
int err ;
2011-04-06 22:40:12 +04:00
err = skb_do_copy_data_nocache ( sk , skb , from , page_address ( page ) + off ,
copy , skb - > len ) ;
2011-04-05 09:30:30 +04:00
if ( err )
return err ;
skb - > len + = copy ;
skb - > data_len + = copy ;
skb - > truesize + = copy ;
sk - > sk_wmem_queued + = copy ;
sk_mem_charge ( sk , copy ) ;
return 0 ;
}
2009-06-16 14:12:03 +04:00
/**
* sk_wmem_alloc_get - returns write allocations
* @ sk : socket
*
* Returns sk_wmem_alloc minus initial offset of one
*/
static inline int sk_wmem_alloc_get ( const struct sock * sk )
{
return atomic_read ( & sk - > sk_wmem_alloc ) - 1 ;
}
/**
* sk_rmem_alloc_get - returns read allocations
* @ sk : socket
*
* Returns sk_rmem_alloc
*/
static inline int sk_rmem_alloc_get ( const struct sock * sk )
{
return atomic_read ( & sk - > sk_rmem_alloc ) ;
}
/**
* sk_has_allocations - check if allocations are outstanding
* @ sk : socket
*
* Returns true if socket has write or read allocations
*/
2012-05-17 02:48:15 +04:00
static inline bool sk_has_allocations ( const struct sock * sk )
2009-06-16 14:12:03 +04:00
{
return sk_wmem_alloc_get ( sk ) | | sk_rmem_alloc_get ( sk ) ;
}
2009-07-08 16:09:13 +04:00
/**
2015-11-26 08:55:39 +03:00
* skwq_has_sleeper - check if there are any waiting processes
2010-05-25 10:54:18 +04:00
* @ wq : struct socket_wq
2009-07-08 16:09:13 +04:00
*
2010-04-29 15:01:49 +04:00
* Returns true if socket_wq has waiting processes
2009-07-08 16:09:13 +04:00
*
2015-11-26 08:55:39 +03:00
* The purpose of the skwq_has_sleeper and sock_poll_wait is to wrap the memory
2009-07-08 16:09:13 +04:00
* barrier call . They were added due to the race found within the tcp code .
*
* Consider following tcp code paths :
*
* CPU1 CPU2
*
* sys_select receive packet
* . . . . . .
* __add_wait_queue update tp - > rcv_nxt
* . . . . . .
* tp - > rcv_nxt check sock_def_readable
* . . . {
2010-04-29 15:01:49 +04:00
* schedule rcu_read_lock ( ) ;
* wq = rcu_dereference ( sk - > sk_wq ) ;
* if ( wq & & waitqueue_active ( & wq - > wait ) )
* wake_up_interruptible ( & wq - > wait )
2009-07-08 16:09:13 +04:00
* . . .
* }
*
* The race for tcp fires when the __add_wait_queue changes done by CPU1 stay
* in its cache , and so does the tp - > rcv_nxt update on CPU2 side . The CPU1
* could then endup calling schedule and sleep forever if there are no more
* data on the socket .
2009-07-08 16:10:31 +04:00
*
2009-07-08 16:09:13 +04:00
*/
2015-11-26 08:55:39 +03:00
static inline bool skwq_has_sleeper ( struct socket_wq * wq )
2009-07-08 16:09:13 +04:00
{
2015-11-26 08:55:39 +03:00
return wq & & wq_has_sleeper ( & wq - > wait ) ;
2009-07-08 16:09:13 +04:00
}
/**
* sock_poll_wait - place memory barrier behind the poll_wait call .
* @ filp : file
* @ wait_address : socket wait queue
* @ p : poll_table
*
2010-04-29 15:01:49 +04:00
* See the comments in the wq_has_sleeper function .
2009-07-08 16:09:13 +04:00
*/
static inline void sock_poll_wait ( struct file * filp ,
wait_queue_head_t * wait_address , poll_table * p )
{
poll: add poll_requested_events() and poll_does_not_wait() functions
In some cases the poll() implementation in a driver has to do different
things depending on the events the caller wants to poll for. An example
is when a driver needs to start a DMA engine if the caller polls for
POLLIN, but doesn't want to do that if POLLIN is not requested but instead
only POLLOUT or POLLPRI is requested. This is something that can happen
in the video4linux subsystem among others.
Unfortunately, the current epoll/poll/select implementation doesn't
provide that information reliably. The poll_table_struct does have it: it
has a key field with the event mask. But once a poll() call matches one
or more bits of that mask any following poll() calls are passed a NULL
poll_table pointer.
Also, the eventpoll implementation always left the key field at ~0 instead
of using the requested events mask.
This was changed in eventpoll.c so the key field now contains the actual
events that should be polled for as set by the caller.
The solution to the NULL poll_table pointer is to set the qproc field to
NULL in poll_table once poll() matches the events, not the poll_table
pointer itself. That way drivers can obtain the mask through a new
poll_requested_events inline.
The poll_table_struct can still be NULL since some kernel code calls it
internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
that case poll_requested_events() returns ~0 (i.e. all events).
Very rarely drivers might want to know whether poll_wait will actually
wait. If another earlier file descriptor in the set already matched the
events the caller wanted to wait for, then the kernel will return from the
select() call without waiting. This might be useful information in order
to avoid doing expensive work.
A new helper function poll_does_not_wait() is added that drivers can use
to detect this situation. This is now used in sock_poll_wait() in
include/net/sock.h. This was the only place in the kernel that needed
this information.
Drivers should no longer access any of the poll_table internals, but use
the poll_requested_events() and poll_does_not_wait() access functions
instead. In order to enforce that the poll_table fields are now prepended
with an underscore and a comment was added warning against using them
directly.
This required a change in unix_dgram_poll() in unix/af_unix.c which used
the key field to get the requested events. It's been replaced by a call
to poll_requested_events().
For qproc it was especially important to change its name since the
behavior of that field changes with this patch since this function pointer
can now be NULL when that wasn't possible in the past.
Any driver accessing the qproc or key fields directly will now fail to compile.
Some notes regarding the correctness of this patch: the driver's poll()
function is called with a 'struct poll_table_struct *wait' argument. This
pointer may or may not be NULL, drivers can never rely on it being one or
the other as that depends on whether or not an earlier file descriptor in
the select()'s fdset matched the requested events.
There are only three things a driver can do with the wait argument:
1) obtain the key field:
events = wait ? wait->key : ~0;
This will still work although it should be replaced with the new
poll_requested_events() function (which does exactly the same).
This will now even work better, since wait is no longer set to NULL
unnecessarily.
2) use the qproc callback. This could be deadly since qproc can now be
NULL. Renaming qproc should prevent this from happening. There are no
kernel drivers that actually access this callback directly, BTW.
3) test whether wait == NULL to determine whether poll would return without
waiting. This is no longer sufficient as the correct test is now
wait == NULL || wait->_qproc == NULL.
However, the worst that can happen here is a slight performance hit in
the case where wait != NULL and wait->_qproc == NULL. In that case the
driver will assume that poll_wait() will actually add the fd to the set
of waiting file descriptors. Of course, poll_wait() will not do that
since it tests for wait->_qproc. This will not break anything, though.
There is only one place in the whole kernel where this happens
(sock_poll_wait() in include/net/sock.h) and that code will be replaced
by a call to poll_does_not_wait() in the next patch.
Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
actually waiting. The next file descriptor from the set might match the
event mask and thus any possible waits will never happen.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Reviewed-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-24 02:02:27 +04:00
if ( ! poll_does_not_wait ( p ) & & wait_address ) {
2009-07-08 16:09:13 +04:00
poll_wait ( filp , wait_address , p ) ;
2012-05-17 02:48:15 +04:00
/* We need to be sure we are in sync with the
2009-07-08 16:09:13 +04:00
* socket flags modification .
*
2010-04-29 15:01:49 +04:00
* This memory barrier is paired in the wq_has_sleeper .
2012-05-17 02:48:15 +04:00
*/
2009-07-08 16:09:13 +04:00
smp_mb ( ) ;
}
}
2014-07-02 08:32:17 +04:00
static inline void skb_set_hash_from_sk ( struct sk_buff * skb , struct sock * sk )
{
if ( sk - > sk_txhash ) {
skb - > l4_hash = 1 ;
skb - > hash = sk - > sk_txhash ;
}
}
2015-11-02 02:36:55 +03:00
void skb_set_owner_w ( struct sk_buff * skb , struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
/*
2012-05-17 02:48:15 +04:00
* Queue a received datagram if it will fit . Stream and sequenced
2005-04-17 02:20:36 +04:00
* protocols can ' t normally use this as they need to fit buffers in
* and play with them .
*
2012-05-17 02:48:15 +04:00
* Inlined as it ' s very short and called for pretty much every
2005-04-17 02:20:36 +04:00
* packet ever received .
*/
static inline void skb_set_owner_r ( struct sk_buff * skb , struct sock * sk )
{
2009-06-22 06:25:25 +04:00
skb_orphan ( skb ) ;
2005-04-17 02:20:36 +04:00
skb - > sk = sk ;
skb - > destructor = sock_rfree ;
atomic_add ( skb - > truesize , & sk - > sk_rmem_alloc ) ;
2007-12-31 11:11:19 +03:00
sk_mem_charge ( sk , skb - > truesize ) ;
2005-04-17 02:20:36 +04:00
}
2013-09-22 21:32:26 +04:00
void sk_reset_timer ( struct sock * sk , struct timer_list * timer ,
unsigned long expires ) ;
2005-04-17 02:20:36 +04:00
2013-09-22 21:32:26 +04:00
void sk_stop_timer ( struct sock * sk , struct timer_list * timer ) ;
2005-04-17 02:20:36 +04:00
2013-09-22 21:32:26 +04:00
int sock_queue_rcv_skb ( struct sock * sk , struct sk_buff * skb ) ;
2005-04-17 02:20:36 +04:00
2013-09-22 21:32:26 +04:00
int sock_queue_err_skb ( struct sock * sk , struct sk_buff * skb ) ;
2014-09-01 05:30:27 +04:00
struct sk_buff * sock_dequeue_err_skb ( struct sock * sk ) ;
2005-04-17 02:20:36 +04:00
/*
* Recover an error report and clear atomically
*/
2012-05-17 02:48:15 +04:00
2005-04-17 02:20:36 +04:00
static inline int sock_error ( struct sock * sk )
{
2005-12-14 10:22:19 +03:00
int err ;
if ( likely ( ! sk - > sk_err ) )
return 0 ;
err = xchg ( & sk - > sk_err , 0 ) ;
2005-04-17 02:20:36 +04:00
return - err ;
}
static inline unsigned long sock_wspace ( struct sock * sk )
{
int amt = 0 ;
if ( ! ( sk - > sk_shutdown & SEND_SHUTDOWN ) ) {
amt = sk - > sk_sndbuf - atomic_read ( & sk - > sk_wmem_alloc ) ;
2012-05-17 02:48:15 +04:00
if ( amt < 0 )
2005-04-17 02:20:36 +04:00
amt = 0 ;
}
return amt ;
}
2015-11-30 07:03:11 +03:00
/* Note:
* We use sk - > sk_wq_raw , from contexts knowing this
* pointer is not NULL and cannot disappear / change .
*/
2015-11-30 07:03:10 +03:00
static inline void sk_set_bit ( int nr , struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2015-11-30 07:03:11 +03:00
set_bit ( nr , & sk - > sk_wq_raw - > flags ) ;
2015-11-30 07:03:10 +03:00
}
static inline void sk_clear_bit ( int nr , struct sock * sk )
{
2015-11-30 07:03:11 +03:00
clear_bit ( nr , & sk - > sk_wq_raw - > flags ) ;
2015-11-30 07:03:10 +03:00
}
2015-11-30 07:03:11 +03:00
static inline void sk_wake_async ( const struct sock * sk , int how , int band )
2005-04-17 02:20:36 +04:00
{
2015-11-30 07:03:11 +03:00
if ( sock_flag ( sk , SOCK_FASYNC ) ) {
rcu_read_lock ( ) ;
sock_wake_async ( rcu_dereference ( sk - > sk_wq ) , how , band ) ;
rcu_read_unlock ( ) ;
}
2005-04-17 02:20:36 +04:00
}
net: sock: adapt SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF
The current situation is that SOCK_MIN_RCVBUF is 2048 + sizeof(struct sk_buff))
while SOCK_MIN_SNDBUF is 2048. Since in both cases, skb->truesize is used for
sk_{r,w}mem_alloc accounting, we should have both sizes adjusted via defining a
TCP_SKB_MIN_TRUESIZE.
Further, as Eric Dumazet points out, the minimal skb truesize in transmit path is
SKB_TRUESIZE(2048) after commit f07d960df33c5 ("tcp: avoid frag allocation for
small frames"), and tcp_sendmsg() tries to limit skb size to half the congestion
window, meaning we try to build two skbs at minimum. Thus, having SOCK_MIN_SNDBUF
as 2048 can hit a small regression for some applications setting to low
SO_SNDBUF / SO_RCVBUF. Note that we define a TCP_SKB_MIN_TRUESIZE, because
SKB_TRUESIZE(2048) adds SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), but in
case of TCP skbs, the skb_shared_info is part of the 2048 bytes allocation for
skb->head.
The minor adaption in sk_stream_moderate_sndbuf() is to silence a warning by
using a typed max macro, as similarly done in SOCK_MIN_RCVBUF occurences, that
would appear otherwise.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 14:51:20 +04:00
/* Since sk_{r,w}mem_alloc sums skb->truesize, even a small frame might
* need sizeof ( sk_buff ) + MTU + padding , unless net driver perform copybreak .
* Note : for send buffers , TCP works better if we can build two skbs at
* minimum .
2010-09-27 05:53:07 +04:00
*/
2013-07-03 16:02:22 +04:00
# define TCP_SKB_MIN_TRUESIZE (2048 + SKB_DATA_ALIGN(sizeof(struct sk_buff)))
net: sock: adapt SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF
The current situation is that SOCK_MIN_RCVBUF is 2048 + sizeof(struct sk_buff))
while SOCK_MIN_SNDBUF is 2048. Since in both cases, skb->truesize is used for
sk_{r,w}mem_alloc accounting, we should have both sizes adjusted via defining a
TCP_SKB_MIN_TRUESIZE.
Further, as Eric Dumazet points out, the minimal skb truesize in transmit path is
SKB_TRUESIZE(2048) after commit f07d960df33c5 ("tcp: avoid frag allocation for
small frames"), and tcp_sendmsg() tries to limit skb size to half the congestion
window, meaning we try to build two skbs at minimum. Thus, having SOCK_MIN_SNDBUF
as 2048 can hit a small regression for some applications setting to low
SO_SNDBUF / SO_RCVBUF. Note that we define a TCP_SKB_MIN_TRUESIZE, because
SKB_TRUESIZE(2048) adds SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), but in
case of TCP skbs, the skb_shared_info is part of the 2048 bytes allocation for
skb->head.
The minor adaption in sk_stream_moderate_sndbuf() is to silence a warning by
using a typed max macro, as similarly done in SOCK_MIN_RCVBUF occurences, that
would appear otherwise.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 14:51:20 +04:00
# define SOCK_MIN_SNDBUF (TCP_SKB_MIN_TRUESIZE * 2)
# define SOCK_MIN_RCVBUF TCP_SKB_MIN_TRUESIZE
2005-04-17 02:20:36 +04:00
static inline void sk_stream_moderate_sndbuf ( struct sock * sk )
{
if ( ! ( sk - > sk_userlocks & SOCK_SNDBUF_LOCK ) ) {
2007-12-21 14:07:41 +03:00
sk - > sk_sndbuf = min ( sk - > sk_sndbuf , sk - > sk_wmem_queued > > 1 ) ;
net: sock: adapt SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF
The current situation is that SOCK_MIN_RCVBUF is 2048 + sizeof(struct sk_buff))
while SOCK_MIN_SNDBUF is 2048. Since in both cases, skb->truesize is used for
sk_{r,w}mem_alloc accounting, we should have both sizes adjusted via defining a
TCP_SKB_MIN_TRUESIZE.
Further, as Eric Dumazet points out, the minimal skb truesize in transmit path is
SKB_TRUESIZE(2048) after commit f07d960df33c5 ("tcp: avoid frag allocation for
small frames"), and tcp_sendmsg() tries to limit skb size to half the congestion
window, meaning we try to build two skbs at minimum. Thus, having SOCK_MIN_SNDBUF
as 2048 can hit a small regression for some applications setting to low
SO_SNDBUF / SO_RCVBUF. Note that we define a TCP_SKB_MIN_TRUESIZE, because
SKB_TRUESIZE(2048) adds SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), but in
case of TCP skbs, the skb_shared_info is part of the 2048 bytes allocation for
skb->head.
The minor adaption in sk_stream_moderate_sndbuf() is to silence a warning by
using a typed max macro, as similarly done in SOCK_MIN_RCVBUF occurences, that
would appear otherwise.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 14:51:20 +04:00
sk - > sk_sndbuf = max_t ( u32 , sk - > sk_sndbuf , SOCK_MIN_SNDBUF ) ;
2005-04-17 02:20:36 +04:00
}
}
2015-05-19 23:26:55 +03:00
struct sk_buff * sk_stream_alloc_skb ( struct sock * sk , int size , gfp_t gfp ,
bool force_schedule ) ;
2005-04-17 02:20:36 +04:00
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
/**
* sk_page_frag - return an appropriate page_frag
* @ sk : socket
*
* If socket allocation mode allows current thread to sleep , it means its
* safe to use the per task page_frag instead of the per socket one .
*/
static inline struct page_frag * sk_page_frag ( struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2015-11-07 03:28:21 +03:00
if ( gfpflags_allow_blocking ( sk - > sk_allocation ) )
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
return & current - > task_frag ;
2005-04-17 02:20:36 +04:00
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
return & sk - > sk_frag ;
2005-04-17 02:20:36 +04:00
}
2013-09-22 21:32:26 +04:00
bool sk_page_frag_refill ( struct sock * sk , struct page_frag * pfrag ) ;
net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 03:04:42 +04:00
2005-04-17 02:20:36 +04:00
/*
* Default write policy as shown to user space via poll / select / SIGIO
*/
2012-05-17 02:48:15 +04:00
static inline bool sock_writeable ( const struct sock * sk )
2005-04-17 02:20:36 +04:00
{
2007-12-21 14:07:41 +03:00
return atomic_read ( & sk - > sk_wmem_alloc ) < ( sk - > sk_sndbuf > > 1 ) ;
2005-04-17 02:20:36 +04:00
}
2005-10-07 10:46:04 +04:00
static inline gfp_t gfp_any ( void )
2005-04-17 02:20:36 +04:00
{
2009-02-13 03:43:17 +03:00
return in_softirq ( ) ? GFP_ATOMIC : GFP_KERNEL ;
2005-04-17 02:20:36 +04:00
}
2012-05-17 02:48:15 +04:00
static inline long sock_rcvtimeo ( const struct sock * sk , bool noblock )
2005-04-17 02:20:36 +04:00
{
return noblock ? 0 : sk - > sk_rcvtimeo ;
}
2012-05-17 02:48:15 +04:00
static inline long sock_sndtimeo ( const struct sock * sk , bool noblock )
2005-04-17 02:20:36 +04:00
{
return noblock ? 0 : sk - > sk_sndtimeo ;
}
static inline int sock_rcvlowat ( const struct sock * sk , int waitall , int len )
{
return ( waitall ? len : min_t ( int , sk - > sk_rcvlowat , len ) ) ? : 1 ;
}
/* Alas, with timeout socket operations are not restartable.
* Compare this to poll ( ) .
*/
static inline int sock_intr_errno ( long timeo )
{
return timeo = = MAX_SCHEDULE_TIMEOUT ? - ERESTARTSYS : - EINTR ;
}
2015-03-01 15:58:31 +03:00
struct sock_skb_cb {
u32 dropcount ;
} ;
/* Store sock_skb_cb at the end of skb->cb[] so protocol families
* using skb - > cb [ ] would keep using it directly and utilize its
* alignement guarantee .
*/
# define SOCK_SKB_CB_OFFSET ((FIELD_SIZEOF(struct sk_buff, cb) - \
sizeof ( struct sock_skb_cb ) ) )
# define SOCK_SKB_CB(__skb) ((struct sock_skb_cb *)((__skb)->cb + \
SOCK_SKB_CB_OFFSET ) )
2015-03-01 15:58:29 +03:00
# define sock_skb_cb_check_size(size) \
2015-03-01 15:58:31 +03:00
BUILD_BUG_ON ( ( size ) > SOCK_SKB_CB_OFFSET )
2015-03-01 15:58:29 +03:00
2015-03-01 15:58:30 +03:00
static inline void
sock_skb_set_dropcount ( const struct sock * sk , struct sk_buff * skb )
{
2015-03-01 15:58:31 +03:00
SOCK_SKB_CB ( skb ) - > dropcount = atomic_read ( & sk - > sk_drops ) ;
2015-03-01 15:58:30 +03:00
}
2013-09-22 21:32:26 +04:00
void __sock_recv_timestamp ( struct msghdr * msg , struct sock * sk ,
struct sk_buff * skb ) ;
void __sock_recv_wifi_status ( struct msghdr * msg , struct sock * sk ,
struct sk_buff * skb ) ;
2007-03-26 09:14:49 +04:00
2012-05-17 02:48:15 +04:00
static inline void
2005-04-17 02:20:36 +04:00
sock_recv_timestamp ( struct msghdr * msg , struct sock * sk , struct sk_buff * skb )
{
2007-04-20 03:16:32 +04:00
ktime_t kt = skb - > tstamp ;
2009-02-12 08:03:38 +03:00
struct skb_shared_hwtstamps * hwtstamps = skb_hwtstamps ( skb ) ;
2005-08-15 04:24:31 +04:00
2009-02-12 08:03:38 +03:00
/*
* generate control messages if
2014-08-05 06:11:46 +04:00
* - receive time stamping in software requested
2009-02-12 08:03:38 +03:00
* - software time stamp available and wanted
* - hardware time stamps available and wanted
*/
if ( sock_flag ( sk , SOCK_RCVTSTAMP ) | |
2014-08-05 06:11:46 +04:00
( sk - > sk_tsflags & SOF_TIMESTAMPING_RX_SOFTWARE ) | |
2014-09-03 20:01:18 +04:00
( kt . tv64 & & sk - > sk_tsflags & SOF_TIMESTAMPING_SOFTWARE ) | |
2009-02-12 08:03:38 +03:00
( hwtstamps - > hwtstamp . tv64 & &
2014-08-05 06:11:46 +04:00
( sk - > sk_tsflags & SOF_TIMESTAMPING_RAW_HARDWARE ) ) )
2007-03-26 09:14:49 +04:00
__sock_recv_timestamp ( msg , sk , skb ) ;
else
2007-04-20 03:16:32 +04:00
sk - > sk_stamp = kt ;
2011-11-09 13:15:42 +04:00
if ( sock_flag ( sk , SOCK_WIFI_STATUS ) & & skb - > wifi_acked_valid )
__sock_recv_wifi_status ( msg , sk , skb ) ;
2005-04-17 02:20:36 +04:00
}
2013-09-22 21:32:26 +04:00
void __sock_recv_ts_and_drops ( struct msghdr * msg , struct sock * sk ,
struct sk_buff * skb ) ;
2010-04-28 23:14:43 +04:00
static inline void sock_recv_ts_and_drops ( struct msghdr * msg , struct sock * sk ,
struct sk_buff * skb )
{
# define FLAGS_TS_OR_DROPS ((1UL << SOCK_RXQ_OVFL) | \
2014-08-05 06:11:46 +04:00
( 1UL < < SOCK_RCVTSTAMP ) )
# define TSFLAGS_ANY (SOF_TIMESTAMPING_SOFTWARE | \
SOF_TIMESTAMPING_RAW_HARDWARE )
2010-04-28 23:14:43 +04:00
2014-08-05 06:11:46 +04:00
if ( sk - > sk_flags & FLAGS_TS_OR_DROPS | | sk - > sk_tsflags & TSFLAGS_ANY )
2010-04-28 23:14:43 +04:00
__sock_recv_ts_and_drops ( msg , sk , skb ) ;
else
sk - > sk_stamp = skb - > tstamp ;
}
net: Generalize socket rx gap / receive queue overflow cmsg
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-13 00:26:31 +04:00
2014-09-09 03:58:58 +04:00
void __sock_tx_timestamp ( const struct sock * sk , __u8 * tx_flags ) ;
2009-02-12 08:03:38 +03:00
/**
* sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
* @ sk : socket sending this packet
2014-08-06 13:49:29 +04:00
* @ tx_flags : completed with instructions for time stamping
*
* Note : callers should take care of initial * tx_flags value ( usually 0 )
2009-02-12 08:03:38 +03:00
*/
2014-09-09 03:58:58 +04:00
static inline void sock_tx_timestamp ( const struct sock * sk , __u8 * tx_flags )
{
if ( unlikely ( sk - > sk_tsflags ) )
__sock_tx_timestamp ( sk , tx_flags ) ;
if ( unlikely ( sock_flag ( sk , SOCK_WIFI_STATUS ) ) )
* tx_flags | = SKBTX_WIFI_STATUS ;
}
2009-02-12 08:03:38 +03:00
2005-04-17 02:20:36 +04:00
/**
* sk_eat_skb - Release a skb if it is no longer needed
2005-05-01 19:59:25 +04:00
* @ sk : socket to eat this skb from
* @ skb : socket buffer to eat
2005-04-17 02:20:36 +04:00
*
* This routine must be called with interrupts disabled or with the socket
* locked so that the sk_buff queue operation is ok .
*/
2013-12-31 00:37:29 +04:00
static inline void sk_eat_skb ( struct sock * sk , struct sk_buff * skb )
2005-04-17 02:20:36 +04:00
{
__skb_unlink ( skb , & sk - > sk_receive_queue ) ;
__kfree_skb ( skb ) ;
}
2008-03-25 20:26:21 +03:00
static inline
struct net * sock_net ( const struct sock * sk )
{
2010-06-01 10:51:19 +04:00
return read_pnet ( & sk - > sk_net ) ;
2008-03-25 20:26:21 +03:00
}
static inline
2008-03-26 10:48:17 +03:00
void sock_net_set ( struct sock * sk , struct net * net )
2008-03-25 20:26:21 +03:00
{
2010-06-01 10:51:19 +04:00
write_pnet ( & sk - > sk_net , net ) ;
2008-03-25 20:26:21 +03:00
}
2008-10-07 23:41:01 +04:00
static inline struct sock * skb_steal_sock ( struct sk_buff * skb )
{
2012-06-24 17:03:07 +04:00
if ( skb - > sk ) {
2008-10-07 23:41:01 +04:00
struct sock * sk = skb - > sk ;
skb - > destructor = NULL ;
skb - > sk = NULL ;
return sk ;
}
return NULL ;
}
2015-03-16 07:12:12 +03:00
/* This helper checks if a socket is a full socket,
* ie _not_ a timewait or request socket .
*/
static inline bool sk_fullsock ( const struct sock * sk )
{
return ( 1 < < sk - > sk_state ) & ~ ( TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV ) ;
}
2015-10-08 15:01:55 +03:00
/* This helper checks if a socket is a LISTEN or NEW_SYN_RECV
* SYNACK messages can be attached to either ones ( depending on SYNCOOKIE )
*/
static inline bool sk_listener ( const struct sock * sk )
{
return ( 1 < < sk - > sk_state ) & ( TCPF_LISTEN | TCPF_NEW_SYN_RECV ) ;
}
2015-11-12 19:43:18 +03:00
/**
* sk_state_load - read sk - > sk_state for lockless contexts
* @ sk : socket pointer
*
* Paired with sk_state_store ( ) . Used in places we do not hold socket lock :
* tcp_diag_get_info ( ) , tcp_get_info ( ) , tcp_poll ( ) , get_tcp4_sock ( ) . . .
*/
static inline int sk_state_load ( const struct sock * sk )
{
return smp_load_acquire ( & sk - > sk_state ) ;
}
/**
* sk_state_store - update sk - > sk_state
* @ sk : socket pointer
* @ newstate : new state
*
* Paired with sk_state_load ( ) . Should be used in contexts where
* state change might impact lockless readers .
*/
static inline void sk_state_store ( struct sock * sk , int newstate )
{
smp_store_release ( & sk - > sk_state , newstate ) ;
}
2013-09-22 21:32:26 +04:00
void sock_enable_timestamp ( struct sock * sk , int flag ) ;
int sock_get_timestamp ( struct sock * , struct timeval __user * ) ;
int sock_get_timestampns ( struct sock * , struct timespec __user * ) ;
int sock_recv_errqueue ( struct sock * sk , struct msghdr * msg , int len , int level ,
int type ) ;
2005-04-17 02:20:36 +04:00
2014-04-24 01:26:56 +04:00
bool sk_ns_capable ( const struct sock * sk ,
struct user_namespace * user_ns , int cap ) ;
bool sk_capable ( const struct sock * sk , int cap ) ;
bool sk_net_capable ( const struct sock * sk , int cap ) ;
2005-04-17 02:20:36 +04:00
extern __u32 sysctl_wmem_max ;
extern __u32 sysctl_rmem_max ;
2015-01-30 21:29:32 +03:00
extern int sysctl_tstamp_allow_data ;
2005-09-06 05:14:11 +04:00
extern int sysctl_optmem_max ;
2005-08-16 09:18:02 +04:00
extern __u32 sysctl_wmem_default ;
extern __u32 sysctl_rmem_default ;
2005-04-17 02:20:36 +04:00
# endif /* _SOCK_H */