2005-04-16 15:20:36 -07:00
/*
* net / sched / sch_netem . c Network emulator
*
* This program is free software ; you can redistribute it and / or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation ; either version
2006-10-22 20:16:57 -07:00
* 2 of the License .
2005-04-16 15:20:36 -07:00
*
* Many of the algorithms and ideas for this came from
2007-02-09 23:25:16 +09:00
* NIST Net which is not copyrighted .
2005-04-16 15:20:36 -07:00
*
* Authors : Stephen Hemminger < shemminger @ osdl . org >
* Catalin ( ux aka Dino ) BOIE < catab at umbrella dot ro >
*/
2011-06-16 11:01:34 +00:00
# include <linux/mm.h>
2005-04-16 15:20:36 -07:00
# include <linux/module.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2005-04-16 15:20:36 -07:00
# include <linux/types.h>
# include <linux/kernel.h>
# include <linux/errno.h>
# include <linux/skbuff.h>
2011-02-24 22:48:13 -08:00
# include <linux/vmalloc.h>
2005-04-16 15:20:36 -07:00
# include <linux/rtnetlink.h>
2011-12-12 14:30:00 +00:00
# include <linux/reciprocal_div.h>
2013-06-28 07:40:57 -07:00
# include <linux/rbtree.h>
2005-04-16 15:20:36 -07:00
2007-03-25 23:06:12 -07:00
# include <net/netlink.h>
2005-04-16 15:20:36 -07:00
# include <net/pkt_sched.h>
2012-04-30 23:11:05 +00:00
# include <net/inet_ecn.h>
2005-04-16 15:20:36 -07:00
2011-02-23 13:04:22 +00:00
# define VERSION "1.3"
2005-11-03 13:49:01 -08:00
2005-04-16 15:20:36 -07:00
/* Network Emulation Queuing algorithm.
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Sources : [ 1 ] Mark Carson , Darrin Santay , " NIST Net - A Linux-based
Network Emulation Tool
[ 2 ] Luigi Rizzo , DummyNet for FreeBSD
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
This started out as a simple way to delay outgoing packets to
test TCP but has grown to include most of the functionality
of a full blown network emulator like NISTnet . It can delay
packets and add random jitter ( and correlation ) . The random
distribution can be loaded from a table as well to provide
normal , Pareto , or experimental curves . Packet loss ,
duplication , and reordering can also be emulated .
This qdisc does not do classification that can be handled in
layering other disciplines . It does not need to do bandwidth
control either since that can be handled by using token
bucket or other rate control .
2011-02-23 13:04:21 +00:00
Correlated Loss Generator models
Added generation of correlated loss according to the
" Gilbert-Elliot " model , a 4 - state markov model .
References :
[ 1 ] NetemCLG Home http : //netgroup.uniroma2.it/NetemCLG
[ 2 ] S . Salsano , F . Ludovici , A . Ordine , " Definition of a general
and intuitive loss model for packet networks and its implementation
in the Netem module in the Linux kernel " , available in [1]
Authors : Stefano Salsano < stefano . salsano at uniroma2 . it
Fabio Ludovici < fabio . ludovici at yahoo . it >
2005-04-16 15:20:36 -07:00
*/
2018-06-27 10:32:19 -07:00
struct disttable {
u32 size ;
s16 table [ 0 ] ;
} ;
2005-04-16 15:20:36 -07:00
struct netem_sched_data {
2013-06-28 07:40:57 -07:00
/* internal t(ime)fifo qdisc uses t_root and sch->limit */
struct rb_root t_root ;
2011-12-28 23:12:02 +00:00
2018-12-04 11:55:56 -08:00
/* a linear queue; reduces rbtree rebalancing when jitter is low */
struct sk_buff * t_head ;
struct sk_buff * t_tail ;
2011-12-28 23:12:02 +00:00
/* optional qdisc for classful handling (NULL at netem init) */
2005-04-16 15:20:36 -07:00
struct Qdisc * qdisc ;
2011-12-28 23:12:02 +00:00
2007-03-16 01:20:31 -07:00
struct qdisc_watchdog watchdog ;
2005-04-16 15:20:36 -07:00
2017-11-08 15:12:26 -08:00
s64 latency ;
s64 jitter ;
2007-03-22 12:16:21 -07:00
2005-04-16 15:20:36 -07:00
u32 loss ;
2012-04-30 23:11:05 +00:00
u32 ecn ;
2005-04-16 15:20:36 -07:00
u32 limit ;
u32 counter ;
u32 gap ;
u32 duplicate ;
2005-05-26 12:55:48 -07:00
u32 reorder ;
2005-12-21 19:03:44 -08:00
u32 corrupt ;
2013-12-25 17:35:15 +08:00
u64 rate ;
2011-12-12 14:30:00 +00:00
s32 packet_overhead ;
u32 cell_size ;
reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.
This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:
1) Initialization:
int l = ceil(log_2 d)
uword m' = floor((1<<32)*((1<<l)-d)/d)+1
int sh_1 = min(l,1)
int sh_2 = max(l-1,0)
2) For q = n/d, all uword:
uword t = (n*m')>>32
q = (t+((n-t)>>sh_1))>>sh_2
The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.
Joint work with Daniel Borkmann.
[1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
[2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
[3] https://gmplib.org/~tege/division-paper.pdf
[4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
[6] http://www.agner.org/optimize/asmlib.zip
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 02:29:41 +01:00
struct reciprocal_value cell_size_reciprocal ;
2011-12-12 14:30:00 +00:00
s32 cell_overhead ;
2005-04-16 15:20:36 -07:00
struct crndstate {
2007-03-22 12:16:21 -07:00
u32 last ;
u32 rho ;
2005-12-21 19:03:44 -08:00
} delay_cor , loss_cor , dup_cor , reorder_cor , corrupt_cor ;
2005-04-16 15:20:36 -07:00
2018-06-27 10:32:19 -07:00
struct disttable * delay_dist ;
2011-02-23 13:04:21 +00:00
enum {
CLG_RANDOM ,
CLG_4_STATES ,
CLG_GILB_ELL ,
} loss_model ;
2014-01-18 18:13:31 +08:00
enum {
TX_IN_GAP_PERIOD = 1 ,
TX_IN_BURST_PERIOD ,
LOST_IN_GAP_PERIOD ,
LOST_IN_BURST_PERIOD ,
} _4_state_model ;
2014-02-14 10:30:43 +08:00
enum {
GOOD_STATE = 1 ,
BAD_STATE ,
} GE_state_model ;
2011-02-23 13:04:21 +00:00
/* Correlated Loss Generation models */
struct clgstate {
/* state of the Markov chain */
u8 state ;
/* 4-states and Gilbert-Elliot models */
u32 a1 ; /* p13 for 4-states or p for GE */
u32 a2 ; /* p31 for 4-states or r for GE */
u32 a3 ; /* p32 for 4-states or h for GE */
u32 a4 ; /* p14 for 4-states or 1-k for GE */
u32 a5 ; /* p23 used only in 4-states */
} clg ;
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
struct tc_netem_slot slot_config ;
struct slotstate {
u64 slot_next ;
s32 packets_left ;
s32 bytes_left ;
} slot ;
2018-06-27 10:32:19 -07:00
struct disttable * slot_dist ;
2005-04-16 15:20:36 -07:00
} ;
2011-12-28 23:12:02 +00:00
/* Time stamp put into socket buffer control block
* Only valid when skbs are in our internal t ( ime ) fifo queue .
2014-11-03 08:19:53 -08:00
*
* As skb - > rbnode uses same storage than skb - > next , skb - > prev and skb - > tstamp ,
* and skb - > next & skb - > prev are scratch space for a qdisc ,
* we save skb - > tstamp value in skb - > cb [ ] before destroying it .
2011-12-28 23:12:02 +00:00
*/
2005-04-16 15:20:36 -07:00
struct netem_skb_cb {
2017-11-08 15:12:26 -08:00
u64 time_to_send ;
2005-04-16 15:20:36 -07:00
} ;
2008-07-20 00:08:04 -07:00
static inline struct netem_skb_cb * netem_skb_cb ( struct sk_buff * skb )
{
2013-06-28 07:40:57 -07:00
/* we assume we can use skb next/prev/tstamp as storage for rb_node */
2012-02-06 15:14:37 -05:00
qdisc_cb_private_validate ( skb , sizeof ( struct netem_skb_cb ) ) ;
2008-07-20 00:08:47 -07:00
return ( struct netem_skb_cb * ) qdisc_skb_cb ( skb ) - > data ;
2008-07-20 00:08:04 -07:00
}
2005-04-16 15:20:36 -07:00
/* init_crandom - initialize correlated random number generator
* Use entropy source for initial seed .
*/
static void init_crandom ( struct crndstate * state , unsigned long rho )
{
state - > rho = rho ;
2014-01-11 07:15:59 -05:00
state - > last = prandom_u32 ( ) ;
2005-04-16 15:20:36 -07:00
}
/* get_crandom - correlated random number generator
* Next number depends on last value .
* rho is scaled to avoid floating point .
*/
2007-03-22 12:16:21 -07:00
static u32 get_crandom ( struct crndstate * state )
2005-04-16 15:20:36 -07:00
{
u64 value , rho ;
unsigned long answer ;
2018-06-27 10:32:19 -07:00
if ( ! state | | state - > rho = = 0 ) /* no correlation */
2014-01-11 07:15:59 -05:00
return prandom_u32 ( ) ;
2005-04-16 15:20:36 -07:00
2014-01-11 07:15:59 -05:00
value = prandom_u32 ( ) ;
2005-04-16 15:20:36 -07:00
rho = ( u64 ) state - > rho + 1 ;
answer = ( value * ( ( 1ull < < 32 ) - rho ) + state - > last * rho ) > > 32 ;
state - > last = answer ;
return answer ;
}
2011-02-23 13:04:21 +00:00
/* loss_4state - 4-state model loss generator
* Generates losses according to the 4 - state Markov chain adopted in
* the GI ( General and Intuitive ) loss model .
*/
static bool loss_4state ( struct netem_sched_data * q )
{
struct clgstate * clg = & q - > clg ;
2014-01-11 07:15:59 -05:00
u32 rnd = prandom_u32 ( ) ;
2011-02-23 13:04:21 +00:00
/*
2011-03-30 22:57:33 -03:00
* Makes a comparison between rnd and the transition
2011-02-23 13:04:21 +00:00
* probabilities outgoing from the current state , then decides the
* next state and if the next packet has to be transmitted or lost .
* The four states correspond to :
2014-01-18 18:13:31 +08:00
* TX_IN_GAP_PERIOD = > successfully transmitted packets within a gap period
* LOST_IN_BURST_PERIOD = > isolated losses within a gap period
* LOST_IN_GAP_PERIOD = > lost packets within a burst period
* TX_IN_GAP_PERIOD = > successfully transmitted packets within a burst period
2011-02-23 13:04:21 +00:00
*/
switch ( clg - > state ) {
2014-01-18 18:13:31 +08:00
case TX_IN_GAP_PERIOD :
2011-02-23 13:04:21 +00:00
if ( rnd < clg - > a4 ) {
2014-01-18 18:13:31 +08:00
clg - > state = LOST_IN_BURST_PERIOD ;
2011-02-23 13:04:21 +00:00
return true ;
2013-11-29 11:03:35 -08:00
} else if ( clg - > a4 < rnd & & rnd < clg - > a1 + clg - > a4 ) {
2014-01-18 18:13:31 +08:00
clg - > state = LOST_IN_GAP_PERIOD ;
2011-02-23 13:04:21 +00:00
return true ;
2014-01-18 18:13:31 +08:00
} else if ( clg - > a1 + clg - > a4 < rnd ) {
clg - > state = TX_IN_GAP_PERIOD ;
}
2011-02-23 13:04:21 +00:00
break ;
2014-01-18 18:13:31 +08:00
case TX_IN_BURST_PERIOD :
2011-02-23 13:04:21 +00:00
if ( rnd < clg - > a5 ) {
2014-01-18 18:13:31 +08:00
clg - > state = LOST_IN_GAP_PERIOD ;
2011-02-23 13:04:21 +00:00
return true ;
2014-01-18 18:13:31 +08:00
} else {
clg - > state = TX_IN_BURST_PERIOD ;
}
2011-02-23 13:04:21 +00:00
break ;
2014-01-18 18:13:31 +08:00
case LOST_IN_GAP_PERIOD :
2011-02-23 13:04:21 +00:00
if ( rnd < clg - > a3 )
2014-01-18 18:13:31 +08:00
clg - > state = TX_IN_BURST_PERIOD ;
2011-02-23 13:04:21 +00:00
else if ( clg - > a3 < rnd & & rnd < clg - > a2 + clg - > a3 ) {
2014-01-18 18:13:31 +08:00
clg - > state = TX_IN_GAP_PERIOD ;
2011-02-23 13:04:21 +00:00
} else if ( clg - > a2 + clg - > a3 < rnd ) {
2014-01-18 18:13:31 +08:00
clg - > state = LOST_IN_GAP_PERIOD ;
2011-02-23 13:04:21 +00:00
return true ;
}
break ;
2014-01-18 18:13:31 +08:00
case LOST_IN_BURST_PERIOD :
clg - > state = TX_IN_GAP_PERIOD ;
2011-02-23 13:04:21 +00:00
break ;
}
return false ;
}
/* loss_gilb_ell - Gilbert-Elliot model loss generator
* Generates losses according to the Gilbert - Elliot loss model or
* its special cases ( Gilbert or Simple Gilbert )
*
2011-03-30 22:57:33 -03:00
* Makes a comparison between random number and the transition
2011-02-23 13:04:21 +00:00
* probabilities outgoing from the current state , then decides the
2011-03-30 22:57:33 -03:00
* next state . A second random number is extracted and the comparison
2011-02-23 13:04:21 +00:00
* with the loss probability of the current state decides if the next
* packet will be transmitted or lost .
*/
static bool loss_gilb_ell ( struct netem_sched_data * q )
{
struct clgstate * clg = & q - > clg ;
switch ( clg - > state ) {
2014-02-14 10:30:43 +08:00
case GOOD_STATE :
2014-01-11 07:15:59 -05:00
if ( prandom_u32 ( ) < clg - > a1 )
2014-02-14 10:30:43 +08:00
clg - > state = BAD_STATE ;
2014-01-11 07:15:59 -05:00
if ( prandom_u32 ( ) < clg - > a4 )
2011-02-23 13:04:21 +00:00
return true ;
2013-11-29 11:02:43 -08:00
break ;
2014-02-14 10:30:43 +08:00
case BAD_STATE :
2014-01-11 07:15:59 -05:00
if ( prandom_u32 ( ) < clg - > a2 )
2014-02-14 10:30:43 +08:00
clg - > state = GOOD_STATE ;
2014-01-11 07:15:59 -05:00
if ( prandom_u32 ( ) > clg - > a3 )
2011-02-23 13:04:21 +00:00
return true ;
}
return false ;
}
static bool loss_event ( struct netem_sched_data * q )
{
switch ( q - > loss_model ) {
case CLG_RANDOM :
/* Random packet drop 0 => none, ~0 => all */
return q - > loss & & q - > loss > = get_crandom ( & q - > loss_cor ) ;
case CLG_4_STATES :
/* 4state loss model algorithm (used also for GI model)
* Extracts a value from the markov 4 state loss generator ,
* if it is 1 drops a packet and if needed writes the event in
* the kernel logs
*/
return loss_4state ( q ) ;
case CLG_GILB_ELL :
/* Gilbert-Elliot loss model algorithm
* Extracts a value from the Gilbert - Elliot loss generator ,
* if it is 1 drops a packet and if needed writes the event in
* the kernel logs
*/
return loss_gilb_ell ( q ) ;
}
return false ; /* not reached */
}
2005-04-16 15:20:36 -07:00
/* tabledist - return a pseudo-randomly distributed value with mean mu and
* std deviation sigma . Uses table lookup to approximate the desired
* distribution , and a uniformly - distributed pseudo - random source .
*/
2017-11-14 11:27:02 -08:00
static s64 tabledist ( s64 mu , s32 sigma ,
2017-11-08 15:12:26 -08:00
struct crndstate * state ,
2017-11-14 11:27:02 -08:00
const struct disttable * dist )
2005-04-16 15:20:36 -07:00
{
2017-11-08 15:12:26 -08:00
s64 x ;
2007-03-22 12:16:21 -07:00
long t ;
u32 rnd ;
2005-04-16 15:20:36 -07:00
if ( sigma = = 0 )
return mu ;
rnd = get_crandom ( state ) ;
/* default uniform distribution */
2007-02-09 23:25:16 +09:00
if ( dist = = NULL )
2018-02-06 23:14:18 -05:00
return ( ( rnd % ( 2 * sigma ) ) + mu ) - sigma ;
2005-04-16 15:20:36 -07:00
t = dist - > table [ rnd % dist - > size ] ;
x = ( sigma % NETEM_DIST_SCALE ) * t ;
if ( x > = 0 )
x + = NETEM_DIST_SCALE / 2 ;
else
x - = NETEM_DIST_SCALE / 2 ;
return x / NETEM_DIST_SCALE + ( sigma / NETEM_DIST_SCALE ) * t + mu ;
}
2017-11-14 11:27:01 -08:00
static u64 packet_time_ns ( u64 len , const struct netem_sched_data * q )
2011-11-30 12:20:26 +00:00
{
2011-12-12 14:30:00 +00:00
len + = q - > packet_overhead ;
if ( q - > cell_size ) {
u32 cells = reciprocal_divide ( len , q - > cell_size_reciprocal ) ;
if ( len > cells * q - > cell_size ) /* extra cell needed for remainder */
cells + + ;
len = cells * ( q - > cell_size + q - > cell_overhead ) ;
}
2017-11-14 11:27:01 -08:00
return div64_u64 ( len * NSEC_PER_SEC , q - > rate ) ;
2011-11-30 12:20:26 +00:00
}
2013-10-06 15:16:49 -07:00
static void tfifo_reset ( struct Qdisc * sch )
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
sch_netem: faster rb tree removal
While running TCP tests involving netem storing millions of packets,
I had the idea to speed up tfifo_reset() and did experiments.
I tried the rbtree_postorder_for_each_entry_safe() method that is
used in skb_rbtree_purge() but discovered it was slower than the
current tfifo_reset() method.
I measured time taken to release skbs with three occupation levels :
10^4, 10^5 and 10^6 skbs with three methods :
1) (current 'naive' method)
while ((p = rb_first(&q->t_root))) {
struct sk_buff *skb = netem_rb_to_skb(p);
rb_erase(p, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
2) Use rb_next() instead of rb_first() in the loop :
p = rb_first(&q->t_root);
while (p) {
struct sk_buff *skb = netem_rb_to_skb(p);
p = rb_next(p);
rb_erase(&skb->rbnode, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
3) "optimized" method using rbtree_postorder_for_each_entry_safe()
struct sk_buff *skb, *next;
rbtree_postorder_for_each_entry_safe(skb, next,
&q->t_root, rbnode) {
rtnl_kfree_skbs(skb, skb);
}
q->t_root = RB_ROOT;
Results :
method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 10000 skbs in 541846 ns (54 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb)
method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 100000 skbs in 5942456 ns (59 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb)
method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 1000000 skbs in 82619635 ns (82 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb)
Method 2) is simply faster, probably because it maintains a smaller
working size set.
Note that this is the method we use in tcp_ofo_queue() already.
I will also change skb_rbtree_purge() in a second patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-23 11:07:28 -07:00
struct rb_node * p = rb_first ( & q - > t_root ) ;
2013-10-06 15:16:49 -07:00
sch_netem: faster rb tree removal
While running TCP tests involving netem storing millions of packets,
I had the idea to speed up tfifo_reset() and did experiments.
I tried the rbtree_postorder_for_each_entry_safe() method that is
used in skb_rbtree_purge() but discovered it was slower than the
current tfifo_reset() method.
I measured time taken to release skbs with three occupation levels :
10^4, 10^5 and 10^6 skbs with three methods :
1) (current 'naive' method)
while ((p = rb_first(&q->t_root))) {
struct sk_buff *skb = netem_rb_to_skb(p);
rb_erase(p, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
2) Use rb_next() instead of rb_first() in the loop :
p = rb_first(&q->t_root);
while (p) {
struct sk_buff *skb = netem_rb_to_skb(p);
p = rb_next(p);
rb_erase(&skb->rbnode, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
3) "optimized" method using rbtree_postorder_for_each_entry_safe()
struct sk_buff *skb, *next;
rbtree_postorder_for_each_entry_safe(skb, next,
&q->t_root, rbnode) {
rtnl_kfree_skbs(skb, skb);
}
q->t_root = RB_ROOT;
Results :
method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 10000 skbs in 541846 ns (54 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb)
method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 100000 skbs in 5942456 ns (59 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb)
method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 1000000 skbs in 82619635 ns (82 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb)
Method 2) is simply faster, probably because it maintains a smaller
working size set.
Note that this is the method we use in tcp_ofo_queue() already.
I will also change skb_rbtree_purge() in a second patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-23 11:07:28 -07:00
while ( p ) {
2017-10-05 22:21:21 -07:00
struct sk_buff * skb = rb_to_skb ( p ) ;
2013-10-06 15:16:49 -07:00
sch_netem: faster rb tree removal
While running TCP tests involving netem storing millions of packets,
I had the idea to speed up tfifo_reset() and did experiments.
I tried the rbtree_postorder_for_each_entry_safe() method that is
used in skb_rbtree_purge() but discovered it was slower than the
current tfifo_reset() method.
I measured time taken to release skbs with three occupation levels :
10^4, 10^5 and 10^6 skbs with three methods :
1) (current 'naive' method)
while ((p = rb_first(&q->t_root))) {
struct sk_buff *skb = netem_rb_to_skb(p);
rb_erase(p, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
2) Use rb_next() instead of rb_first() in the loop :
p = rb_first(&q->t_root);
while (p) {
struct sk_buff *skb = netem_rb_to_skb(p);
p = rb_next(p);
rb_erase(&skb->rbnode, &q->t_root);
rtnl_kfree_skbs(skb, skb);
}
3) "optimized" method using rbtree_postorder_for_each_entry_safe()
struct sk_buff *skb, *next;
rbtree_postorder_for_each_entry_safe(skb, next,
&q->t_root, rbnode) {
rtnl_kfree_skbs(skb, skb);
}
q->t_root = RB_ROOT;
Results :
method_1:while (rb_first()) rb_erase() 10000 skbs in 690378 ns (69 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 10000 skbs in 541846 ns (54 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 10000 skbs in 868307 ns (86 ns per skb)
method_1:while (rb_first()) rb_erase() 99996 skbs in 7804021 ns (78 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 100000 skbs in 5942456 ns (59 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 100000 skbs in 11584940 ns (115 ns per skb)
method_1:while (rb_first()) rb_erase() 1000000 skbs in 108577838 ns (108 ns per skb)
method_2:rb_first; while (p) { p = rb_next(p); ...} 1000000 skbs in 82619635 ns (82 ns per skb)
method_3:rbtree_postorder_for_each_entry_safe() 1000000 skbs in 127328743 ns (127 ns per skb)
Method 2) is simply faster, probably because it maintains a smaller
working size set.
Note that this is the method we use in tcp_ofo_queue() already.
I will also change skb_rbtree_purge() in a second patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-23 11:07:28 -07:00
p = rb_next ( p ) ;
rb_erase ( & skb - > rbnode , & q - > t_root ) ;
2016-06-13 20:21:57 -07:00
rtnl_kfree_skbs ( skb , skb ) ;
2013-10-06 15:16:49 -07:00
}
2018-12-04 11:55:56 -08:00
rtnl_kfree_skbs ( q - > t_head , q - > t_tail ) ;
q - > t_head = NULL ;
q - > t_tail = NULL ;
2013-10-06 15:16:49 -07:00
}
2012-07-03 20:55:21 +00:00
static void tfifo_enqueue ( struct sk_buff * nskb , struct Qdisc * sch )
2011-12-28 23:12:02 +00:00
{
2013-06-28 07:40:57 -07:00
struct netem_sched_data * q = qdisc_priv ( sch ) ;
2017-11-08 15:12:26 -08:00
u64 tnext = netem_skb_cb ( nskb ) - > time_to_send ;
2011-12-28 23:12:02 +00:00
2018-12-04 11:55:56 -08:00
if ( ! q - > t_tail | | tnext > = netem_skb_cb ( q - > t_tail ) - > time_to_send ) {
if ( q - > t_tail )
q - > t_tail - > next = nskb ;
2013-06-28 07:40:57 -07:00
else
2018-12-04 11:55:56 -08:00
q - > t_head = nskb ;
q - > t_tail = nskb ;
} else {
struct rb_node * * p = & q - > t_root . rb_node , * parent = NULL ;
while ( * p ) {
struct sk_buff * skb ;
parent = * p ;
skb = rb_to_skb ( parent ) ;
if ( tnext > = netem_skb_cb ( skb ) - > time_to_send )
p = & parent - > rb_right ;
else
p = & parent - > rb_left ;
}
rb_link_node ( & nskb - > rbnode , parent , p ) ;
rb_insert_color ( & nskb - > rbnode , & q - > t_root ) ;
2011-12-28 23:12:02 +00:00
}
2013-06-28 07:40:57 -07:00
sch - > q . qlen + + ;
2011-12-28 23:12:02 +00:00
}
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
/* netem can't properly corrupt a megapacket (like we get from GSO), so instead
* when we statistically choose to corrupt one , we instead segment it , returning
* the first packet to be corrupted , and re - enqueue the remaining frames
*/
2016-06-21 23:16:49 -07:00
static struct sk_buff * netem_segment ( struct sk_buff * skb , struct Qdisc * sch ,
struct sk_buff * * to_free )
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
{
struct sk_buff * segs ;
netdev_features_t features = netif_skb_features ( skb ) ;
segs = skb_gso_segment ( skb , features & ~ NETIF_F_GSO_MASK ) ;
if ( IS_ERR_OR_NULL ( segs ) ) {
2016-06-21 23:16:49 -07:00
qdisc_drop ( skb , sch , to_free ) ;
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
return NULL ;
}
consume_skb ( skb ) ;
return segs ;
}
2005-05-26 12:53:49 -07:00
/*
* Insert one skb into qdisc .
* Note : parent depends on return value to account for queue length .
* NET_XMIT_DROP : queue length didn ' t change .
* NET_XMIT_SUCCESS : one skb was queued .
*/
2016-06-21 23:16:49 -07:00
static int netem_enqueue ( struct sk_buff * skb , struct Qdisc * sch ,
struct sk_buff * * to_free )
2005-04-16 15:20:36 -07:00
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
2006-07-21 14:45:25 -07:00
/* We don't fill cb now as skb_unshare() may invalidate it */
struct netem_skb_cb * cb ;
2005-05-26 12:53:49 -07:00
struct sk_buff * skb2 ;
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
struct sk_buff * segs = NULL ;
unsigned int len = 0 , last_len , prev_len = qdisc_pkt_len ( skb ) ;
int nb = 0 ;
2005-05-26 12:53:49 -07:00
int count = 1 ;
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
int rc = NET_XMIT_SUCCESS ;
2019-02-28 18:47:58 +08:00
int rc_drop = NET_XMIT_DROP ;
2005-04-16 15:20:36 -07:00
2018-11-29 16:01:04 -08:00
/* Do not fool qdisc_drop_all() */
skb - > prev = NULL ;
2005-05-26 12:53:49 -07:00
/* Random duplication */
if ( q - > duplicate & & q - > duplicate > = get_crandom ( & q - > dup_cor ) )
+ + count ;
2011-02-23 13:04:21 +00:00
/* Drop packet? */
2012-04-30 23:11:05 +00:00
if ( loss_event ( q ) ) {
if ( q - > ecn & & INET_ECN_set_ce ( skb ) )
2014-09-28 11:53:29 -07:00
qdisc_qstats_drop ( sch ) ; /* mark packet */
2012-04-30 23:11:05 +00:00
else
- - count ;
}
2005-05-26 12:53:49 -07:00
if ( count = = 0 ) {
2014-09-28 11:53:29 -07:00
qdisc_qstats_drop ( sch ) ;
2016-06-21 23:16:49 -07:00
__qdisc_drop ( skb , to_free ) ;
2008-08-04 22:39:11 -07:00
return NET_XMIT_SUCCESS | __NET_XMIT_BYPASS ;
2005-04-16 15:20:36 -07:00
}
netem: refine early skb orphaning
netem does an early orphaning of skbs. Doing so breaks TCP Small Queue
or any mechanism relying on socket sk_wmem_alloc feedback.
Ideally, we should perform this orphaning after the rate module and
before the delay module, to mimic what happens on a real link :
skb orphaning is indeed normally done at TX completion, before the
transit on the link.
+-------+ +--------+ +---------------+ +-----------------+
+ Qdisc +---> Device +--> TX completion +--> links / hops +->
+ + + xmit + + skb orphaning + + propagation +
+-------+ +--------+ +---------------+ +-----------------+
< rate limiting > < delay, drops, reorders >
If netem is used without delay feature (drops, reorders, rate
limiting), then we should avoid early skb orphaning, to keep pressure
on sockets as long as packets are still in qdisc queue.
Ideally, netem should be refactored to implement delay module
as the last stage. Current algorithm merges the two phases
(rate limiting + delay) so its not correct.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Mark Gordon <msg@google.com>
Cc: Andreas Terzis <aterzis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-14 03:16:27 +00:00
/* If a delay is expected, orphan the skb. (orphaning usually takes
* place at TX completion time , so _before_ the link transit delay )
*/
netem: apply correct delay when rate throttling
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:
1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
machines
The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
There appear to be two underlying bugs causing these effects:
- The first issue causes long delays when the rate is slow and no
delay is configured (e.g., "rate 512kbit"). This is because SKBs are
not orphaned when no delay is configured, so orphaning does not
occur until *after* the rate-induced delay has been applied. For
this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
dramatically increases the measured bandwidth.
- The second issue is that rate-induced delays are not correctly
applied, allowing SKB delays to occur in parallel. The indended
approach is to compute the delay for an SKB and to add this delay to
the end of the current queue. However, the code does not detect
existing SKBs in the queue due to improperly testing sch->q.qlen,
which is nonzero even when packets exist only in the
rbtree. Consequently, new SKBs do not wait for the current queue to
empty. When packet delays vary significantly (e.g., if packet sizes
are different), then this also causes unintended reordering.
I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.
Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 10:16:58 -07:00
if ( q - > latency | | q - > jitter | | q - > rate )
2013-07-30 17:55:08 -07:00
skb_orphan_partial ( skb ) ;
2006-10-22 21:00:33 -07:00
2005-05-26 12:53:49 -07:00
/*
* If we need to duplicate packet , then re - insert at top of the
* qdisc tree , since parent queuer expects that only one
* skb will be queued .
*/
if ( count > 1 & & ( skb2 = skb_clone ( skb , GFP_ATOMIC ) ) ! = NULL ) {
2008-07-16 01:42:40 -07:00
struct Qdisc * rootq = qdisc_root ( sch ) ;
2005-05-26 12:53:49 -07:00
u32 dupsave = q - > duplicate ; /* prevent duplicating a dup... */
2015-05-11 09:06:56 -07:00
q - > duplicate = 0 ;
2016-06-21 23:16:49 -07:00
rootq - > enqueue ( skb2 , rootq , to_free ) ;
2005-05-26 12:53:49 -07:00
q - > duplicate = dupsave ;
2019-02-28 18:47:58 +08:00
rc_drop = NET_XMIT_SUCCESS ;
2005-04-16 15:20:36 -07:00
}
2005-12-21 19:03:44 -08:00
/*
* Randomized packet corruption .
* Make copy if needed since we are modifying
* If packet is going to be hardware checksummed , then
* do it now in software before we mangle it .
*/
if ( q - > corrupt & & q - > corrupt > = get_crandom ( & q - > corrupt_cor ) ) {
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
if ( skb_is_gso ( skb ) ) {
2016-06-21 23:16:49 -07:00
segs = netem_segment ( skb , sch , to_free ) ;
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
if ( ! segs )
2019-02-28 18:47:58 +08:00
return rc_drop ;
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
} else {
segs = skb ;
}
skb = segs ;
segs = segs - > next ;
2016-06-28 10:30:08 +02:00
skb = skb_unshare ( skb , GFP_ATOMIC ) ;
if ( unlikely ( ! skb ) ) {
qdisc_qstats_drop ( sch ) ;
goto finish_segs ;
}
if ( skb - > ip_summed = = CHECKSUM_PARTIAL & &
skb_checksum_help ( skb ) ) {
qdisc_drop ( skb , sch , to_free ) ;
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
goto finish_segs ;
}
2005-12-21 19:03:44 -08:00
2014-01-11 07:15:59 -05:00
skb - > data [ prandom_u32 ( ) % skb_headlen ( skb ) ] ^ =
1 < < ( prandom_u32 ( ) % 8 ) ;
2005-12-21 19:03:44 -08:00
}
2019-02-28 18:47:58 +08:00
if ( unlikely ( sch - > q . qlen > = sch - > limit ) ) {
qdisc_drop_all ( skb , sch , to_free ) ;
return rc_drop ;
}
2012-07-03 20:55:21 +00:00
2014-09-28 11:53:29 -07:00
qdisc_qstats_backlog_inc ( sch , skb ) ;
2012-07-03 20:55:21 +00:00
2008-07-20 00:08:04 -07:00
cb = netem_skb_cb ( skb ) ;
2011-01-19 19:26:56 +00:00
if ( q - > gap = = 0 | | /* not doing reordering */
2012-01-19 10:20:59 +00:00
q - > counter < q - > gap - 1 | | /* inside last reordering gap */
2009-11-29 16:55:45 -08:00
q - > reorder < get_crandom ( & q - > reorder_cor ) ) {
2017-11-08 15:12:26 -08:00
u64 now ;
s64 delay ;
2005-11-03 13:43:07 -08:00
delay = tabledist ( q - > latency , q - > jitter ,
& q - > delay_cor , q - > delay_dist ) ;
2017-11-08 15:12:26 -08:00
now = ktime_get_ns ( ) ;
2011-11-30 12:20:26 +00:00
if ( q - > rate ) {
netem: apply correct delay when rate throttling
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:
1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
machines
The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
There appear to be two underlying bugs causing these effects:
- The first issue causes long delays when the rate is slow and no
delay is configured (e.g., "rate 512kbit"). This is because SKBs are
not orphaned when no delay is configured, so orphaning does not
occur until *after* the rate-induced delay has been applied. For
this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
dramatically increases the measured bandwidth.
- The second issue is that rate-induced delays are not correctly
applied, allowing SKB delays to occur in parallel. The indended
approach is to compute the delay for an SKB and to add this delay to
the end of the current queue. However, the code does not detect
existing SKBs in the queue due to improperly testing sch->q.qlen,
which is nonzero even when packets exist only in the
rbtree. Consequently, new SKBs do not wait for the current queue to
empty. When packet delays vary significantly (e.g., if packet sizes
are different), then this also causes unintended reordering.
I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.
Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 10:16:58 -07:00
struct netem_skb_cb * last = NULL ;
if ( sch - > q . tail )
last = netem_skb_cb ( sch - > q . tail ) ;
if ( q - > t_root . rb_node ) {
struct sk_buff * t_skb ;
struct netem_skb_cb * t_last ;
2017-10-05 22:21:21 -07:00
t_skb = skb_rb_last ( & q - > t_root ) ;
netem: apply correct delay when rate throttling
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:
1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
machines
The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
There appear to be two underlying bugs causing these effects:
- The first issue causes long delays when the rate is slow and no
delay is configured (e.g., "rate 512kbit"). This is because SKBs are
not orphaned when no delay is configured, so orphaning does not
occur until *after* the rate-induced delay has been applied. For
this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
dramatically increases the measured bandwidth.
- The second issue is that rate-induced delays are not correctly
applied, allowing SKB delays to occur in parallel. The indended
approach is to compute the delay for an SKB and to add this delay to
the end of the current queue. However, the code does not detect
existing SKBs in the queue due to improperly testing sch->q.qlen,
which is nonzero even when packets exist only in the
rbtree. Consequently, new SKBs do not wait for the current queue to
empty. When packet delays vary significantly (e.g., if packet sizes
are different), then this also causes unintended reordering.
I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.
Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 10:16:58 -07:00
t_last = netem_skb_cb ( t_skb ) ;
if ( ! last | |
2018-12-04 11:55:56 -08:00
t_last - > time_to_send > last - > time_to_send )
last = t_last ;
}
if ( q - > t_tail ) {
struct netem_skb_cb * t_last =
netem_skb_cb ( q - > t_tail ) ;
if ( ! last | |
t_last - > time_to_send > last - > time_to_send )
netem: apply correct delay when rate throttling
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:
1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
machines
The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
There appear to be two underlying bugs causing these effects:
- The first issue causes long delays when the rate is slow and no
delay is configured (e.g., "rate 512kbit"). This is because SKBs are
not orphaned when no delay is configured, so orphaning does not
occur until *after* the rate-induced delay has been applied. For
this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
dramatically increases the measured bandwidth.
- The second issue is that rate-induced delays are not correctly
applied, allowing SKB delays to occur in parallel. The indended
approach is to compute the delay for an SKB and to add this delay to
the end of the current queue. However, the code does not detect
existing SKBs in the queue due to improperly testing sch->q.qlen,
which is nonzero even when packets exist only in the
rbtree. Consequently, new SKBs do not wait for the current queue to
empty. When packet delays vary significantly (e.g., if packet sizes
are different), then this also causes unintended reordering.
I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.
Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 10:16:58 -07:00
last = t_last ;
}
2011-11-30 12:20:26 +00:00
2013-06-28 07:40:57 -07:00
if ( last ) {
2011-11-30 12:20:26 +00:00
/*
netem: fix delay calculation in rate extension
The delay calculation with the rate extension introduces in v3.3 does
not properly work, if other packets are still queued for transmission.
For the delay calculation to work, both delay types (latency and delay
introduces by rate limitation) have to be handled differently. The
latency delay for a packet can overlap with the delay of other packets.
The delay introduced by the rate however is separate, and can only
start, once all other rate-introduced delays finished.
Latency delay is from same distribution for each packet, rate delay
depends on the packet size.
.: latency delay
-: rate delay
x: additional delay we have to wait since another packet is currently
transmitted
.....---- Packet 1
.....xx------ Packet 2
.....------ Packet 3
^^^^^
latency stacks
^^
rate delay doesn't stack
^^
latency stacks
-----> time
When a packet is enqueued, we first consider the latency delay. If other
packets are already queued, we can reduce the latency delay until the
last packet in the queue is send, however the latency delay cannot be
<0, since this would mean that the rate is overcommitted. The new
reference point is the time at which the last packet will be send. To
find the time, when the packet should be send, the rate introduces delay
has to be added on top of that.
Signed-off-by: Johannes Naab <jn@stusta.de>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-23 11:36:51 +00:00
* Last packet in queue is reference point ( now ) ,
* calculate this time bonus and subtract
2011-11-30 12:20:26 +00:00
* from delay .
*/
netem: apply correct delay when rate throttling
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:
1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
machines
The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
There appear to be two underlying bugs causing these effects:
- The first issue causes long delays when the rate is slow and no
delay is configured (e.g., "rate 512kbit"). This is because SKBs are
not orphaned when no delay is configured, so orphaning does not
occur until *after* the rate-induced delay has been applied. For
this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
dramatically increases the measured bandwidth.
- The second issue is that rate-induced delays are not correctly
applied, allowing SKB delays to occur in parallel. The indended
approach is to compute the delay for an SKB and to add this delay to
the end of the current queue. However, the code does not detect
existing SKBs in the queue due to improperly testing sch->q.qlen,
which is nonzero even when packets exist only in the
rbtree. Consequently, new SKBs do not wait for the current queue to
empty. When packet delays vary significantly (e.g., if packet sizes
are different), then this also causes unintended reordering.
I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.
Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 10:16:58 -07:00
delay - = last - > time_to_send - now ;
2017-11-08 15:12:26 -08:00
delay = max_t ( s64 , 0 , delay ) ;
netem: apply correct delay when rate throttling
I recently reported on the netem list that iperf network benchmarks
show unexpected results when a bandwidth throttling rate has been
configured for netem. Specifically:
1) The measured link bandwidth *increases* when a higher delay is added
2) The measured link bandwidth appears higher than the specified limit
3) The measured link bandwidth for the same very slow settings varies significantly across
machines
The issue can be reproduced by using tc to configure netem with a
512kbit rate and various (none, 1us, 50ms, 100ms, 200ms) delays on a
veth pair between network namespaces, and then using iperf (or any
other network benchmarking tool) to test throughput. Complete detailed
instructions are in the original email chain here:
https://lists.linuxfoundation.org/pipermail/netem/2017-February/001672.html
There appear to be two underlying bugs causing these effects:
- The first issue causes long delays when the rate is slow and no
delay is configured (e.g., "rate 512kbit"). This is because SKBs are
not orphaned when no delay is configured, so orphaning does not
occur until *after* the rate-induced delay has been applied. For
this reason, adding a tiny delay (e.g., "rate 512kbit delay 1us")
dramatically increases the measured bandwidth.
- The second issue is that rate-induced delays are not correctly
applied, allowing SKB delays to occur in parallel. The indended
approach is to compute the delay for an SKB and to add this delay to
the end of the current queue. However, the code does not detect
existing SKBs in the queue due to improperly testing sch->q.qlen,
which is nonzero even when packets exist only in the
rbtree. Consequently, new SKBs do not wait for the current queue to
empty. When packet delays vary significantly (e.g., if packet sizes
are different), then this also causes unintended reordering.
I modified the code to expect a delay (and orphan the SKB) when a rate
is configured. I also added some defensive tests that correctly find
the latest scheduled delivery time, even if it is (unexpectedly) for a
packet in sch->q. I have tested these changes on the latest kernel
(4.11.0-rc1+) and the iperf / ping test results are as expected.
Signed-off-by: Nik Unger <njunger@uwaterloo.ca>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 10:16:58 -07:00
now = last - > time_to_send ;
2011-11-30 12:20:26 +00:00
}
netem: fix delay calculation in rate extension
The delay calculation with the rate extension introduces in v3.3 does
not properly work, if other packets are still queued for transmission.
For the delay calculation to work, both delay types (latency and delay
introduces by rate limitation) have to be handled differently. The
latency delay for a packet can overlap with the delay of other packets.
The delay introduced by the rate however is separate, and can only
start, once all other rate-introduced delays finished.
Latency delay is from same distribution for each packet, rate delay
depends on the packet size.
.: latency delay
-: rate delay
x: additional delay we have to wait since another packet is currently
transmitted
.....---- Packet 1
.....xx------ Packet 2
.....------ Packet 3
^^^^^
latency stacks
^^
rate delay doesn't stack
^^
latency stacks
-----> time
When a packet is enqueued, we first consider the latency delay. If other
packets are already queued, we can reduce the latency delay until the
last packet in the queue is send, however the latency delay cannot be
<0, since this would mean that the rate is overcommitted. The new
reference point is the time at which the last packet will be send. To
find the time, when the packet should be send, the rate introduces delay
has to be added on top of that.
Signed-off-by: Johannes Naab <jn@stusta.de>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-23 11:36:51 +00:00
2017-11-14 11:27:01 -08:00
delay + = packet_time_ns ( qdisc_pkt_len ( skb ) , q ) ;
2011-11-30 12:20:26 +00:00
}
2007-03-23 11:27:45 -07:00
cb - > time_to_send = now + delay ;
2005-04-16 15:20:36 -07:00
+ + q - > counter ;
2012-07-03 20:55:21 +00:00
tfifo_enqueue ( skb , sch ) ;
2005-04-16 15:20:36 -07:00
} else {
2007-02-09 23:25:16 +09:00
/*
2005-05-26 12:55:48 -07:00
* Do re - ordering by putting one out of N packets at the front
* of the queue .
*/
2017-11-08 15:12:26 -08:00
cb - > time_to_send = ktime_get_ns ( ) ;
2005-05-26 12:55:48 -07:00
q - > counter = 0 ;
2008-11-02 00:36:03 -07:00
2018-07-29 16:33:28 -07:00
__qdisc_enqueue_head ( skb , & sch - > q ) ;
2012-01-04 17:35:26 +00:00
sch - > qstats . requeues + + ;
2008-08-04 22:31:03 -07:00
}
2005-04-16 15:20:36 -07:00
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
finish_segs :
if ( segs ) {
while ( segs ) {
skb2 = segs - > next ;
2018-07-29 20:42:53 -07:00
skb_mark_not_on_list ( segs ) ;
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
qdisc_skb_cb ( segs ) - > pkt_len = segs - > len ;
last_len = segs - > len ;
2016-06-21 23:16:49 -07:00
rc = qdisc_enqueue ( segs , sch , to_free ) ;
netem: Segment GSO packets on enqueue
This was recently reported to me, and reproduced on the latest net kernel,
when attempting to run netperf from a host that had a netem qdisc attached
to the egress interface:
[ 788.073771] ---------------------[ cut here ]---------------------------
[ 788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
[ 788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
data_len=0 gso_size=1448 gso_type=1 ip_summed=3
[ 788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
dm_mod
[ 788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G W
------------ 3.10.0-327.el7.x86_64 #1
[ 788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
[ 788.542260] ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
ffffffff816351f1
[ 788.576332] ffff880437c036a8 ffffffff8107b200 ffff880633e74200
ffff880231674000
[ 788.611943] 0000000000000001 0000000000000003 0000000000000000
ffff880437c03710
[ 788.647241] Call Trace:
[ 788.658817] <IRQ> [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 788.686193] [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
[ 788.713803] [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
[ 788.741314] [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
[ 788.767018] [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
[ 788.796117] [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
[ 788.823392] [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
[ 788.854487] [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
[ 788.880870] [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
...
The problem occurs because netem is not prepared to handle GSO packets (as it
uses skb_checksum_help in its enqueue path, which cannot manipulate these
frames).
The solution I think is to simply segment the skb in a simmilar fashion to the
way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
the first segment, and enqueue the remaining ones.
tested successfully by myself on the latest net kernel, to which this applies
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netem@lists.linux-foundation.org
CC: eric.dumazet@gmail.com
CC: stephen@networkplumber.org
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 12:20:15 -04:00
if ( rc ! = NET_XMIT_SUCCESS ) {
if ( net_xmit_drop_count ( rc ) )
qdisc_qstats_drop ( sch ) ;
} else {
nb + + ;
len + = last_len ;
}
segs = skb2 ;
}
sch - > q . qlen + = nb ;
if ( nb > 1 )
qdisc_tree_reduce_backlog ( sch , 1 - nb , prev_len - len ) ;
}
2011-02-23 13:04:20 +00:00
return NET_XMIT_SUCCESS ;
2005-04-16 15:20:36 -07:00
}
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
/* Delay the next round with a new future slot with a
* correct number of bytes and packets .
*/
static void get_slot_next ( struct netem_sched_data * q , u64 now )
{
2018-06-27 10:32:19 -07:00
s64 next_delay ;
if ( ! q - > slot_dist )
next_delay = q - > slot_config . min_delay +
( prandom_u32 ( ) *
( q - > slot_config . max_delay -
q - > slot_config . min_delay ) > > 32 ) ;
else
next_delay = tabledist ( q - > slot_config . dist_delay ,
( s32 ) ( q - > slot_config . dist_jitter ) ,
NULL , q - > slot_dist ) ;
q - > slot . slot_next = now + next_delay ;
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
q - > slot . packets_left = q - > slot_config . max_packets ;
q - > slot . bytes_left = q - > slot_config . max_bytes ;
}
2018-12-04 11:55:56 -08:00
static struct sk_buff * netem_peek ( struct netem_sched_data * q )
{
struct sk_buff * skb = skb_rb_first ( & q - > t_root ) ;
u64 t1 , t2 ;
if ( ! skb )
return q - > t_head ;
if ( ! q - > t_head )
return skb ;
t1 = netem_skb_cb ( skb ) - > time_to_send ;
t2 = netem_skb_cb ( q - > t_head ) - > time_to_send ;
if ( t1 < t2 )
return skb ;
return q - > t_head ;
}
static void netem_erase_head ( struct netem_sched_data * q , struct sk_buff * skb )
{
if ( skb = = q - > t_head ) {
q - > t_head = skb - > next ;
if ( ! q - > t_head )
q - > t_tail = NULL ;
} else {
rb_erase ( & skb - > rbnode , & q - > t_root ) ;
}
}
2005-04-16 15:20:36 -07:00
static struct sk_buff * netem_dequeue ( struct Qdisc * sch )
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
struct sk_buff * skb ;
2011-12-28 23:12:02 +00:00
tfifo_dequeue :
2016-09-18 00:57:33 +02:00
skb = __qdisc_dequeue_head ( & sch - > q ) ;
2005-05-03 16:24:32 -07:00
if ( skb ) {
2014-09-28 11:53:29 -07:00
qdisc_qstats_backlog_dec ( sch , skb ) ;
2015-04-06 18:00:56 +00:00
deliver :
2013-06-28 07:40:57 -07:00
qdisc_bstats_update ( sch , skb ) ;
return skb ;
}
2018-12-04 11:55:56 -08:00
skb = netem_peek ( q ) ;
if ( skb ) {
2017-11-08 15:12:26 -08:00
u64 time_to_send ;
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
u64 now = ktime_get_ns ( ) ;
2013-07-03 14:04:14 -07:00
2005-05-26 12:55:01 -07:00
/* if more time remaining? */
2013-07-03 14:04:14 -07:00
time_to_send = netem_skb_cb ( skb ) - > time_to_send ;
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
if ( q - > slot . slot_next & & q - > slot . slot_next < time_to_send )
get_slot_next ( q , now ) ;
2013-06-28 07:40:57 -07:00
2018-12-04 11:55:56 -08:00
if ( time_to_send < = now & & q - > slot . slot_next < = now ) {
netem_erase_head ( q , skb ) ;
2013-06-28 07:40:57 -07:00
sch - > q . qlen - - ;
2015-04-06 18:00:56 +00:00
qdisc_qstats_backlog_dec ( sch , skb ) ;
2013-06-28 07:40:57 -07:00
skb - > next = NULL ;
skb - > prev = NULL ;
2017-09-19 05:14:24 -07:00
/* skb->dev shares skb->rbnode area,
* we need to restore its value .
*/
skb - > dev = qdisc_dev ( sch ) ;
2008-10-31 00:46:19 -07:00
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
if ( q - > slot . slot_next ) {
q - > slot . packets_left - - ;
q - > slot . bytes_left - = qdisc_pkt_len ( skb ) ;
if ( q - > slot . packets_left < = 0 | |
q - > slot . bytes_left < = 0 )
get_slot_next ( q , now ) ;
}
2011-12-28 23:12:02 +00:00
if ( q - > qdisc ) {
2016-06-20 15:00:43 -07:00
unsigned int pkt_len = qdisc_pkt_len ( skb ) ;
2016-06-21 23:16:49 -07:00
struct sk_buff * to_free = NULL ;
int err ;
2011-12-28 23:12:02 +00:00
2016-06-21 23:16:49 -07:00
err = qdisc_enqueue ( skb , q - > qdisc , & to_free ) ;
kfree_skb_list ( to_free ) ;
2016-06-20 15:00:43 -07:00
if ( err ! = NET_XMIT_SUCCESS & &
net_xmit_drop_count ( err ) ) {
qdisc_qstats_drop ( sch ) ;
qdisc_tree_reduce_backlog ( sch , 1 ,
pkt_len ) ;
2011-12-28 23:12:02 +00:00
}
goto tfifo_dequeue ;
}
2013-06-28 07:40:57 -07:00
goto deliver ;
2005-11-03 13:43:07 -08:00
}
2007-03-22 12:17:42 -07:00
2011-12-28 23:12:02 +00:00
if ( q - > qdisc ) {
skb = q - > qdisc - > ops - > dequeue ( q - > qdisc ) ;
if ( skb )
goto deliver ;
}
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
qdisc_watchdog_schedule_ns ( & q - > watchdog ,
max ( time_to_send ,
q - > slot . slot_next ) ) ;
2005-05-26 12:55:01 -07:00
}
2011-12-28 23:12:02 +00:00
if ( q - > qdisc ) {
skb = q - > qdisc - > ops - > dequeue ( q - > qdisc ) ;
if ( skb )
goto deliver ;
}
2005-05-26 12:55:01 -07:00
return NULL ;
2005-04-16 15:20:36 -07:00
}
static void netem_reset ( struct Qdisc * sch )
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
2011-12-28 23:12:02 +00:00
qdisc_reset_queue ( sch ) ;
2013-10-06 15:16:49 -07:00
tfifo_reset ( sch ) ;
2011-12-28 23:12:02 +00:00
if ( q - > qdisc )
qdisc_reset ( q - > qdisc ) ;
2007-03-16 01:20:31 -07:00
qdisc_watchdog_cancel ( & q - > watchdog ) ;
2005-04-16 15:20:36 -07:00
}
2011-02-23 13:04:18 +00:00
static void dist_free ( struct disttable * d )
{
2014-06-02 15:55:22 -07:00
kvfree ( d ) ;
2011-02-23 13:04:18 +00:00
}
2005-04-16 15:20:36 -07:00
/*
* Distribution data is a variable size payload containing
* signed 16 bit values .
*/
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
2018-06-27 10:32:19 -07:00
static int get_dist_table ( struct Qdisc * sch , struct disttable * * tbl ,
const struct nlattr * attr )
2005-04-16 15:20:36 -07:00
{
2011-02-23 13:04:18 +00:00
size_t n = nla_len ( attr ) / sizeof ( __s16 ) ;
2008-01-22 22:11:17 -08:00
const __s16 * data = nla_data ( attr ) ;
2008-07-16 01:42:40 -07:00
spinlock_t * root_lock ;
2005-04-16 15:20:36 -07:00
struct disttable * d ;
int i ;
2011-02-23 13:04:19 +00:00
if ( n > NETEM_DIST_MAX )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
2017-05-08 15:57:27 -07:00
d = kvmalloc ( sizeof ( struct disttable ) + n * sizeof ( s16 ) , GFP_KERNEL ) ;
2005-04-16 15:20:36 -07:00
if ( ! d )
return - ENOMEM ;
d - > size = n ;
for ( i = 0 ; i < n ; i + + )
d - > table [ i ] = data [ i ] ;
2007-02-09 23:25:16 +09:00
2008-08-29 14:21:52 -07:00
root_lock = qdisc_root_sleeping_lock ( sch ) ;
2008-07-16 01:42:40 -07:00
spin_lock_bh ( root_lock ) ;
2018-06-27 10:32:19 -07:00
swap ( * tbl , d ) ;
2008-07-16 01:42:40 -07:00
spin_unlock_bh ( root_lock ) ;
2011-12-23 19:28:51 +00:00
dist_free ( d ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
static void get_slot ( struct netem_sched_data * q , const struct nlattr * attr )
{
const struct tc_netem_slot * c = nla_data ( attr ) ;
q - > slot_config = * c ;
if ( q - > slot_config . max_packets = = 0 )
q - > slot_config . max_packets = INT_MAX ;
if ( q - > slot_config . max_bytes = = 0 )
q - > slot_config . max_bytes = INT_MAX ;
q - > slot . packets_left = q - > slot_config . max_packets ;
q - > slot . bytes_left = q - > slot_config . max_bytes ;
2018-06-27 10:32:19 -07:00
if ( q - > slot_config . min_delay | q - > slot_config . max_delay |
q - > slot_config . dist_jitter )
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
q - > slot . slot_next = ktime_get_ns ( ) ;
else
q - > slot . slot_next = 0 ;
}
2014-02-14 10:30:42 +08:00
static void get_correlation ( struct netem_sched_data * q , const struct nlattr * attr )
2005-04-16 15:20:36 -07:00
{
2008-01-22 22:11:17 -08:00
const struct tc_netem_corr * c = nla_data ( attr ) ;
2005-04-16 15:20:36 -07:00
init_crandom ( & q - > delay_cor , c - > delay_corr ) ;
init_crandom ( & q - > loss_cor , c - > loss_corr ) ;
init_crandom ( & q - > dup_cor , c - > dup_corr ) ;
}
2014-02-14 10:30:42 +08:00
static void get_reorder ( struct netem_sched_data * q , const struct nlattr * attr )
2005-05-26 12:55:48 -07:00
{
2008-01-22 22:11:17 -08:00
const struct tc_netem_reorder * r = nla_data ( attr ) ;
2005-05-26 12:55:48 -07:00
q - > reorder = r - > probability ;
init_crandom ( & q - > reorder_cor , r - > correlation ) ;
}
2014-02-14 10:30:42 +08:00
static void get_corrupt ( struct netem_sched_data * q , const struct nlattr * attr )
2005-12-21 19:03:44 -08:00
{
2008-01-22 22:11:17 -08:00
const struct tc_netem_corrupt * r = nla_data ( attr ) ;
2005-12-21 19:03:44 -08:00
q - > corrupt = r - > probability ;
init_crandom ( & q - > corrupt_cor , r - > correlation ) ;
}
2014-02-14 10:30:42 +08:00
static void get_rate ( struct netem_sched_data * q , const struct nlattr * attr )
2011-11-30 12:20:26 +00:00
{
const struct tc_netem_rate * r = nla_data ( attr ) ;
q - > rate = r - > rate ;
2011-12-12 14:30:00 +00:00
q - > packet_overhead = r - > packet_overhead ;
q - > cell_size = r - > cell_size ;
reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.
This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:
1) Initialization:
int l = ceil(log_2 d)
uword m' = floor((1<<32)*((1<<l)-d)/d)+1
int sh_1 = min(l,1)
int sh_2 = max(l-1,0)
2) For q = n/d, all uword:
uword t = (n*m')>>32
q = (t+((n-t)>>sh_1))>>sh_2
The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.
Joint work with Daniel Borkmann.
[1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
[2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
[3] https://gmplib.org/~tege/division-paper.pdf
[4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
[6] http://www.agner.org/optimize/asmlib.zip
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 02:29:41 +01:00
q - > cell_overhead = r - > cell_overhead ;
2011-12-12 14:30:00 +00:00
if ( q - > cell_size )
q - > cell_size_reciprocal = reciprocal_value ( q - > cell_size ) ;
reciprocal_divide: update/correction of the algorithm
Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.
This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:
1) Initialization:
int l = ceil(log_2 d)
uword m' = floor((1<<32)*((1<<l)-d)/d)+1
int sh_1 = min(l,1)
int sh_2 = max(l-1,0)
2) For q = n/d, all uword:
uword t = (n*m')>>32
q = (t+((n-t)>>sh_1))>>sh_2
The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.
Joint work with Daniel Borkmann.
[1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
[2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
[3] https://gmplib.org/~tege/division-paper.pdf
[4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
[6] http://www.agner.org/optimize/asmlib.zip
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 02:29:41 +01:00
else
q - > cell_size_reciprocal = ( struct reciprocal_value ) { 0 } ;
2011-11-30 12:20:26 +00:00
}
2014-02-14 10:30:42 +08:00
static int get_loss_clg ( struct netem_sched_data * q , const struct nlattr * attr )
2011-02-23 13:04:21 +00:00
{
const struct nlattr * la ;
int rem ;
nla_for_each_nested ( la , attr , rem ) {
u16 type = nla_type ( la ) ;
2013-12-10 20:55:32 +08:00
switch ( type ) {
2011-02-23 13:04:21 +00:00
case NETEM_LOSS_GI : {
const struct tc_netem_gimodel * gi = nla_data ( la ) ;
2011-12-23 09:16:30 +00:00
if ( nla_len ( la ) < sizeof ( struct tc_netem_gimodel ) ) {
2011-02-23 13:04:21 +00:00
pr_info ( " netem: incorrect gi model size \n " ) ;
return - EINVAL ;
}
q - > loss_model = CLG_4_STATES ;
2014-02-17 16:48:21 +08:00
q - > clg . state = TX_IN_GAP_PERIOD ;
2011-02-23 13:04:21 +00:00
q - > clg . a1 = gi - > p13 ;
q - > clg . a2 = gi - > p31 ;
q - > clg . a3 = gi - > p32 ;
q - > clg . a4 = gi - > p14 ;
q - > clg . a5 = gi - > p23 ;
break ;
}
case NETEM_LOSS_GE : {
const struct tc_netem_gemodel * ge = nla_data ( la ) ;
2011-12-23 09:16:30 +00:00
if ( nla_len ( la ) < sizeof ( struct tc_netem_gemodel ) ) {
pr_info ( " netem: incorrect ge model size \n " ) ;
2011-02-23 13:04:21 +00:00
return - EINVAL ;
}
q - > loss_model = CLG_GILB_ELL ;
2014-02-17 16:48:21 +08:00
q - > clg . state = GOOD_STATE ;
2011-02-23 13:04:21 +00:00
q - > clg . a1 = ge - > p ;
q - > clg . a2 = ge - > r ;
q - > clg . a3 = ge - > h ;
q - > clg . a4 = ge - > k1 ;
break ;
}
default :
pr_info ( " netem: unknown loss type %u \n " , type ) ;
return - EINVAL ;
}
}
return 0 ;
}
2008-01-23 20:35:39 -08:00
static const struct nla_policy netem_policy [ TCA_NETEM_MAX + 1 ] = {
[ TCA_NETEM_CORR ] = { . len = sizeof ( struct tc_netem_corr ) } ,
[ TCA_NETEM_REORDER ] = { . len = sizeof ( struct tc_netem_reorder ) } ,
[ TCA_NETEM_CORRUPT ] = { . len = sizeof ( struct tc_netem_corrupt ) } ,
2011-11-30 12:20:26 +00:00
[ TCA_NETEM_RATE ] = { . len = sizeof ( struct tc_netem_rate ) } ,
2011-02-23 13:04:21 +00:00
[ TCA_NETEM_LOSS ] = { . type = NLA_NESTED } ,
2012-04-30 23:11:05 +00:00
[ TCA_NETEM_ECN ] = { . type = NLA_U32 } ,
2013-12-25 17:35:15 +08:00
[ TCA_NETEM_RATE64 ] = { . type = NLA_U64 } ,
2017-11-08 15:12:27 -08:00
[ TCA_NETEM_LATENCY64 ] = { . type = NLA_S64 } ,
[ TCA_NETEM_JITTER64 ] = { . type = NLA_S64 } ,
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
[ TCA_NETEM_SLOT ] = { . len = sizeof ( struct tc_netem_slot ) } ,
2008-01-23 20:35:39 -08:00
} ;
2008-09-02 17:30:27 -07:00
static int parse_attr ( struct nlattr * tb [ ] , int maxtype , struct nlattr * nla ,
const struct nla_policy * policy , int len )
{
int nested_len = nla_len ( nla ) - NLA_ALIGN ( len ) ;
2011-02-23 13:04:21 +00:00
if ( nested_len < 0 ) {
pr_info ( " netem: invalid attributes len %d \n " , nested_len ) ;
2008-09-02 17:30:27 -07:00
return - EINVAL ;
2011-02-23 13:04:21 +00:00
}
2008-09-02 17:30:27 -07:00
if ( nested_len > = nla_attr_size ( 0 ) )
return nla_parse ( tb , maxtype , nla_data ( nla ) + NLA_ALIGN ( len ) ,
2017-04-12 14:34:07 +02:00
nested_len , policy , NULL ) ;
2011-02-23 13:04:21 +00:00
2008-09-02 17:30:27 -07:00
memset ( tb , 0 , sizeof ( struct nlattr * ) * ( maxtype + 1 ) ) ;
return 0 ;
}
2005-12-21 19:03:44 -08:00
/* Parse netlink message to set options */
2017-12-20 12:35:14 -05:00
static int netem_change ( struct Qdisc * sch , struct nlattr * opt ,
struct netlink_ext_ack * extack )
2005-04-16 15:20:36 -07:00
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
2008-01-23 20:32:21 -08:00
struct nlattr * tb [ TCA_NETEM_MAX + 1 ] ;
2005-04-16 15:20:36 -07:00
struct tc_netem_qopt * qopt ;
2014-02-14 10:30:41 +08:00
struct clgstate old_clg ;
int old_loss_model = CLG_RANDOM ;
2005-04-16 15:20:36 -07:00
int ret ;
2007-02-09 23:25:16 +09:00
2008-01-23 20:32:21 -08:00
if ( opt = = NULL )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
2008-09-02 17:30:27 -07:00
qopt = nla_data ( opt ) ;
ret = parse_attr ( tb , TCA_NETEM_MAX , opt , netem_policy , sizeof ( * qopt ) ) ;
2008-01-23 20:32:21 -08:00
if ( ret < 0 )
return ret ;
2014-02-14 10:30:41 +08:00
/* backup q->clg and q->loss_model */
old_clg = q - > clg ;
old_loss_model = q - > loss_model ;
if ( tb [ TCA_NETEM_LOSS ] ) {
2014-02-14 10:30:42 +08:00
ret = get_loss_clg ( q , tb [ TCA_NETEM_LOSS ] ) ;
2014-02-14 10:30:41 +08:00
if ( ret ) {
q - > loss_model = old_loss_model ;
return ret ;
}
} else {
q - > loss_model = CLG_RANDOM ;
}
if ( tb [ TCA_NETEM_DELAY_DIST ] ) {
2018-06-27 10:32:19 -07:00
ret = get_dist_table ( sch , & q - > delay_dist ,
tb [ TCA_NETEM_DELAY_DIST ] ) ;
if ( ret )
goto get_table_failure ;
}
if ( tb [ TCA_NETEM_SLOT_DIST ] ) {
ret = get_dist_table ( sch , & q - > slot_dist ,
tb [ TCA_NETEM_SLOT_DIST ] ) ;
if ( ret )
goto get_table_failure ;
2014-02-14 10:30:41 +08:00
}
2011-12-28 23:12:02 +00:00
sch - > limit = qopt - > limit ;
2007-02-09 23:25:16 +09:00
2017-11-08 15:12:26 -08:00
q - > latency = PSCHED_TICKS2NS ( qopt - > latency ) ;
q - > jitter = PSCHED_TICKS2NS ( qopt - > jitter ) ;
2005-04-16 15:20:36 -07:00
q - > limit = qopt - > limit ;
q - > gap = qopt - > gap ;
2005-05-26 12:55:48 -07:00
q - > counter = 0 ;
2005-04-16 15:20:36 -07:00
q - > loss = qopt - > loss ;
q - > duplicate = qopt - > duplicate ;
2007-03-23 00:12:09 -07:00
/* for compatibility with earlier versions.
* if gap is set , need to assume 100 % probability
2005-05-26 12:55:48 -07:00
*/
2007-03-22 12:15:45 -07:00
if ( q - > gap )
q - > reorder = ~ 0 ;
2005-05-26 12:55:48 -07:00
2008-11-03 21:13:26 -08:00
if ( tb [ TCA_NETEM_CORR ] )
2014-02-14 10:30:42 +08:00
get_correlation ( q , tb [ TCA_NETEM_CORR ] ) ;
2005-04-16 15:20:36 -07:00
2008-11-03 21:13:26 -08:00
if ( tb [ TCA_NETEM_REORDER ] )
2014-02-14 10:30:42 +08:00
get_reorder ( q , tb [ TCA_NETEM_REORDER ] ) ;
2005-04-16 15:20:36 -07:00
2008-11-03 21:13:26 -08:00
if ( tb [ TCA_NETEM_CORRUPT ] )
2014-02-14 10:30:42 +08:00
get_corrupt ( q , tb [ TCA_NETEM_CORRUPT ] ) ;
2005-04-16 15:20:36 -07:00
2011-11-30 12:20:26 +00:00
if ( tb [ TCA_NETEM_RATE ] )
2014-02-14 10:30:42 +08:00
get_rate ( q , tb [ TCA_NETEM_RATE ] ) ;
2011-11-30 12:20:26 +00:00
2013-12-25 17:35:15 +08:00
if ( tb [ TCA_NETEM_RATE64 ] )
q - > rate = max_t ( u64 , q - > rate ,
nla_get_u64 ( tb [ TCA_NETEM_RATE64 ] ) ) ;
2017-11-08 15:12:27 -08:00
if ( tb [ TCA_NETEM_LATENCY64 ] )
q - > latency = nla_get_s64 ( tb [ TCA_NETEM_LATENCY64 ] ) ;
if ( tb [ TCA_NETEM_JITTER64 ] )
q - > jitter = nla_get_s64 ( tb [ TCA_NETEM_JITTER64 ] ) ;
2012-04-30 23:11:05 +00:00
if ( tb [ TCA_NETEM_ECN ] )
q - > ecn = nla_get_u32 ( tb [ TCA_NETEM_ECN ] ) ;
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
if ( tb [ TCA_NETEM_SLOT ] )
get_slot ( q , tb [ TCA_NETEM_SLOT ] ) ;
2011-02-23 13:04:21 +00:00
return ret ;
2018-06-27 10:32:19 -07:00
get_table_failure :
/* recover clg and loss_model, in case of
* q - > clg and q - > loss_model were modified
* in get_loss_clg ( )
*/
q - > clg = old_clg ;
q - > loss_model = old_loss_model ;
return ret ;
2005-04-16 15:20:36 -07:00
}
2017-12-20 12:35:13 -05:00
static int netem_init ( struct Qdisc * sch , struct nlattr * opt ,
struct netlink_ext_ack * extack )
2005-04-16 15:20:36 -07:00
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
int ret ;
2017-08-30 12:49:03 +03:00
qdisc_watchdog_init ( & q - > watchdog , sch ) ;
2005-04-16 15:20:36 -07:00
if ( ! opt )
return - EINVAL ;
2011-02-23 13:04:21 +00:00
q - > loss_model = CLG_RANDOM ;
2017-12-20 12:35:14 -05:00
ret = netem_change ( sch , opt , extack ) ;
2011-12-28 23:12:02 +00:00
if ( ret )
2011-02-23 13:04:22 +00:00
pr_info ( " netem: change failed \n " ) ;
2005-04-16 15:20:36 -07:00
return ret ;
}
static void netem_destroy ( struct Qdisc * sch )
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
2007-03-16 01:20:31 -07:00
qdisc_watchdog_cancel ( & q - > watchdog ) ;
2011-12-28 23:12:02 +00:00
if ( q - > qdisc )
2018-09-24 19:22:50 +03:00
qdisc_put ( q - > qdisc ) ;
2011-02-23 13:04:18 +00:00
dist_free ( q - > delay_dist ) ;
2018-06-27 10:32:19 -07:00
dist_free ( q - > slot_dist ) ;
2005-04-16 15:20:36 -07:00
}
2011-02-23 13:04:21 +00:00
static int dump_loss_model ( const struct netem_sched_data * q ,
struct sk_buff * skb )
{
struct nlattr * nest ;
nest = nla_nest_start ( skb , TCA_NETEM_LOSS ) ;
if ( nest = = NULL )
goto nla_put_failure ;
switch ( q - > loss_model ) {
case CLG_RANDOM :
/* legacy loss model */
nla_nest_cancel ( skb , nest ) ;
return 0 ; /* no data */
case CLG_4_STATES : {
struct tc_netem_gimodel gi = {
. p13 = q - > clg . a1 ,
. p31 = q - > clg . a2 ,
. p32 = q - > clg . a3 ,
. p14 = q - > clg . a4 ,
. p23 = q - > clg . a5 ,
} ;
2012-03-29 05:11:39 -04:00
if ( nla_put ( skb , NETEM_LOSS_GI , sizeof ( gi ) , & gi ) )
goto nla_put_failure ;
2011-02-23 13:04:21 +00:00
break ;
}
case CLG_GILB_ELL : {
struct tc_netem_gemodel ge = {
. p = q - > clg . a1 ,
. r = q - > clg . a2 ,
. h = q - > clg . a3 ,
. k1 = q - > clg . a4 ,
} ;
2012-03-29 05:11:39 -04:00
if ( nla_put ( skb , NETEM_LOSS_GE , sizeof ( ge ) , & ge ) )
goto nla_put_failure ;
2011-02-23 13:04:21 +00:00
break ;
}
}
nla_nest_end ( skb , nest ) ;
return 0 ;
nla_put_failure :
nla_nest_cancel ( skb , nest ) ;
return - 1 ;
}
2005-04-16 15:20:36 -07:00
static int netem_dump ( struct Qdisc * sch , struct sk_buff * skb )
{
const struct netem_sched_data * q = qdisc_priv ( sch ) ;
2011-02-23 13:04:17 +00:00
struct nlattr * nla = ( struct nlattr * ) skb_tail_pointer ( skb ) ;
2005-04-16 15:20:36 -07:00
struct tc_netem_qopt qopt ;
struct tc_netem_corr cor ;
2005-05-26 12:55:48 -07:00
struct tc_netem_reorder reorder ;
2005-12-21 19:03:44 -08:00
struct tc_netem_corrupt corrupt ;
2011-11-30 12:20:26 +00:00
struct tc_netem_rate rate ;
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
struct tc_netem_slot slot ;
2005-04-16 15:20:36 -07:00
2017-11-08 15:12:26 -08:00
qopt . latency = min_t ( psched_tdiff_t , PSCHED_NS2TICKS ( q - > latency ) ,
UINT_MAX ) ;
qopt . jitter = min_t ( psched_tdiff_t , PSCHED_NS2TICKS ( q - > jitter ) ,
UINT_MAX ) ;
2005-04-16 15:20:36 -07:00
qopt . limit = q - > limit ;
qopt . loss = q - > loss ;
qopt . gap = q - > gap ;
qopt . duplicate = q - > duplicate ;
2012-03-29 05:11:39 -04:00
if ( nla_put ( skb , TCA_OPTIONS , sizeof ( qopt ) , & qopt ) )
goto nla_put_failure ;
2005-04-16 15:20:36 -07:00
2017-11-08 15:12:27 -08:00
if ( nla_put ( skb , TCA_NETEM_LATENCY64 , sizeof ( q - > latency ) , & q - > latency ) )
goto nla_put_failure ;
if ( nla_put ( skb , TCA_NETEM_JITTER64 , sizeof ( q - > jitter ) , & q - > jitter ) )
goto nla_put_failure ;
2005-04-16 15:20:36 -07:00
cor . delay_corr = q - > delay_cor . rho ;
cor . loss_corr = q - > loss_cor . rho ;
cor . dup_corr = q - > dup_cor . rho ;
2012-03-29 05:11:39 -04:00
if ( nla_put ( skb , TCA_NETEM_CORR , sizeof ( cor ) , & cor ) )
goto nla_put_failure ;
2005-05-26 12:55:48 -07:00
reorder . probability = q - > reorder ;
reorder . correlation = q - > reorder_cor . rho ;
2012-03-29 05:11:39 -04:00
if ( nla_put ( skb , TCA_NETEM_REORDER , sizeof ( reorder ) , & reorder ) )
goto nla_put_failure ;
2005-05-26 12:55:48 -07:00
2005-12-21 19:03:44 -08:00
corrupt . probability = q - > corrupt ;
corrupt . correlation = q - > corrupt_cor . rho ;
2012-03-29 05:11:39 -04:00
if ( nla_put ( skb , TCA_NETEM_CORRUPT , sizeof ( corrupt ) , & corrupt ) )
goto nla_put_failure ;
2005-12-21 19:03:44 -08:00
2013-12-25 17:35:15 +08:00
if ( q - > rate > = ( 1ULL < < 32 ) ) {
2016-04-25 10:25:15 +02:00
if ( nla_put_u64_64bit ( skb , TCA_NETEM_RATE64 , q - > rate ,
TCA_NETEM_PAD ) )
2013-12-25 17:35:15 +08:00
goto nla_put_failure ;
rate . rate = ~ 0U ;
} else {
rate . rate = q - > rate ;
}
2011-12-12 14:30:00 +00:00
rate . packet_overhead = q - > packet_overhead ;
rate . cell_size = q - > cell_size ;
rate . cell_overhead = q - > cell_overhead ;
2012-03-29 05:11:39 -04:00
if ( nla_put ( skb , TCA_NETEM_RATE , sizeof ( rate ) , & rate ) )
goto nla_put_failure ;
2011-11-30 12:20:26 +00:00
2012-04-30 23:11:05 +00:00
if ( q - > ecn & & nla_put_u32 ( skb , TCA_NETEM_ECN , q - > ecn ) )
goto nla_put_failure ;
2011-02-23 13:04:21 +00:00
if ( dump_loss_model ( q , skb ) ! = 0 )
goto nla_put_failure ;
2018-06-27 10:32:19 -07:00
if ( q - > slot_config . min_delay | q - > slot_config . max_delay |
q - > slot_config . dist_jitter ) {
netem: support delivering packets in delayed time slots
Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.
It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.
The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.
The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.
Examples of use:
tc qdisc add dev eth0 root netem delay 200us \
slot 800us 10ms bytes 64k packets 42
A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:
tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
slot 800us 10ms bytes 64k packets 42 limit 512
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 15:12:28 -08:00
slot = q - > slot_config ;
if ( slot . max_packets = = INT_MAX )
slot . max_packets = 0 ;
if ( slot . max_bytes = = INT_MAX )
slot . max_bytes = 0 ;
if ( nla_put ( skb , TCA_NETEM_SLOT , sizeof ( slot ) , & slot ) )
goto nla_put_failure ;
}
2011-02-23 13:04:17 +00:00
return nla_nest_end ( skb , nla ) ;
2005-04-16 15:20:36 -07:00
2008-01-22 22:11:17 -08:00
nla_put_failure :
2011-02-23 13:04:17 +00:00
nlmsg_trim ( skb , nla ) ;
2005-04-16 15:20:36 -07:00
return - 1 ;
}
2011-02-23 13:04:20 +00:00
static int netem_dump_class ( struct Qdisc * sch , unsigned long cl ,
struct sk_buff * skb , struct tcmsg * tcm )
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
2011-12-28 23:12:02 +00:00
if ( cl ! = 1 | | ! q - > qdisc ) /* only one class */
2011-02-23 13:04:20 +00:00
return - ENOENT ;
tcm - > tcm_handle | = TC_H_MIN ( 1 ) ;
tcm - > tcm_info = q - > qdisc - > handle ;
return 0 ;
}
static int netem_graft ( struct Qdisc * sch , unsigned long arg , struct Qdisc * new ,
2017-12-20 12:35:17 -05:00
struct Qdisc * * old , struct netlink_ext_ack * extack )
2011-02-23 13:04:20 +00:00
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
2016-02-25 14:55:00 -08:00
* old = qdisc_replace ( sch , new , & q - > qdisc ) ;
2011-02-23 13:04:20 +00:00
return 0 ;
}
static struct Qdisc * netem_leaf ( struct Qdisc * sch , unsigned long arg )
{
struct netem_sched_data * q = qdisc_priv ( sch ) ;
return q - > qdisc ;
}
net_sched: remove tc class reference counting
For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:
1) For class modification and dumping paths, we already hold RTNL lock,
so all of these ->get(),->change(),->put() are atomic.
2) For filter bindiing/unbinding, we use other reference counter than
this one, and they should have RTNL lock too.
3) For ->qlen_notify(), it is special because it is called on ->enqueue()
path, but we already hold qdisc tree lock there, and we hold this
tree lock when graft or delete the class too, so it should not be gone
or changed until we release the tree lock.
Therefore, this patch removes ->get() and ->put(), but:
1) Adds a new ->find() to find the pointer to a class by classid, no
refcnt.
2) Move the original class destroy upon the last refcnt into ->delete(),
right after releasing tree lock. This is fine because the class is
already removed from hash when holding the lock.
For those who also use ->put() as ->unbind(), just rename them to reflect
this change.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 16:51:29 -07:00
static unsigned long netem_find ( struct Qdisc * sch , u32 classid )
2011-02-23 13:04:20 +00:00
{
return 1 ;
}
static void netem_walk ( struct Qdisc * sch , struct qdisc_walker * walker )
{
if ( ! walker - > stop ) {
if ( walker - > count > = walker - > skip )
if ( walker - > fn ( sch , 1 , walker ) < 0 ) {
walker - > stop = 1 ;
return ;
}
walker - > count + + ;
}
}
static const struct Qdisc_class_ops netem_class_ops = {
. graft = netem_graft ,
. leaf = netem_leaf ,
net_sched: remove tc class reference counting
For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:
1) For class modification and dumping paths, we already hold RTNL lock,
so all of these ->get(),->change(),->put() are atomic.
2) For filter bindiing/unbinding, we use other reference counter than
this one, and they should have RTNL lock too.
3) For ->qlen_notify(), it is special because it is called on ->enqueue()
path, but we already hold qdisc tree lock there, and we hold this
tree lock when graft or delete the class too, so it should not be gone
or changed until we release the tree lock.
Therefore, this patch removes ->get() and ->put(), but:
1) Adds a new ->find() to find the pointer to a class by classid, no
refcnt.
2) Move the original class destroy upon the last refcnt into ->delete(),
right after releasing tree lock. This is fine because the class is
already removed from hash when holding the lock.
For those who also use ->put() as ->unbind(), just rename them to reflect
this change.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 16:51:29 -07:00
. find = netem_find ,
2011-02-23 13:04:20 +00:00
. walk = netem_walk ,
. dump = netem_dump_class ,
} ;
2007-11-14 01:44:41 -08:00
static struct Qdisc_ops netem_qdisc_ops __read_mostly = {
2005-04-16 15:20:36 -07:00
. id = " netem " ,
2011-02-23 13:04:20 +00:00
. cl_ops = & netem_class_ops ,
2005-04-16 15:20:36 -07:00
. priv_size = sizeof ( struct netem_sched_data ) ,
. enqueue = netem_enqueue ,
. dequeue = netem_dequeue ,
2008-10-31 00:47:01 -07:00
. peek = qdisc_peek_dequeued ,
2005-04-16 15:20:36 -07:00
. init = netem_init ,
. reset = netem_reset ,
. destroy = netem_destroy ,
. change = netem_change ,
. dump = netem_dump ,
. owner = THIS_MODULE ,
} ;
static int __init netem_module_init ( void )
{
2005-11-03 13:49:01 -08:00
pr_info ( " netem: version " VERSION " \n " ) ;
2005-04-16 15:20:36 -07:00
return register_qdisc ( & netem_qdisc_ops ) ;
}
static void __exit netem_module_exit ( void )
{
unregister_qdisc ( & netem_qdisc_ops ) ;
}
module_init ( netem_module_init )
module_exit ( netem_module_exit )
MODULE_LICENSE ( " GPL " ) ;