2019-05-19 13:07:45 +01:00
# SPDX-License-Identifier: GPL-2.0-only
2005-04-16 15:20:36 -07:00
#
# Traffic control configuration.
2018-07-24 12:29:01 -07:00
#
2005-07-11 21:13:56 -07:00
2007-10-18 21:56:38 -07:00
menuconfig NET_SCHED
2005-07-11 21:13:56 -07:00
bool "QoS and/or fair queueing"
2006-11-09 16:16:21 -08:00
select NET_SCH_FIFO
2020-06-14 01:50:22 +09:00
help
2005-07-11 21:13:56 -07:00
When the kernel has several packets to send out over a network
device, it has to decide which ones to send first, which ones to
2005-11-01 15:13:02 +01:00
delay, and which ones to drop. This is the job of the queueing
disciplines, several different algorithms for how to do this
2005-07-11 21:13:56 -07:00
"fairly" have been proposed.
If you say N here, you will get the standard packet scheduler, which
is a FIFO (first come, first served). If you say Y here, you will be
able to choose from among several alternative algorithms which can
then be attached to different network devices. This is useful for
example if some of your network devices are real time devices that
need a certain minimum data flow rate, or if you need to limit the
maximum data flow rate for traffic which matches specified criteria.
This code is considered to be experimental.
To administer these schedulers, you'll need the user-level utilities
2014-12-03 14:07:31 -08:00
from the package iproute2+tc at
<https://www.kernel.org/pub/linux/utils/net/iproute2/>. That package
also contains some documentation; for more, check out
2010-11-15 19:55:34 +00:00
<http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2>.
2005-07-11 21:13:56 -07:00
This Quality of Service (QoS) support will enable you to use
Differentiated Services (diffserv) and Resource Reservation Protocol
2005-11-01 15:13:02 +01:00
(RSVP) on your Linux router if you also say Y to the corresponding
classifiers below. Documentation and software is at
<http://diffserv.sourceforge.net/>.
2005-07-11 21:13:56 -07:00
If you say Y here and to "/proc file system" below, you will be able
to read status information about packet schedulers from the file
/proc/net/psched.
The available schedulers are listed in the following questions; you
can say Y to as many as you like. If unsure, say N now.
2005-11-17 15:22:39 -08:00
if NET_SCHED
2005-11-01 15:13:02 +01:00
comment "Queueing/Scheduling"
2005-04-16 15:20:36 -07:00
config NET_SCH_HTB
2005-11-01 15:13:02 +01:00
tristate "Hierarchical Token Bucket (HTB)"
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to use the Hierarchical Token Buckets (HTB)
2005-11-01 15:13:02 +01:00
packet scheduling algorithm. See
2005-04-16 15:20:36 -07:00
<http://luxik.cdi.cz/~devik/qos/htb/> for complete manual and
in-depth articles.
2005-11-01 15:13:02 +01:00
HTB is very similar to CBQ regarding its goals however is has
2005-04-16 15:20:36 -07:00
different properties and different algorithm.
To compile this code as a module, choose M here: the
module will be called sch_htb.
config NET_SCH_HFSC
2005-11-01 15:13:02 +01:00
tristate "Hierarchical Fair Service Curve (HFSC)"
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to use the Hierarchical Fair Service Curve
2005-11-01 15:13:02 +01:00
(HFSC) packet scheduling algorithm.
2005-04-16 15:20:36 -07:00
To compile this code as a module, choose M here: the
module will be called sch_hfsc.
config NET_SCH_PRIO
2005-11-01 15:13:02 +01:00
tristate "Multi Band Priority Queueing (PRIO)"
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to use an n-band priority queue packet
2005-11-01 15:13:02 +01:00
scheduler.
2005-04-16 15:20:36 -07:00
To compile this code as a module, choose M here: the
module will be called sch_prio.
2008-09-12 16:29:34 -07:00
config NET_SCH_MULTIQ
tristate "Hardware Multiqueue-aware Multi Band Queuing (MULTIQ)"
2020-06-14 01:50:22 +09:00
help
2008-09-12 16:29:34 -07:00
Say Y here if you want to use an n-band queue packet scheduler
to support devices that have multiple hardware transmit queues.
To compile this code as a module, choose M here: the
module will be called sch_multiq.
2005-04-16 15:20:36 -07:00
config NET_SCH_RED
2005-11-01 15:13:02 +01:00
tristate "Random Early Detection (RED)"
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to use the Random Early Detection (RED)
2005-11-01 15:13:02 +01:00
packet scheduling algorithm.
See the top of <file:net/sched/sch_red.c> for more details.
2005-04-16 15:20:36 -07:00
To compile this code as a module, choose M here: the
module will be called sch_red.
net_sched: SFB flow scheduler
This is the Stochastic Fair Blue scheduler, based on work from :
W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue
Management Algorithms. U. Michigan CSE-TR-387-99, April 1999.
http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf
This implementation is based on work done by Juliusz Chroboczek
General SFB algorithm can be found in figure 14, page 15:
B[l][n] : L x N array of bins (L levels, N bins per level)
enqueue()
Calculate hash function values h{0}, h{1}, .. h{L-1}
Update bins at each level
for i = 0 to L - 1
if (B[i][h{i}].qlen > bin_size)
B[i][h{i}].p_mark += p_increment;
else if (B[i][h{i}].qlen == 0)
B[i][h{i}].p_mark -= p_decrement;
p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark);
if (p_min == 1.0)
ratelimit();
else
mark/drop with probabilty p_min;
I did the adaptation of Juliusz code to meet current kernel standards,
and various changes to address previous comments :
http://thread.gmane.org/gmane.linux.network/90225
http://thread.gmane.org/gmane.linux.network/90375
Default flow classifier is the rxhash introduced by RPS in 2.6.35, but
we can use an external flow classifier if wanted.
tc qdisc add dev $DEV parent 1:11 handle 11: \
est 0.5sec 2sec sfb limit 128
tc filter add dev $DEV protocol ip parent 11: handle 3 \
flow hash keys dst divisor 1024
Notes:
1) SFB default child qdisc is pfifo_fast. It can be changed by another
qdisc but a child qdisc MUST not drop a packet previously queued. This
is because SFB needs to handle a dequeued packet in order to maintain
its virtual queue states. pfifo_head_drop or CHOKe should not be used.
2) ECN is enabled by default, unlike RED/CHOKe/GRED
With help from Patrick McHardy & Andi Kleen
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Juliusz Chroboczek <Juliusz.Chroboczek@pps.jussieu.fr>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 10:56:17 +00:00
config NET_SCH_SFB
tristate "Stochastic Fair Blue (SFB)"
2020-06-14 01:50:22 +09:00
help
net_sched: SFB flow scheduler
This is the Stochastic Fair Blue scheduler, based on work from :
W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue
Management Algorithms. U. Michigan CSE-TR-387-99, April 1999.
http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf
This implementation is based on work done by Juliusz Chroboczek
General SFB algorithm can be found in figure 14, page 15:
B[l][n] : L x N array of bins (L levels, N bins per level)
enqueue()
Calculate hash function values h{0}, h{1}, .. h{L-1}
Update bins at each level
for i = 0 to L - 1
if (B[i][h{i}].qlen > bin_size)
B[i][h{i}].p_mark += p_increment;
else if (B[i][h{i}].qlen == 0)
B[i][h{i}].p_mark -= p_decrement;
p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark);
if (p_min == 1.0)
ratelimit();
else
mark/drop with probabilty p_min;
I did the adaptation of Juliusz code to meet current kernel standards,
and various changes to address previous comments :
http://thread.gmane.org/gmane.linux.network/90225
http://thread.gmane.org/gmane.linux.network/90375
Default flow classifier is the rxhash introduced by RPS in 2.6.35, but
we can use an external flow classifier if wanted.
tc qdisc add dev $DEV parent 1:11 handle 11: \
est 0.5sec 2sec sfb limit 128
tc filter add dev $DEV protocol ip parent 11: handle 3 \
flow hash keys dst divisor 1024
Notes:
1) SFB default child qdisc is pfifo_fast. It can be changed by another
qdisc but a child qdisc MUST not drop a packet previously queued. This
is because SFB needs to handle a dequeued packet in order to maintain
its virtual queue states. pfifo_head_drop or CHOKe should not be used.
2) ECN is enabled by default, unlike RED/CHOKe/GRED
With help from Patrick McHardy & Andi Kleen
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Juliusz Chroboczek <Juliusz.Chroboczek@pps.jussieu.fr>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Andi Kleen <andi@firstfloor.org>
CC: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23 10:56:17 +00:00
Say Y here if you want to use the Stochastic Fair Blue (SFB)
packet scheduling algorithm.
See the top of <file:net/sched/sch_sfb.c> for more details.
To compile this code as a module, choose M here: the
module will be called sch_sfb.
2005-04-16 15:20:36 -07:00
config NET_SCH_SFQ
2005-11-01 15:13:02 +01:00
tristate "Stochastic Fairness Queueing (SFQ)"
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to use the Stochastic Fairness Queueing (SFQ)
2007-07-18 02:00:04 -07:00
packet scheduling algorithm.
2005-11-01 15:13:02 +01:00
See the top of <file:net/sched/sch_sfq.c> for more details.
2005-04-16 15:20:36 -07:00
To compile this code as a module, choose M here: the
module will be called sch_sfq.
config NET_SCH_TEQL
2005-11-01 15:13:02 +01:00
tristate "True Link Equalizer (TEQL)"
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to use the True Link Equalizer (TLE) packet
2005-11-01 15:13:02 +01:00
scheduling algorithm. This queueing discipline allows the combination
of several physical devices into one virtual device.
See the top of <file:net/sched/sch_teql.c> for more details.
2005-04-16 15:20:36 -07:00
To compile this code as a module, choose M here: the
module will be called sch_teql.
config NET_SCH_TBF
2005-11-01 15:13:02 +01:00
tristate "Token Bucket Filter (TBF)"
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here if you want to use the Token Bucket Filter (TBF) packet
scheduling algorithm.
See the top of <file:net/sched/sch_tbf.c> for more details.
2005-04-16 15:20:36 -07:00
To compile this code as a module, choose M here: the
module will be called sch_tbf.
2017-10-16 18:01:26 -07:00
config NET_SCH_CBS
tristate "Credit Based Shaper (CBS)"
2020-06-14 01:50:22 +09:00
help
2017-10-16 18:01:26 -07:00
Say Y here if you want to use the Credit Based Shaper (CBS) packet
scheduling algorithm.
See the top of <file:net/sched/sch_cbs.c> for more details.
To compile this code as a module, choose M here: the
module will be called sch_cbs.
net/sched: Introduce the ETF Qdisc
The ETF (Earliest TxTime First) qdisc uses the information added
earlier in this series (the socket option SO_TXTIME and the new
role of sk_buff->tstamp) to schedule packets transmission based
on absolute time.
For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.
Example:
$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
$ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
clockid CLOCK_TAI
In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in the
socket will be in reference to the clockid CLOCK_TAI and packets
will leave the qdisc "delta" (100000) nanoseconds before its transmission
time.
The ETF qdisc will buffer packets sorted by their txtime. It will drop
packets on enqueue() if their skbuff clockid does not match the clock
reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
if it expires while being enqueued.
The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
will dequeue a packet as soon as possible and change the skb timestamp
to 'now' during etf_dequeue().
Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
to be configured, but it's been decided that usage of CLOCK_TAI should
be enforced until we decide to allow for other clockids to be used.
The rationale here is that PTP times are usually in the TAI scale, thus
no other clocks should be necessary. For now, the qdisc will return
EINVAL if any clocks other than CLOCK_TAI are used.
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-03 15:42:53 -07:00
config NET_SCH_ETF
tristate "Earliest TxTime First (ETF)"
help
Say Y here if you want to use the Earliest TxTime First (ETF) packet
scheduling algorithm.
See the top of <file:net/sched/sch_etf.c> for more details.
To compile this code as a module, choose M here: the
module will be called sch_etf.
2017-10-16 18:01:26 -07:00
2023-02-04 15:53:01 +02:00
config NET_SCH_MQPRIO_LIB
tristate
help
Common library for manipulating mqprio queue configurations.
tc: Add support for configuring the taprio scheduler
This traffic scheduler allows traffic classes states (transmission
allowed/not allowed, in the simplest case) to be scheduled, according
to a pre-generated time sequence. This is the basis of the IEEE
802.1Qbv specification.
Example configuration:
tc qdisc replace dev enp3s0 parent root handle 100 taprio \
num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 2@2 \
base-time 1528743495910289987 \
sched-entry S 01 300000 \
sched-entry S 02 300000 \
sched-entry S 04 300000 \
clockid CLOCK_TAI
The configuration format is similar to mqprio. The main difference is
the presence of a schedule, built by multiple "sched-entry"
definitions, each entry has the following format:
sched-entry <CMD> <GATE MASK> <INTERVAL>
The only supported <CMD> is "S", which means "SetGateStates",
following the IEEE 802.1Qbv-2015 definition (Table 8-6). <GATE MASK>
is a bitmask where each bit is a associated with a traffic class, so
bit 0 (the least significant bit) being "on" means that traffic class
0 is "active" for that schedule entry. <INTERVAL> is a time duration
in nanoseconds that specifies for how long that state defined by <CMD>
and <GATE MASK> should be held before moving to the next entry.
This schedule is circular, that is, after the last entry is executed
it starts from the first one, indefinitely.
The other parameters can be defined as follows:
- base-time: specifies the instant when the schedule starts, if
'base-time' is a time in the past, the schedule will start at
base-time + (N * cycle-time)
where N is the smallest integer so the resulting time is greater
than "now", and "cycle-time" is the sum of all the intervals of the
entries in the schedule;
- clockid: specifies the reference clock to be used;
The parameters should be similar to what the IEEE 802.1Q family of
specification defines.
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-28 17:59:43 -07:00
config NET_SCH_TAPRIO
tristate "Time Aware Priority (taprio) Scheduler"
2023-02-04 15:53:01 +02:00
select NET_SCH_MQPRIO_LIB
tc: Add support for configuring the taprio scheduler
This traffic scheduler allows traffic classes states (transmission
allowed/not allowed, in the simplest case) to be scheduled, according
to a pre-generated time sequence. This is the basis of the IEEE
802.1Qbv specification.
Example configuration:
tc qdisc replace dev enp3s0 parent root handle 100 taprio \
num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 2@2 \
base-time 1528743495910289987 \
sched-entry S 01 300000 \
sched-entry S 02 300000 \
sched-entry S 04 300000 \
clockid CLOCK_TAI
The configuration format is similar to mqprio. The main difference is
the presence of a schedule, built by multiple "sched-entry"
definitions, each entry has the following format:
sched-entry <CMD> <GATE MASK> <INTERVAL>
The only supported <CMD> is "S", which means "SetGateStates",
following the IEEE 802.1Qbv-2015 definition (Table 8-6). <GATE MASK>
is a bitmask where each bit is a associated with a traffic class, so
bit 0 (the least significant bit) being "on" means that traffic class
0 is "active" for that schedule entry. <INTERVAL> is a time duration
in nanoseconds that specifies for how long that state defined by <CMD>
and <GATE MASK> should be held before moving to the next entry.
This schedule is circular, that is, after the last entry is executed
it starts from the first one, indefinitely.
The other parameters can be defined as follows:
- base-time: specifies the instant when the schedule starts, if
'base-time' is a time in the past, the schedule will start at
base-time + (N * cycle-time)
where N is the smallest integer so the resulting time is greater
than "now", and "cycle-time" is the sum of all the intervals of the
entries in the schedule;
- clockid: specifies the reference clock to be used;
The parameters should be similar to what the IEEE 802.1Q family of
specification defines.
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-28 17:59:43 -07:00
help
Say Y here if you want to use the Time Aware Priority (taprio) packet
scheduling algorithm.
See the top of <file:net/sched/sch_taprio.c> for more details.
To compile this code as a module, choose M here: the
module will be called sch_taprio.
2005-04-16 15:20:36 -07:00
config NET_SCH_GRED
2005-11-01 15:13:02 +01:00
tristate "Generic Random Early Detection (GRED)"
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to use the Generic Random Early Detection
2005-05-03 14:34:20 -07:00
(GRED) packet scheduling algorithm for some of your network devices
2005-04-16 15:20:36 -07:00
(see the top of <file:net/sched/sch_red.c> for details and
references about the algorithm).
To compile this code as a module, choose M here: the
module will be called sch_gred.
config NET_SCH_NETEM
2005-11-01 15:13:02 +01:00
tristate "Network emulator (NETEM)"
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y if you want to emulate network delay, loss, and packet
re-ordering. This is often useful to simulate networks when
testing applications or protocols.
To compile this driver as a module, choose M here: the module
will be called sch_netem.
If unsure, say N.
2008-11-20 04:10:00 -08:00
config NET_SCH_DRR
tristate "Deficit Round Robin scheduler (DRR)"
help
Say Y here if you want to use the Deficit Round Robin (DRR) packet
scheduling algorithm.
To compile this driver as a module, choose M here: the module
will be called sch_drr.
If unsure, say N.
2011-01-17 08:06:09 +00:00
config NET_SCH_MQPRIO
tristate "Multi-queue priority scheduler (MQPRIO)"
2023-02-04 15:53:01 +02:00
select NET_SCH_MQPRIO_LIB
2011-01-17 08:06:09 +00:00
help
Say Y here if you want to use the Multi-queue Priority scheduler.
This scheduler allows QOS to be offloaded on NICs that have support
for offloading QOS schedulers.
To compile this driver as a module, choose M here: the module will
be called sch_mqprio.
If unsure, say N.
2018-07-23 19:37:41 +05:30
config NET_SCH_SKBPRIO
tristate "SKB priority queue scheduler (SKBPRIO)"
help
Say Y here if you want to use the SKB priority queue
scheduler. This schedules packets according to skb->priority,
which is useful for request packets in DoS mitigation systems such
as Gatekeeper.
To compile this driver as a module, choose M here: the module will
be called sch_skbprio.
If unsure, say N.
2011-02-02 15:21:10 +00:00
config NET_SCH_CHOKE
tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
help
Say Y here if you want to use the CHOKe packet scheduler (CHOose
and Keep for responsive flows, CHOose and Kill for unresponsive
2020-12-07 16:12:31 +00:00
flows). This is a variation of RED which tries to penalize flows
2011-02-02 15:21:10 +00:00
that monopolize the queue.
To compile this code as a module, choose M here: the
module will be called sch_choke.
2011-04-04 05:30:58 +00:00
config NET_SCH_QFQ
tristate "Quick Fair Queueing scheduler (QFQ)"
help
Say Y here if you want to use the Quick Fair Queueing Scheduler (QFQ)
packet scheduling algorithm.
To compile this driver as a module, choose M here: the module
will be called sch_qfq.
If unsure, say N.
codel: Controlled Delay AQM
An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.
http://queue.acm.org/detail.cfm?id=2209336
This AQM main input is no longer queue size in bytes or packets, but the
delay packets stay in (FIFO) queue.
As we don't have infinite memory, we still can drop packets in enqueue()
in case of massive load, but mean of CoDel is to drop packets in
dequeue(), using a control law based on two simple parameters :
target : target sojourn time (default 5ms)
interval : width of moving time window (default 100ms)
Based on initial work from Dave Taht.
Refactored to help future codel inclusion as a plugin for other linux
qdisc (FQ_CODEL, ...), like RED.
include/net/codel.h contains codel algorithm as close as possible than
Kathleen reference.
net/sched/sch_codel.c contains the linux qdisc specific glue.
Separate structures permit a memory efficient implementation of fq_codel
(to be sent as a separate work) : Each flow has its own struct
codel_vars.
timestamps are taken at enqueue() time with 1024 ns precision, allowing
a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses
usec as base unit.
Selected packets are dropped, unless ECN is enabled and packets can get
ECN mark instead.
Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and
tg3 drivers (BQL enabled).
Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
[ interval TIME ] [ ecn ]
qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
maxpacket 1514 ecn_mark 84399 drop_overlimit 0
CoDel must be seen as a base module, and should be used keeping in mind
there is still a FIFO queue. So a typical setup will probably need a
hierarchy of several qdiscs and packet classifiers to be able to meet
whatever constraints a user might have.
One possible example would be to use fq_codel, which combines Fair
Queueing and CoDel, in replacement of sfq / sfq_red.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Van Jacobson <van@pollere.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 07:51:25 +00:00
config NET_SCH_CODEL
tristate "Controlled Delay AQM (CODEL)"
help
Say Y here if you want to use the Controlled Delay (CODEL)
packet scheduling algorithm.
To compile this driver as a module, choose M here: the module
will be called sch_codel.
If unsure, say N.
fq_codel: Fair Queue Codel AQM
Fair Queue Codel packet scheduler
Principles :
- Packets are classified (internal classifier or external) on flows.
- This is a Stochastic model (as we use a hash, several flows might
be hashed on same slot)
- Each flow has a CoDel managed queue.
- Flows are linked onto two (Round Robin) lists,
so that new flows have priority on old ones.
- For a given flow, packets are not reordered (CoDel uses a FIFO)
- head drops only.
- ECN capability is on by default.
- Very low memory footprint (64 bytes per flow)
tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
[ target TIME ] [ interval TIME ] [ noecn ]
[ quantum BYTES ]
defaults : 1024 flows, 10240 packets limit, quantum : device MTU
target : 5ms (CoDel default)
interval : 100ms (CoDel default)
Impressive results on load :
class htb 1:1 root leaf 10: prio 0 quantum 1514 rate 200000Kbit ceil 200000Kbit burst 1475b/8 mpu 0b overhead 0b cburst 1475b/8 mpu 0b overhead 0b level 0
Sent 43304920109 bytes 33063109 pkt (dropped 0, overlimits 0 requeues 0)
rate 201691Kbit 28595pps backlog 0b 312p requeues 0
lended: 33063109 borrowed: 0 giants: 0
tokens: -912 ctokens: -912
class fq_codel 10:1735 parent 10:
(dropped 1292, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4524 parent 10:
(dropped 1291, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4e74 parent 10:
(dropped 1290, overlimits 0 requeues 0)
backlog 6056b 4p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 6.4ms dropping drop_next 92.0ms
class fq_codel 10:628a parent 10:
(dropped 1289, overlimits 0 requeues 0)
backlog 7570b 5p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.4ms dropping drop_next 90.9ms
class fq_codel 10:a4b3 parent 10:
(dropped 302, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:c3c2 parent 10:
(dropped 1284, overlimits 0 requeues 0)
backlog 13626b 9p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:d331 parent 10:
(dropped 299, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.0ms
class fq_codel 10:d526 parent 10:
(dropped 12160, overlimits 0 requeues 0)
backlog 35870b 211p requeues 0
deficit 1508 count 12160 lastcount 1 ldelay 15.3ms dropping drop_next 247us
class fq_codel 10:e2c6 parent 10:
(dropped 1288, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:eab5 parent 10:
(dropped 1285, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:f220 parent 10:
(dropped 1289, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
qdisc htb 1: root refcnt 6 r2q 10 default 1 direct_packets_stat 0 ver 3.17
Sent 43331086547 bytes 33092812 pkt (dropped 0, overlimits 66063544 requeues 71)
rate 201697Kbit 28602pps backlog 0b 260p requeues 71
qdisc fq_codel 10: parent 1:1 limit 10240p flows 65536 target 5.0ms interval 100.0ms ecn
Sent 43331086547 bytes 33092812 pkt (dropped 949359, overlimits 0 requeues 0)
rate 201697Kbit 28602pps backlog 189352b 260p requeues 0
maxpacket 1514 drop_overlimit 0 new_flow_count 5582 ecn_mark 125593
new_flows_len 0 old_flows_len 11
PING 172.30.42.18 (172.30.42.18) 56(84) bytes of data.
64 bytes from 172.30.42.18: icmp_req=1 ttl=64 time=0.227 ms
64 bytes from 172.30.42.18: icmp_req=2 ttl=64 time=0.165 ms
64 bytes from 172.30.42.18: icmp_req=3 ttl=64 time=0.166 ms
64 bytes from 172.30.42.18: icmp_req=4 ttl=64 time=0.151 ms
64 bytes from 172.30.42.18: icmp_req=5 ttl=64 time=0.164 ms
64 bytes from 172.30.42.18: icmp_req=6 ttl=64 time=0.172 ms
64 bytes from 172.30.42.18: icmp_req=7 ttl=64 time=0.175 ms
64 bytes from 172.30.42.18: icmp_req=8 ttl=64 time=0.183 ms
64 bytes from 172.30.42.18: icmp_req=9 ttl=64 time=0.158 ms
64 bytes from 172.30.42.18: icmp_req=10 ttl=64 time=0.200 ms
10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.151/0.176/0.227/0.022 ms
Much better than SFQ because of priority given to new flows, and fast
path dirtying less cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-11 09:30:50 +00:00
config NET_SCH_FQ_CODEL
tristate "Fair Queue Controlled Delay AQM (FQ_CODEL)"
help
Say Y here if you want to use the FQ Controlled Delay (FQ_CODEL)
packet scheduling algorithm.
To compile this driver as a module, choose M here: the module
will be called sch_fq_codel.
If unsure, say N.
sched: Add Common Applications Kept Enhanced (cake) qdisc
sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.
Example of use on a cable ISP uplink:
tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
To shape a cable download link (ifb and tc-mirred setup elided)
tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
CAKE is filled with:
* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
variation.
A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
International Symposium on Local and Metropolitan Area Networks (LANMAN).
This patch adds the base shaper and packet scheduler, while subsequent
commits add the optional (configurable) features. The full userspace API
and most data structures are included in this commit, but options not
understood in the base version will be ignored.
Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.
sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.
CAKE's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.
Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.
tc -s qdisc show dev eth2
qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
backlog 0b 0p requeues 12
memory used: 1053008b of 15140Kb
capacity estimate: 970Mbit
min/max network layer size: 28 / 1500
min/max overhead-adjusted size: 84 / 1538
average network hdr offset: 14
Bulk Best Effort Voice
thresh 62500Kbit 1Gbit 250Mbit
target 5.0ms 5.0ms 5.0ms
interval 100.0ms 100.0ms 100.0ms
pk_delay 5us 5us 6us
av_delay 3us 2us 2us
sp_delay 2us 1us 1us
backlog 0b 0b 0b
pkts 3164050 25030267 9530280
bytes 3227519915 35396974782 12879808898
way_inds 0 8 0
way_miss 21 366 25
way_cols 0 0 0
drops 5 0 1
marks 0 0 0
ack_drop 0 0 0
sp_flows 1 3 0
bk_flows 0 1 1
un_flows 0 0 0
max_len 68130 68130 68130
Tested-by: Pete Heist <peteheist@gmail.com>
Tested-by: Georgios Amanakis <gamanakis@gmail.com>
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-06 17:37:19 +02:00
config NET_SCH_CAKE
tristate "Common Applications Kept Enhanced (CAKE)"
help
Say Y here if you want to use the Common Applications Kept Enhanced
2019-09-23 17:52:42 +02:00
(CAKE) queue management algorithm.
sched: Add Common Applications Kept Enhanced (cake) qdisc
sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.
Example of use on a cable ISP uplink:
tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
To shape a cable download link (ifb and tc-mirred setup elided)
tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
CAKE is filled with:
* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
variation.
A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
International Symposium on Local and Metropolitan Area Networks (LANMAN).
This patch adds the base shaper and packet scheduler, while subsequent
commits add the optional (configurable) features. The full userspace API
and most data structures are included in this commit, but options not
understood in the base version will be ignored.
Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.
sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.
CAKE's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.
Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.
tc -s qdisc show dev eth2
qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
backlog 0b 0p requeues 12
memory used: 1053008b of 15140Kb
capacity estimate: 970Mbit
min/max network layer size: 28 / 1500
min/max overhead-adjusted size: 84 / 1538
average network hdr offset: 14
Bulk Best Effort Voice
thresh 62500Kbit 1Gbit 250Mbit
target 5.0ms 5.0ms 5.0ms
interval 100.0ms 100.0ms 100.0ms
pk_delay 5us 5us 6us
av_delay 3us 2us 2us
sp_delay 2us 1us 1us
backlog 0b 0b 0b
pkts 3164050 25030267 9530280
bytes 3227519915 35396974782 12879808898
way_inds 0 8 0
way_miss 21 366 25
way_cols 0 0 0
drops 5 0 1
marks 0 0 0
ack_drop 0 0 0
sp_flows 1 3 0
bk_flows 0 1 1
un_flows 0 0 0
max_len 68130 68130 68130
Tested-by: Pete Heist <peteheist@gmail.com>
Tested-by: Georgios Amanakis <gamanakis@gmail.com>
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-06 17:37:19 +02:00
To compile this driver as a module, choose M here: the module
will be called sch_cake.
If unsure, say N.
pkt_sched: fq: Fair Queue packet scheduler
- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
- Uses the new_flow/old_flow separation from FQ_codel
- New flows get an initial credit allowing IW10 without added delay.
- Special FIFO queue for high prio packets (no need for PRIO + FQ)
- Uses a hash table of RB trees to locate the flows at enqueue() time
- Smart on demand gc (at enqueue() time, RB tree lookup evicts old
unused flows)
- Dynamic memory allocations.
- Designed to allow millions of concurrent flows per Qdisc.
- Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
- Single high resolution timer for throttled flows (if any).
- One RB tree to link throttled flows.
- Ability to have a max rate per flow. We might add a socket option
to add per socket limitation.
Attempts have been made to add TCP pacing in TCP stack, but this
seems to add complex code to an already complex stack.
TCP pacing is welcomed for flows having idle times, as the cwnd
permits TCP stack to queue a possibly large number of packets.
This removes the 'slow start after idle' choice, hitting badly
large BDP flows, and applications delivering chunks of data
as video streams.
Nicely spaced packets :
Here interface is 10Gbit, but flow bottleneck is ~20Mbit
cwin is big, yet FQ avoids the typical bursts generated by TCP
(as in netperf TCP_RR -- -r 100000,100000)
15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
TCP timestamps show that most packets from B were queued in the same ms
timeframe (TSval 1159799{3,4}), but FQ managed to send them right
in time to avoid a big burst.
In slow start or steady state, very few packets are throttled [1]
FQ gets a bunch of tunables as :
limit : max number of packets on whole Qdisc (default 10000)
flow_limit : max number of packets per flow (default 100)
quantum : the credit per RR round (default is 2 MTU)
initial_quantum : initial credit for new flows (default is 10 MTU)
maxrate : max per flow rate (default : unlimited)
buckets : number of RB trees (default : 1024) in hash table.
(consumes 8 bytes per bucket)
[no]pacing : disable/enable pacing (default is enable)
All of them can be changed on a live qdisc.
$ tc qd add dev eth0 root fq help
Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
[ quantum BYTES ] [ initial_quantum BYTES ]
[ maxrate RATE ] [ buckets NUMBER ]
[ [no]pacing ]
$ tc -s -d qd
qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
backlog 0b 0p requeues 14
511 flows, 511 inactive, 0 throttled
110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit
[1] Except if initial srtt is overestimated, as if using
cached srtt in tcp metrics. We'll provide a fix for this issue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 15:49:55 -07:00
config NET_SCH_FQ
tristate "Fair Queue"
help
Say Y here if you want to use the FQ packet scheduling algorithm.
FQ does flow separation, and is able to respect pacing requirements
2023-01-24 10:17:24 -08:00
set by TCP stack into sk->sk_pacing_rate (for locally generated
pkt_sched: fq: Fair Queue packet scheduler
- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
- Uses the new_flow/old_flow separation from FQ_codel
- New flows get an initial credit allowing IW10 without added delay.
- Special FIFO queue for high prio packets (no need for PRIO + FQ)
- Uses a hash table of RB trees to locate the flows at enqueue() time
- Smart on demand gc (at enqueue() time, RB tree lookup evicts old
unused flows)
- Dynamic memory allocations.
- Designed to allow millions of concurrent flows per Qdisc.
- Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
- Single high resolution timer for throttled flows (if any).
- One RB tree to link throttled flows.
- Ability to have a max rate per flow. We might add a socket option
to add per socket limitation.
Attempts have been made to add TCP pacing in TCP stack, but this
seems to add complex code to an already complex stack.
TCP pacing is welcomed for flows having idle times, as the cwnd
permits TCP stack to queue a possibly large number of packets.
This removes the 'slow start after idle' choice, hitting badly
large BDP flows, and applications delivering chunks of data
as video streams.
Nicely spaced packets :
Here interface is 10Gbit, but flow bottleneck is ~20Mbit
cwin is big, yet FQ avoids the typical bursts generated by TCP
(as in netperf TCP_RR -- -r 100000,100000)
15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
TCP timestamps show that most packets from B were queued in the same ms
timeframe (TSval 1159799{3,4}), but FQ managed to send them right
in time to avoid a big burst.
In slow start or steady state, very few packets are throttled [1]
FQ gets a bunch of tunables as :
limit : max number of packets on whole Qdisc (default 10000)
flow_limit : max number of packets per flow (default 100)
quantum : the credit per RR round (default is 2 MTU)
initial_quantum : initial credit for new flows (default is 10 MTU)
maxrate : max per flow rate (default : unlimited)
buckets : number of RB trees (default : 1024) in hash table.
(consumes 8 bytes per bucket)
[no]pacing : disable/enable pacing (default is enable)
All of them can be changed on a live qdisc.
$ tc qd add dev eth0 root fq help
Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
[ quantum BYTES ] [ initial_quantum BYTES ]
[ maxrate RATE ] [ buckets NUMBER ]
[ [no]pacing ]
$ tc -s -d qd
qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
backlog 0b 0p requeues 14
511 flows, 511 inactive, 0 throttled
110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit
[1] Except if initial srtt is overestimated, as if using
cached srtt in tcp metrics. We'll provide a fix for this issue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 15:49:55 -07:00
traffic)
To compile this driver as a module, choose M here: the module
will be called sch_fq.
If unsure, say N.
net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc
This patch implements the first size-based qdisc that attempts to
differentiate between small flows and heavy-hitters. The goal is to
catch the heavy-hitters and move them to a separate queue with less
priority so that bulk traffic does not affect the latency of critical
traffic. Currently "less priority" means less weight (2:1 in
particular) in a Weighted Deficit Round Robin (WDRR) scheduler.
In essence, this patch addresses the "delay-bloat" problem due to
bloated buffers. In some systems, large queues may be necessary for
obtaining CPU efficiency, or due to the presence of unresponsive
traffic like UDP, or just a large number of connections with each
having a small amount of outstanding traffic. In these circumstances,
HHF aims to reduce the HoL blocking for latency sensitive traffic,
while not impacting the queues built up by bulk traffic. HHF can also
be used in conjunction with other AQM mechanisms such as CoDel.
To capture heavy-hitters, we implement the "multi-stage filter" design
in the following paper:
C. Estan and G. Varghese, "New Directions in Traffic Measurement and
Accounting", in ACM SIGCOMM, 2002.
Some configurable qdisc settings through 'tc':
- hhf_reset_timeout: period to reset counter values in the multi-stage
filter (default 40ms)
- hhf_admit_bytes: threshold to classify heavy-hitters
(default 128KB)
- hhf_evict_timeout: threshold to evict idle heavy-hitters
(default 1s)
- hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for
non-heavy-hitters (default 2)
- hh_flows_limit: max number of heavy-hitter flow entries
(default 2048)
Note that the ratio between hhf_admit_bytes and hhf_reset_timeout
reflects the bandwidth of heavy-hitters that we attempt to capture
(25Mbps with the above default settings).
The false negative rate (heavy-hitter flows getting away unclassified)
is zero by the design of the multi-stage filter algorithm.
With 100 heavy-hitter flows, using four hashes and 4000 counters yields
a false positive rate (non-heavy-hitters mistakenly classified as
heavy-hitters) of less than 1e-4.
Signed-off-by: Terry Lam <vtlam@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-15 00:30:21 -08:00
config NET_SCH_HHF
tristate "Heavy-Hitter Filter (HHF)"
help
Say Y here if you want to use the Heavy-Hitter Filter (HHF)
packet scheduling algorithm.
To compile this driver as a module, choose M here: the module
will be called sch_hhf.
net: pkt_sched: PIE AQM scheme
Proportional Integral controller Enhanced (PIE) is a scheduler to address the
bufferbloat problem.
>From the IETF draft below:
" Bufferbloat is a phenomenon where excess buffers in the network cause high
latency and jitter. As more and more interactive applications (e.g. voice over
IP, real time video streaming and financial transactions) run in the Internet,
high latency and jitter degrade application performance. There is a pressing
need to design intelligent queue management schemes that can control latency and
jitter; and hence provide desirable quality of service to users.
We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software. "
Many thanks to Dave Taht for extensive feedback, reviews, testing and
suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
suggestions. Naeem Khademi and Dave Taht independently contributed to ECN
support.
For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.
Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00
All relevant code, documents and test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.
For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Mythili Prabhu <mysuryan@cisco.com>
CC: Dave Taht <dave.taht@bufferbloat.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04 17:33:55 -08:00
config NET_SCH_PIE
tristate "Proportional Integral controller Enhanced (PIE) scheduler"
help
Say Y here if you want to use the Proportional Integral controller
Enhanced scheduler packet scheduling algorithm.
2019-03-23 19:11:33 +05:30
For more information, please see https://tools.ietf.org/html/rfc8033
net: pkt_sched: PIE AQM scheme
Proportional Integral controller Enhanced (PIE) is a scheduler to address the
bufferbloat problem.
>From the IETF draft below:
" Bufferbloat is a phenomenon where excess buffers in the network cause high
latency and jitter. As more and more interactive applications (e.g. voice over
IP, real time video streaming and financial transactions) run in the Internet,
high latency and jitter degrade application performance. There is a pressing
need to design intelligent queue management schemes that can control latency and
jitter; and hence provide desirable quality of service to users.
We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software. "
Many thanks to Dave Taht for extensive feedback, reviews, testing and
suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
suggestions. Naeem Khademi and Dave Taht independently contributed to ECN
support.
For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.
Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00
All relevant code, documents and test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.
For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Mythili Prabhu <mysuryan@cisco.com>
CC: Dave Taht <dave.taht@bufferbloat.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-04 17:33:55 -08:00
To compile this driver as a module, choose M here: the module
will be called sch_pie.
If unsure, say N.
2020-01-22 23:52:33 +05:30
config NET_SCH_FQ_PIE
depends on NET_SCH_PIE
tristate "Flow Queue Proportional Integral controller Enhanced (FQ-PIE)"
help
Say Y here if you want to use the Flow Queue Proportional Integral
controller Enhanced (FQ-PIE) packet scheduling algorithm.
For more information, please see https://tools.ietf.org/html/rfc8033
To compile this driver as a module, choose M here: the module
will be called sch_fq_pie.
If unsure, say N.
2005-04-16 15:20:36 -07:00
config NET_SCH_INGRESS
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 22:29:47 +01:00
tristate "Ingress/classifier-action Qdisc"
2008-01-31 16:57:15 -08:00
depends on NET_CLS_ACT
bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 16:08:52 +02:00
select NET_XGRESS
2020-06-14 01:50:22 +09:00
help
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 22:29:47 +01:00
Say Y here if you want to use classifiers for incoming and/or outgoing
packets. This qdisc doesn't do anything else besides running classifiers,
which can also have actions attached to them. In case of outgoing packets,
classifiers that this qdisc holds are executed in the transmit path
before real enqueuing to an egress qdisc happens.
2005-04-16 15:20:36 -07:00
If unsure, say Y.
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 22:29:47 +01:00
To compile this code as a module, choose M here: the module will be
called sch_ingress with alias of sch_clsact.
2005-04-16 15:20:36 -07:00
2012-02-05 13:51:32 +00:00
config NET_SCH_PLUG
tristate "Plug network traffic until release (PLUG)"
2020-06-14 01:50:22 +09:00
help
2012-02-05 13:51:32 +00:00
This queuing discipline allows userspace to plug/unplug a network
output queue, using the netlink interface. When it receives an
enqueue command it inserts a plug into the outbound queue that
causes following packets to enqueue until a dequeue command arrives
over netlink, causing the plug to be removed and resuming the normal
packet flow.
This module also provides a generic "network output buffering"
functionality (aka output commit), wherein upon arrival of a dequeue
command, only packets up to the first plug are released for delivery.
The Remus HA project uses this module to enable speculative execution
of virtual machines by allowing the generated network output to be rolled
back if needed.
2014-12-03 14:07:31 -08:00
For more information, please refer to <http://wiki.xenproject.org/wiki/Remus>
2012-02-05 13:51:32 +00:00
Say Y here if you are using this kernel for Xen dom0 and
want to protect Xen guests with Remus.
To compile this code as a module, choose M here: the
module will be called sch_plug.
2019-12-18 14:55:13 +00:00
config NET_SCH_ETS
tristate "Enhanced transmission selection scheduler (ETS)"
help
The Enhanced Transmission Selection scheduler is a classful
queuing discipline that merges functionality of PRIO and DRR
qdiscs in one scheduler. ETS makes it easy to configure a set of
strict and bandwidth-sharing bands to implement the transmission
selection described in 802.1Qaz.
Say Y here if you want to use the ETS packet scheduling
algorithm.
To compile this driver as a module, choose M here: the module
will be called sch_ets.
If unsure, say N.
2017-04-13 08:40:53 -07:00
menuconfig NET_SCH_DEFAULT
bool "Allow override default queue discipline"
2020-06-14 01:50:22 +09:00
help
2017-04-13 08:40:53 -07:00
Support for selection of default queuing discipline.
Nearly all users can safely say no here, and the default
of pfifo_fast will be used. Many distributions already set
the default value via /proc/sys/net/core/default_qdisc.
If unsure, say N.
if NET_SCH_DEFAULT
choice
prompt "Default queuing discipline"
default DEFAULT_PFIFO_FAST
help
Select the queueing discipline that will be used by default
for all network devices.
config DEFAULT_FQ
bool "Fair Queue" if NET_SCH_FQ
config DEFAULT_CODEL
bool "Controlled Delay" if NET_SCH_CODEL
config DEFAULT_FQ_CODEL
bool "Fair Queue Controlled Delay" if NET_SCH_FQ_CODEL
2020-07-01 16:01:52 -07:00
config DEFAULT_FQ_PIE
bool "Flow Queue Proportional Integral controller Enhanced" if NET_SCH_FQ_PIE
2017-04-13 08:40:53 -07:00
config DEFAULT_SFQ
bool "Stochastic Fair Queue" if NET_SCH_SFQ
config DEFAULT_PFIFO_FAST
bool "Priority FIFO Fast"
endchoice
config DEFAULT_NET_SCH
string
default "pfifo_fast" if DEFAULT_PFIFO_FAST
default "fq" if DEFAULT_FQ
default "fq_codel" if DEFAULT_FQ_CODEL
2020-07-01 16:01:52 -07:00
default "fq_pie" if DEFAULT_FQ_PIE
2017-04-13 08:40:53 -07:00
default "sfq" if DEFAULT_SFQ
default "pfifo_fast"
endif
2005-11-01 15:13:02 +01:00
comment "Classification"
2005-04-16 15:20:36 -07:00
config NET_CLS
2014-12-20 15:41:11 -05:00
bool
2005-04-16 15:20:36 -07:00
config NET_CLS_BASIC
2005-11-01 15:13:02 +01:00
tristate "Elementary classification (BASIC)"
select NET_CLS
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to be able to classify packets using
only extended matches and actions.
To compile this code as a module, choose M here: the
module will be called cls_basic.
config NET_CLS_ROUTE4
2005-11-01 15:13:02 +01:00
tristate "Routing decision (ROUTE)"
2011-05-19 19:23:28 -04:00
depends on INET
2011-01-14 13:36:42 +01:00
select IP_ROUTE_CLASSID
2005-11-01 15:13:02 +01:00
select NET_CLS
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
If you say Y here, you will be able to classify packets
according to the route table entry they matched.
2005-04-16 15:20:36 -07:00
To compile this code as a module, choose M here: the
module will be called cls_route.
config NET_CLS_FW
2005-11-01 15:13:02 +01:00
tristate "Netfilter mark (FW)"
select NET_CLS
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
If you say Y here, you will be able to classify packets
according to netfilter/firewall marks.
2005-04-16 15:20:36 -07:00
To compile this code as a module, choose M here: the
module will be called cls_fw.
config NET_CLS_U32
2005-11-01 15:13:02 +01:00
tristate "Universal 32bit comparisons w/ hashing (U32)"
select NET_CLS
2020-06-14 01:50:22 +09:00
help
2006-06-30 18:53:46 +02:00
Say Y here to be able to classify packets using a universal
2005-11-01 15:13:02 +01:00
32bit pieces based comparison scheme.
2005-04-16 15:20:36 -07:00
To compile this code as a module, choose M here: the
module will be called cls_u32.
config CLS_U32_PERF
2005-11-01 15:13:02 +01:00
bool "Performance counters support"
2005-04-16 15:20:36 -07:00
depends on NET_CLS_U32
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here to make u32 gather additional statistics useful for
fine tuning u32 classifiers.
2005-04-16 15:20:36 -07:00
config CLS_U32_MARK
2005-11-01 15:13:02 +01:00
bool "Netfilter marks support"
2006-11-09 15:19:14 -08:00
depends on NET_CLS_U32
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here to be able to use netfilter marks as u32 key.
2005-04-16 15:20:36 -07:00
[NET_SCHED]: Add flow classifier
Add new "flow" classifier, which is meant to extend the SFQ hashing
capabilities without hard-coding new hash functions and also allows
deterministic mappings of keys to classes, replacing some out of tree
iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps
IPs to marks, with fw filters to classes), ...
Some examples:
- Classic SFQ hash:
tc filter add ... flow hash \
keys src,dst,proto,proto-src,proto-dst divisor 1024
- Classic SFQ hash, but using information from conntrack to work properly in
combination with NAT:
tc filter add ... flow hash \
keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024
- Map destination IPs of 192.168.0.0/24 to classids 1-257:
tc filter add ... flow map \
key dst addend -192.168.0.0 divisor 256
- alternatively:
tc filter add ... flow map \
key dst and 0xff
- similar, but reverse ordered:
tc filter add ... flow map \
key dst and 0xff xor 0xff
Perturbation is currently not supported because we can't reliable kill the
timer on destruction.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31 18:37:42 -08:00
config NET_CLS_FLOW
tristate "Flow classifier"
select NET_CLS
2020-06-14 01:50:22 +09:00
help
[NET_SCHED]: Add flow classifier
Add new "flow" classifier, which is meant to extend the SFQ hashing
capabilities without hard-coding new hash functions and also allows
deterministic mappings of keys to classes, replacing some out of tree
iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps
IPs to marks, with fw filters to classes), ...
Some examples:
- Classic SFQ hash:
tc filter add ... flow hash \
keys src,dst,proto,proto-src,proto-dst divisor 1024
- Classic SFQ hash, but using information from conntrack to work properly in
combination with NAT:
tc filter add ... flow hash \
keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024
- Map destination IPs of 192.168.0.0/24 to classids 1-257:
tc filter add ... flow map \
key dst addend -192.168.0.0 divisor 256
- alternatively:
tc filter add ... flow map \
key dst and 0xff
- similar, but reverse ordered:
tc filter add ... flow map \
key dst and 0xff xor 0xff
Perturbation is currently not supported because we can't reliable kill the
timer on destruction.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31 18:37:42 -08:00
If you say Y here, you will be able to classify packets based on
a configurable combination of packet keys. This is mostly useful
in combination with SFQ.
To compile this code as a module, choose M here: the
module will be called cls_flow.
2008-11-07 22:56:00 -08:00
config NET_CLS_CGROUP
2010-03-23 05:24:03 +00:00
tristate "Control Group Classifier"
2008-11-07 22:56:00 -08:00
select NET_CLS
2013-12-29 18:27:10 +01:00
select CGROUP_NET_CLASSID
2008-11-07 22:56:00 -08:00
depends on CGROUPS
2020-06-14 01:50:22 +09:00
help
2008-11-07 22:56:00 -08:00
Say Y here if you want to classify packets based on the control
cgroup of their process.
2010-03-23 05:24:03 +00:00
To compile this code as a module, choose M here: the
module will be called cls_cgroup.
net: sched: cls_bpf: add BPF-based classifier
This work contains a lightweight BPF-based traffic classifier that can
serve as a flexible alternative to ematch-based tree classification, i.e.
now that BPF filter engine can also be JITed in the kernel. Naturally, tc
actions and policies are supported as well with cls_bpf. Multiple BPF
programs/filter can be attached for a class, or they can just as well be
written within a single BPF program, that's really up to the user how he
wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
The notion of a BPF program's return/exit codes is being kept as follows:
0: No match
-1: Select classid given in "tc filter ..." command
else: flowid, overwrite the default one
As a minimal usage example with iproute2, we use a 3 band prio root qdisc
on a router with sfq each as leave, and assign ssh and icmp bpf-based
filters to band 1, http traffic to band 2 and the rest to band 3. For the
first two bands we load the bytecode from a file, in the 2nd we load it
inline as an example:
echo 1 > /proc/sys/net/core/bpf_jit_enable
tc qdisc del dev em1 root
tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
tc qdisc add dev em1 parent 1:1 sfq perturb 16
tc qdisc add dev em1 parent 1:2 sfq perturb 16
tc qdisc add dev em1 parent 1:3 sfq perturb 16
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
BPF programs can be easily created and passed to tc, either as inline
'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
compile opcodes, for example:
1) People familiar with tcpdump-like filters:
tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
2) People that want to low-level program their filters or use BPF
extensions that lack support by libpcap's compiler:
bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
ssh.ops example code:
ldh [12]
jne #0x800, drop
ldb [23]
jneq #6, drop
ldh [20]
jset #0x1fff, drop
ldxb 4 * ([14] & 0xf)
ldh [%x + 14]
jeq #0x16, pass
ldh [%x + 16]
jne #0x16, drop
pass: ret #-1
drop: ret #0
It was chosen to load bytecode into tc, since the reverse operation,
tc filter list dev em1, is then able to show the exact commands again.
Possible follow-up work could also include a small expression compiler
for iproute2. Tested with the help of bmon. This idea came up during
the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
Eric Dumazet!
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-28 16:43:02 +01:00
config NET_CLS_BPF
tristate "BPF-based classifier"
select NET_CLS
2020-06-14 01:50:22 +09:00
help
net: sched: cls_bpf: add BPF-based classifier
This work contains a lightweight BPF-based traffic classifier that can
serve as a flexible alternative to ematch-based tree classification, i.e.
now that BPF filter engine can also be JITed in the kernel. Naturally, tc
actions and policies are supported as well with cls_bpf. Multiple BPF
programs/filter can be attached for a class, or they can just as well be
written within a single BPF program, that's really up to the user how he
wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
The notion of a BPF program's return/exit codes is being kept as follows:
0: No match
-1: Select classid given in "tc filter ..." command
else: flowid, overwrite the default one
As a minimal usage example with iproute2, we use a 3 band prio root qdisc
on a router with sfq each as leave, and assign ssh and icmp bpf-based
filters to band 1, http traffic to band 2 and the rest to band 3. For the
first two bands we load the bytecode from a file, in the 2nd we load it
inline as an example:
echo 1 > /proc/sys/net/core/bpf_jit_enable
tc qdisc del dev em1 root
tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
tc qdisc add dev em1 parent 1:1 sfq perturb 16
tc qdisc add dev em1 parent 1:2 sfq perturb 16
tc qdisc add dev em1 parent 1:3 sfq perturb 16
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3
BPF programs can be easily created and passed to tc, either as inline
'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
compile opcodes, for example:
1) People familiar with tcpdump-like filters:
tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf
2) People that want to low-level program their filters or use BPF
extensions that lack support by libpcap's compiler:
bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf
ssh.ops example code:
ldh [12]
jne #0x800, drop
ldb [23]
jneq #6, drop
ldh [20]
jset #0x1fff, drop
ldxb 4 * ([14] & 0xf)
ldh [%x + 14]
jeq #0x16, pass
ldh [%x + 16]
jne #0x16, drop
pass: ret #-1
drop: ret #0
It was chosen to load bytecode into tc, since the reverse operation,
tc filter list dev em1, is then able to show the exact commands again.
Possible follow-up work could also include a small expression compiler
for iproute2. Tested with the help of bmon. This idea came up during
the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
Eric Dumazet!
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-28 16:43:02 +01:00
If you say Y here, you will be able to classify packets based on
programmable BPF (JIT'ed) filters as an alternative to ematches.
To compile this code as a module, choose M here: the module will
be called cls_bpf.
2015-05-12 14:56:21 +02:00
config NET_CLS_FLOWER
tristate "Flower classifier"
select NET_CLS
2020-06-14 01:50:22 +09:00
help
2015-05-12 14:56:21 +02:00
If you say Y here, you will be able to classify packets based on
a configurable combination of packet keys and masks.
To compile this code as a module, choose M here: the module will
be called cls_flower.
2016-07-21 12:03:11 +02:00
config NET_CLS_MATCHALL
tristate "Match-all classifier"
select NET_CLS
2020-06-14 01:50:22 +09:00
help
2016-07-21 12:03:11 +02:00
If you say Y here, you will be able to classify packets based on
nothing. Every packet will match.
To compile this code as a module, choose M here: the module will
be called cls_matchall.
2005-04-16 15:20:36 -07:00
config NET_EMATCH
bool "Extended Matches"
2005-11-01 15:13:02 +01:00
select NET_CLS
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to use extended matches on top of classifiers
and select the extended matches below.
Extended matches are small classification helpers not worth writing
2005-11-01 15:13:02 +01:00
a separate classifier for.
2005-04-16 15:20:36 -07:00
2005-11-01 15:13:02 +01:00
A recent version of the iproute2 package is required to use
2005-04-16 15:20:36 -07:00
extended matches.
config NET_EMATCH_STACK
int "Stack size"
depends on NET_EMATCH
default "32"
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Size of the local stack variable used while evaluating the tree of
ematches. Limits the depth of the tree, i.e. the number of
2005-06-08 15:10:22 -07:00
encapsulated precedences. Every level requires 4 bytes of additional
2005-04-16 15:20:36 -07:00
stack space.
config NET_EMATCH_CMP
tristate "Simple packet data comparison"
depends on NET_EMATCH
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to be able to classify packets based on
simple packet data comparisons for 8, 16, and 32bit values.
To compile this code as a module, choose M here: the
module will be called em_cmp.
config NET_EMATCH_NBYTE
tristate "Multi byte comparison"
depends on NET_EMATCH
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to be able to classify packets based on
multiple byte comparisons mainly useful for IPv6 address comparisons.
To compile this code as a module, choose M here: the
module will be called em_nbyte.
config NET_EMATCH_U32
2005-11-01 15:13:02 +01:00
tristate "U32 key"
2005-04-16 15:20:36 -07:00
depends on NET_EMATCH
2020-06-14 01:50:22 +09:00
help
2005-04-16 15:20:36 -07:00
Say Y here if you want to be able to classify packets using
the famous u32 key in combination with logic relations.
To compile this code as a module, choose M here: the
module will be called em_u32.
config NET_EMATCH_META
tristate "Metadata"
depends on NET_EMATCH
2020-06-14 01:50:22 +09:00
help
2006-01-11 16:40:30 -08:00
Say Y here if you want to be able to classify packets based on
2005-04-16 15:20:36 -07:00
metadata such as load average, netfilter attributes, socket
attributes and routing decisions.
To compile this code as a module, choose M here: the
module will be called em_meta.
2005-06-23 21:00:58 -07:00
config NET_EMATCH_TEXT
tristate "Textsearch"
depends on NET_EMATCH
2005-06-23 23:55:41 -07:00
select TEXTSEARCH
2005-06-24 17:39:03 -07:00
select TEXTSEARCH_KMP
2005-08-25 16:23:11 -07:00
select TEXTSEARCH_BM
2005-06-24 17:39:03 -07:00
select TEXTSEARCH_FSM
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here if you want to be able to classify packets based on
2005-06-24 17:39:03 -07:00
textsearch comparisons.
2005-06-23 21:00:58 -07:00
To compile this code as a module, choose M here: the
module will be called em_text.
2012-07-04 05:32:03 +02:00
config NET_EMATCH_CANID
tristate "CAN Identifier"
2012-11-23 00:44:57 +00:00
depends on NET_EMATCH && (CAN=y || CAN=m)
2020-06-14 01:50:22 +09:00
help
2012-07-04 05:32:03 +02:00
Say Y here if you want to be able to classify CAN frames based
on CAN Identifier.
To compile this code as a module, choose M here: the
module will be called em_canid.
2012-07-11 10:56:57 +00:00
config NET_EMATCH_IPSET
tristate "IPset"
depends on NET_EMATCH && IP_SET
2020-06-14 01:50:22 +09:00
help
2012-07-11 10:56:57 +00:00
Say Y here if you want to be able to classify packets based on
ipset membership.
To compile this code as a module, choose M here: the
module will be called em_ipset.
2018-02-15 19:42:43 +02:00
config NET_EMATCH_IPT
tristate "IPtables Matches"
depends on NET_EMATCH && NETFILTER && NETFILTER_XTABLES
2020-06-14 01:50:22 +09:00
help
2018-02-15 19:42:43 +02:00
Say Y here to be able to classify packets based on iptables
matches.
Current supported match is "policy" which allows packet classification
based on IPsec policy that was used during decapsulation
To compile this code as a module, choose M here: the
module will be called em_ipt.
2005-04-16 15:20:36 -07:00
config NET_CLS_ACT
2005-11-01 15:13:02 +01:00
bool "Actions"
2017-06-04 18:49:28 +02:00
select NET_CLS
bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.
Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:
- From Meta: "It's especially important for applications that are deployed
fleet-wide and that don't "control" hosts they are deployed to. If such
application crashes and no one notices and does anything about that, BPF
program will keep running draining resources or even just, say, dropping
packets. We at FB had outages due to such permanent BPF attachment
semantics. With fd-based BPF link we are getting a framework, which allows
safe, auto-detachable behavior by default, unless application explicitly
opts in by pinning the BPF link." [1]
- From Cilium-side the tc BPF programs we attach to host-facing veth devices
and phys devices build the core datapath for Kubernetes Pods, and they
implement forwarding, load-balancing, policy, EDT-management, etc, within
BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
experienced hard-to-debug issues in a user's staging environment where
another Kubernetes application using tc BPF attached to the same prio/handle
of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
it. The goal is to establish a clear/safe ownership model via links which
cannot accidentally be overridden. [0,2]
BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.
Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.
We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.
For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.
For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.
The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.
tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.
The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.
Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.
[0] https://lpc.events/event/16/contributions/1353/
[1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
[2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
[3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
[4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 16:08:52 +02:00
select NET_XGRESS
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here if you want to use traffic control actions. Actions
get attached to classifiers and are invoked after a successful
classification. They are used to overwrite the classification
result, instantly drop or redirect packets, etc.
A recent version of the iproute2 package is required to use
extended matches.
2005-04-16 15:20:36 -07:00
config NET_ACT_POLICE
2005-11-01 15:13:02 +01:00
tristate "Traffic Policing"
2019-09-23 17:52:42 +02:00
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here if you want to do traffic policing, i.e. strict
bandwidth limiting. This action replaces the existing policing
module.
To compile this code as a module, choose M here: the
2010-02-08 22:41:44 -08:00
module will be called act_police.
2005-04-16 15:20:36 -07:00
config NET_ACT_GACT
2019-09-23 17:52:42 +02:00
tristate "Generic actions"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here to take generic actions such as dropping and
accepting packets.
To compile this code as a module, choose M here: the
2010-02-08 22:41:44 -08:00
module will be called act_gact.
2005-04-16 15:20:36 -07:00
config GACT_PROB
2019-09-23 17:52:42 +02:00
bool "Probability support"
depends on NET_ACT_GACT
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here to use the generic action randomly or deterministically.
2005-04-16 15:20:36 -07:00
config NET_ACT_MIRRED
2019-09-23 17:52:42 +02:00
tristate "Redirecting and Mirroring"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here to allow packets to be mirrored or redirected to
other devices.
To compile this code as a module, choose M here: the
2010-02-08 22:41:44 -08:00
module will be called act_mirred.
2005-04-16 15:20:36 -07:00
2017-01-23 11:07:09 +01:00
config NET_ACT_SAMPLE
2019-09-23 17:52:42 +02:00
tristate "Traffic Sampling"
depends on NET_CLS_ACT
select PSAMPLE
2020-06-14 01:50:22 +09:00
help
2017-01-23 11:07:09 +01:00
Say Y here to allow packet sampling tc action. The packet sample
action consists of statistically choosing packets and sampling
them using the psample module.
To compile this code as a module, choose M here: the
module will be called act_sample.
2005-04-16 15:20:36 -07:00
config NET_ACT_IPT
2019-09-23 17:52:42 +02:00
tristate "IPtables targets"
2020-12-08 21:47:07 +01:00
depends on NET_CLS_ACT && NETFILTER && NETFILTER_XTABLES
2020-06-14 01:50:22 +09:00
help
2006-06-30 18:53:46 +02:00
Say Y here to be able to invoke iptables targets after successful
2005-11-01 15:13:02 +01:00
classification.
To compile this code as a module, choose M here: the
2010-02-08 22:41:44 -08:00
module will be called act_ipt.
2005-04-16 15:20:36 -07:00
2007-09-27 12:48:05 -07:00
config NET_ACT_NAT
2019-09-23 17:52:42 +02:00
tristate "Stateless NAT"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2007-09-27 12:48:05 -07:00
Say Y here to do stateless NAT on IPv4 packets. You should use
netfilter for NAT unless you know what you are doing.
To compile this code as a module, choose M here: the
2010-02-08 22:41:44 -08:00
module will be called act_nat.
2007-09-27 12:48:05 -07:00
2005-04-16 15:20:36 -07:00
config NET_ACT_PEDIT
2019-09-23 17:52:42 +02:00
tristate "Packet Editing"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here if you want to mangle the content of packets.
2005-04-16 15:20:36 -07:00
2005-11-01 15:13:02 +01:00
To compile this code as a module, choose M here: the
2010-02-08 22:41:44 -08:00
module will be called act_pedit.
2005-04-16 15:20:36 -07:00
2005-04-24 20:10:16 -07:00
config NET_ACT_SIMP
2019-09-23 17:52:42 +02:00
tristate "Simple Example (Debug)"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2005-11-01 15:13:02 +01:00
Say Y here to add a simple action for demonstration purposes.
It is meant as an example and for debugging purposes. It will
print a configured policy string followed by the packet count
to the console for every packet that passes by.
If unsure, say N.
To compile this code as a module, choose M here: the
2010-02-08 22:41:44 -08:00
module will be called act_simple.
2005-11-01 15:13:02 +01:00
2008-09-12 16:30:20 -07:00
config NET_ACT_SKBEDIT
2019-09-23 17:52:42 +02:00
tristate "SKB Editing"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2008-09-12 16:30:20 -07:00
Say Y here to change skb priority or queue_mapping settings.
If unsure, say N.
To compile this code as a module, choose M here: the
2010-02-08 22:41:44 -08:00
module will be called act_skbedit.
2008-09-12 16:30:20 -07:00
2010-08-18 13:10:35 +00:00
config NET_ACT_CSUM
2019-09-23 17:52:42 +02:00
tristate "Checksum Updating"
depends on NET_CLS_ACT && INET
select LIBCRC32C
2020-06-14 01:50:22 +09:00
help
2010-08-18 13:10:35 +00:00
Say Y here to update some common checksum after some direct
packet alterations.
To compile this code as a module, choose M here: the
module will be called act_csum.
2019-07-07 15:01:57 +01:00
config NET_ACT_MPLS
tristate "MPLS manipulation"
depends on NET_CLS_ACT
help
Say Y here to push or pop MPLS headers.
If unsure, say N.
To compile this code as a module, choose M here: the
module will be called act_mpls.
2014-11-19 14:05:03 +01:00
config NET_ACT_VLAN
2019-09-23 17:52:42 +02:00
tristate "Vlan manipulation"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2014-11-19 14:05:03 +01:00
Say Y here to push or pop vlan headers.
If unsure, say N.
To compile this code as a module, choose M here: the
module will be called act_vlan.
2015-01-15 09:52:39 +01:00
config NET_ACT_BPF
2019-09-23 17:52:42 +02:00
tristate "BPF based action"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2015-01-15 09:52:39 +01:00
Say Y here to execute BPF code on packets. The BPF code will decide
if the packet should be dropped or not.
If unsure, say N.
To compile this code as a module, choose M here: the
module will be called act_bpf.
2015-01-18 16:35:14 -05:00
config NET_ACT_CONNMARK
2019-09-23 17:52:42 +02:00
tristate "Netfilter Connection Mark Retriever"
2020-12-08 21:47:07 +01:00
depends on NET_CLS_ACT && NETFILTER
2019-09-23 17:52:42 +02:00
depends on NF_CONNTRACK && NF_CONNTRACK_MARK
2020-06-14 01:50:22 +09:00
help
2015-01-18 16:35:14 -05:00
Say Y here to allow retrieving of conn mark
If unsure, say N.
To compile this code as a module, choose M here: the
module will be called act_connmark.
2019-05-28 17:03:50 +00:00
config NET_ACT_CTINFO
2019-09-23 17:52:42 +02:00
tristate "Netfilter Connection Mark Actions"
2020-12-08 21:47:07 +01:00
depends on NET_CLS_ACT && NETFILTER
2019-09-23 17:52:42 +02:00
depends on NF_CONNTRACK && NF_CONNTRACK_MARK
help
2019-05-28 17:03:50 +00:00
Say Y here to allow transfer of a connmark stored information.
Current actions transfer connmark stored DSCP into
ipv4/v6 diffserv and/or to transfer connmark to packet
mark. Both are useful for restoring egress based marks
back onto ingress connections for qdisc priority mapping
purposes.
If unsure, say N.
To compile this code as a module, choose M here: the
module will be called act_ctinfo.
2016-09-12 20:13:09 -04:00
config NET_ACT_SKBMOD
2019-09-23 17:52:42 +02:00
tristate "skb data modification action"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2019-09-23 17:52:42 +02:00
Say Y here to allow modification of skb data
2016-09-12 20:13:09 -04:00
2019-09-23 17:52:42 +02:00
If unsure, say N.
2016-09-12 20:13:09 -04:00
2019-09-23 17:52:42 +02:00
To compile this code as a module, choose M here: the
module will be called act_skbmod.
2016-09-12 20:13:09 -04:00
2016-02-27 08:08:54 -05:00
config NET_ACT_IFE
2019-09-23 17:52:42 +02:00
tristate "Inter-FE action based on IETF ForCES InterFE LFB"
depends on NET_CLS_ACT
select NET_IFE
2020-06-14 01:50:22 +09:00
help
2016-02-27 08:08:54 -05:00
Say Y here to allow for sourcing and terminating metadata
For details refer to netdev01 paper:
"Distributing Linux Traffic Control Classifier-Action Subsystem"
Authors: Jamal Hadi Salim and Damascene M. Joachimpillai
To compile this code as a module, choose M here: the
module will be called act_ife.
2016-09-08 16:23:48 +03:00
config NET_ACT_TUNNEL_KEY
2019-09-23 17:52:42 +02:00
tristate "IP tunnel metadata manipulation"
depends on NET_CLS_ACT
2020-06-14 01:50:22 +09:00
help
2016-09-08 16:23:48 +03:00
Say Y here to set/release ip tunnel metadata.
If unsure, say N.
To compile this code as a module, choose M here: the
module will be called act_tunnel_key.
2019-07-09 10:30:48 +03:00
config NET_ACT_CT
2019-09-23 17:52:42 +02:00
tristate "connection tracking tc action"
2022-11-18 16:33:03 -05:00
depends on NET_CLS_ACT && NF_CONNTRACK && (!NF_NAT || NF_NAT) && NF_FLOW_TABLE
2023-02-07 17:52:06 -05:00
select NF_CONNTRACK_OVS
2022-12-08 11:56:12 -05:00
select NF_NAT_OVS if NF_NAT
2019-09-23 17:52:42 +02:00
help
2019-07-09 10:30:48 +03:00
Say Y here to allow sending the packets to conntrack module.
If unsure, say N.
To compile this code as a module, choose M here: the
module will be called act_ct.
2020-05-01 08:53:15 +08:00
config NET_ACT_GATE
tristate "Frame gate entry list control tc action"
depends on NET_CLS_ACT
help
Say Y here to allow to control the ingress flow to be passed at
specific time slot and be dropped at other specific time slot by
the gate entry list.
If unsure, say N.
To compile this code as a module, choose M here: the
module will be called act_gate.
2016-02-27 08:08:55 -05:00
config NET_IFE_SKBMARK
2019-09-23 17:52:42 +02:00
tristate "Support to encoding decoding skb mark on IFE action"
depends on NET_ACT_IFE
2016-02-27 08:08:55 -05:00
2016-02-27 08:08:56 -05:00
config NET_IFE_SKBPRIO
2019-09-23 17:52:42 +02:00
tristate "Support to encoding decoding skb prio on IFE action"
depends on NET_ACT_IFE
2016-02-27 08:08:56 -05:00
2016-09-18 07:31:43 -04:00
config NET_IFE_SKBTCINDEX
2019-09-23 17:52:42 +02:00
tristate "Support to encoding decoding skb tcindex on IFE action"
depends on NET_ACT_IFE
2016-09-18 07:31:43 -04:00
net: openvswitch: Set OvS recirc_id from tc chain index
Offloaded OvS datapath rules are translated one to one to tc rules,
for example the following simplified OvS rule:
recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)
Will be translated to the following tc rule:
$ tc filter add dev dev1 ingress \
prio 1 chain 0 proto ip \
flower tcp ct_state -trk \
action ct pipe \
action goto chain 2
Received packets will first travel though tc, and if they aren't stolen
by it, like in the above rule, they will continue to OvS datapath.
Since we already did some actions (action ct in this case) which might
modify the packets, and updated action stats, we would like to continue
the proccessing with the correct recirc_id in OvS (here recirc_id(2))
where we left off.
To support this, introduce a new skb extension for tc, which
will be used for translating tc chain to ovs recirc_id to
handle these miss cases. Last tc chain index will be set
by tc goto chain action and read by OvS datapath.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-04 16:56:37 +03:00
config NET_TC_SKB_EXT
bool "TC recirculation support"
depends on NET_CLS_ACT
select SKB_EXTENSIONS
help
Say Y here to allow tc chain misses to continue in OvS datapath in
the correct recirc_id, and hardware chain misses to continue in
the correct chain in tc software datapath.
Say N here if you won't be using tc<->ovs offload or tc chains offload.
2005-11-17 15:22:39 -08:00
endif # NET_SCHED
2007-10-18 21:56:38 -07:00
config NET_SCH_FIFO
bool