linux/include/net
Eric Dumazet 218af599fa tcp: internal implementation for pacing
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
  flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
  to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
 4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
 0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
 1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
 1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
 1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
 0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
 4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
 3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0

As expected, number of interrupts per second is very different.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:43:31 -04:00
..
9p 9p: constify ->d_name handling 2017-01-12 04:01:17 -05:00
bluetooth Bluetooth: L2CAP: Fix L2CAP_CR_SCID_IN_USE value 2017-04-12 22:02:37 +02:00
caif
irda scripts/spelling.txt: add "overide" pattern and fix typo instances 2017-03-09 17:01:09 -08:00
iucv s390/iucv: do not use arrays as argument 2015-09-21 16:03:04 -07:00
netfilter netfilter: nf_ct_ext: invoke destroy even when ext is not attached 2017-05-01 11:48:49 +02:00
netns can: network namespace support for CAN gateway 2017-04-25 09:04:30 +02:00
nfc NFC: Add nfc_dbg() macro 2017-04-05 10:15:20 +02:00
phonet sock: struct proto hash function may error 2016-02-11 03:54:14 -05:00
sctp sctp: process duplicated strreset out and addstrm out requests correctly 2017-04-18 13:39:50 -04:00
tc_act net/sched: Removed unused vlan actions definition 2017-04-06 13:28:35 -07:00
6lowpan.h 6lowpan: Fix IID format for Bluetooth 2017-04-12 22:02:36 +02:00
act_api.h net sched actions: Add support for user cookies 2017-01-25 12:37:04 -05:00
addrconf.h ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf 2017-05-08 17:31:24 -04:00
af_ieee802154.h ieee802154: af_ieee802154: fix typo in comment. 2015-09-17 13:20:05 +02:00
af_rxrpc.h rxrpc: Note a successfully aborted kernel operation 2017-04-06 10:11:59 +01:00
af_unix.h af_unix: split 'u->readlock' into two: 'iolock' and 'bindlock' 2016-09-04 13:29:29 -07:00
af_vsock.h VSOCK: Add vsockmon tap functions 2017-04-24 12:35:56 -04:00
ah.h
arp.h net: add confirm_neigh method to dst_ops 2017-02-07 13:07:46 -05:00
atmclip.h
ax25.h
ax88796.h
bond_3ad.h bonding: 3ad: apply ad_actor settings changes immediately 2016-02-09 04:45:49 -05:00
bond_alb.h
bond_options.h bonding: convert num_grat_arp to the new bonding option API 2015-07-27 01:05:24 -07:00
bonding.h bonding: fix wq initialization for links created via netlink 2017-04-21 15:28:37 -04:00
busy_poll.h net: Commonize busy polling code to focus on napi_id instead of socket 2017-03-24 20:49:31 -07:00
calipso.h calipso: Add a label cache. 2016-06-27 15:06:17 -04:00
cfg80211-wext.h
cfg80211.h cfg80211: fix multi scheduled scan kernel-doc 2017-05-08 13:09:38 +02:00
cfg802154.h ieee802154: add netns support 2016-07-08 12:20:57 +02:00
checksum.h csum: eliminate sparse warning in remcsum_unadjust() 2017-01-20 12:12:13 -05:00
cipso_ipv4.h netlabel: out of bound access in cipso_v4_validate() 2017-02-04 19:44:22 -05:00
cls_cgroup.h cls_cgroup: get sk_classid only from full sockets 2016-04-19 20:09:25 -04:00
codel_impl.h codel: split into multiple files 2016-04-25 16:44:27 -04:00
codel_qdisc.h net_sched: fq_codel: cache skb->truesize into skb->cb 2016-06-25 12:19:35 -04:00
codel.h codel: split into multiple files 2016-04-25 16:44:27 -04:00
compat.h packet: compat support for sock_fprog 2016-06-09 23:41:03 -07:00
datalink.h
dcbevent.h
dcbnl.h
devlink.h net/devlink: Add E-Switch encapsulation control 2017-04-22 20:26:37 +03:00
dn_dev.h
dn_fib.h
dn_neigh.h netfilter: Pass net into okfn 2015-09-17 17:18:37 -07:00
dn_nsp.h
dn_route.h
dn.h
dsa.h net: dsa: add support for the SMSC-LAN9303 tagging format 2017-04-20 13:48:54 -04:00
dsfield.h
dst_cache.h net: add dst_cache support 2016-02-16 20:21:48 -05:00
dst_metadata.h net/dst: Add dst port to dst_metadata utility functions 2016-11-09 13:41:54 -05:00
dst_ops.h net: add confirm_neigh method to dst_ops 2017-02-07 13:07:46 -05:00
dst.h net: rename dst_neigh_output back to neigh_output 2017-02-11 21:25:18 -05:00
esp.h esp6: Reorganize esp_output 2017-04-14 10:06:42 +02:00
ethoc.h net/ethoc: support big-endian register layout 2015-09-23 15:33:15 -07:00
fib_rules.h net: rtnetlink: plumb extended ack to doit function 2017-04-17 15:35:38 -04:00
firewire.h
flow_dissector.h flow_dissector: add mpls support (v2) 2017-04-24 14:30:46 -04:00
flow.h flowcache: make flow_key_size() return "unsigned int" 2017-04-03 19:04:48 -07:00
flowcache.h flowcache: more "unsigned int" 2017-04-03 19:04:48 -07:00
fou.h fou: Add encap ops for IPv6 tunnels 2016-05-20 18:03:16 -04:00
fq_impl.h fq.h: Port memory limit mechanism from fq_codel 2016-09-30 13:29:21 +02:00
fq.h fq.h: Port memory limit mechanism from fq_codel 2016-09-30 13:29:21 +02:00
garp.h
gen_stats.h net_sched: gen_estimator: complete rewrite of rate estimators 2016-12-05 15:21:59 -05:00
genetlink.h netlink: pass extended ACK struct to parsing functions 2017-04-13 13:58:22 -04:00
geneve.h net: Remove deprecated tunnel specific UDP offload functions 2016-06-17 20:23:32 -07:00
gre.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-08-18 01:17:32 -04:00
gro_cells.h gro_cells: move to net/core/gro_cells.c 2017-02-08 14:38:18 -05:00
gtp.h gtp: #define #define _GTP_H_ and not #define _GTP_H 2016-07-25 17:55:43 -07:00
gue.h
hwbm.h net: add a hardware buffer management helper API 2016-03-14 12:19:46 -04:00
icmp.h net: snmp: kill STATS_BH macros 2016-04-27 22:48:25 -04:00
ieee80211_radiotap.h wireless: radiotap: rewrite the radiotap header file 2017-01-25 16:00:33 +01:00
ieee802154_netdev.h mac802154: constify ieee802154_llsec_ops structure 2016-01-04 20:40:41 +01:00
if_inet6.h net/ipv6: allow sysctl to change link-local address generation mode 2017-01-27 10:25:34 -05:00
ife.h net: Introduce ife encapsulation module 2017-02-03 15:16:45 -05:00
ila.h ila: Add generic ILA translation facility 2015-12-15 23:25:20 -05:00
inet6_connection_sock.h inet: drop ->bind_conflict 2017-01-18 13:04:28 -05:00
inet6_hashtables.h tcp/dccp: do not touch listener sk_refcnt under synflood 2016-04-04 22:11:20 -04:00
inet_common.h net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
inet_connection_sock.h net: Work around lockdep limitation in sockets that use sockets 2017-03-09 18:23:27 -08:00
inet_ecn.h ipv6: suppress sparse warnings in IP6_ECN_set_ce() 2016-08-13 15:08:00 -07:00
inet_frag.h net: remove bh disabling around percpu_counter accesses 2017-01-20 11:27:22 -05:00
inet_hashtables.h inet: reset tb->fastreuseport when adding a reuseport sk 2017-01-18 13:04:29 -05:00
inet_sock.h net/tcp-fastopen: Add new API support 2017-01-25 14:04:38 -05:00
inet_timewait_sock.h ipv4: Namespaceify tcp_tw_recycle and tcp_max_tw_buckets knob 2016-12-29 11:38:31 -05:00
inetpeer.h inet: tcp: fix inetpeer_set_addr_v4() 2015-12-16 00:14:12 -05:00
ip6_checksum.h ipv6: Pass proto to csum_ipv6_magic as __u8 instead of unsigned short 2016-03-13 23:55:13 -04:00
ip6_fib.h net: ipv6: Allow shorthand delete of all nexthops in multipath route 2017-02-04 19:58:14 -05:00
ip6_route.h ipv6: initialize route null entry in addrconf_init() 2017-05-04 12:51:24 -04:00
ip6_tunnel.h ip6_tunnel: Allow policy-based routing through tunnels 2017-04-21 13:21:30 -04:00
ip_fib.h net: ipv4: add support for ECMP hash policy choice 2017-03-21 15:27:19 -07:00
ip_tunnels.h ip_tunnel: Allow policy-based routing through tunnels 2017-04-21 13:21:31 -04:00
ip_vs.h ipvs: remove unused function ip_vs_set_state_timeout 2017-04-28 12:00:10 +02:00
ip.h net: ipv4: Refine the ipv4_default_advmss 2017-04-13 13:19:48 -04:00
ipcomp.h
ipconfig.h
ipv6.h ipv6: fix flow labels when the traffic class is non-0 2017-01-31 13:16:59 -05:00
ipx.h
iw_handler.h wext: uninline stream addition functions 2017-01-13 09:38:42 +01:00
kcm.h kcm: Use stream parser 2016-08-17 19:36:23 -04:00
l3mdev.h net: ipv4: Do not drop to make_route if oif is l3mdev 2016-10-13 12:05:26 -04:00
lapb.h
lib80211.h
llc_c_ac.h
llc_c_ev.h
llc_c_st.h
llc_conn.h
llc_if.h
llc_pdu.h
llc_s_ac.h
llc_s_ev.h
llc_s_st.h
llc_sap.h
llc.h
lwtunnel.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-11 02:31:11 -05:00
mac80211.h mac80211: properly remove RX_ENC_FLAG_40MHZ 2017-05-08 11:11:56 +02:00
mac802154.h ieee802154: cleanup WARN_ON for fc fetch 2016-07-08 13:23:12 +02:00
mip6.h
mld.h
mpls_iptunnel.h net: mpls: Increase max number of labels for lwt encap 2017-04-01 20:21:44 -07:00
mpls.h openvswitch: use mpls_hdr 2016-10-03 02:00:22 -04:00
mrp.h
ncsi.h net/ncsi: Introduce ncsi_stop_dev() 2016-10-04 02:11:51 -04:00
ndisc.h ipv6: add support for NETDEV_RESEND_IGMP event 2017-03-28 22:02:21 -07:00
neighbour.h net: neigh: make ->hh_len 32-bit 2017-04-12 13:59:21 -04:00
net_namespace.h can: initial support for network namespaces 2017-04-04 17:35:58 +02:00
net_ratelimit.h
netevent.h neigh: Send a notification when DELAY_PROBE_TIME changes 2016-07-05 09:06:29 -07:00
netlabel.h netlabel: Implement CALIPSO config functions for SMACK. 2016-06-27 15:06:18 -04:00
netlink.h netlink: pass extended ACK struct to parsing functions 2017-04-13 13:58:22 -04:00
netprio_cgroup.h net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct 2015-12-08 22:02:33 -05:00
netrom.h
nexthop.h
nl802154.h ieee802154: add netns support 2016-07-08 12:20:57 +02:00
p8022.h
ping.h net: ping: make ping_v6_sendmsg static 2016-03-23 22:09:58 -04:00
pkt_cls.h net/sched: Reflect HW offload status 2017-02-17 12:08:05 -05:00
pkt_sched.h net: sched: make default fifo qdiscs appear in the dump 2017-03-12 22:53:02 -07:00
pptp.h pptp: Refactor the struct and macros of PPTP codes 2016-08-15 10:55:53 -07:00
protocol.h net: Add sysctl to toggle early demux for tcp and udp 2017-03-24 13:17:07 -07:00
psample.h net: Introduce psample, a new genetlink channel for packet sampling 2017-01-24 13:44:28 -05:00
psnap.h
raw.h net: ip, diag -- Add diag interface for raw sockets 2016-10-23 19:35:24 -04:00
rawv6.h net: ip, diag -- Add diag interface for raw sockets 2016-10-23 19:35:24 -04:00
red.h ktime: Get rid of the union 2016-12-25 17:21:22 +01:00
regulatory.h
request_sock.h ipv4: Namespaceify tcp_max_syn_backlog knob 2016-12-29 11:38:31 -05:00
rose.h
route.h Revert "ipv4: restore rt->fi for reference counting" 2017-05-08 22:35:32 -04:00
rtnetlink.h net: rtnetlink: plumb extended ack to doit function 2017-04-17 15:35:38 -04:00
sch_generic.h net_sched: move the empty tp check from ->destroy() to ->delete() 2017-04-21 13:58:15 -04:00
scm.h sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h> 2017-03-02 08:42:31 +01:00
secure_seq.h tcp: randomize timestamps on syncookies 2017-05-05 12:00:11 -04:00
seg6_hmac.h ipv6: sr: add core files for SR HMAC support 2016-11-09 20:40:06 -05:00
seg6.h ipv6: sr: add core files for SR HMAC support 2016-11-09 20:40:06 -05:00
slhc_vj.h
smc.h smc: netlink interface for SMC sockets 2017-01-09 16:07:41 -05:00
snmp.h net: snmp: fix 64bit stats on 32bit arches 2016-04-28 11:49:45 -04:00
sock_reuseport.h soreuseport: fix NULL ptr dereference SO_REUSEPORT after bind 2016-01-19 14:44:23 -05:00
sock.h tcp: internal implementation for pacing 2017-05-16 15:43:31 -04:00
Space.h
stp.h
strparser.h kcm: Remove TCP specific references from kcm and strparser 2016-08-28 23:32:41 -04:00
switchdev.h switchdev: bridge: Offload mc router ports 2017-02-10 11:46:39 -05:00
tcp_states.h
tcp.h tcp: internal implementation for pacing 2017-05-16 15:43:31 -04:00
timewait_sock.h
transp_v6.h ipv6: add new struct ipcm6_cookie 2016-05-03 16:08:14 -04:00
tso.h net: tso: add support for IPv6 2015-10-26 22:24:22 -07:00
udp_tunnel.h vxlan: Add new UDP encapsulation offload type for VXLAN-GPE 2016-06-17 20:23:32 -07:00
udp.h udp: use a separate rx queue for packet reception 2017-05-16 15:41:29 -04:00
udplite.h udp: use a separate rx queue for packet reception 2017-05-16 15:41:29 -04:00
vsock_addr.h
vxlan.h vxlan: remove unsed vxlan_dev_dst_port() 2016-11-15 12:16:13 -05:00
wext.h
wimax.h
x25.h
x25device.h
xfrm.h net: Add a xfrm validate function to validate_xmit_skb 2017-04-14 10:07:28 +02:00