10551 Commits

Author SHA1 Message Date
Neal Cardwell
29a949325c tcp: simplify tcp_set_congestion_control(): Always reinitialize
Now that the previous patches ensure that all call sites for
tcp_set_congestion_control() want to initialize congestion control, we
can simplify tcp_set_congestion_control() by removing the reinit
argument and the code to support it.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10 20:53:01 -07:00
Neal Cardwell
8919a9b31e tcp: Only init congestion control if not initialized already
Change tcp_init_transfer() to only initialize congestion control if it
has not been initialized already.

With this new approach, we can arrange things so that if the EBPF code
sets the congestion control by calling setsockopt(TCP_CONGESTION) then
tcp_init_transfer() will not re-initialize the CC module.

This is an approach that has the following beneficial properties:

(1) This allows CC module customizations made by the EBPF called in
    tcp_init_transfer() to persist, and not be wiped out by a later
    call to tcp_init_congestion_control() in tcp_init_transfer().

(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
    for existing code upstream that depends on the current order.

(3) Does not cause 2 initializations for for CC in the case where the
    EBPF called in tcp_init_transfer() wants to set the CC to a new CC
    algorithm.

(4) Allows follow-on simplifications to the code in net/core/filter.c
    and net/ipv4/tcp_cong.c, which currently both have some complexity
    to special-case CC initialization to avoid double CC
    initialization if EBPF sets the CC.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10 20:53:01 -07:00
Wei Wang
ac8f1710c1 tcp: reflect tos value received in SYN to the socket
This commit adds a new TCP feature to reflect the tos value received in
SYN, and send it out on the SYN-ACK, and eventually set the tos value of
the established socket with this reflected tos value. This provides a
way to set the traffic class/QoS level for all traffic in the same
connection to be the same as the incoming SYN request. It could be
useful in data centers to provide equivalent QoS according to the
incoming request.
This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
default turned off.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:15:40 -07:00
Wei Wang
de033b7d15 ip: pass tos into ip_build_and_send_pkt()
This commit adds tos as a new passed in parameter to
ip_build_and_send_pkt() which will be used in the later commit.
This is a pure restructure and does not have any functional change.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:15:40 -07:00
Wei Wang
e9b12edc13 tcp: record received TOS value in the request socket
A new field is added to the request sock to record the TOS value
received on the listening socket during 3WHS:
When not under syn flood, it is recording the TOS value sent in SYN.
When under syn flood, it is recording the TOS value sent in the ACK.
This is a preparation patch in order to do TOS reflection in the later
commit.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:15:40 -07:00
Paul Davey
bb82067c57 ipmr: Use full VIF ID in netlink cache reports
Insert the full 16 bit VIF ID into ipmr Netlink cache reports.

The VIF_ID attribute has 32 bits of space so can store the full VIF ID
extracted from the high and low byte fields in the igmpmsg.

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:25:51 -07:00
Paul Davey
c8715a8e9f ipmr: Add high byte of VIF ID to igmpmsg
Use the unused3 byte in struct igmpmsg to hold the high 8 bits of the
VIF ID.

If using more than 255 IPv4 multicast interfaces it is necessary to have
access to a VIF ID for cache reports that is wider than 8 bits, the VIF
ID present in the igmpmsg reports sent to mroute_sk was only 8 bits wide
in the igmpmsg header.  Adding the high 8 bits of the 16 bit VIF ID in
the unused byte allows use of more than 255 IPv4 multicast interfaces.

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:25:51 -07:00
Paul Davey
501cb00890 ipmr: Add route table ID to netlink cache reports
Insert the multicast route table ID as a Netlink attribute to Netlink
cache report notifications.

When multiple route tables are in use it is necessary to have a way to
determine which route table a given cache report belongs to when
receiving the cache report.

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:25:51 -07:00
Wei Wang
ba9e04a7dd ip: fix tos reflection in ack and reset packets
Currently, in tcp_v4_reqsk_send_ack() and tcp_v4_send_reset(), we
echo the TOS value of the received packets in the response.
However, we do not want to echo the lower 2 ECN bits in accordance
with RFC 3168 6.1.5 robustness principles.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:19:08 -07:00
Wang Hai
7edce63666 cipso: fix 'audit_secid' kernel-doc warning in cipso_ipv4.c
Fixes the following W=1 kernel build warning(s):

net/ipv4/cipso_ipv4.c:510: warning: Excess function parameter 'audit_secid' description in 'cipso_v4_doi_remove'

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:03:36 -07:00
Jakub Kicinski
44a8c4f33c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.

Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444a4 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-04 21:28:59 -07:00
Wei Wang
c107761614 ip: expose inet sockopts through inet_diag
Expose all exisiting inet sockopt bits through inet_diag for debug purpose.
Corresponding changes in iproute2 ss will be submitted to output all
these values.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 15:17:28 -07:00
David S. Miller
150f29f5e6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-09-01

The following pull-request contains BPF updates for your *net-next* tree.

There are two small conflicts when pulling, resolve as follows:

1) Merge conflict in tools/lib/bpf/libbpf.c between 88a82120282b ("libbpf: Factor
   out common ELF operations and improve logging") in bpf-next and 1e891e513e16
   ("libbpf: Fix map index used in error message") in net-next. Resolve by taking
   the hunk in bpf-next:

        [...]
        scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
        data = elf_sec_data(obj, scn);
        if (!scn || !data) {
                pr_warn("elf: failed to get %s map definitions for %s\n",
                        MAPS_ELF_SEC, obj->path);
                return -EINVAL;
        }
        [...]

2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
   9647c57b11e5 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
   better performance") in bpf-next and e20f0dbf204f ("net/mlx5e: RX, Add a prefetch
   command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
   net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:

        [...]
        xdp_set_data_meta_invalid(xdp);
        xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
        net_prefetch(xdp->data);
        [...]

We've added 133 non-merge commits during the last 14 day(s) which contain
a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).

The main changes are:

1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
   for tracing to reliably access user memory, from Alexei Starovoitov.

2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.

3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.

4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.

5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.

6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.

7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.

8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.

9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.

10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.

11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.

12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.

13) Various smaller cleanups and improvements all over the place.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01 13:22:59 -07:00
Miaohe Lin
34e1ec319e net: ipv4: remove unused arg exact_dif in compute_score
The arg exact_dif is not used anymore, remove it. inet_exact_dif_match()
is no longer needed after the above is removed, so remove it too.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 13:08:29 -07:00
Miaohe Lin
5af68891dc net: clean up codestyle
This is a pure codestyle cleanup patch. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:33:34 -07:00
Miaohe Lin
cbc08a3312 net: Use helper macro IP_MAX_MTU in __ip_append_data()
What 0xFFFF means here is actually the max mtu of a ip packet. Use help
macro IP_MAX_MTU here.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:33:16 -07:00
Randy Dunlap
4b7ddc58e6 netfilter: delete repeated words
Drop duplicated words in net/netfilter/ and net/ipv4/netfilter/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 20:11:38 +02:00
Miaohe Lin
645f08975f net: Fix some comments
Fix some comments, including wrong function name, duplicated word and so
on.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-27 07:55:59 -07:00
Ido Schimmel
885a3b1579 ipv4: nexthop: Correctly update nexthop group when replacing a nexthop
Each nexthop group contains an indication if it has IPv4 nexthops
('has_v4'). Its purpose is to prevent IPv6 routes from using groups with
IPv4 nexthops.

However, the indication is not updated when a nexthop is replaced. This
results in the kernel wrongly rejecting IPv6 routes from pointing to
groups that only contain IPv6 nexthops. Example:

# ip nexthop replace id 1 via 192.0.2.2 dev dummy10
# ip nexthop replace id 10 group 1
# ip nexthop replace id 1 via 2001:db8:1::2 dev dummy10
# ip route replace 2001:db8:10::/64 nhid 10
Error: IPv6 routes can not use an IPv4 nexthop.

Solve this by iterating over all the nexthop groups that the replaced
nexthop is a member of and potentially update their IPv4 indication
according to the new set of member nexthops.

Avoid wasting cycles by only performing the update in case an IPv4
nexthop is replaced by an IPv6 nexthop.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
863b25581c ipv4: nexthop: Correctly update nexthop group when removing a nexthop
Each nexthop group contains an indication if it has IPv4 nexthops
('has_v4'). Its purpose is to prevent IPv6 routes from using groups with
IPv4 nexthops.

However, the indication is not updated when a nexthop is removed. This
results in the kernel wrongly rejecting IPv6 routes from pointing to
groups that only contain IPv6 nexthops. Example:

# ip nexthop replace id 1 via 192.0.2.2 dev dummy10
# ip nexthop replace id 2 via 2001:db8:1::2 dev dummy10
# ip nexthop replace id 10 group 1/2
# ip nexthop del id 1
# ip route replace 2001:db8:10::/64 nhid 10
Error: IPv6 routes can not use an IPv4 nexthop.

Solve this by updating the indication according to the new set of
member nexthops.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
233c63785c ipv4: nexthop: Remove unnecessary rtnl_dereference()
The pointer is not RCU protected, so remove the unnecessary
rtnl_dereference(). This suppresses the following warning:

net/ipv4/nexthop.c:1101:24: error: incompatible types in comparison expression (different address spaces):
net/ipv4/nexthop.c:1101:24:    struct rb_node [noderef] __rcu *
net/ipv4/nexthop.c:1101:24:    struct rb_node *

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
33d80996b8 ipv4: nexthop: Use nla_put_be32() for NHA_GATEWAY
The code correctly uses nla_get_be32() to get the payload of the
attribute, but incorrectly uses nla_put_u32() to add the attribute to
the payload. This results in the following warning:

net/ipv4/nexthop.c:279:59: warning: incorrect type in argument 3 (different base types)
net/ipv4/nexthop.c:279:59:    expected unsigned int [usertype] value
net/ipv4/nexthop.c:279:59:    got restricted __be32 [usertype] ipv4

Suppress the warning by using nla_put_be32().

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
d7d49dc77c ipv4: nexthop: Reduce allocation size of 'struct nh_group'
The struct looks as follows:

struct nh_group {
	struct nh_group		*spare; /* spare group for removals */
	u16			num_nh;
	bool			mpath;
	bool			fdb_nh;
	bool			has_v4;
	struct nh_grp_entry	nh_entries[];
};

But its offset within 'struct nexthop' is also taken into account to
determine the allocation size.

Instead, use struct_size() to allocate only the required number of
bytes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
7f6f32bb7d ipv4: Silence suspicious RCU usage warning
fib_info_notify_update() is always called with RTNL held, but not from
an RCU read-side critical section. This leads to the following warning
[1] when the FIB table list is traversed with
hlist_for_each_entry_rcu(), but without a proper lockdep expression.

Since modification of the list is protected by RTNL, silence the warning
by adding a lockdep expression which verifies RTNL is held.

[1]
 =============================
 WARNING: suspicious RCU usage
 5.9.0-rc1-custom-14233-g2f26e122d62f #129 Not tainted
 -----------------------------
 net/ipv4/fib_trie.c:2124 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 1 lock held by ip/834:
  #0: ffffffff85a3b6b0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0

 stack backtrace:
 CPU: 0 PID: 834 Comm: ip Not tainted 5.9.0-rc1-custom-14233-g2f26e122d62f #129
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
 Call Trace:
  dump_stack+0x100/0x184
  lockdep_rcu_suspicious+0x143/0x14d
  fib_info_notify_update+0x8d1/0xa60
  __nexthop_replace_notify+0xd2/0x290
  rtm_new_nexthop+0x35e2/0x5946
  rtnetlink_rcv_msg+0x4f7/0xbd0
  netlink_rcv_skb+0x17a/0x480
  rtnetlink_rcv+0x22/0x30
  netlink_unicast+0x5ae/0x890
  netlink_sendmsg+0x98a/0xf40
  ____sys_sendmsg+0x879/0xa00
  ___sys_sendmsg+0x122/0x190
  __sys_sendmsg+0x103/0x1d0
  __x64_sys_sendmsg+0x7d/0xb0
  do_syscall_64+0x32/0x50
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fde28c3be57
 Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51
c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
RSP: 002b:00007ffc09330028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fde28c3be57
RDX: 0000000000000000 RSI: 00007ffc09330090 RDI: 0000000000000003
RBP: 000000005f45f911 R08: 0000000000000001 R09: 00007ffc0933012c
R10: 0000000000000076 R11: 0000000000000246 R12: 0000000000000001
R13: 00007ffc09330290 R14: 00007ffc09330eee R15: 00005610e48ed020

Fixes: 1bff1a0c9bbd ("ipv4: Add function to send route updates")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 15:58:48 -07:00
Miaohe Lin
343d8c6014 net: clean up codestyle for net/ipv4
This is a pure codestyle cleanup patch. Also add a blank line after
declarations as warned by checkpatch.pl.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:28:02 -07:00
Miaohe Lin
fdf1923bf9 net: Remove duplicated midx check against 0
Check midx against 0 is always equal to check midx against sk_bound_dev_if
when sk_bound_dev_if is known not equal to 0 in these case.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:23:59 -07:00
Miaohe Lin
0ce779a9f5 net: Avoid unnecessary inet_addr_type() call when addr is INADDR_ANY
We can avoid unnecessary inet_addr_type() call by check addr against
INADDR_ANY first.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:20:10 -07:00
Miaohe Lin
0316a21116 net: Set ping saddr after we successfully get the ping port
We can defer set ping saddr until we successfully get the ping port. So we
can avoid clear saddr when failed. Since ping_clear_saddr() is not used
anymore now, remove it.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:18:13 -07:00
Miaohe Lin
8b4510d76c net: gain ipv4 mtu when mtu is not locked
When mtu is locked, we should not obtain ipv4 mtu as we return immediately
in this case and leave acquired ipv4 mtu unused.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:04:39 -07:00
Miaohe Lin
373c15c2e9 net: Use helper macro RT_TOS() in __icmp_send()
Use helper macro RT_TOS() to get tos in __icmp_send().

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 18:12:36 -07:00
Miaohe Lin
7551144978 net: Avoid access icmp_err_convert when icmp code is ICMP_FRAG_NEEDED
There is no need to fetch errno and fatal info from icmp_err_convert when
icmp code is ICMP_FRAG_NEEDED.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 18:11:43 -07:00
Randy Dunlap
2bdcc73c88 net: ipv4: delete repeated words
Drop duplicate words in comments in net/ipv4/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 17:31:20 -07:00
Luke Hsiao
583bbf0624 io_uring: allow tcp ancillary data for __sys_recvmsg_sock()
For TCP tx zero-copy, the kernel notifies the process of completions by
queuing completion notifications on the socket error queue. This patch
allows reading these notifications via recvmsg to support TCP tx
zero-copy.

Ancillary data was originally disallowed due to privilege escalation
via io_uring's offloading of sendmsg() onto a kernel thread with kernel
credentials (https://crbug.com/project-zero/1975). So, we must ensure
that the socket type is one where the ancillary data types that are
delivered on recvmsg are plain data (no file descriptors or values that
are translated based on the identity of the calling process).

This was tested by using io_uring to call recvmsg on the MSG_ERRQUEUE
with tx zero-copy enabled. Before this patch, we received -EINVALID from
this specific code path. After this patch, we could read tcp tx
zero-copy completion notifications from the MSG_ERRQUEUE.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:16:06 -07:00
Martin KaFai Lau
267cf9fa43 tcp: bpf: Optionally store mac header in TCP_SAVE_SYN
This patch is adapted from Eric's patch in an earlier discussion [1].

The TCP_SAVE_SYN currently only stores the network header and
tcp header.  This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.

It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option.  Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".

The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.

[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
0813a84156 bpf: tcp: Allow bpf prog to write and parse TCP header option
[ Note: The TCP changes here is mainly to implement the bpf
  pieces into the bpf_skops_*() functions introduced
  in the earlier patches. ]

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options.  It currently supports most of
the TCP packet except RST.

Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt().  The helper will ensure there is no duplicated
option in the header.

By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog.  Different bpf-prog can write its
own option kind.  It could also allow the bpf-prog to support a
recently standardized option on an older kernel.

Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG

A few words on the PARSE CB flags.  When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.

The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option.  There are
details comment on these new cb flags in the UAPI bpf.h.

sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options.  They are read only.

The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.

Please refer to the comment in UAPI bpf.h.  It has details
on what skb_data contains under different sock_ops->op.

3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.

* Passive side

When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog.  The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb.  The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later.  Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).

The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
  is not very useful here since the listen sk will be shared
  by many concurrent connection requests.

  Extending bpf_sk_storage support to request_sock will add weight
  to the minisock and it is not necessary better than storing the
  whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.

The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario.  More on this later.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.

The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
  - (a) the just received syn (available when the bpf prog is writing SYNACK)
        and it is the only way to get SYN during syncookie mode.
  or
  - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
        existing CB).

The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.

Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.

* Fastopen

Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.

* Syncookie

For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK.  The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().

* Turn off header CB flags after 3WHS

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.

[1]: draft-wang-tcpm-low-latency-opt-00
     https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
331fca4315 bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt()
The bpf prog needs to parse the SYN header to learn what options have
been sent by the peer's bpf-prog before writing its options into SYNACK.
This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
This syn_skb will eventually be made available (as read-only) to the
bpf prog.  This will be the only SYN packet available to the bpf
prog during syncookie.  For other regular cases, the bpf prog can
also use the saved_syn.

When writing options, the bpf prog will first be called to tell the
kernel its required number of bytes.  It is done by the new
bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
When the bpf prog returns, the kernel will know how many bytes are needed
and then update the "*remaining" arg accordingly.  4 byte alignment will
be included in the "*remaining" before this function returns.  The 4 byte
aligned number of bytes will also be stored into the opts->bpf_opt_len.
"bpf_opt_len" is a newly added member to the struct tcp_out_options.

Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
header options.  The bpf prog is only called if it has reserved spaces
before (opts->bpf_opt_len > 0).

The bpf prog is the last one getting a chance to reserve header space
and writing the header option.

These two functions are half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
00d211a4ea bpf: tcp: Add bpf_skops_parse_hdr()
The patch adds a function bpf_skops_parse_hdr().
It will call the bpf prog to parse the TCP header received at
a tcp_sock that has at least reached the ESTABLISHED state.

For the packets received during the 3WHS (SYN, SYNACK and ACK),
the received skb will be available to the bpf prog during the callback
in bpf_skops_established() introduced in the previous patch and
in the bpf_skops_write_hdr_opt() that will be added in the
next patch.

Calling bpf prog to parse header is controlled by two new flags in
tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG and
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG.

When BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set,
the bpf prog will only be called when there is unknown
option in the TCP header.

When BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set,
the bpf prog will be called on all received TCP header.

This function is half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190046.2885054-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
72be0fe6ba bpf: tcp: Add bpf_skops_established()
In tcp_init_transfer(), it currently calls the bpf prog to give it a
chance to handle the just "ESTABLISHED" event (e.g. do setsockopt
on the newly established sk).  Right now, it is done by calling the
general purpose tcp_call_bpf().

In the later patch, it also needs to pass the just-received skb which
concludes the 3 way handshake. E.g. the SYNACK received at the active side.
The bpf prog can then learn some specific header options written by the
peer's bpf-prog and potentially do setsockopt on the newly established sk.
Thus, instead of reusing the general purpose tcp_call_bpf(), a new function
bpf_skops_established() is added to allow passing the "skb" to the bpf
prog.  The actual skb passing from bpf_skops_established() to the bpf prog
will happen together in a later patch which has the necessary bpf pieces.

A "skb" arg is also added to tcp_init_transfer() such that
it can then be passed to bpf_skops_established().

Calling the new bpf_skops_established() instead of tcp_call_bpf()
should be a noop in this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190039.2884750-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
7656d68455 tcp: Add saw_unknown to struct tcp_options_received
In a later patch, the bpf prog only wants to be called to handle
a header option if that particular header option cannot be handled by
the kernel.  This unknown option could be written by the peer's bpf-prog.
It could also be a new standard option that the running kernel does not
support it while a bpf-prog can handle it.

This patch adds a "saw_unknown" bit to "struct tcp_options_received"
and it uses an existing one byte hole to do that.  "saw_unknown" will
be set in tcp_parse_options() if it sees an option that the kernel
cannot handle.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190033.2884430-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
ca584ba070 tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt
This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
to set the min rto of a connection.  It could be used together
with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).

A later selftest patch will communicate the max delay ack in a
bpf tcp header option and then the receiving side can use
bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
2b8ee4f05d tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt
This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.

The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
to its peer through bpf header option.  The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
in the next patch.

Another later selftest patch will also use it like the above to show
how to write and parse bpf tcp header option.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com
2020-08-24 14:34:59 -07:00
Martin KaFai Lau
70a217f197 tcp: Use a struct to represent a saved_syn
The TCP_SAVE_SYN has both the network header and tcp header.
The total length of the saved syn packet is currently stored in
the first 4 bytes (u32) of an array and the actual packet data is
stored after that.

A later patch will add a bpf helper that allows to get the tcp header
alone from the saved syn without the network header.  It will be more
convenient to have a direct offset to a specific header instead of
re-parsing it.  This requires to separately store the network hdrlen.
The total header length (i.e. network + tcp) is still needed for the
current usage in getsockopt.  Although this total length can be obtained
by looking into the tcphdr and then get the (th->doff << 2), this patch
chooses to directly store the tcp hdrlen in the second four bytes of
this newly created "struct saved_syn".  By using a new struct, it can
give a readable name to each individual header length.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com
2020-08-24 14:34:59 -07:00
Nikolay Aleksandrov
eeaac3634e net: nexthop: don't allow empty NHA_GROUP
Currently the nexthop code will use an empty NHA_GROUP attribute, but it
requires at least 1 entry in order to function properly. Otherwise we
end up derefencing null or random pointers all over the place due to not
having any nh_grp_entry members allocated, nexthop code relies on having at
least the first member present. Empty NHA_GROUP doesn't make any sense so
just disallow it.
Also add a WARN_ON for any future users of nexthop_create_group().

 BUG: kernel NULL pointer dereference, address: 0000000000000080
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP
 CPU: 0 PID: 558 Comm: ip Not tainted 5.9.0-rc1+ #93
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
 RIP: 0010:fib_check_nexthop+0x4a/0xaa
 Code: 0f 84 83 00 00 00 48 c7 02 80 03 f7 81 c3 40 80 fe fe 75 12 b8 ea ff ff ff 48 85 d2 74 6b 48 c7 02 40 03 f7 81 c3 48 8b 40 10 <48> 8b 80 80 00 00 00 eb 36 80 78 1a 00 74 12 b8 ea ff ff ff 48 85
 RSP: 0018:ffff88807983ba00 EFLAGS: 00010213
 RAX: 0000000000000000 RBX: ffff88807983bc00 RCX: 0000000000000000
 RDX: ffff88807983bc00 RSI: 0000000000000000 RDI: ffff88807bdd0a80
 RBP: ffff88807983baf8 R08: 0000000000000dc0 R09: 000000000000040a
 R10: 0000000000000000 R11: ffff88807bdd0ae8 R12: 0000000000000000
 R13: 0000000000000000 R14: ffff88807bea3100 R15: 0000000000000001
 FS:  00007f10db393700(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000080 CR3: 000000007bd0f004 CR4: 00000000003706f0
 Call Trace:
  fib_create_info+0x64d/0xaf7
  fib_table_insert+0xf6/0x581
  ? __vma_adjust+0x3b6/0x4d4
  inet_rtm_newroute+0x56/0x70
  rtnetlink_rcv_msg+0x1e3/0x20d
  ? rtnl_calcit.isra.0+0xb8/0xb8
  netlink_rcv_skb+0x5b/0xac
  netlink_unicast+0xfa/0x17b
  netlink_sendmsg+0x334/0x353
  sock_sendmsg_nosec+0xf/0x3f
  ____sys_sendmsg+0x1a0/0x1fc
  ? copy_msghdr_from_user+0x4c/0x61
  ___sys_sendmsg+0x63/0x84
  ? handle_mm_fault+0xa39/0x11b5
  ? sockfd_lookup_light+0x72/0x9a
  __sys_sendmsg+0x50/0x6e
  do_syscall_64+0x54/0xbe
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f10dacc0bb7
 Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 8b 05 9a 4b 2b 00 85 c0 75 2e 48 63 ff 48 63 d2 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 f2 2a 00 f7 d8 64 89 02 48
 RSP: 002b:00007ffcbe628bf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffcbe628f80 RCX: 00007f10dacc0bb7
 RDX: 0000000000000000 RSI: 00007ffcbe628c60 RDI: 0000000000000003
 RBP: 000000005f41099c R08: 0000000000000001 R09: 0000000000000008
 R10: 00000000000005e9 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 00007ffcbe628d70 R15: 0000563a86c6e440
 Modules linked in:
 CR2: 0000000000000080

CC: David Ahern <dsahern@gmail.com>
Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
Reported-by: syzbot+a61aa19b0c14c8770bd9@syzkaller.appspotmail.com
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:39:55 -07:00
Lorenz Bauer
7b219da43f net: sk_msg: Simplify sk_psock initialization
Initializing psock->sk_proto and other saved callbacks is only
done in sk_psock_update_proto, after sk_psock_init has returned.
The logic for this is difficult to follow, and needlessly complex.

Instead, initialize psock->sk_proto whenever we allocate a new
psock. Additionally, assert the following invariants:

* The SK has no ULP: ULP does it's own finagling of sk->sk_prot
* sk_user_data is unused: we need it to store sk_psock

Protect our access to sk_user_data with sk_callback_lock, which
is what other users like reuseport arrays, etc. do.

The result is that an sk_psock is always fully initialized, and
that psock->sk_proto is always the "original" struct proto.
The latter allows us to use psock->sk_proto when initializing
IPv6 TCP / UDP callbacks for sockmap.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200821102948.21918-2-lmb@cloudflare.com
2020-08-21 15:16:11 -07:00
Al Viro
cc44c17baf csum_partial_copy_nocheck(): drop the last argument
It's always 0.  Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.

However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:14 -04:00
Al Viro
3ea7ca80d9 icmp_push_reply(): reorder adding the checksum up
do csum_partial_copy_nocheck() on the first fragment, then
add the rest to it.  Equivalent transformation.

That was the only caller of csum_partial_copy_nocheck() that
might pass it non-zero as the last argument.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:13 -04:00
Al Viro
8d5930dfb7 skb_copy_and_csum_bits(): don't bother with the last argument
it's always 0

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:13 -04:00
Colin Ian King
ad6641189c net: ipv4: remove duplicate "the the" phrase in Kconfig text
The Kconfig help text contains the phrase "the the" in the help
text. Fix this and reformat the block of help text.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 16:02:16 -07:00
Linus Torvalds
a1d21081a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Some merge window fallout, some longer term fixes:

   1) Handle headroom properly in lapbether and x25_asy drivers, from
      Xie He.

   2) Fetch MAC address from correct r8152 device node, from Thierry
      Reding.

   3) In the sw kTLS path we should allow MSG_CMSG_COMPAT in sendmsg,
      from Rouven Czerwinski.

   4) Correct fdputs in socket layer, from Miaohe Lin.

   5) Revert troublesome sockptr_t optimization, from Christoph Hellwig.

   6) Fix TCP TFO key reading on big endian, from Jason Baron.

   7) Missing CAP_NET_RAW check in nfc, from Qingyu Li.

   8) Fix inet fastreuse optimization with tproxy sockets, from Tim
      Froidcoeur.

   9) Fix 64-bit divide in new SFC driver, from Edward Cree.

  10) Add a tracepoint for prandom_u32 so that we can more easily
      perform usage analysis. From Eric Dumazet.

  11) Fix rwlock imbalance in AF_PACKET, from John Ogness"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits)
  net: openvswitch: introduce common code for flushing flows
  af_packet: TPACKET_V3: fix fill status rwlock imbalance
  random32: add a tracepoint for prandom_u32()
  Revert "ipv4: tunnel: fix compilation on ARCH=um"
  net: accept an empty mask in /sys/class/net/*/queues/rx-*/rps_cpus
  net: ethernet: stmmac: Disable hardware multicast filter
  net: stmmac: dwmac1000: provide multicast filter fallback
  ipv4: tunnel: fix compilation on ARCH=um
  vsock: fix potential null pointer dereference in vsock_poll()
  sfc: fix ef100 design-param checking
  net: initialize fastreuse on inet_inherit_port
  net: refactor bind_bucket fastreuse into helper
  net: phy: marvell10g: fix null pointer dereference
  net: Fix potential memory leak in proto_register()
  net: qcom/emac: add missed clk_disable_unprepare in error path of emac_clks_phase1_init
  ionic_lif: Use devm_kcalloc() in ionic_qcq_alloc()
  net/nfc/rawsock.c: add CAP_NET_RAW check.
  hinic: fix strncpy output truncated compile warnings
  drivers/net/wan/x25_asy: Added needed_headroom and a skb->len check
  net/tls: Fix kmap usage
  ...
2020-08-13 20:03:11 -07:00
David S. Miller
9643609423 Revert "ipv4: tunnel: fix compilation on ARCH=um"
This reverts commit 06a7a37be55e29961c9ba2abec4d07c8e0e21861.

The bug was already fixed, this added a dup include.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-12 13:26:37 -07:00