50111 Commits

Author SHA1 Message Date
Kirill Tkhai
989d9812b7 net: Convert sit_net_ops
These pernet_operations are similar to ip6_tnl_net_ops. Exit method
unregisters all net sit devices, and it looks like another
pernet_operations are not interested in foreign net sit list.
Init method registers netdevice. So, it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:38 -05:00
Kirill Tkhai
5ecc29550a net: Convert vti6_net_ops
These pernet_operations are similar to ip6_tnl_net_ops. Exit method
unregisters all net vti6 tunnels, and it looks like another
pernet_operations are not interested in foreign net vti6 list.
Init method registers netdevice. So, it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:38 -05:00
Kirill Tkhai
66997ba083 net: Convert ip6_tnl_net_ops
These pernet_operations are similar to ip6gre_net_ops. Exit method
unregisters all net ip6_tnl tunnels, and it looks like another
pernet_operations are not interested in foreign net tunnels list.
So, it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:38 -05:00
Kirill Tkhai
5c155c5024 net: Convert ip6gre_net_ops
These pernet_operations are similar to bond_net_ops. Exit method
unregisters all net ip6gre devices, and it looks like another
pernet_operations are not interested in foreign net ip6gre list
or net_generic()->tunnels_wc. Init method registers net device.
So, it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:38 -05:00
Kirill Tkhai
31502104b3 net: Convert ipgre_net_ops, ipgre_tap_net_ops, erspan_net_ops, vti_net_ops and ipip_net_ops
These pernet_operations are similar to bond_net_ops. Exit methods
unregisters all net ipgre/ipgre_tap/erspan/vti/ipip devices, and it
looks like another pernet_operations are not interested in foreign
net ipgre/ipgre_tap/erspan/vti/ipip list. Init method also does not
intersect with something pernet-specific. So, it's possible
to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:37 -05:00
Kirill Tkhai
3cec5fb347 net: Convert br_net_ops
These pernet_operations are similar to bond_net_ops. Exit method
unregisters all net bridge devices, and it looks like another
pernet_operations are not interested in foreign net bridge list.
So, it's possible to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:37 -05:00
Kirill Tkhai
685ecfb198 net: Convert tc_action_net_init() and tc_action_net_exit() based pernet_operations
These pernet_operations are from net/sched directory, and they call only
tc_action_net_init() and tc_action_net_exit():

bpf_net_ops
connmark_net_ops
csum_net_ops
gact_net_ops
ife_net_ops
ipt_net_ops
xt_net_ops
mirred_net_ops
nat_net_ops
pedit_net_ops
police_net_ops
sample_net_ops
simp_net_ops
skbedit_net_ops
skbmod_net_ops
tunnel_key_net_ops
vlan_net_ops

1)tc_action_net_init() just allocates and initializes per-net memory.
2)There should not be in-flight packets at the time of tc_action_net_exit()
call, or another pernet_operations send packets to dying net (except
netlink). So, it seems they can be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:36 -05:00
Kirill Tkhai
5fcc85843d net: Convert sysctl creating and destroying pernet_operations
These pernet_operations create and destroy sysctl tables,
and they are able to be executed in parallel with any others:

ip_vs_lblc_ops
ip_vs_lblcr_ops

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:36 -05:00
Kirill Tkhai
02df428ca2 net: Convert simple pernet_operations
These pernet_operations make pretty simple actions
like variable initialization on init, debug checks
on exit, and so on, and they obviously are able
to be executed in parallel with any others:

vrf_net_ops
lockd_net_ops
grace_net_ops
xfrm6_tunnel_net_ops
kcm_net_ops
tcf_net_ops

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:35 -05:00
Kirill Tkhai
f0aad8e340 net: Convert synproxy_net_ops
These pernet_operations create and destroy /proc entries
and allocate extents to template ct, which depend on global
nf_ct_ext_types[] array. So, we are able to mark them async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:35 -05:00
Kirill Tkhai
47d63a0179 net: Convert hashlimit_net_ops and recent_net_ops
These pernet_operations just create and destroy /proc entries.
Also, new /proc entries also may come after new nf rules
are added, but this is not possible, when net isn't alive.
So, they are safe to be marked as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:35 -05:00
Kirill Tkhai
c80afa026a net: Convert /proc creating and destroying pernet_operations
These pernet_operations just create and destroy /proc entries,
and they can safely marked as async:

pppoe_net_ops
vlan_net_ops
canbcm_pernet_ops
kcm_net_ops
pfkey_net_ops
pppol2tp_net_ops
phonet_net_ops

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-27 11:01:35 -05:00
Alexey Dobriyan
08009a7602 net: make kmem caches as __ro_after_init
All kmem caches aren't reallocated once set up.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-26 15:11:48 -05:00
David S. Miller
f74290fdb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-24 00:04:20 -05:00
Donald Sharp
1b71af6053 net: fib_rules: Add new attribute to set protocol
For ages iproute2 has used `struct rtmsg` as the ancillary header for
FIB rules and in the process set the protocol value to RTPROT_BOOT.
Until ca56209a66 ("net: Allow a rule to track originating protocol")
the kernel rules code ignored the protocol value sent from userspace
and always returned 0 in notifications. To avoid incompatibility with
existing iproute2, send the protocol as a new attribute.

Fixes: cac56209a66 ("net: Allow a rule to track originating protocol")
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23 15:47:20 -05:00
Eric Dumazet
a5f7add332 net_sched: gen_estimator: fix broken estimators based on percpu stats
pfifo_fast got percpu stats lately, uncovering a bug I introduced last
year in linux-4.10.

I missed the fact that we have to clear our temporary storage
before calling __gnet_stats_copy_basic() in the case of percpu stats.

Without this fix, rate estimators (tc qd replace dev xxx root est 1sec
4sec pfifo_fast) are utterly broken.

Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23 12:35:46 -05:00
David S. Miller
2217009443 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2018-02-22

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) two urgent fixes for bpf_tail_call logic for x64 and arm64 JITs, from Daniel.

2) cond_resched points in percpu array alloc/free paths, from Eric.

3) lockdep and other minor fixes, from Yonghong, Arnd, Anders, Li.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23 12:34:18 -05:00
Sowmini Varadhan
79a5b9727a rds: rds_msg_zcopy should return error of null rm->data.op_mmp_znotifier
if either or both of MSG_ZEROCOPY and SOCK_ZEROCOPY have not been
specified, the rm->data.op_mmp_znotifier allocation will be skipped.
In this case, it is invalid ot pass down a cmsghdr with
RDS_CMSG_ZCOPY_COOKIE, so return EINVAL from rds_msg_zcopy for this
case.

Reported-by: syzbot+f893ae7bb2f6456dfbc3@syzkaller.appspotmail.com
Fixes: 0cebaccef3ac ("rds: zerocopy Tx support.")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23 12:30:52 -05:00
Arnd Bergmann
ca79bec237 ipv6 sit: work around bogus gcc-8 -Wrestrict warning
gcc-8 has a new warning that detects overlapping input and output arguments
in memcpy(). It triggers for sit_init_net() calling ipip6_tunnel_clone_6rd(),
which is actually correct:

net/ipv6/sit.c: In function 'sit_init_net':
net/ipv6/sit.c:192:3: error: 'memcpy' source argument is the same as destination [-Werror=restrict]

The problem here is that the logic detecting the memcpy() arguments finds them
to be the same, but the conditional that tests for the input and output of
ipip6_tunnel_clone_6rd() to be identical is not a compile-time constant.

We know that netdev_priv(t->dev) is the same as t for a tunnel device,
and comparing "dev" directly here lets the compiler figure out as well
that 'dev == sitn->fb_tunnel_dev' when called from sit_init_net(), so
it no longer warns.

This code is old, so Cc stable to make sure that we don't get the warning
for older kernels built with new gcc.

Cc: Martin Sebor <msebor@gmail.com>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83456
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-23 10:53:26 -05:00
David Howells
93c62c45ed rxrpc: Fix send in rxrpc_send_data_packet()
All the kernel_sendmsg() calls in rxrpc_send_data_packet() need to send
both parts of the iov[] buffer, but one of them does not.  Fix it so that
it does.

Without this, short IPv6 rxrpc DATA packets may be seen that have the rxrpc
header included, but no payload.

Fixes: 5a924b8951f8 ("rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22 15:37:47 -05:00
David S. Miller
60772e48ec Various updates across wireless.
One thing to note: I've included a new ethertype
 that wireless uses (ETH_P_PREAUTH) in if_ether.h.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlqPJKcACgkQB8qZga/f
 l8TAUw/7BMKG4ofFYRgujmqS+mnSDJx9vgtFHIV3Ymcm9vgdQ6wbNe85ME8J6TpN
 Z/HqtWVhn7BEWqpiDgq0ADTEmU/Vt6AQvy6fJX80+Lz4yup8Dq9dI9/CJ7BlKP+t
 O7/jXkv2RykFv1IAG9US3Xx9rIwLJRP6XndksZMsK4QihdUYOqAqjZ+pLWCHQ7+a
 vFewlUV6t7IMq3R9scL4nf5EmgLWNDNCSOZ6xWDxfDgLHsErbCD9ojRsfAnQWPN0
 1rwPC5kGm9tzGtiPVhA0/a4D0dgiYdv723ubs/waSYX5fimXDPSXsRizVp06ZWC+
 lFW+Mw52WvxriO61MD99xH1O1/svy+YMMgECoPMjGk7QzgY+2xQ/8hUbo91fsj07
 05+rGX3O0SJvB0une3m3ZsZz00DkZDU4Fw0kvO0aSCmE++O4vt/04wmMWxGVfnSo
 RtdrQNSAYrYqHSc+1kIzDAH2jCBBLj0cWdlZPYciYMTRUHOFZmMApyyXVLoOl3yn
 eqLgK8QBNpkjkf5FbF+m0ccHtQ8lkKiZDcqqIVN+dxKuuO9FEfblDty5bZEp8AaT
 Q2soararYeUcNVA2A+Gi1l3qtcj+wWay+CdDcU/QmYbXoxdRh64FBs//y6akIH9p
 p/cNgRP/MyEUEkvGmpva+MdtzCToWR4Asm4eErlsvvbIvKEFjPI=
 =A6XK
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2018-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Various updates across wireless.

One thing to note: I've included a new ethertype
that wireless uses (ETH_P_PREAUTH) in if_ether.h.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22 15:18:28 -05:00
David S. Miller
ed04c46d4e Various fixes across the tree, the shortlog basically says it all:
cfg80211: fix cfg80211_beacon_dup
   -> old bug in this code
 
   cfg80211: clear wep keys after disconnection
   -> certain ways of disconnecting left the keys
 
   mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4
   -> alignment issues with using 14 bytes
 
   mac80211: Do not disconnect on invalid operating class
   -> if the AP has a bogus operating class, let it be
 
   mac80211: Fix sending ADDBA response for an ongoing session
   -> don't send the same frame twice
 
   cfg80211: use only 1Mbps for basic rates in mesh
   -> interop issue with old versions of our code
 
   mac80211_hwsim: don't use WQ_MEM_RECLAIM
   -> it causes splats because it flushes work on a non-reclaim WQ
 
   regulatory: add NUL to request alpha2
   -> nla_put_string() issue from Kees
 
   mac80211: mesh: fix wrong mesh TTL offset calculation
   -> protocol issue
 
   mac80211: fix a possible leak of station stats
   -> error path might leak memory
 
   mac80211: fix calling sleeping function in atomic context
   -> percpu allocations need to be made with gfp flags
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlqPIu8ACgkQB8qZga/f
 l8R6ww/+NWuu2T3rXSqfp0hDI/CYCwpMV12wsD/4BGC+6idZBicwLVwNyey7Frzh
 IUb8vpuUR0+gvacY9ogSsBGBlU/IjydJWGpXiXIlruB/WNdMTHor9LZHr8dH2jDn
 m8rYwzOdpnp73IvME3krtvLv24NrJmOjBlkGTZ236403yRtYqX5k/bn/AriYSqMm
 bGbXTM9acs3WTygvR8KwCpOPjuosw3VL/54nu52MIegkORAHKA7SOm6O8PCjaG2Q
 4pRopztpvGAIQOe+VzYt8n47uW2a8g6FGQnRZOusAzf98xZLgfTBric5y5Vtf4j4
 WiSFnECCugoC0se8op5C5OgPmPEK7cN0j22PrJ0wJzd8cFuZSnw+MoHQuvvaH3WF
 4DtLNOs9uWyNqN3PJES6hhQJi1WXMKAV2GNOLsp/P2jmZya/TrHFiBH8nIAGqJhj
 3rARKmamI1qMUBs62fQfpXl+iOzLzKNIy6RzDr81Rh3Jhavx/xR7uJKIyy4xwQc0
 NfvBABT21WwI6+KC7EEyOqbti+Ldee3hd0fKift4Uww9j+P7c8UXTrWeGlq31M+v
 QSX8YstmBcDAk/llAwK/nM+9t1gXLBS9ZDv2M+ag7be0wZIORDlehsMBE987T3AB
 UrPgpxCM8Yrk10yHbpaq3sstZo9xWLGzrwhAUFIw2WzbWFDrd8A=
 =kwiY
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2018-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Various fixes across the tree, the shortlog basically says it all:

  cfg80211: fix cfg80211_beacon_dup
  -> old bug in this code

  cfg80211: clear wep keys after disconnection
  -> certain ways of disconnecting left the keys

  mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4
  -> alignment issues with using 14 bytes

  mac80211: Do not disconnect on invalid operating class
  -> if the AP has a bogus operating class, let it be

  mac80211: Fix sending ADDBA response for an ongoing session
  -> don't send the same frame twice

  cfg80211: use only 1Mbps for basic rates in mesh
  -> interop issue with old versions of our code

  mac80211_hwsim: don't use WQ_MEM_RECLAIM
  -> it causes splats because it flushes work on a non-reclaim WQ

  regulatory: add NUL to request alpha2
  -> nla_put_string() issue from Kees

  mac80211: mesh: fix wrong mesh TTL offset calculation
  -> protocol issue

  mac80211: fix a possible leak of station stats
  -> error path might leak memory

  mac80211: fix calling sleeping function in atomic context
  -> percpu allocations need to be made with gfp flags
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22 15:17:01 -05:00
Ilan Peer
94ba92713f mac80211: Call mgd_prep_tx before transmitting deauthentication
In multi channel scenarios, when disassociating from the AP before a
beacon was heard from the AP, it is not guaranteed that the virtual
interface is granted air time for the transmission of the
deauthentication frame. This in turn can lead to various issues as
the AP might never get the deauthentication frame.

To mitigate such possible issues, add a HW flag indicating that the
driver requires mac80211 to call the mgd_prep_tx() driver callback
to make sure that the virtual interface is granted immediate airtime
to be able to transmit the frame, in case that no beacon was heard
from the AP.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-02-22 21:13:04 +01:00
Sara Sharon
a1f2ba04cc mac80211: add get TID helper
Extracting the TID from the QOS header is common enough
to justify helper.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-02-22 21:13:03 +01:00
Johannes Berg
7299d6f7bf mac80211: support reporting A-MPDU EOF bit value/known
Support getting the EOF bit value reported from hardware
and writing it out to radiotap.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-02-22 21:13:02 +01:00
David Ahern
1fe4b1184c net: ipv4: Set addr_type in hash_keys for forwarded case
The result of the skb flow dissect is copied from keys to hash_keys to
ensure only the intended data is hashed. The original L4 hash patch
overlooked setting the addr_type for this case; add it.

Fixes: bf4e0a3db97eb ("net: ipv4: add support for ECMP hash policy choice")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22 14:30:51 -05:00
Eric Dumazet
350c9f484b tcp_bbr: better deal with suboptimal GSO
BBR uses tcp_tso_autosize() in an attempt to probe what would be the
burst sizes and to adjust cwnd in bbr_target_cwnd() with following
gold formula :

/* Allow enough full-sized skbs in flight to utilize end systems. */
cwnd += 3 * bbr->tso_segs_goal;

But GSO can be lacking or be constrained to very small
units (ip link set dev ... gso_max_segs 2)

What we really want is to have enough packets in flight so that both
GSO and GRO are efficient.

So in the case GSO is off or downgraded, we still want to have the same
number of packets in flight as if GSO/TSO was fully operational, so
that GRO can hopefully be working efficiently.

To fix this issue, we make tcp_tso_autosize() unaware of
sk->sk_gso_max_segs

Only tcp_tso_segs() has to enforce the gso_max_segs limit.

Tested:

ethtool -K eth0 tso off gso off
tc qd replace dev eth0 root pfifo_fast

Before patch:
for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
    691  (ss -temoi shows cwnd is stuck around 6 )
    667
    651
    631
    517

After patch :
# for f in {1..5}; do ./super_netperf 1 -H lpaa24 -- -K bbr; done
   1733 (ss -temoi shows cwnd is around 386 )
   1778
   1746
   1781
   1718

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22 14:15:23 -05:00
Jason A. Donenfeld
b87b6194be netlink: put module reference if dump start fails
Before, if cb->start() failed, the module reference would never be put,
because cb->cb_running is intentionally false at this point. Users are
generally annoyed by this because they can no longer unload modules that
leak references. Also, it may be possible to tediously wrap a reference
counter back to zero, especially since module.c still uses atomic_inc
instead of refcount_inc.

This patch expands the error path to simply call module_put if
cb->start() fails.

Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-22 14:01:38 -05:00
Arnd Bergmann
a7dcdf6ea1 bpf: clean up unused-variable warning
The only user of this variable is inside of an #ifdef, causing
a warning without CONFIG_INET:

net/core/filter.c: In function '____bpf_sock_ops_cb_flags_set':
net/core/filter.c:3382:6: error: unused variable 'val' [-Werror=unused-variable]
  int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;

This replaces the #ifdef with a nicer IS_ENABLED() check that
makes the code more readable and avoids the warning.

Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-02-22 01:11:20 +01:00
Donald Sharp
cac56209a6 net: Allow a rule to track originating protocol
Allow a rule that is being added/deleted/modified or
dumped to contain the originating protocol's id.

The protocol is handled just like a routes originating
protocol is.  This is especially useful because there
is starting to be a plethora of different user space
programs adding rules.

Allow the vrf device to specify that the kernel is the originator
of the rule created for this device.

Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 17:49:24 -05:00
David S. Miller
943a0d4a9b Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains large batch with Netfilter fixes for
your net tree, mostly due to syzbot report fixups and pr_err()
ratelimiting, more specifically, they are:

1) Get rid of superfluous unnecessary check in x_tables before vmalloc(),
   we don't hit BUG there anymore, patch from Michal Hock, suggested by
   Andrew Morton.

2) Race condition in proc file creation in ipt_CLUSTERIP, from Cong Wang.

3) Drop socket lock that results in circular locking dependency, patch
   from Paolo Abeni.

4) Drop packet if case of malformed blob that makes backpointer jump
   in x_tables, from Florian Westphal.

5) Fix refcount leak due to race in ipt_CLUSTERIP in
   clusterip_config_find_get(), from Cong Wang.

6) Several patches to ratelimit pr_err() for x_tables since this can be
   a problem where CAP_NET_ADMIN semantics can protect us in untrusted
   namespace, from Florian Westphal.

7) Missing .gitignore update for new autogenerated asn1 state machine
   for the SNMP NAT helper, from Zhu Lingshan.

8) Missing timer initialization in xt_LED, from Paolo Abeni.

9) Do not allow negative port range in NAT, also from Paolo.

10) Lock imbalance in the xt_hashlimit rate match mode, patch from
    Eric Dumazet.

11) Initialize workqueue before timer in the idletimer match,
    from Eric Dumazet.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 14:49:55 -05:00
Eric Dumazet
98be9b1209 tcp: remove dead code after CHECKSUM_PARTIAL adoption
Since all skbs in write/rtx queues have CHECKSUM_PARTIAL,
we can remove dead code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 14:24:14 -05:00
Eric Dumazet
4a64fd6ccf tcp: remove dead code from tcp_set_skb_tso_segs()
We no longer have skbs with skb->ip_summed == CHECKSUM_NONE
in TCP write queues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 14:24:14 -05:00
Eric Dumazet
65ec60973a tcp: tcp_sendmsg() only deals with CHECKSUM_PARTIAL
We no longer have skbs with skb->ip_summed == CHECKSUM_NONE
in TCP write queues.

We can remove dead code in tcp_sendmsg().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 14:24:14 -05:00
Eric Dumazet
dead7cdb0d tcp: remove sk_check_csum_caps()
Since TCP relies on GSO, we do not need this helper anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 14:24:14 -05:00
Eric Dumazet
74d4a8f8d3 tcp: remove sk_can_gso() use
After previous commit, sk_can_gso() is always true.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 14:24:14 -05:00
Eric Dumazet
0a6b2a1dc2 tcp: switch to GSO being always on
Oleksandr Natalenko reported performance issues with BBR without FQ
packet scheduler that were root caused to lack of SG and GSO/TSO on
his configuration.

In this mode, TCP internal pacing has to setup a high resolution timer
for each MSS sent.

We could implement in TCP a strategy similar to the one adopted
in commit fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers drifts")
or decide to finally switch TCP stack to a GSO only mode.

This has many benefits :

1) Most TCP developments are done with TSO in mind.
2) Less high-resolution timers needs to be armed for TCP-pacing
3) GSO can benefit of xmit_more hint
4) Receiver GRO is more effective (as if TSO was used for real on sender)
   -> Lower ACK traffic
5) Write queues have less overhead (one skb holds about 64KB of payload)
6) SACK coalescing just works.
7) rtx rb-tree contains less packets, SACK is cheaper.

This patch implements the minimum patch, but we can remove some legacy
code as follow ups.

Tested:

On 40Gbit link, one netperf -t TCP_STREAM

BBR+fq:
sg on:  26 Gbits/sec
sg off: 15.7 Gbits/sec   (was 2.3 Gbit before patch)

BBR+pfifo_fast:
sg on:  24.2 Gbits/sec
sg off: 14.9 Gbits/sec  (was 0.66 Gbit before patch !!! )

BBR+fq_codel:
sg on:  24.4 Gbits/sec
sg off: 15 Gbits/sec  (was 0.66 Gbit before patch !!! )

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 14:24:13 -05:00
Gustavo A. R. Silva
f905311356 rds: send: mark expected switch fall-through in rds_rm_size
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 1465362 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by:  Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 14:18:18 -05:00
Eyal Birger
ccc007e4a7 net: sched: add em_ipt ematch for calling xtables matches
The commit a new tc ematch for using netfilter xtable matches.

This allows early classification as well as mirroning/redirecting traffic
based on logic implemented in netfilter extensions.

Current supported use case is classification based on the incoming IPSec
state used during decpsulation using the 'policy' iptables extension
(xt_policy).

The module dynamically fetches the netfilter match module and calls
it using a fake xt_action_param structure based on validated userspace
provided parameters.

As the xt_policy match does not access skb->data, no skb modifications
are needed on match.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21 13:15:33 -05:00
Roman Kapl
5ae437ad5a net: sched: report if filter is too large to dump
So far, if the filter was too large to fit in the allocated skb, the
kernel did not return any error and stopped dumping. Modify the dumper
so that it returns -EMSGSIZE when a filter fails to dump and it is the
first filter in the skb. If we are not first, we will get a next chance
with more room.

I understand this is pretty near to being an API change, but the
original design (silent truncation) can be considered a bug.

Note: The error case can happen pretty easily if you create a filter
with 32 actions and have 4kb pages. Also recent versions of iproute try
to be clever with their buffer allocation size, which in turn leads to

Signed-off-by: Roman Kapl <code@rkapl.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-20 21:57:17 -05:00
Antonio Cardace
3fef2b6290 x25: use %*ph to print small buffer
Use %*ph format to print small buffer as hex string.

Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Antonio Cardace <anto.cardace@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-20 13:51:47 -05:00
Arkadi Sharshevsky
cc944ead83 devlink: Move size validation to core
Currently the size validation is done via a cb, which is unneeded. The
size validation can be moved to core. The next patch will perform cleanup.

Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-20 13:38:53 -05:00
Kirill Tkhai
8349efd903 net: Queue net_cleanup_work only if there is first net added
When llist_add() returns false, cleanup_net() hasn't made its
llist_del_all(), while the work has already been scheduled
by the first queuer. So, we may skip queue_work() in this case.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-20 13:23:49 -05:00
Kirill Tkhai
65b7b5b90f net: Make cleanup_list and net::cleanup_list of llist type
This simplifies cleanup queueing and makes cleanup lists
to use llist primitives. Since llist has its own cmpxchg()
ordering, cleanup_list_lock is not more need.

Also, struct llist_node is smaller, than struct list_head,
so we save some bytes in struct net with this patch.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-20 13:23:27 -05:00
Kirill Tkhai
19efbd93e6 net: Kill net_mutex
We take net_mutex, when there are !async pernet_operations
registered, and read locking of net_sem is not enough. But
we may get rid of taking the mutex, and just change the logic
to write lock net_sem in such cases. This obviously reduces
the number of lock operations, we do.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-20 13:23:13 -05:00
David S. Miller
f5c0c6f429 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-02-19 18:46:11 -05:00
Linus Torvalds
79c0ef3e85 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Prevent index integer overflow in ptr_ring, from Jason Wang.

 2) Program mvpp2 multicast filter properly, from Mikulas Patocka.

 3) The bridge brport attribute file is write only and doesn't have a
    ->show() method, don't blindly invoke it. From Xin Long.

 4) Inverted mask used in genphy_setup_forced(), from Ingo van Lil.

 5) Fix multiple definition issue with if_ether.h UAPI header, from
    Hauke Mehrtens.

 6) Fix GFP_KERNEL usage in atomic in RDS protocol code, from Sowmini
    Varadhan.

 7) Revert XDP redirect support from thunderx driver, it is not
    implemented properly. From Jesper Dangaard Brouer.

 8) Fix missing RTNL protection across some tipc operations, from Ying
    Xue.

 9) Return the correct IV bytes in the TLS getsockopt code, from Boris
    Pismenny.

10) Take tclassid into consideration properly when doing FIB rule
    matching. From Stefano Brivio.

11) cxgb4 device needs more PCI VPD quirks, from Casey Leedom.

12) TUN driver doesn't align frags properly, and we can end up doing
    unaligned atomics on misaligned metadata. From Eric Dumazet.

13) Fix various crashes found using DEBUG_PREEMPT in rmnet driver, from
    Subash Abhinov Kasiviswanathan.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits)
  tg3: APE heartbeat changes
  mlxsw: spectrum_router: Do not unconditionally clear route offload indication
  net: qualcomm: rmnet: Fix possible null dereference in command processing
  net: qualcomm: rmnet: Fix warning seen with 64 bit stats
  net: qualcomm: rmnet: Fix crash on real dev unregistration
  sctp: remove the left unnecessary check for chunk in sctp_renege_events
  rxrpc: Work around usercopy check
  tun: fix tun_napi_alloc_frags() frag allocator
  udplite: fix partial checksum initialization
  skbuff: Fix comment mis-spelling.
  dn_getsockoptdecnet: move nf_{get/set}sockopt outside sock lock
  PCI/cxgb4: Extend T3 PCI quirk to T4+ devices
  cxgb4: fix trailing zero in CIM LA dump
  cxgb4: free up resources of pf 0-3
  fib_semantics: Don't match route with mismatching tclassid
  NFC: llcp: Limit size of SDP URI
  tls: getsockopt return record sequence number
  tls: reset the crypto info if copy_from_user fails
  tls: retrun the correct IV in getsockopt
  docs: segmentation-offloads.txt: add SCTP info
  ...
2018-02-19 11:58:19 -08:00
Paolo Abeni
26736a08ee tipc: don't call sock_release() in atomic context
syzbot reported a scheduling while atomic issue at netns
destruction time:

BUG: sleeping function called from invalid context at net/core/sock.c:2769
in_atomic(): 1, irqs_disabled(): 0, pid: 85, name: kworker/u4:3
5 locks held by kworker/u4:3/85:
  #0:  ((wq_completion)"%s""netns"){+.+.}, at: [<00000000c9792deb>]
process_one_work+0xaaf/0x1af0 kernel/workqueue.c:2084
  #1:  (net_cleanup_work){+.+.}, at: [<00000000adc12e2a>]
process_one_work+0xb01/0x1af0 kernel/workqueue.c:2088
  #2:  (net_sem){++++}, at: [<000000009ccb5669>] cleanup_net+0x23f/0xd20
net/core/net_namespace.c:494
  #3:  (net_mutex){+.+.}, at: [<00000000a92767d9>] cleanup_net+0xa7d/0xd20
net/core/net_namespace.c:496
  #4:  (&(&srv->idr_lock)->rlock){+...}, at: [<000000001343e568>]
spin_lock_bh include/linux/spinlock.h:315 [inline]
  #4:  (&(&srv->idr_lock)->rlock){+...}, at: [<000000001343e568>]
tipc_topsrv_stop+0x231/0x610 net/tipc/topsrv.c:685
CPU: 0 PID: 85 Comm: kworker/u4:3 Not tainted 4.16.0-rc1+ #230
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: netns cleanup_net
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x257 lib/dump_stack.c:53
  ___might_sleep+0x2b2/0x470 kernel/sched/core.c:6128
  __might_sleep+0x95/0x190 kernel/sched/core.c:6081
  lock_sock_nested+0x37/0x110 net/core/sock.c:2769
  lock_sock include/net/sock.h:1463 [inline]
  tipc_release+0x103/0xff0 net/tipc/socket.c:572
  sock_release+0x8d/0x1e0 net/socket.c:594
  tipc_topsrv_stop+0x3c0/0x610 net/tipc/topsrv.c:696
  tipc_exit_net+0x15/0x40 net/tipc/core.c:96
  ops_exit_list.isra.6+0xae/0x150 net/core/net_namespace.c:148
  cleanup_net+0x6ba/0xd20 net/core/net_namespace.c:529
  process_one_work+0xbbf/0x1af0 kernel/workqueue.c:2113
  worker_thread+0x223/0x1990 kernel/workqueue.c:2247
  kthread+0x33c/0x400 kernel/kthread.c:238
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:429

This is caused by tipc_topsrv_stop() releasing the listener socket
with the idr lock held. This changeset addresses the issue moving
the release operation outside such lock.

Reported-and-tested-by: syzbot+749d9d87c294c00ca856@syzkaller.appspotmail.com
Fixes: 0ef897be12b8 ("tipc: separate topology server listener socket from subcsriber sockets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by:  ///jon
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:38:50 -05:00
Jon Maloy
96c252bf1c tipc: fix bug on error path in tipc_topsrv_kern_subscr()
In commit cc1ea9ffadf7 ("tipc: eliminate struct tipc_subscriber") we
re-introduced an old bug on the error path in the function
tipc_topsrv_kern_subscr(). We now re-introduce the correction too.

Reported-by: syzbot+f62e0f2a0ef578703946@syzkaller.appspotmail.com
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:38:08 -05:00
Kirill Tkhai
da349fad80 net: Convert iptable_filter_net_ops
These pernet_operations register and unregister
net::ipv4.iptable_filter table. Since there are
no packets in-flight at the time of exit method
is working, iptables rules should not be touched.
Also, pernet_operations should not send ipv4
packets each other. So, it's safe to mark them
async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:12 -05:00