65014 Commits

Author SHA1 Message Date
Chuck Lever
e4b52ca013 xprtrdma: Do not recycle MR after FastReg/LocalInv flushes
Better not to touch MRs involved in a flush or post error until the
Send and Receive Queues are drained and the transport is fully
quiescent. Simply don't insert such MRs back onto the free list.
They remain on mr_all and will be released when the connection is
torn down.

I had thought that recycling would prevent hardware resources from
being tied up for a long time. However, since v5.7, a transport
disconnect destroys the QP and other hardware-owned resources. The
MRs get cleaned up nicely at that point.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:25:12 -04:00
Chuck Lever
44438ad9ae xprtrdma: Clarify use of barrier in frwr_wc_localinv_done()
Clean up: The comment and the placement of the memory barrier is
confusing. Humans want to read the function statements from head
to tail.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:47 -04:00
Chuck Lever
f912af77e2 xprtrdma: Rename frwr_release_mr()
Clean up: To be consistent with other functions in this source file,
follow the naming convention of putting the object being acted upon
before the action itself.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:29 -04:00
Chuck Lever
1363e6388c xprtrdma: rpcrdma_mr_pop() already does list_del_init()
The rpcrdma_mr_pop() earlier in the function has already cleared
out mr_list, so it must not be done again in the error path.

Fixes: 847568942f93 ("xprtrdma: Remove fr_state")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:22 -04:00
Chuck Lever
c35ca60d49 xprtrdma: Delete rpcrdma_recv_buffer_put()
Clean up: The name recv_buffer_put() is a vestige of older code,
and the function is just a wrapper for the newer rpcrdma_rep_put().
In most of the existing call sites, a pointer to the owning
rpcrdma_buffer is already available.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:12 -04:00
Chuck Lever
35d8b10a25 xprtrdma: Fix cwnd update ordering
After a reconnect, the reply handler is opening the cwnd (and thus
enabling more RPC Calls to be sent) /before/ rpcrdma_post_recvs()
can post enough Receive WRs to receive their replies. This causes an
RNR and the new connection is lost immediately.

The race is most clearly exposed when KASAN and disconnect injection
are enabled. This slows down rpcrdma_rep_create() enough to allow
the send side to post a bunch of RPC Calls before the Receive
completion handler can invoke ib_post_recv().

Fixes: 2ae50ad68cd7 ("xprtrdma: Close window between waking RPC senders and posting Receives")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:24:05 -04:00
Chuck Lever
9e3ca33b62 xprtrdma: Improve locking around rpcrdma_rep creation
Defensive clean up: Protect the rb_all_reps list during rep
creation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:23:40 -04:00
Chuck Lever
8b5292be68 xprtrdma: Improve commentary around rpcrdma_reps_unmap()
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:23:19 -04:00
Chuck Lever
eaf86e8cc8 xprtrdma: Improve locking around rpcrdma_rep destruction
Currently rpcrdma_reps_destroy() assumes that, at transport
tear-down, the content of the rb_free_reps list is the same as the
content of the rb_all_reps list. Although that is usually true,
using the rb_all_reps list should be more reliable because of
the way it's managed. And, rpcrdma_reps_unmap() uses rb_all_reps;
these two functions should both traverse the "all" list.

Ensure that all rpcrdma_reps are always destroyed whether they are
on the rep free list or not.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:22:29 -04:00
Chuck Lever
5030c9a938 xprtrdma: Put flushed Receives on free list instead of destroying them
Defer destruction of an rpcrdma_rep until transport tear-down to
preserve the rb_all_reps list while Receives flush.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:21:49 -04:00
Chuck Lever
15788d1d10 xprtrdma: Do not refresh Receive Queue while it is draining
Currently the Receive completion handler refreshes the Receive Queue
whenever a successful Receive completion occurs.

On disconnect, xprtrdma drains the Receive Queue. The first few
Receive completions after a disconnect are typically successful,
until the first flushed Receive.

This means the Receive completion handler continues to post more
Receive WRs after the drain sentinel has been posted. The late-
posted Receives flush after the drain sentinel has completed,
leading to a crash later in rpcrdma_xprt_disconnect().

To prevent this crash, xprtrdma has to ensure that the Receive
handler stops posting Receives before ib_drain_rq() posts its
drain sentinel.

Suggested-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:21:44 -04:00
Chuck Lever
32e6b68167 xprtrdma: Avoid Receive Queue wrapping
Commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)")
increased the number of Receive WRs that are posted by the client,
but did not increase the size of the Receive Queue allocated during
transport set-up.

This is usually not an issue because RPCRDMA_BACKWARD_WRS is defined
as (32) when SUNRPC_BACKCHANNEL is defined. In cases where it isn't,
there is a real risk of Receive Queue wrapping.

Fixes: e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-04-26 09:21:36 -04:00
Pablo Neira Ayuso
a655536571 netfilter: nfnetlink: add struct nfnl_info and pass it to callbacks
Add a new structure to reduce callback footprint and to facilite
extensions of the nfnetlink callback interface in the future.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:58:17 +02:00
Pablo Neira Ayuso
d59d2f82f9 netfilter: nftables: add nft_pernet() helper function
Consolidate call to net_generic(net, nf_tables_net_id) in this
wrapper function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:58:17 +02:00
David S. Miller
1e5e4acb66 mlx5-updates-2021-04-21
devlink external port attribute for SF (Sub-Function) port flavour
 
 This adds the support to instantiate Sub-Functions on external hosts
 E.g when Eswitch manager is enabled on the ARM SmarNic SoC CPU, users
 are now able to spawn new Sub-Functions on the Host server CPU.
 
 Parav Pandit Says:
 ==================
 
 This series introduces and uses external attribute for the SF port to
 indicate that a SF port belongs to an external controller.
 
 This is needed to generate unique phys_port_name when PF and SF numbers
 are overlapping between local and external controllers.
 For example two controllers 0 and 1, both of these controller have a SF.
 having PF number 0, SF number 77. Here, phys_port_name has duplicate
 entry which doesn't have controller number in it.
 
 Hence, add controller number optionally when a SF port is for an
 external controller. This extension is similar to existing PF and VF
 eswitch ports of the external controller.
 
 When a SF is for external controller an example view of external SF
 port and config sequence:
 
 On eswitch system:
 $ devlink dev eswitch set pci/0033:01:00.0 mode switchdev
 
 $ devlink port show
 pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
 pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
   function:
     hw_addr 00:00:00:00:00:00
 
 $ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
 pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
   function:
     hw_addr 00:00:00:00:00:00 state inactive opstate detached
 
 phys_port_name construction:
 $ cat /sys/class/net/eth1/phys_port_name
 c1pf0sf77
 
 Patch summary:
 First 3 patches prepares the eswitch to handle vports in more generic
 way using xarray to lookup vport from its unique vport number.
 Patch-1 returns maximum eswitch ports only when eswitch is enabled
 Patch-2 prepares eswitch to return eswitch max ports from a struct
 Patch-3 uses xarray for vport and representor lookup
 Patch-4 considers SF for an additioanl range of SF vports
 Patch-5 relies on SF hw table to check SF support
 Patch-6 extends SF devlink port attribute for external flag
 Patch-7 stores the per controller SF allocation attributes
 Patch-8 uses SF function id for filtering events
 Patch-9 uses helper for allocation and free
 Patch-10 splits hw table into per controller table and generic one
 Patch-11 extends sf table for additional range
 
 ==================
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmCDz80ACgkQSD+KveBX
 +j4LmggAwS9otYoo639Kmow/wlMZ6yyLsH02zVMFLEJ2AE4VbL73i4iiQ67ZWygL
 yQ8HawPAnythx4RsN/M6/WjSKRpdqTC27C9CpdM78zhXb1vnOrlzba7rYngqmo7N
 5fIkGyjsUGHNqq+15SftK7JYbXFTe1b5RdWawXkQoyBlXTTBamyxD7C5NMpoDots
 /e88Bs8Zy5nVPZqPchIId8TZEKKuO/heTz8ks6q6s/t1MGj7QP+ddxVMgNg00NR5
 OpNTr7YYdpHxpfLSUZgdHaptwwKOx+nou8LdJkIKWPs7SHX6HDggyZJjGBOEWtE7
 qG7oSip4olOTM0w9PZrAewLwSYhq7Q==
 =oqBr
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2021-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2021-04-21

devlink external port attribute for SF (Sub-Function) port flavour

This adds the support to instantiate Sub-Functions on external hosts
E.g when Eswitch manager is enabled on the ARM SmarNic SoC CPU, users
are now able to spawn new Sub-Functions on the Host server CPU.

Parav Pandit Says:
==================

This series introduces and uses external attribute for the SF port to
indicate that a SF port belongs to an external controller.

This is needed to generate unique phys_port_name when PF and SF numbers
are overlapping between local and external controllers.
For example two controllers 0 and 1, both of these controller have a SF.
having PF number 0, SF number 77. Here, phys_port_name has duplicate
entry which doesn't have controller number in it.

Hence, add controller number optionally when a SF port is for an
external controller. This extension is similar to existing PF and VF
eswitch ports of the external controller.

When a SF is for external controller an example view of external SF
port and config sequence:

On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77

Patch summary:
First 3 patches prepares the eswitch to handle vports in more generic
way using xarray to lookup vport from its unique vport number.
Patch-1 returns maximum eswitch ports only when eswitch is enabled
Patch-2 prepares eswitch to return eswitch max ports from a struct
Patch-3 uses xarray for vport and representor lookup
Patch-4 considers SF for an additioanl range of SF vports
Patch-5 relies on SF hw table to check SF support
Patch-6 extends SF devlink port attribute for external flag
Patch-7 stores the per controller SF allocation attributes
Patch-8 uses SF function id for filtering events
Patch-9 uses helper for allocation and free
Patch-10 splits hw table into per controller table and generic one
Patch-11 extends sf table for additional range

==================

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:31:35 -07:00
Phil Sutter
593268ddf3 netfilter: nf_log_syslog: Unset bridge logger in pernet exit
Without this, a stale pointer remains in pernet loggers after module
unload causing a kernel oops during dereference. Easily reproduced by:

| # modprobe nf_log_syslog
| # rmmod nf_log_syslog
| # cat /proc/net/netfilter/nf_log

Fixes: 77ccee96a6742 ("netfilter: nf_log_bridge: merge with nf_log_syslog")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:47 +02:00
Florian Westphal
ee177a5441 netfilter: ip6_tables: pass table pointer via nf_hook_ops
Same patch as the ip_tables one: removal of all accesses to ip6_tables
xt_table pointers.  After this patch the struct net xt_table anchors
can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:47 +02:00
Florian Westphal
f9006acc8d netfilter: arp_tables: pass table pointer via nf_hook_ops
Same change as previous patch.  Only difference:
no need to handle NULL template_ops parameter, the only caller
(arptable_filter) always passes non-NULL argument.

This removes all remaining accesses to net->ipv4.arptable_filter.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
ae68933422 netfilter: ip_tables: pass table pointer via nf_hook_ops
iptable_x modules rely on 'struct net' to contain a pointer to the
table that should be evaluated.

In order to remove these pointers from struct net, pass them via
the 'priv' pointer in a similar fashion as nf_tables passes the
rule data.

To do that, duplicate the nf_hook_info array passed in from the
iptable_x modules, update the ops->priv pointers of the copy to
refer to the table and then change the hookfn implementations to
just pass the 'priv' argument to the traverser.

After this patch, the xt_table pointers can already be removed
from struct net.

However, changes to struct net result in re-compile of the entire
network stack, so do the removal after arptables and ip6tables
have been converted as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
a4aeafa28c netfilter: xt_nat: pass table to hookfn
This changes how ip(6)table nat passes the ruleset/table to the
evaluation loop.

At the moment, it will fetch the table from struct net.

This change stores the table in the hook_ops 'priv' argument
instead.

This requires to duplicate the hook_ops for each netns, so
they can store the (per-net) xt_table structure.

The dupliated nat hook_ops get stored in net_generic data area.
They are free'd in the namespace exit path.

This is a pre-requisite to remove the xt_table/ruleset pointers
from struct net.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
f68772ed67 netfilter: x_tables: remove paranoia tests
No need for these.
There is only one caller, the xtables core, when the table is registered
for the first time with a particular network namespace.

After ->table_init() call, the table is linked into the tables[af] list,
so next call to that function will skip the ->table_init().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
4d70539919 netfilter: arptables: unregister the tables by name
and again, this time for arptables.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
6c0717545f netfilter: ip6tables: unregister the tables by name
Same as the previous patch, but for ip6tables.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
20a9df3359 netfilter: iptables: unregister the tables by name
xtables stores the xt_table structs in the struct net.  This isn't
needed anymore, the structures could be passed via the netfilter hook
'private' pointer to the hook functions, which would allow us to remove
those pointers from struct net.

As a first step, reduce the number of accesses to the
net->ipv4.ip6table_{raw,filter,...} pointers.
This allows the tables to get unregistered by name instead of having to
pass the raw address.

The xt_table structure cane looked up by name+address family instead.

This patch is useless as-is (the backends still have the raw pointer
address), but it lowers the bar to remove those.

It also allows to put the 'was table registered in the first place' check
into ip_tables.c rather than have it in each table sub module.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:46 +02:00
Florian Westphal
1ef4d6d1af netfilter: x_tables: add xt_find_table
This will be used to obtain the xt_table struct given address family and
table name.

Followup patches will reduce the number of direct accesses to the xt_table
structures via net->ipv{4,6}.ip(6)table_{nat,mangle,...} pointers, then
remove them.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:39 +02:00
Florian Westphal
7716bf090e netfilter: x_tables: remove ipt_unregister_table
Its the same function as ipt_unregister_table_exit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:08 +02:00
Florian Westphal
4c95e0728e netfilter: ebtables: remove the 3 ebtables pointers from struct net
ebtables stores the table internal data (what gets passed to the
ebt_do_table() interpreter) in struct net.

nftables keeps the internal interpreter format in pernet lists
and passes it via the netfilter core infrastructure (priv pointer).

Do the same for ebtables: the nf_hook_ops are duplicated via kmemdup,
then the ops->priv pointer is set to the table that is being registered.

After that, the netfilter core passes this table info to the hookfn.

This allows to remove the pointers from struct net.

Same pattern can be applied to ip/ip6/arptables.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
Florian Westphal
de8c12110a netfilter: disable defrag once its no longer needed
When I changed defrag hooks to no longer get registered by default I
intentionally made it so that registration can only be un-done by unloading
the nf_defrag_ipv4/6 module.

In hindsight this was too conservative; there is no reason to keep defrag
on while there is no feature dependency anymore.

Moreover, this won't work if user isn't allowed to remove nf_defrag module.

This adds the disable() functions for both ipv4 and ipv6 and calls them
from conntrack, TPROXY and the xtables socket module.

ipvs isn't converted here, it will behave as before this patch and
will need module removal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
Pablo Neira Ayuso
e0bb96db96 netfilter: nft_socket: add support for cgroupsv2
Allow to match on the cgroupsv2 id from ancestor level.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
Florian Westphal
885e8c6824 netfilter: nat: move nf_xfrm_me_harder to where it is used
remove the export and make it static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-04-26 03:20:07 +02:00
David S. Miller
5f6c2f536d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).

The main changes are:

1) Add BPF static linker support for extern resolution of global, from Andrii.

2) Refine retval for bpf_get_task_stack helper, from Dave.

3) Add a bpf_snprintf helper, from Florent.

4) A bunch of miscellaneous improvements from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:02:32 -07:00
Erik Flodin
e6b031d3c3 can: proc: fix rcvlist_* header alignment on 64-bit system
Before this fix, the function and userdata columns weren't aligned:
  device   can_id   can_mask  function  userdata   matches  ident
   vcan0  92345678  9fffffff  0000000000000000  0000000000000000         0  raw
   vcan0     123    00000123  0000000000000000  0000000000000000         0  raw

After the fix they are:
  device   can_id   can_mask      function          userdata       matches  ident
   vcan0  92345678  9fffffff  0000000000000000  0000000000000000         0  raw
   vcan0     123    00000123  0000000000000000  0000000000000000         0  raw

Link: Link: https://lore.kernel.org/r/20210425141440.229653-1-erik@flodin.me
Signed-off-by: Erik Flodin <erik@flodin.me>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-04-25 19:43:00 +02:00
Linus Torvalds
799bac5512 Revert "net/rds: Avoid potential use after free in rds_send_remove_from_sock"
This reverts commit 0c85a7e87465f2d4cbc768e245f4f45b2f299b05.

The games with 'rm' are on (two separate instances) of a local variable,
and make no difference.

Quoting Aditya Pakki:
 "I was the author of the patch and it was the cause of the giant UMN
  revert.

  The patch is garbage and I was unaware of the steps involved in
  retracting it. I *believed* the maintainers would pull it, given it
  was already under Greg's list. The patch does not introduce any bugs
  but is pointless and is stupid. I accept my incompetence and for not
  requesting a revert earlier."

Link: https://lwn.net/Articles/854319/
Requested-by: Aditya Pakki <pakki001@umn.edu>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-24 09:32:35 -07:00
Parav Pandit
a1ab3e4554 devlink: Extend SF port attributes to have external attribute
Extended SF port attributes to have optional external flag similar to
PCI PF and VF port attributes.

External atttibute is required to generate unique phys_port_name when PF number
and SF number are overlapping between two controllers similar to SR-IOV
VFs.

When a SF is for external controller an example view of external SF
port and config sequence.

On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev

$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24 00:58:53 -07:00
Yonglong Li
ca4fb89257 mptcp: add MSG_PEEK support
This patch adds support for MSG_PEEK flag. Packets are not removed
from the receive_queue if MSG_PEEK set in recv() system call.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:06:32 -07:00
Paolo Abeni
987858e5d0 mptcp: ignore unsupported msg flags
Currently mptcp_sendmsg() fails with EOPNOTSUPP if the
user-space provides some unsupported flag. That is unexpected
and may foul existing applications migrated to MPTCP, which
expect a different behavior.

Change the mentioned function to silently ignore the unsupported
flags except MSG_FASTOPEN. This is the only flags currently not
supported by MPTCP with user-space visible side-effects.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/162
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:06:32 -07:00
Paolo Abeni
d976092ce1 mptcp: implement MSG_TRUNC support
The mentioned flag is currently silenlty ignored. This
change implements the TCP-like behaviour, dropping the
pending data up to the specified length.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sigend-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:06:32 -07:00
Paolo Abeni
cb9d80f494 mptcp: implement dummy MSG_ERRQUEUE support
mptcp_recvmsg() currently silently ignores MSG_ERRQUEUE, returning
input data instead of error cmsg.

This change provides a dummy implementation for MSG_ERRQUEUE - always
returns no data. That is consistent with the current lack of a suitable
IP_RECVERR setsockopt() support.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:06:32 -07:00
Mat Martineau
6477dd39e6 mptcp: Retransmit DATA_FIN
With this change, the MPTCP-level retransmission timer is used to resend
DATA_FIN. The retranmit timer is not stopped while waiting for a
MPTCP-level ACK of DATA_FIN, and retransmitted DATA_FINs are sent on all
subflows. The retry interval starts at TCP_RTO_MIN and then doubles on
each attempt, up to TCP_RTO_MAX.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/146
Fixes: 43b54c6ee382 ("mptcp: Use full MPTCP-level disconnect state machine")
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 14:03:02 -07:00
David S. Miller
7679f864a0 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2021-04-23

1) The SPI flow key in struct flowi has no consumers,
   so remove it. From Florian Westphal.

2) Remove stray synchronize_rcu from xfrm_init.
   From Florian Westphal.

3) Use the new exit_pre hook to reset the netlink socket
   on net namespace destruction. From Florian Westphal.

4) Remove an unnecessary get_cpu() in ipcomp, that
   code is always called with BHs off.
   From Sabrina Dubroca.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 13:37:47 -07:00
Tonghao Zhang
ed744d8193 net: sock: remove the unnecessary check in proto_register
tw_prot_cleanup will check the twsk_prot.

Fixes: 0f5907af3913 ("net: Fix potential memory leak in proto_register()")
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 13:10:03 -07:00
Ilya Maximets
7d742b509d openvswitch: meter: remove rate from the bucket size calculation
Implementation of meters supposed to be a classic token bucket with 2
typical parameters: rate and burst size.

Burst size in this schema is the maximum number of bytes/packets that
could pass without being rate limited.

Recent changes to userspace datapath made meter implementation to be
in line with the kernel one, and this uncovered several issues.

The main problem is that maximum bucket size for unknown reason
accounts not only burst size, but also the numerical value of rate.
This creates a lot of confusion around behavior of meters.

For example, if rate is configured as 1000 pps and burst size set to 1,
this should mean that meter will tolerate bursts of 1 packet at most,
i.e. not a single packet above the rate should pass the meter.
However, current implementation calculates maximum bucket size as
(rate + burst size), so the effective bucket size will be 1001.  This
means that first 1000 packets will not be rate limited and average
rate might be twice as high as the configured rate.  This also makes
it practically impossible to configure meter that will have burst size
lower than the rate, which might be a desirable configuration if the
rate is high.

Inability to configure low values of a burst size and overall inability
for a user to predict what will be a maximum and average rate from the
configured parameters of a meter without looking at the OVS and kernel
code might be also classified as a security issue, because drop meters
are frequently used as a way of protection from DoS attacks.

This change removes rate from the calculation of a bucket size, making
it in line with the classic token bucket algorithm and essentially
making the rate and burst tolerance being predictable from a users'
perspective.

Same change proposed for the userspace implementation.

Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23 13:08:47 -07:00
Yunjian Wang
b9f83ffaa0 SUNRPC: Fix null pointer dereference in svc_rqst_free()
When alloc_pages_node() returns null in svc_rqst_alloc(), the
null rq_scratch_page pointer will be dereferenced when calling
put_page() in svc_rqst_free(). Fix it by adding a null check.

Addresses-Coverity: ("Dereference after null check")
Fixes: 5191955d6fc6 ("SUNRPC: Prepare for xdr_stream-style decoding on the server-side")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-04-23 10:43:05 -04:00
Lin Ma
e2cb6b891a bluetooth: eliminate the potential race condition when removing the HCI controller
There is a possible race condition vulnerability between issuing a HCI
command and removing the cont.  Specifically, functions hci_req_sync()
and hci_dev_do_close() can race each other like below:

thread-A in hci_req_sync()      |   thread-B in hci_dev_do_close()
                                |   hci_req_sync_lock(hdev);
test_bit(HCI_UP, &hdev->flags); |
...                             |   test_and_clear_bit(HCI_UP, &hdev->flags)
hci_req_sync_lock(hdev);        |
                                |
In this commit we alter the sequence in function hci_req_sync(). Hence,
the thread-A cannot issue th.

Signed-off-by: Lin Ma <linma@zju.edu.cn>
Cc: Marcel Holtmann <marcel@holtmann.org>
Fixes: 7c6a329e4447 ("[Bluetooth] Fix regression from using default link policy")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-23 13:25:04 +02:00
David Howells
f105da1a79 afs: Don't truncate iter during data fetch
Don't truncate the iterator to correspond to the actual data size when
fetching the data from the server - rather, pass the length we want to read
to rxrpc.

This will allow the clear-after-read code in future to simply clear the
remaining iterator capacity rather than having to reinitialise the
iterator.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/158861249201.340223.13035445866976590375.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465825061.1377938.14403904452300909320.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588531418.3465195.10712005940763063144.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118148567.1232039.13380313332292947956.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161044610.2537118.17908520793806837792.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340407907.1303470.6501394859511712746.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539551721.286939.14655713136572200716.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653807790.2770958.14034599989374173734.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789090823.6155.15673999934535049102.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:26 +01:00
Li RongQing
e7a1c13008 xsk: Align XDP socket batch size with DPDK
DPDK default burst size is 32, however, kernel xsk sendto
syscall can not handle all 32 at one time, and return with
error.

So make kernel XDP socket batch size larger to avoid
unnecessary syscall fail and context switch which will help
to increase performance.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/1618378752-4191-1-git-send-email-lirongqing@baidu.com
2021-04-23 09:50:35 +02:00
Martin Willi
22b6034323 net, xdp: Update pkt_type if generic XDP changes unicast MAC
If a generic XDP program changes the destination MAC address from/to
multicast/broadcast, the skb->pkt_type is updated to properly handle
the packet when passed up the stack. When changing the MAC from/to
the NICs MAC, PACKET_HOST/OTHERHOST is not updated, though, making
the behavior different from that of native XDP.

Remember the PACKET_HOST/OTHERHOST state before calling the program
in generic XDP, and update pkt_type accordingly if the destination
MAC address has changed. As eth_type_trans() assumes a default
pkt_type of PACKET_HOST, restore that before calling it.

The use case for this is when a XDP program wants to push received
packets up the stack by rewriting the MAC to the NICs MAC, for
example by cluster nodes sharing MAC addresses.

Fixes: 297249569932 ("net: fix generic XDP to handle if eth header was mangled")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210419141559.8611-1-martin@strongswan.org
2021-04-22 23:18:02 +02:00
Dan Carpenter
cb57908653 SUNRPC: fix ternary sign expansion bug in tracing
This code is supposed to pass negative "err" values for tracing but it
passes positive values instead.  The problem is that the
trace_svcsock_tcp_send() function takes a long but "err" is an int and
"sent" is a u32.  The negative is first type promoted to u32 so it
becomes a high positive then it is promoted to long and it stays
positive.

Fix this by casting "err" directly to long.

Fixes: 998024dee197 ("SUNRPC: Add more svcsock tracepoints")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-04-22 11:02:28 -04:00
Chinmay Agarwal
eefb45eef5 neighbour: Prevent Race condition in neighbour subsytem
Following Race Condition was detected:

<CPU A, t0>: Executing: __netif_receive_skb() ->__netif_receive_skb_core()
-> arp_rcv() -> arp_process().arp_process() calls __neigh_lookup() which
takes a reference on neighbour entry 'n'.
Moves further along, arp_process() and calls neigh_update()->
__neigh_update(). Neighbour entry is unlocked just before a call to
neigh_update_gc_list.

This unlocking paves way for another thread that may take a reference on
the same and mark it dead and remove it from gc_list.

<CPU B, t1> - neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead. Also n will be
removed from gc_list.
Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t1,
'n' couldn't be destroyed.

<CPU A, t3>- Code hits neigh_update_gc_list, with neighbour entry
set as dead.

<CPU A, t4> - arp_process() finally calls neigh_release(n), destroying
the neighbour entry and we have a destroyed ntry still part of gc_list.

Fixes: eb4e8fac00d1("neighbour: Prevent a dead entry from updating gc_list")
Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-21 14:47:43 -07:00
Vladimir Oltean
68f5c12abb net: bridge: fix error in br_multicast_add_port when CONFIG_NET_SWITCHDEV=n
When CONFIG_NET_SWITCHDEV is disabled, the shim for switchdev_port_attr_set
inside br_mc_disabled_update returns -EOPNOTSUPP. This is not caught,
and propagated to the caller of br_multicast_add_port, preventing ports
from joining the bridge.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Fixes: ae1ea84b33da ("net: bridge: propagate error code and extack from br_mc_disabled_update")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-21 13:13:33 -07:00