67922 Commits

Author SHA1 Message Date
Michael S. Tsirkin
d9679d0013 virtio: wrap config->reset calls
This will enable cleanups down the road.
The idea is to disable cbs, then add "flush_queued_cbs" callback
as a parameter, this way drivers can flush any work
queued after callbacks have been disabled.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20211013105226.20225-1-mst@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-01-14 18:50:52 -05:00
Chuck Lever
c0f26167dd xprtrdma: Remove definitions of RPCDBG_FACILITY
Deprecated. dprintk is no longer used in xprtrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-14 10:35:08 -05:00
Chuck Lever
c03061e7a2 xprtrdma: Remove final dprintk call sites from xprtrdma
Deprecated. This information is available via tracepoints.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-14 10:33:31 -05:00
Linus Torvalds
3bad80dab9 Char/Misc and other driver changes for 5.17-rc1
Here is the large set of char, misc, and other "small" driver subsystem
 changes for 5.17-rc1.
 
 Lots of different things are in here for char/misc drivers such as:
 	- habanalabs driver updates
 	- mei driver updates
 	- lkdtm driver updates
 	- vmw_vmci driver updates
 	- android binder driver updates
 	- other small char/misc driver updates
 
 Also smaller driver subsystems have also been updated, including:
 	- fpga subsystem updates
 	- iio subsystem updates
 	- soundwire subsystem updates
 	- extcon subsystem updates
 	- gnss subsystem updates
 	- phy subsystem updates
 	- coresight subsystem updates
 	- firmware subsystem updates
 	- comedi subsystem updates
 	- mhi subsystem updates
 	- speakup subsystem updates
 	- rapidio subsystem updates
 	- spmi subsystem updates
 	- virtual driver updates
 	- counter subsystem updates
 
 Too many individual changes to summarize, the shortlog contains the full
 details.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYeGNAQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymoVgCg1CPjMu8/SDj3Sm3a1UMQJn9jnl8AnjQcEp3z
 hMr9mISG4r6g4PvjrJBj
 =9May
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc and other driver updates from Greg KH:
 "Here is the large set of char, misc, and other "small" driver
  subsystem changes for 5.17-rc1.

  Lots of different things are in here for char/misc drivers such as:

   - habanalabs driver updates

   - mei driver updates

   - lkdtm driver updates

   - vmw_vmci driver updates

   - android binder driver updates

   - other small char/misc driver updates

  Also smaller driver subsystems have also been updated, including:

   - fpga subsystem updates

   - iio subsystem updates

   - soundwire subsystem updates

   - extcon subsystem updates

   - gnss subsystem updates

   - phy subsystem updates

   - coresight subsystem updates

   - firmware subsystem updates

   - comedi subsystem updates

   - mhi subsystem updates

   - speakup subsystem updates

   - rapidio subsystem updates

   - spmi subsystem updates

   - virtual driver updates

   - counter subsystem updates

  Too many individual changes to summarize, the shortlog contains the
  full details.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (406 commits)
  counter: 104-quad-8: Fix use-after-free by quad8_irq_handler
  dt-bindings: mux: Document mux-states property
  dt-bindings: ti-serdes-mux: Add defines for J721S2 SoC
  counter: remove old and now unused registration API
  counter: ti-eqep: Convert to new counter registration
  counter: stm32-lptimer-cnt: Convert to new counter registration
  counter: stm32-timer-cnt: Convert to new counter registration
  counter: microchip-tcb-capture: Convert to new counter registration
  counter: ftm-quaddec: Convert to new counter registration
  counter: intel-qep: Convert to new counter registration
  counter: interrupt-cnt: Convert to new counter registration
  counter: 104-quad-8: Convert to new counter registration
  counter: Update documentation for new counter registration functions
  counter: Provide alternative counter registration functions
  counter: stm32-timer-cnt: Convert to counter_priv() wrapper
  counter: stm32-lptimer-cnt: Convert to counter_priv() wrapper
  counter: ti-eqep: Convert to counter_priv() wrapper
  counter: ftm-quaddec: Convert to counter_priv() wrapper
  counter: intel-qep: Convert to counter_priv() wrapper
  counter: microchip-tcb-capture: Convert to counter_priv() wrapper
  ...
2022-01-14 16:02:28 +01:00
Kevin Bracey
fb80445c43 net_sched: restore "mpu xxx" handling
commit 56b765b79e9a ("htb: improved accuracy at high rates") broke
"overhead X", "linklayer atm" and "mpu X" attributes.

"overhead X" and "linklayer atm" have already been fixed. This restores
the "mpu X" handling, as might be used by DOCSIS or Ethernet shaping:

    tc class add ... htb rate X overhead 4 mpu 64

The code being fixed is used by htb, tbf and act_police. Cake has its
own mpu handling. qdisc_calculate_pkt_len still uses the size table
containing values adjusted for mpu by user space.

iproute2 tc has always passed mpu into the kernel via a tc_ratespec
structure, but the kernel never directly acted on it, merely stored it
so that it could be read back by `tc class show`.

Rather, tc would generate length-to-time tables that included the mpu
(and linklayer) in their construction, and the kernel used those tables.

Since v3.7, the tables were no longer used. Along with "mpu", this also
broke "overhead" and "linklayer" which were fixed in 01cb71d2d47b
("net_sched: restore "overhead xxx" handling", v3.10) and 8a8e3d84b171
("net_sched: restore "linklayer atm" handling", v3.11).

"overhead" was fixed by simply restoring use of tc_ratespec::overhead -
this had originally been used by the kernel but was initially omitted
from the new non-table-based calculations.

"linklayer" had been handled in the table like "mpu", but the mode was
not originally passed in tc_ratespec. The new implementation was made to
handle it by getting new versions of tc to pass the mode in an extended
tc_ratespec, and for older versions of tc the table contents were analysed
at load time to deduce linklayer.

As "mpu" has always been given to the kernel in tc_ratespec,
accompanying the mpu-based table, we can restore system functionality
with no userspace change by making the kernel act on the tc_ratespec
value.

Fixes: 56b765b79e9a ("htb: improved accuracy at high rates")
Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Vimalkumar <j.vimal@gmail.com>
Link: https://lore.kernel.org/r/20220112170210.1014351-1-kevin@bracey.fi
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-13 11:06:42 -08:00
Anna Schumaker
1a48db3fef sunrpc: Fix potential race conditions in rpc_sysfs_xprt_state_change()
We need to use test_and_set_bit() when changing xprt state flags to
avoid potentially getting xps->xps_nactive out of sync.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-13 09:36:58 -05:00
Xiyu Yang
776d794f28 net/sunrpc: fix reference count leaks in rpc_sysfs_xprt_state_change
The refcount leak issues take place in an error handling path. When the
3rd argument buf doesn't match with "offline", "online" or "remove", the
function simply returns -EINVAL and forgets to decrease the reference
count of a rpc_xprt object and a rpc_xprt_switch object increased by
rpc_sysfs_xprt_kobj_get_xprt() and
rpc_sysfs_xprt_kobj_get_xprt_switch(), causing reference count leaks of
both unused objects.

Fix this issue by jumping to the error handling path labelled with
out_put when buf matches none of "offline", "online" or "remove".

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Xiong <xiongx18@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-13 09:36:58 -05:00
Olga Kornievskaia
b8a09619a5 SUNRPC allow for unspecified transport time in rpc_clnt_add_xprt
If the supplied argument doesn't specify the transport type, use the
type of the existing rpc clnt and its existing transport.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-13 09:36:58 -05:00
Wen Gu
20c9398d33 net/smc: Resolve the race between SMC-R link access and clear
We encountered some crashes caused by the race between SMC-R
link access and link clear that triggered by abnormal link
group termination, such as port error.

Here is an example of this kind of crashes:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 Workqueue: smc_hs_wq smc_listen_work [smc]
 RIP: 0010:smc_llc_flow_initiate+0x44/0x190 [smc]
 Call Trace:
  <TASK>
  ? __smc_buf_create+0x75a/0x950 [smc]
  smcr_lgr_reg_rmbs+0x2a/0xbf [smc]
  smc_listen_work+0xf72/0x1230 [smc]
  ? process_one_work+0x25c/0x600
  process_one_work+0x25c/0x600
  worker_thread+0x4f/0x3a0
  ? process_one_work+0x600/0x600
  kthread+0x15d/0x1a0
  ? set_kthread_struct+0x40/0x40
  ret_from_fork+0x1f/0x30
  </TASK>

smc_listen_work()                     __smc_lgr_terminate()
---------------------------------------------------------------
                                    | smc_lgr_free()
                                    |  |- smcr_link_clear()
                                    |      |- memset(lnk, 0)
smc_listen_rdma_reg()               |
 |- smcr_lgr_reg_rmbs()             |
     |- smc_llc_flow_initiate()     |
         |- access lnk->lgr (panic) |

These crashes are similarly caused by clearing SMC-R link
resources when some functions is still accessing to them.
This patch tries to fix the issue by introducing reference
count of SMC-R links and ensuring that the sensitive resources
of links won't be cleared until reference count reaches zero.

The operation to the SMC-R link reference count can be concluded
as follows:

object          [hold or initialized as 1]         [put]
--------------------------------------------------------------------
links           smcr_link_init()                   smcr_link_clear()
connections     smc_conn_create()                  smc_conn_free()

Through this way, the clear of SMC-R links is later than the
free of all the smc connections above it, thus avoiding the
unsafe reference to SMC-R links.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-13 13:14:53 +00:00
Wen Gu
ea89c6c098 net/smc: Introduce a new conn->lgr validity check helper
It is no longer suitable to identify whether a smc connection
is registered in a link group through checking if conn->lgr
is NULL, because conn->lgr won't be reset even the connection
is unregistered from a link group.

So this patch introduces a new helper smc_conn_lgr_valid() and
replaces all the check of conn->lgr in original implementation
with the new helper to judge if conn->lgr is valid to use.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-13 13:14:53 +00:00
Eric Dumazet
91341fa000 inet: frags: annotate races around fqdir->dead and fqdir->high_thresh
Both fields can be read/written without synchronization,
add proper accessors and documentation.

Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-13 13:06:05 +00:00
Wen Gu
61f434b028 net/smc: Resolve the race between link group access and termination
We encountered some crashes caused by the race between the access
and the termination of link groups.

Here are some of panic stacks we met:

1) Race between smc_clc_wait_msg() and __smc_lgr_terminate()

 BUG: kernel NULL pointer dereference, address: 00000000000002f0
 Workqueue: smc_hs_wq smc_listen_work [smc]
 RIP: 0010:smc_clc_wait_msg+0x3eb/0x5c0 [smc]
 Call Trace:
  <TASK>
  ? smc_clc_send_accept+0x45/0xa0 [smc]
  ? smc_clc_send_accept+0x45/0xa0 [smc]
  smc_listen_work+0x783/0x1220 [smc]
  ? finish_task_switch+0xc4/0x2e0
  ? process_one_work+0x1ad/0x3c0
  process_one_work+0x1ad/0x3c0
  worker_thread+0x4c/0x390
  ? rescuer_thread+0x320/0x320
  kthread+0x149/0x190
  ? set_kthread_struct+0x40/0x40
  ret_from_fork+0x1f/0x30
  </TASK>

smc_listen_work()                abnormal case like port error
---------------------------------------------------------------
                                | __smc_lgr_terminate()
                                |  |- smc_conn_kill()
                                |      |- smc_lgr_unregister_conn()
                                |          |- set conn->lgr = NULL
smc_clc_wait_msg()              |
 |- access conn->lgr (panic)    |

2) Race between smc_setsockopt() and __smc_lgr_terminate()

 BUG: kernel NULL pointer dereference, address: 00000000000002e8
 RIP: 0010:smc_setsockopt+0x17a/0x280 [smc]
 Call Trace:
  <TASK>
  __sys_setsockopt+0xfc/0x190
  __x64_sys_setsockopt+0x20/0x30
  do_syscall_64+0x34/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
  </TASK>

smc_setsockopt()                 abnormal case like port error
--------------------------------------------------------------
                                | __smc_lgr_terminate()
                                |  |- smc_conn_kill()
                                |      |- smc_lgr_unregister_conn()
                                |          |- set conn->lgr = NULL
mod_delayed_work()              |
 |- access conn->lgr (panic)    |

There are some other panic places and they are caused by the
similar reason as described above, which is accessing link
group after termination, thus getting a NULL pointer or invalid
resource.

Currently, there seems to be no synchronization between the
link group access and a sudden termination of it. This patch
tries to fix this by introducing reference count of link group
and not freeing link group until reference count is zero.

Link group might be referred to by links or smc connections. So
the operation to the link group reference count can be concluded
as follows:

object          [hold or initialized as 1]       [put]
-------------------------------------------------------------------
link group      smc_lgr_create()                 smc_lgr_free()
connections     smc_conn_create()                smc_conn_free()
links           smcr_link_init()                 smcr_link_clear()

Througth this way, we extend the life cycle of link group and
ensure it is longer than the life cycle of connections and links
above it, so that avoid invalid access to link group after its
termination.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-13 12:55:40 +00:00
Venky Shankar
4153c7fc93 libceph: rename parse_fsid() to ceph_parse_fsid() and export
... as it is too generic. also, use __func__ when logging
rather than hardcoding the function name.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-13 13:40:06 +01:00
Venky Shankar
2d7c86a8f9 libceph: generalize addr/ip parsing based on delimiter
... and remove hardcoded function name in ceph_parse_ips().

[ idryomov: delim parameter, drop CEPH_ADDR_PARSE_DEFAULT_DELIM ]

Signed-off-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-13 13:40:05 +01:00
Maxim Mikityanskiy
de2d807b29 sch_api: Don't skip qdisc attach on ingress
The attach callback of struct Qdisc_ops is used by only a few qdiscs:
mq, mqprio and htb. qdisc_graft() contains the following logic
(pseudocode):

    if (!qdisc->ops->attach) {
        if (ingress)
            do ingress stuff;
        else
            do egress stuff;
    }
    if (!ingress) {
        ...
        if (qdisc->ops->attach)
            qdisc->ops->attach(qdisc);
    } else {
        ...
    }

As we see, the attach callback is not called if the qdisc is being
attached to ingress (TC_H_INGRESS). That wasn't a problem for mq and
mqprio, since they contain a check that they are attached to TC_H_ROOT,
and they can't be attached to TC_H_INGRESS anyway.

However, the commit cited below added the attach callback to htb. It is
needed for the hardware offload, but in the non-offload mode it
simulates the "do egress stuff" part of the pseudocode above. The
problem is that when htb is attached to ingress, neither "do ingress
stuff" nor attach() is called. It results in an inconsistency, and the
following message is printed to dmesg:

unregister_netdevice: waiting for lo to become free. Usage count = 2

This commit addresses the issue by running "do ingress stuff" in the
ingress flow even in the attach callback is present, which is fine,
because attach isn't going to be called afterwards.

The bug was found by syzbot and reported by Eric.

Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-13 12:34:59 +00:00
Pablo Neira Ayuso
7d70984a1a netfilter: nft_connlimit: memleak if nf_ct_netns_get() fails
Check if nf_ct_netns_get() fails then release the limit object
previously allocated via kmalloc().

Fixes: 37f319f37d90 ("netfilter: nft_connlimit: move stateful fields out of expression data")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-13 12:26:04 +01:00
Ignat Korchagin
ed6ae5ca43 sit: allow encapsulated IPv6 traffic to be delivered locally
While experimenting with FOU encapsulation Amir noticed that encapsulated IPv6
traffic fails to be delivered, if the peer IP address is configured locally.

It can be easily verified by creating a sit interface like below:

$ sudo ip link add name fou_test type sit remote 127.0.0.1 encap fou encap-sport auto encap-dport 1111
$ sudo ip link set fou_test up

and sending some IPv4 and IPv6 traffic to it

$ ping -I fou_test -c 1 1.1.1.1
$ ping6 -I fou_test -c 1 fe80::d0b0:dfff:fe4c:fcbc

"tcpdump -i any udp dst port 1111" will confirm that only the first IPv4 ping
was encapsulated and attempted to be delivered.

This seems like a limitation: for example, in a cloud environment the "peer"
service may be arbitrarily scheduled on any server within the cluster, where all
nodes are trying to send encapsulated traffic. And the unlucky node will not be
able to. Moreover, delivering encapsulated IPv4 traffic locally is allowed.

But I may not have all the context about this restriction and this code predates
the observable git history.

Reported-by: Amir Razmjou <arazmjou@cloudflare.com>
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220107123842.211335-1-ignat@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-12 13:56:07 -08:00
Linus Torvalds
342465f533 TTY/Serial driver updates for 5.17-rc1
Here is the big set of tty/serial driver updates for 5.17-rc1.
 
 Nothing major in here, just lots of good updates and fixes, including:
 	- more tty core cleanups from Jiri as well as mxser driver
 	  cleanups.  This is the majority of the core diffstat
 	- tty documentation updates from Jiri
 	- platform_get_irq() updates
 	- various serial driver updates for new features and hardware
 	- fifo usage for 8250 console, reducing cpu load a lot
 	- LED fix for keyboards, long-time bugfix that went through many
 	  revisions
 	- minor cleanups
 
 All have been in linux-next for a while with no reported problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYd7Q0g8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yn3FACgoFZEFY04TU+Cd9mrlRq/mazZm/IAniJfPxOF
 U0s57L5o1dlnmawh8mmV
 =HSOB
 -----END PGP SIGNATURE-----

Merge tag 'tty-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial driver updates from Greg KH:
 "Here is the big set of tty/serial driver updates for 5.17-rc1.

  Nothing major in here, just lots of good updates and fixes, including:

   - more tty core cleanups from Jiri as well as mxser driver cleanups.
     This is the majority of the core diffstat

   - tty documentation updates from Jiri

   - platform_get_irq() updates

   - various serial driver updates for new features and hardware

   - fifo usage for 8250 console, reducing cpu load a lot

   - LED fix for keyboards, long-time bugfix that went through many
     revisions

   - minor cleanups

  All have been in linux-next for a while with no reported problems"

* tag 'tty-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (119 commits)
  serial: core: Keep mctrl register state and cached copy in sync
  serial: stm32: correct loop for dma error handling
  serial: stm32: fix flow control transfer in DMA mode
  serial: stm32: rework TX DMA state condition
  serial: stm32: move tx dma terminate DMA to shutdown
  serial: pl011: Drop redundant DTR/RTS preservation on close/open
  serial: pl011: Drop CR register reset on set_termios
  serial: pl010: Drop CR register reset on set_termios
  serial: liteuart: fix MODULE_ALIAS
  serial: 8250_bcm7271: Fix return error code in case of dma_alloc_coherent() failure
  Revert "serdev: BREAK/FRAME/PARITY/OVERRUN notification prototype V2"
  tty: goldfish: Use platform_get_irq() to get the interrupt
  serdev: BREAK/FRAME/PARITY/OVERRUN notification prototype V2
  tty: serial: meson: Drop the legacy compatible strings and clock code
  serial: pmac_zilog: Use platform_get_irq() to get the interrupt
  serial: bcm63xx: Use platform_get_irq() to get the interrupt
  serial: ar933x: Use platform_get_irq() to get the interrupt
  serial: vt8500: Use platform_get_irq() to get the interrupt
  serial: altera_jtaguart: Use platform_get_irq_optional() to get the interrupt
  serial: pxa: Use platform_get_irq() to get the interrupt
  ...
2022-01-12 11:21:52 -08:00
Eric Dumazet
7b9b1d449a net/smc: fix possible NULL deref in smc_pnet_add_eth()
I missed that @ndev value can be NULL.

I prefer not factorizing this NULL check, and instead
clearly document where a NULL might be expected.

general protection fault, probably for non-canonical address 0xdffffc00000000ba: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000005d0-0x00000000000005d7]
CPU: 0 PID: 19875 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__lock_acquire+0xd7a/0x5470 kernel/locking/lockdep.c:4897
Code: 14 0e 41 bf 01 00 00 00 0f 86 c8 00 00 00 89 05 5c 20 14 0e e9 bd 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 9f 2e 00 00 49 81 3e 20 c5 1a 8f 0f 84 52 f3 ff
RSP: 0018:ffffc900057071d0 EFLAGS: 00010002
RAX: dffffc0000000000 RBX: 1ffff92000ae0e65 RCX: 1ffff92000ae0e4c
RDX: 00000000000000ba RSI: 0000000000000000 RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
R10: fffffbfff1b24ae2 R11: 000000000008808a R12: 0000000000000000
R13: ffff888040ca4000 R14: 00000000000005d0 R15: 0000000000000000
FS:  00007fbd683e0700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2be22000 CR3: 0000000013fea000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 lock_acquire kernel/locking/lockdep.c:5637 [inline]
 lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
 _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
 ref_tracker_alloc+0x182/0x440 lib/ref_tracker.c:84
 netdev_tracker_alloc include/linux/netdevice.h:3859 [inline]
 smc_pnet_add_eth net/smc/smc_pnet.c:372 [inline]
 smc_pnet_enter net/smc/smc_pnet.c:492 [inline]
 smc_pnet_add+0x49a/0x14d0 net/smc/smc_pnet.c:555
 genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
 genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
 genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
 netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:725
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2413
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: b60645248af3 ("net/smc: add net device tracker to struct smc_pnetentry")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-12 14:45:29 +00:00
Eric Dumazet
fcfb894d59 net: bridge: fix net device refcount tracking issue in error path
I left one dev_put() in br_add_if() error path and sure enough
syzbot found its way.

As the tracker is allocated in new_nbp(), we must make sure
to properly free it.

We have to call dev_put_track(dev, &p->dev_tracker) before
@p object is freed, of course. This is not an issue because
br_add_if() owns a reference on @dev.

Fixes: b2dcdc7f731d ("net: bridge: add net device refcount tracker")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-12 14:44:18 +00:00
Miroslav Lichvar
2a4d75bfe4 net: fix sock_timestamping_bind_phc() to release device
Don't forget to release the device in sock_timestamping_bind_phc() after
it was used to get the vclock indices.

Fixes: d463126e23f1 ("net: sock: extend SO_TIMESTAMPING for PHC binding")
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-12 14:16:15 +00:00
Michael Walle
3486eb774f Revert "of: net: support NVMEM cells with MAC in text format"
This reverts commit 9ed319e411915e882bb4ed99be3ae78667a70022.

We can already post process a nvmem cell value in a particular driver.
Instead of having yet another place to convert the values, the post
processing hook of the nvmem provider should be used in this case.

Signed-off-by: Michael Walle <michael@walle.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-12 14:14:36 +00:00
Pablo Neira Ayuso
fe75e84a8f netfilter: nf_tables: set last expression in register tracking area
nft_rule_for_each_expr() sets on last to nft_rule_last(), however, this
is coming after track.last field is set on.

Use nft_expr_last() to set track.last accordingly.

Fixes: 12e4ecfa244b ("netfilter: nf_tables: add register tracking infrastructure")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-12 12:30:29 +01:00
Guillaume Nault
f7716b3185 gre: Don't accidentally set RTO_ONLINK in gre_fill_metadata_dst()
Mask the ECN bits before initialising ->flowi4_tos. The tunnel key may
have the last ECN bit set, which will interfere with the route lookup
process as ip_route_output_key_hash() interpretes this bit specially
(to restrict the route scope).

Found by code inspection, compile tested only.

Fixes: 962924fa2b7a ("ip_gre: Refactor collect metatdata mode tunnel xmit to ip_md_tunnel_xmit")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-11 20:36:08 -08:00
Guillaume Nault
23e7b1bfed xfrm: Don't accidentally set RTO_ONLINK in decode_session4()
Similar to commit 94e2238969e8 ("xfrm4: strip ECN bits from tos field"),
clear the ECN bits from iph->tos when setting ->flowi4_tos.
This ensures that the last bit of ->flowi4_tos is cleared, so
ip_route_output_key_hash() isn't going to restrict the scope of the
route lookup.

Use ~INET_ECN_MASK instead of IPTOS_RT_MASK, because we have no reason
to clear the high order bits.

Found by code inspection, compile tested only.

Fixes: 4da3089f2b58 ("[IPSEC]: Use TOS when doing tunnel lookups")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-11 20:36:08 -08:00
Matt Johnston
284a4d94e8 mctp: test: zero out sockaddr
MCTP now requires that padding bytes are zero.

Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Fixes: 1e4b50f06d97 ("mctp: handle the struct sockaddr_mctp padding fields")
Link: https://lore.kernel.org/r/20220110021806.2343023-1-matt@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-11 20:26:36 -08:00
Linus Torvalds
a135ce4400 selinux/stable-5.17 PR 20220110
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmHcekAUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNMAQ//TlHhQPEkPYZwD0EPXQIkl5IxGZfR
 DSg24OMzrh+x/jKVYhMXuW6BlnF10wd2ocG0DNAt9weF22PLrbDsxbzAAnv+MLXM
 wCzWLmgKpPpbaG57oa17LMcswDjVr8YbxXPOvyZJ/G3YsWDarH8Ezot8iNlVUkOh
 oqjjXTy0dyLUB/FsW7sIGWa3O18SkbI4pDQxL2vcpdDbvPtAY94kNG3bKTONK/kn
 OxLrjKscTtu6EuSdFhMqwcUxpfqvqUDrEfiSdNruMlT0/DixHgFlJA8WHXjn7Wpq
 XTbqdFrClfpmIiTrPSSszLrZAceegTdefDAf5wgfTxmcfYKiPDF9kPu5UPX1rAgO
 hSoXvGvPjez+RzyXmv9S292lkhV4tz/Wg0YlxZ0LUWif/CSZEyeGGoFZs8080Ukl
 1C/oNrByriaGGRLVwKNGOg5x/UkP1ipnorNnAiiIhw+xkzfXm12RdisUyUgLh9+w
 WsfsC9BJkXMpuvfkgya5PJ677wVNou4ZtJX+2MV8CfGEKTAJp/X/HKh3tWkQFYKx
 35QLVO1LD6gJZlSZZTsjZDxUyPwSd8e55GSFn1qKIHuC2jKjSkwCE7JfErIrI/W0
 js6HKO2ak7oMTWjxRizANyKI/yva/DeMl+mKCC4QV0xmwlm5JsOe7mYSCNGws7AV
 A2qYX7S1xKbLTGI=
 =8H+p
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20220110' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux updates from Paul Moore:
 "Nothing too significant, but five SELinux patches for v5.17 that do
  the following:

   - Harden the code through additional use of the struct_size() macro

   - Plug some memory leaks

   - Clean up the code via removal of the security_add_mnt_opt() LSM
     hook and minor tweaks to selinux_add_opt()

   - Rename security_task_getsecid_subj() to better reflect its actual
     behavior/use - now called security_current_getsecid_subj()"

* tag 'selinux-pr-20220110' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: minor tweaks to selinux_add_opt()
  selinux: fix potential memleak in selinux_add_opt()
  security,selinux: remove security_add_mnt_opt()
  selinux: Use struct_size() helper in kmalloc()
  lsm: security_task_getsecid_subj() -> security_current_getsecid_subj()
2022-01-11 13:03:06 -08:00
Toke Høiland-Jørgensen
382778edc8 xdp: check prog type before updating BPF link
The bpf_xdp_link_update() function didn't check the program type before
updating the program, which made it possible to install any program type as
an XDP program, which is obviously not good. Syzbot managed to trigger this
by swapping in an LWT program on the XDP hook which would crash in a helper
call.

Fix this by adding a check and bailing out if the types don't match.

Fixes: 026a4c28e1db ("bpf, xdp: Implement LINK_UPDATE for BPF XDP link")
Reported-by: syzbot+983941aa85af6ded1fd9@syzkaller.appspotmail.com
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20220107221115.326171-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-11 09:44:06 -08:00
Pablo Neira Ayuso
cf46eacbc1 netfilter: nf_tables: remove unused variable
> Remove unused variable and fix missing initialization.
>
> >> net/netfilter/nf_tables_api.c:8266:6: warning: variable 'i' set but not used [-Wunused-but-set-variable]
>            int i;
>                ^

Fixes: 2c865a8a28a1 ("netfilter: nf_tables: add rule blob layout")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-11 10:41:44 +01:00
Florian Westphal
0e906607b9 netfilter: nf_conntrack_netbios_ns: fix helper module alias
The helper gets registered as 'netbios-ns', not netbios_ns.
Intentionally not adding a fixes-tag because i don't want this to go to
stable. This wasn't noticed for a very long time so no so no need to risk
regressions.

Reported-by: Yi Chen <yiche@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-11 10:41:44 +01:00
Pablo Neira Ayuso
51edb2ff1c netfilter: nf_tables: typo NULL check in _clone() function
This should check for NULL in case memory allocation fails.

Reported-by: Julian Wiedmann <jwiedmann.dev@gmail.com>
Fixes: 3b9e2ea6c11b ("netfilter: nft_limit: move stateful fields out of expression data")
Fixes: 37f319f37d90 ("netfilter: nft_connlimit: move stateful fields out of expression data")
Fixes: 33a24de37e81 ("netfilter: nft_last: move stateful fields out of expression data")
Fixes: ed0a0c60f0e5 ("netfilter: nft_quota: move stateful fields out of expression data")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20220110194817.53481-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-10 21:09:43 -08:00
Linus Torvalds
63045bfd3c netfilter: nf_tables: don't use 'data_size' uninitialized
Commit 2c865a8a28a1 ("netfilter: nf_tables: add rule blob layout") never
initialized the new 'data_size' variable.

I'm not sure how it ever worked, but it might have worked almost by
accident - gcc seems to occasionally miss these kinds of 'variable used
uninitialized' situations, but I've seen it do so because it ended up
zero-initializing them due to some other simplification.

But clang is very unhappy about it all, and correctly reports

    net/netfilter/nf_tables_api.c:8278:4: error: variable 'data_size' is uninitialized when used here [-Werror,-Wuninitialized]
                            data_size += sizeof(*prule) + rule->dlen;
                            ^~~~~~~~~
    net/netfilter/nf_tables_api.c:8263:30: note: initialize the variable 'data_size' to silence this warning
            unsigned int size, data_size;
                                        ^
                                         = 0
    1 error generated.

and this fix just initializes 'data_size' to zero before the loop.

Fixes: 2c865a8a28a1 ("netfilter: nf_tables: add rule blob layout")
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-10 19:33:36 -08:00
Chuck Lever
dc6c6fb3d6 SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point
While testing, I got an unexpected KASAN splat:

Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: BUG: KASAN: stack-out-of-bounds in trace_event_raw_event_svc_xprt_create_err+0x190/0x210 [sunrpc]
Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: Read of size 28 at addr ffffc9000008f728 by task mount.nfs/4628

The memcpy() in the TP_fast_assign section of this trace point
copies the size of the destination buffer in order that the buffer
won't be overrun.

In other similar trace points, the source buffer for this memcpy is
a "struct sockaddr_storage" so the actual length of the source
buffer is always long enough to prevent the memcpy from reading
uninitialized or unallocated memory.

However, for this trace point, the source buffer can be as small as
a "struct sockaddr_in". For AF_INET sockaddrs, the memcpy() reads
memory that follows the source buffer, which is not always valid
memory.

To avoid copying past the end of the passed-in sockaddr, make the
source address's length available to the memcpy(). It would be a
little nicer if the tracing infrastructure was more friendly about
storing socket addresses that are not AF_INET, but I could not find
a way to make printk("%pIS") work with a dynamic array.

Reported-by: KASAN
Fixes: 4b8f380e46e4 ("SUNRPC: Tracepoint to record errors in svc_xpo_create()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-01-10 10:57:33 -05:00
Jakub Kicinski
8aaaf2f3af Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in fixes directly in prep for the 5.17 merge window.
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 17:00:17 -08:00
Christian Schoenebeck
15e2721b19 net/9p: show error message if user 'msize' cannot be satisfied
If user supplied a large value with the 'msize' option, then
client would silently limit that 'msize' value to the maximum
value supported by transport. That's a bit confusing for users
of not having any indication why the preferred 'msize' value
could not be satisfied.

Link: https://lkml.kernel.org/r/783ba37c1566dd715b9a67d437efa3b77e3cd1a7.1640870037.git.linux_oss@crudebyte.com
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-01-10 10:00:09 +09:00
Thomas Weißschuh
019641d1b5 net/p9: load default transports
Now that all transports are split into modules it may happen that no
transports are registered when v9fs_get_default_trans() is called.
When that is the case try to load more transports from modules.

Link: https://lkml.kernel.org/r/20211103193823.111007-5-linux@weissschuh.net
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
[Dominique: constify v9fs_get_trans_by_name argument as per patch1v2]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-01-10 10:00:09 +09:00
Thomas Weißschuh
99aa673e29 9p/xen: autoload when xenbus service is available
Link: https://lkml.kernel.org/r/20211103193823.111007-4-linux@weissschuh.net
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-01-10 10:00:09 +09:00
Thomas Weißschuh
1c582c6dc4 9p/trans_fd: split into dedicated module
This allows these transports only to be used when needed.

Link: https://lkml.kernel.org/r/20211103193823.111007-3-linux@weissschuh.net
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
[Dominique: Kconfig NET_9P_FD: -depends VIRTIO, +default NET_9P]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-01-10 09:58:30 +09:00
Benjamin Yim
208dd45d8d tcp: tcp_send_challenge_ack delete useless param skb
After this parameter is passed in, there is no usage, and deleting it will
 not bring any impact.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Benjamin Yim <yan2228598786@gmail.com>
Link: https://lore.kernel.org/r/20220109130824.2776-1-yan2228598786@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 16:52:21 -08:00
Yunsheng Lin
07b17f0f74 page_pool: remove spinlock in page_pool_refill_alloc_cache()
As page_pool_refill_alloc_cache() is only called by
__page_pool_get_cached(), which assumes non-concurrent access
as suggested by the comment in __page_pool_get_cached(), and
ptr_ring allows concurrent access between consumer and producer,
so remove the spinlock in page_pool_refill_alloc_cache().

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/20220107090042.13605-1-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 16:45:27 -08:00
Menglong Dong
1c7fab70df net: skb: use kfree_skb_reason() in __udp4_lib_rcv()
Replace kfree_skb() with kfree_skb_reason() in __udp4_lib_rcv.
New drop reason 'SKB_DROP_REASON_UDP_CSUM' is added for udp csum
error.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 16:30:35 -08:00
Menglong Dong
8512559741 net: skb: use kfree_skb_reason() in tcp_v4_rcv()
Replace kfree_skb() with kfree_skb_reason() in tcp_v4_rcv(). Following
drop reasons are added:

SKB_DROP_REASON_NO_SOCKET
SKB_DROP_REASON_PKT_TOO_SMALL
SKB_DROP_REASON_TCP_CSUM
SKB_DROP_REASON_TCP_FILTER

After this patch, 'kfree_skb' event will print message like this:

$           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
$              | |         |   |||||     |         |
          <idle>-0       [000] ..s1.    36.113438: kfree_skb: skbaddr=(____ptrval____) protocol=2048 location=(____ptrval____) reason: NO_SOCKET

The reason of skb drop is printed too.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 16:30:34 -08:00
Menglong Dong
c504e5c2f9 net: skb: introduce kfree_skb_reason()
Introduce the interface kfree_skb_reason(), which is able to pass
the reason why the skb is dropped to 'kfree_skb' tracepoint.

Add the 'reason' field to 'trace_kfree_skb', therefor user can get
more detail information about abnormal skb with 'drop_monitor' or
eBPF.

All drop reasons are defined in the enum 'skb_drop_reason', and
they will be print as string in 'kfree_skb' tracepoint in format
of 'reason: XXX'.

( Maybe the reasons should be defined in a uapi header file, so that
user space can use them? )

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 16:30:34 -08:00
Paul Blakey
6f022c2ddb net: openvswitch: Fix ct_state nat flags for conns arriving from tc
Netfilter conntrack maintains NAT flags per connection indicating
whether NAT was configured for the connection. Openvswitch maintains
NAT flags on the per packet flow key ct_state field, indicating
whether NAT was actually executed on the packet.

When a packet misses from tc to ovs the conntrack NAT flags are set.
However, NAT was not necessarily executed on the packet because the
connection's state might still be in NEW state. As such, openvswitch
wrongly assumes that NAT was executed and sets an incorrect flow key
NAT flags.

Fix this, by flagging to openvswitch which NAT was actually done in
act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so
the packet flow key NAT flags will be correctly set.

Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20220106153804.26451-1-paulb@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 16:24:12 -08:00
Jakub Kicinski
77bbcb60f7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next. This
includes one patch to update ovs and act_ct to use nf_ct_put() instead
of nf_conntrack_put().

1) Add netns_tracker to nfnetlink_log and masquerade, from Eric Dumazet.

2) Remove redundant rcu read-size lock in nf_tables packet path.

3) Replace BUG() by WARN_ON_ONCE() in nft_payload.

4) Consolidate rule verdict tracing.

5) Replace WARN_ON() by WARN_ON_ONCE() in nf_tables core.

6) Make counter support built-in in nf_tables.

7) Add new field to conntrack object to identify locally generated
   traffic, from Florian Westphal.

8) Prevent NAT from shadowing well-known ports, from Florian Westphal.

9) Merge nf_flow_table_{ipv4,ipv6} into nf_flow_table_inet, also from
   Florian.

10) Remove redundant pointer in nft_pipapo AVX2 support, from Colin Ian King.

11) Replace opencoded max() in conntrack, from Jiapeng Chong.

12) Update conntrack to use refcount_t API, from Florian Westphal.

13) Move ip_ct_attach indirection into the nf_ct_hook structure.

14) Constify several pointer object in the netfilter codebase,
    from Florian Westphal.

15) Tree-wide replacement of nf_conntrack_put() by nf_ct_put(), also
    from Florian.

16) Fix egress splat due to incorrect rcu notation, from Florian.

17) Move stateful fields of connlimit, last, quota, numgen and limit
    out of the expression data area.

18) Build a blob to represent the ruleset in nf_tables, this is a
    requirement of the new register tracking infrastructure.

19) Add NFT_REG32_NUM to define the maximum number of 32-bit registers.

20) Add register tracking infrastructure to skip redundant
    store-to-register operations, this includes support for payload,
    meta and bitwise expresssions.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: (32 commits)
  netfilter: nft_meta: cancel register tracking after meta update
  netfilter: nft_payload: cancel register tracking after payload update
  netfilter: nft_bitwise: track register operations
  netfilter: nft_meta: track register operations
  netfilter: nft_payload: track register operations
  netfilter: nf_tables: add register tracking infrastructure
  netfilter: nf_tables: add NFT_REG32_NUM
  netfilter: nf_tables: add rule blob layout
  netfilter: nft_limit: move stateful fields out of expression data
  netfilter: nft_limit: rename stateful structure
  netfilter: nft_numgen: move stateful fields out of expression data
  netfilter: nft_quota: move stateful fields out of expression data
  netfilter: nft_last: move stateful fields out of expression data
  netfilter: nft_connlimit: move stateful fields out of expression data
  netfilter: egress: avoid a lockdep splat
  net: prefer nf_ct_put instead of nf_conntrack_put
  netfilter: conntrack: avoid useless indirection during conntrack destruction
  netfilter: make function op structures const
  netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook
  netfilter: conntrack: convert to refcount_t api
  ...
====================

Link: https://lore.kernel.org/r/20220109231640.104123-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09 15:59:23 -08:00
Pablo Neira Ayuso
4a80e02698 netfilter: nft_meta: cancel register tracking after meta update
The meta expression might mangle the packet metadata, cancel register
tracking since any metadata in the registers is stale.

Finer grain register tracking cancellation by inspecting the meta type
on the register is also possible.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:35:17 +01:00
Pablo Neira Ayuso
cc003c7ee6 netfilter: nft_payload: cancel register tracking after payload update
The payload expression might mangle the packet, cancel register tracking
since any payload data in the registers is stale.

Finer grain register tracking cancellation by inspecting the payload
base, offset and length on the register is also possible.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:35:17 +01:00
Pablo Neira Ayuso
be5650f8f4 netfilter: nft_bitwise: track register operations
Check if the destination register already contains the data that this
bitwise expression performs. This allows to skip this redundant
operation.

If the destination contains a different bitwise operation, cancel the
register tracking information. If the destination contains no bitwise
operation, update the register tracking information.

Update the payload and meta expression to check if this bitwise
operation has been already performed on the register. Hence, both the
payload/meta and the bitwise expressions are reduced.

There is also a special case: If source register != destination register
and source register is not updated by a previous bitwise operation, then
transfer selector from the source register to the destination register.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:35:17 +01:00
Pablo Neira Ayuso
9b17afb2c8 netfilter: nft_meta: track register operations
Check if the destination register already contains the data that this
meta store expression performs. This allows to skip this redundant
operation. If the destination contains a different selector, update
the register tracking information.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:35:17 +01:00
Pablo Neira Ayuso
a7c176bf9f netfilter: nft_payload: track register operations
Check if the destination register already contains the data that this
payload store expression performs. This allows to skip this redundant
operation. If the destination contains a different selector, update
the register tracking information.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:35:17 +01:00