linux/include/net/netfilter
Pablo Neira Ayuso 5f68718b34 netfilter: nf_tables: GC transaction API to avoid race with control plane
The set types rhashtable and rbtree use a GC worker to reclaim memory.
From system work queue, in periodic intervals, a scan of the table is
done.

The major caveat here is that the nft transaction mutex is not held.
This causes a race between control plane and GC when they attempt to
delete the same element.

We cannot grab the netlink mutex from the work queue, because the
control plane has to wait for the GC work queue in case the set is to be
removed, so we get following deadlock:

   cpu 1                                cpu2
     GC work                            transaction comes in , lock nft mutex
       `acquire nft mutex // BLOCKS
                                        transaction asks to remove the set
                                        set destruction calls cancel_work_sync()

cancel_work_sync will now block forever, because it is waiting for the
mutex the caller already owns.

This patch adds a new API that deals with garbage collection in two
steps:

1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT
   so they are not visible via lookup. Annotate current GC sequence in
   the GC transaction. Enqueue GC transaction work as soon as it is
   full. If ruleset is updated, then GC transaction is aborted and
   retried later.

2) GC work grabs the mutex. If GC sequence has changed then this GC
   transaction lost race with control plane, abort it as it contains
   stale references to objects and let GC try again later. If the
   ruleset is intact, then this GC transaction deactivates and removes
   the elements and it uses call_rcu() to destroy elements.

Note that no elements are removed from GC lockless path, the _DEAD bit
is set and pointers are collected. GC catchall does not remove the
elements anymore too. There is a new set->dead flag that is set on to
abort the GC transaction to deal with set->ops->destroy() path which
removes the remaining elements in the set from commit_release, where no
mutex is held.

To deal with GC when mutex is held, which allows safe deactivate and
removal, add sync GC API which releases the set element object via
call_rcu(). This is used by rbtree and pipapo backends which also
perform garbage collection from control plane path.

Since element removal from sets can happen from control plane and
element garbage collection/timeout, it is necessary to keep the set
structure alive until all elements have been deactivated and destroyed.

We cannot do a cancel_work_sync or flush_work in nft_set_destroy because
its called with the transaction mutex held, but the aforementioned async
work queue might be blocked on the very mutex that nft_set_destroy()
callchain is sitting on.

This gives us the choice of ABBA deadlock or UaF.

To avoid both, add set->refs refcount_t member. The GC API can then
increment the set refcount and release it once the elements have been
free'd.

Set backends are adapted to use the GC transaction API in a follow up
patch entitled:

  ("netfilter: nf_tables: use gc transaction API in set backends")

This is joint work with Florian Westphal.

Fixes: cfed7e1b1f ("netfilter: nf_tables: add set garbage collection helpers")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-10 08:25:16 +02:00
..
ipv4 netfilter: disable defrag once its no longer needed 2021-04-26 03:20:07 +02:00
ipv6 netfilter: conntrack: fix boot failure with nf_conntrack.enable_hooks=1 2021-09-28 13:04:55 +02:00
br_netfilter.h netfilter: remove CONFIG_NETFILTER checks from headers. 2019-09-13 12:47:36 +02:00
nf_bpf_link.h bpf: minimal support for programs hooked into netfilter framework 2023-04-21 11:34:14 -07:00
nf_conntrack_acct.h netfilter: conntrack: remove extension register api 2022-02-04 06:30:28 +01:00
nf_conntrack_act_ct.h net/sched: act_ct: Fill offloading tuple iifidx 2022-01-04 12:12:55 +00:00
nf_conntrack_bpf.h net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c 2022-10-03 09:17:32 -07:00
nf_conntrack_bridge.h netfilter: remove CONFIG_NETFILTER checks from headers. 2019-09-13 12:47:36 +02:00
nf_conntrack_core.h netfilter: conntrack: fix wrong ct->timeout value 2023-04-19 12:08:38 +02:00
nf_conntrack_count.h netfilter: nf_conncount: reduce unnecessary GC 2022-05-16 13:05:40 +02:00
nf_conntrack_ecache.h netfilter: prefer extension check to pointer check 2022-05-13 18:56:28 +02:00
nf_conntrack_expect.h netfilter: Reorder fields in 'struct nf_conntrack_expect' 2023-05-18 08:48:54 +02:00
nf_conntrack_extend.h netfilter: extensions: introduce extension genid count 2022-05-13 18:52:16 +02:00
nf_conntrack_helper.h net: move add ct helper function to nf_conntrack_helper for ovs and tc 2022-11-08 12:15:19 +01:00
nf_conntrack_l4proto.h netfilter: conntrack: pass hook state to log functions 2021-06-18 14:47:43 +02:00
nf_conntrack_labels.h netfilter: extensions: introduce extension genid count 2022-05-13 18:52:16 +02:00
nf_conntrack_seqadj.h netfilter: conntrack: remove extension register api 2022-02-04 06:30:28 +01:00
nf_conntrack_synproxy.h netfilter: conntrack: wrap two inline functions in config checks. 2019-09-13 12:47:10 +02:00
nf_conntrack_timeout.h netfilter: nf_conntrack: add missing __rcu annotations 2022-07-11 16:25:15 +02:00
nf_conntrack_timestamp.h netfilter: conntrack: remove extension register api 2022-02-04 06:30:28 +01:00
nf_conntrack_tuple.h netfilter: conntrack: don't fold port numbers into addresses before hashing 2023-07-05 14:42:16 +02:00
nf_conntrack_zones.h netfilter: conntrack: remove CONFIG_NF_CONNTRACK checks from nf_conntrack_zones.h. 2019-09-13 12:47:41 +02:00
nf_conntrack.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next 2023-02-20 10:53:56 +00:00
nf_dup_netdev.h
nf_flow_table.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-06-15 22:19:41 -07:00
nf_hooks_lwtunnel.h netfilter: add netfilter hooks to SRv6 data plane 2021-08-30 01:51:36 +02:00
nf_log.h netfilter: nf_log_common: merge with nf_log_syslog 2021-03-31 22:34:10 +02:00
nf_nat_helper.h netfilter: nat: move repetitive nat port reserve loop to a helper 2022-09-07 16:46:04 +02:00
nf_nat_masquerade.h
nf_nat_redirect.h netfilter: nft_redir: use struct nf_nat_range2 throughout and deduplicate eval call-backs 2023-03-22 21:48:59 +01:00
nf_nat.h net: move the nat function to nf_nat_ovs for ovs and tc 2022-12-12 10:14:03 +00:00
nf_queue.h treewide: use get_random_u32() when possible 2022-10-11 17:42:58 -06:00
nf_reject.h netfilter: conntrack: skip verification of zero UDP checksum 2022-05-13 18:56:28 +02:00
nf_socket.h
nf_synproxy.h netfilter: remove CONFIG_NETFILTER checks from headers. 2019-09-13 12:47:36 +02:00
nf_tables_core.h netfilter: nf_tables: avoid retpoline overhead for some ct expression calls 2023-01-18 13:05:25 +01:00
nf_tables_ipv4.h netfilter: use skb_ip_totlen and iph_totlen 2023-02-01 20:54:27 -08:00
nf_tables_ipv6.h netfilter: nf_tables: reduce nft_pktinfo by 8 bytes 2022-10-25 13:44:14 +02:00
nf_tables_offload.h netfilter: nf_tables: bail out early if hardware offload is not supported 2022-06-06 19:19:15 +02:00
nf_tables.h netfilter: nf_tables: GC transaction API to avoid race with control plane 2023-08-10 08:25:16 +02:00
nf_tproxy.h netfilter: tproxy: fix deadlock due to missing BH disable 2023-03-06 12:09:48 +01:00
nft_fib.h netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters 2022-11-15 10:46:34 +01:00
nft_meta.h netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters 2022-11-15 10:46:34 +01:00
nft_reject.h netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters 2022-11-15 10:46:34 +01:00
xt_rateest.h net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types 2021-10-18 12:54:41 +01:00