IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Timewait sockets have included mark since approx 4.18.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jon Maxwell <jmaxwell37@gmail.com>
Fixes: 0048369055 ("tcp: Add mark for TIMEWAIT sockets")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix some spelling mistakes in comments:
Dont ==> Don't
timout ==> timeout
incomming ==> incoming
necesarry ==> necessary
substract ==> subtract
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MPTCP is builtin, so no need to add EXPORT_SYMBOL()s.
It will be used to support SO_TIMESTAMP(NS) ancillary
messages in the mptcp receive path.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Including <linux/in.h> and <netinet/in.h> in the dependencies breaks
compilation of trinity due to multiple definitions. <linux/in.h> is only
used in <linux/icmp.h> to provide the definition of the struct in_addr,
but this can be substituted out by using the datatype __be32.
Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch aims to improve the situation when reordering and loss are
ocurring in the same flight of packets.
Previously the reordering would first induce a spurious recovery, then
the subsequent ACK may undo the cwnd (based on the timestamps e.g.).
However the current loss recovery does not proceed to invoke
RACK to install a reordering timer. If some packets are also lost, this
may lead to a long RTO-based recovery. An example is
https://groups.google.com/g/bbr-dev/c/OFHADvJbTEI
The solution is to after reverting the recovery, always invoke RACK
to either mount the RACK timer to fast retransmit after the reordering
window, or restarts the recovery if new loss is identified. Hence
it is possible the sender may go from Recovery to Disorder/Open to
Recovery again in one ACK.
Reported-by: mingkun bian <bianmingkun@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the user specifies a hostname or domain name as part of the ip=
command-line option, preserve it and don't overwrite it with one
supplied by DHCP/BOOTP.
For instance, ip=::::myhostname::dhcp will use "myhostname" rather than
ignoring and overwriting it.
Fix the comment on ic_bootp_string that suggests it only copies a string
"if not already set"; it doesn't have any such logic.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Support for SCTP chunks matching on nf_tables, from Phil Sutter.
2) Skip LDMXCSR, we don't need a valid MXCSR state. From Stefano Brivio.
3) CONFIG_RETPOLINE for nf_tables set lookups, from Florian Westphal.
4) A few Kconfig leading spaces removal, from Juerg Haefliger.
5) Remove spinlock from xt_limit, from Jason Baron.
6) Remove useless initialization in xt_CT, oneliner from Yang Li.
7) Tree-wide replacement of netlink_unicast() by nfnetlink_unicast().
8) Reduce footprint of several structures: xt_action_param,
nft_pktinfo and nf_hook_state, from Florian.
10) Add nft_thoff() and nft_sk() helpers and use them, also from Florian.
11) Fix documentation in nf_tables pipapo avx2, from Florian Westphal.
12) Fix clang-12 fmt string warnings, also from Florian.
====================
When kalloc or kmemdup failed, should return ENOMEM rather than ENOBUF.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows to change storage placement later on without changing readers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-05-19
The following pull-request contains BPF updates for your *net-next* tree.
We've added 43 non-merge commits during the last 11 day(s) which contain
a total of 74 files changed, 3717 insertions(+), 578 deletions(-).
The main changes are:
1) syscall program type, fd array, and light skeleton, from Alexei.
2) Stop emitting static variables in skeleton, from Andrii.
3) Low level tc-bpf api, from Kumar.
4) Reduce verifier kmalloc/kfree churn, from Lorenz.
====================
In-kernel notifications are already sent when the multipath hash policy
itself changes, but not when the multipath hash fields change.
Add these notifications, so that interested listeners (e.g., switch ASIC
drivers) could perform the necessary configuration.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since cipso_v4_cache_invalidate() has no return value, so drop
related descriptions in its comments.
Fixes: 446fda4f26 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new multipath hash policy where the packet fields used for hash
calculation are determined by user space via the
fib_multipath_hash_fields sysctl that was introduced in the previous
patch.
The current set of available packet fields includes both outer and inner
fields, which requires two invocations of the flow dissector. Avoid
unnecessary dissection of the outer or inner flows by skipping
dissection if none of the outer or inner fields are required.
In accordance with the existing policies, when an skb is not available,
packet fields are extracted from the provided flow key. In which case,
only outer fields are considered.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsequent patch will add a new multipath hash policy where the packet
fields used for multipath hash calculation are determined by user space.
This patch adds a sysctl that allows user space to set these fields.
The packet fields are represented using a bitmask and are common between
IPv4 and IPv6 to allow user space to use the same numbering across both
protocols. For example, to hash based on standard 5-tuple:
# sysctl -w net.ipv4.fib_multipath_hash_fields=0x0037
net.ipv4.fib_multipath_hash_fields = 0x0037
The kernel rejects unknown fields, for example:
# sysctl -w net.ipv4.fib_multipath_hash_fields=0x1000
sysctl: setting key "net.ipv4.fib_multipath_hash_fields": Invalid argument
More fields can be added in the future, if needed.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsequent patch will add another multipath hash policy where the
multipath hash is calculated directly by the policy specific code and
not outside of the switch statement.
Prepare for this change by moving the multipath hash calculation inside
the switch statement.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
'err' and 'flags' are not used, we can just get rid of them.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210517022348.50555-1-xiyou.wangcong@gmail.com
Every protocol has the 'netns_ok' member and it is euqal to 1. The
'if (!prot->netns_ok)' always false in inet_add_protocol().
Signed-off-by: Yejune Deng <yejunedeng@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Treat only the highest, not the lowest, IPv4 address within a local
subnet as a broadcast address.
Signed-off-by: Seth David Schoen <schoen@loyalty.org>
Suggested-by: John Gilmore <gnu@toad.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a tracepoint for capturing TCP segments with
a bad checksum. This makes it easy to identify
sources of bad frames in the fleet (e.g. machines
with faulty NICs).
It should also help tools like IOvisor's tcpdrop.py
which are used today to get detailed information
about such packets.
We don't have a socket in many cases so we must
open code the address extraction based just on
the skb.
v2: add missing export for ipv6=m
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2021-05-11
The following pull-request contains BPF updates for your *net* tree.
We've added 13 non-merge commits during the last 8 day(s) which contain
a total of 21 files changed, 817 insertions(+), 382 deletions(-).
The main changes are:
1) Fix multiple ringbuf bugs in particular to prevent writable mmap of
read-only pages, from Andrii Nakryiko & Thadeu Lima de Souza Cascardo.
2) Fix verifier alu32 known-const subregister bound tracking for bitwise
operations and/or/xor, from Daniel Borkmann.
3) Reject trampoline attachment for functions with variable arguments,
and also add a deny list of other forbidden functions, from Jiri Olsa.
4) Fix nested bpf_bprintf_prepare() calls used by various helpers by
switching to per-CPU buffers, from Florent Revest.
5) Fix kernel compilation with BTF debug info on ppc64 due to pahole
missing TCP-CC functions like cubictcp_init, from Martin KaFai Lau.
6) Add a kconfig entry to provide an option to disallow unprivileged
BPF by default, from Daniel Borkmann.
7) Fix libbpf compilation for older libelf when GELF_ST_VISIBILITY()
macro is not available, from Arnaldo Carvalho de Melo.
8) Migrate test_tc_redirect to test_progs framework as prep work
for upcoming skb_change_head() fix & selftest, from Jussi Maki.
9) Fix a libbpf segfault in add_dummy_ksym_var() if BTF is not
present, from Ian Rogers.
10) Fix tx_only micro-benchmark in xdpsock BPF sample with proper frame
size, from Magnus Karlsson.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
During the discussion in [0]. It was pointed out that static functions
in ppc64 is prefixed with ".". For example, the 'readelf -s vmlinux.ppc':
89326: c000000001383280 24 NOTYPE LOCAL DEFAULT 31 cubictcp_init
89327: c000000000c97c50 168 FUNC LOCAL DEFAULT 2 .cubictcp_init
The one with FUNC type is ".cubictcp_init" instead of "cubictcp_init".
The "." seems to be done by arch/powerpc/include/asm/ppc_asm.h.
This caused that pahole cannot generate the BTF for these tcp-cc kernel
functions because pahole only captures the FUNC type and "cubictcp_init"
is not. It then failed the kernel compilation in ppc64.
This behavior is only reported in ppc64 so far. I tried arm64, s390,
and sparc64 and did not observe this "." prefix and NOTYPE behavior.
Since the kfunc call is only supported in the x86_64 and x86_32 JIT,
this patch limits those tcp-cc functions to x86 only to avoid unnecessary
compilation issue in other ARCHs. In the future, we can examine if it
is better to change all those functions from static to extern.
[0] https://lore.kernel.org/bpf/4e051459-8532-7b61-c815-f3435767f8a0@kernel.org/
Fixes: e78aea8b21 ("bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Michal Suchánek <msuchanek@suse.de>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/bpf/20210508005011.3863757-1-kafai@fb.com
When we call af_ops->set_link_af() we hold a RCU read lock
as we retrieve af_ops from the RCU protected list, but this
is unnecessary because we already hold RTNL lock, which is
the writer lock for protecting rtnl_af_ops, so it is safer
than RCU read lock. Similar for af_ops->validate_link_af().
This was not a problem until we begin to take mutex lock
down the path of ->set_link_af() in __ipv6_dev_mc_dec()
recently. We can just drop the RCU read lock there and
assert RTNL lock.
Reported-and-tested-by: syzbot+7d941e89dd48bcf42573@syzkaller.appspotmail.com
Fixes: 63ed8de4be ("mld: add mc_lock for protecting per-interface mld data")
Tested-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Add SECMARK revision 1 to fix incorrect layout that prevents
from remove rule with this target, from Phil Sutter.
2) Fix pernet exit path spat in arptables, from Florian Westphal.
3) Missing rcu_read_unlock() for unknown nfnetlink callbacks,
reported by syzbot, from Eric Dumazet.
4) Missing check for skb_header_pointer() NULL pointer in
nfnetlink_osf.
5) Remove BUG_ON() after skb_header_pointer() from packet path
in several conntrack helper and the TCP tracker.
6) Fix memleak in the new object error path of userdata.
7) Avoid overflows in nft_hash_buckets(), reported by syzbot,
also from Eric.
8) Avoid overflows in 32bit arches, from Eric.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nftables: avoid potential overflows on 32bit arches
netfilter: nftables: avoid overflows in nft_hash_buckets()
netfilter: nftables: Fix a memleak from userdata error path in new objects
netfilter: remove BUG_ON() after skb_header_pointer()
netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check
netfilter: nfnetlink: add a missing rcu_read_unlock()
netfilter: arptables: use pernet ops struct during unregister
netfilter: xt_SECMARK: add new revision to fix structure layout
====================
Link: https://lore.kernel.org/r/20210507174739.1850-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A prior change (1f466e1f15) introduces separate handling for
->msg_control depending on whether the pointer is a kernel or user
pointer. However, while tcp receive zerocopy is using this field, it
is not properly annotating that the buffer in this case is a user
pointer. This can cause faults when the improper mechanism is used
within put_cmsg().
This patch simply annotates tcp receive zerocopy's use as explicitly
being a user pointer.
Fixes: 7eeba1706e ("tcp: Add receive timestamp support for receive zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210506223530.2266456-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_set_default_congestion_control() is netns-safe in that it writes
to &net->ipv4.tcp_congestion_control, but it also sets
ca->flags |= TCP_CONG_NON_RESTRICTED which is not namespaced.
This has the unintended side-effect of changing the global
net.ipv4.tcp_allowed_congestion_control sysctl, despite the fact that it
is read-only: 97684f0970 ("net: Make tcp_allowed_congestion_control
readonly in non-init netns")
Resolve this netns "leak" by only allowing the init netns to set the
default algorithm to one that is restricted. This restriction could be
removed if tcp_allowed_congestion_control were namespace-ified in the
future.
This bug was uncovered with
https://github.com/JonathonReinhart/linux-netns-sysctl-verify
Fixes: 6670e15244 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control")
Signed-off-by: Jonathon Reinhart <jonathon.reinhart@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like with iptables and ebtables, hook unregistration has to use the
pernet ops struct, not the template.
This triggered following splat:
hook not found, pf 3 num 0
WARNING: CPU: 0 PID: 224 at net/netfilter/core.c:480 __nf_unregister_net_hook+0x1eb/0x610 net/netfilter/core.c:480
[..]
nf_unregister_net_hook net/netfilter/core.c:502 [inline]
nf_unregister_net_hooks+0x117/0x160 net/netfilter/core.c:576
arpt_unregister_table_pre_exit+0x67/0x80 net/ipv4/netfilter/arp_tables.c:1565
Fixes: f9006acc8d ("netfilter: arp_tables: pass table pointer via nf_hook_ops")
Reported-by: syzbot+dcccba8a1e41a38cb9df@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The current definitions of constants for PROBE, currently defined only
in the net-next kernel branch, are inconsistent, with
some beginning with ICMP and others with simply EXT. This patch
attempts to standardize the naming conventions of the constants for
PROBE before their release into a stable Kernel, and to update the
relevant definitions in net/ipv4/icmp.c.
Similarly, the definitions for the code field (previously
ICMP_EXT_MAL_QUERY, etc) use the same prefixes as the type field. This
patch adds _CODE_ to the prefix to clarify the distinction of these
constants.
Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Acked-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210427153635.2591-1-andreas.a.roeseler@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) The various ip(6)table_foo incarnations are updated to expect
that the table is passed as 'void *priv' argument that netfilter core
passes to the hook functions. This reduces the struct net size by 2
cachelines on x86_64. From Florian Westphal.
2) Add cgroupsv2 support for nftables.
3) Fix bridge log family merge into nf_log_syslog: Missing
unregistration from netns exit path, from Phil Sutter.
4) Add nft_pernet() helper to access nftables pernet area.
5) Add struct nfnl_info to reduce nfnetlink callback footprint and
to facilite future updates. Consolidate nfnetlink callbacks.
6) Add CONFIG_NETFILTER_XTABLES_COMPAT Kconfig knob, also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The compat layer needs to parse untrusted input (the ruleset)
to translate it to a 64bit compatible format.
We had a number of bugs in this department in the past, so allow users
to turn this feature off.
Add CONFIG_NETFILTER_XTABLES_COMPAT kconfig knob and make it default to y
to keep existing behaviour.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Same change as previous patch. Only difference:
no need to handle NULL template_ops parameter, the only caller
(arptable_filter) always passes non-NULL argument.
This removes all remaining accesses to net->ipv4.arptable_filter.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
iptable_x modules rely on 'struct net' to contain a pointer to the
table that should be evaluated.
In order to remove these pointers from struct net, pass them via
the 'priv' pointer in a similar fashion as nf_tables passes the
rule data.
To do that, duplicate the nf_hook_info array passed in from the
iptable_x modules, update the ops->priv pointers of the copy to
refer to the table and then change the hookfn implementations to
just pass the 'priv' argument to the traverser.
After this patch, the xt_table pointers can already be removed
from struct net.
However, changes to struct net result in re-compile of the entire
network stack, so do the removal after arptables and ip6tables
have been converted as well.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This changes how ip(6)table nat passes the ruleset/table to the
evaluation loop.
At the moment, it will fetch the table from struct net.
This change stores the table in the hook_ops 'priv' argument
instead.
This requires to duplicate the hook_ops for each netns, so
they can store the (per-net) xt_table structure.
The dupliated nat hook_ops get stored in net_generic data area.
They are free'd in the namespace exit path.
This is a pre-requisite to remove the xt_table/ruleset pointers
from struct net.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need for these.
There is only one caller, the xtables core, when the table is registered
for the first time with a particular network namespace.
After ->table_init() call, the table is linked into the tables[af] list,
so next call to that function will skip the ->table_init().
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
xtables stores the xt_table structs in the struct net. This isn't
needed anymore, the structures could be passed via the netfilter hook
'private' pointer to the hook functions, which would allow us to remove
those pointers from struct net.
As a first step, reduce the number of accesses to the
net->ipv4.ip6table_{raw,filter,...} pointers.
This allows the tables to get unregistered by name instead of having to
pass the raw address.
The xt_table structure cane looked up by name+address family instead.
This patch is useless as-is (the backends still have the raw pointer
address), but it lowers the bar to remove those.
It also allows to put the 'was table registered in the first place' check
into ip_tables.c rather than have it in each table sub module.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its the same function as ipt_unregister_table_exit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When I changed defrag hooks to no longer get registered by default I
intentionally made it so that registration can only be un-done by unloading
the nf_defrag_ipv4/6 module.
In hindsight this was too conservative; there is no reason to keep defrag
on while there is no feature dependency anymore.
Moreover, this won't work if user isn't allowed to remove nf_defrag module.
This adds the disable() functions for both ipv4 and ipv6 and calls them
from conntrack, TPROXY and the xtables socket module.
ipvs isn't converted here, it will behave as before this patch and
will need module removal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-04-23
The following pull-request contains BPF updates for your *net-next* tree.
We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).
The main changes are:
1) Add BPF static linker support for extern resolution of global, from Andrii.
2) Refine retval for bpf_get_task_stack helper, from Dave.
3) Add a bpf_snprintf helper, from Florent.
4) A bunch of miscellaneous improvements from many developers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Add vlan match and pop actions to the flowtable offload,
patches from wenxu.
2) Reduce size of the netns_ct structure, which itself is
embedded in struct net Make netns_ct a read-mostly structure.
Patches from Florian Westphal.
3) Add FLOW_OFFLOAD_XMIT_UNSPEC to skip dst check from garbage
collector path, as required by the tc CT action. From Roi Dayan.
4) VLAN offload fixes for nftables: Allow for matching on both s-vlan
and c-vlan selectors. Fix match of VLAN id due to incorrect
byteorder. Add a new routine to properly populate flow dissector
ethertypes.
5) Missing keys in ip{6}_route_me_harder() results in incorrect
routes. This includes an update for selftest infra. Patches
from Ido Schimmel.
6) Add counter hardware offload support through FLOW_CLS_STATS.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, a multi-part nexthop dump is restarted based on the number of
nexthops that have been dumped so far. This can result in a lot of
nexthops not being dumped when nexthops are simultaneously deleted:
# ip nexthop | wc -l
65536
# ip nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 36040 nexthops
# ip nexthop | wc -l
29496
Instead, restart the dump based on the nexthop identifier (fixed number)
of the last successfully dumped nexthop:
# ip nexthop | wc -l
65536
# ip nexthop flush
Dump was interrupted and may be inconsistent.
Flushed 65536 nexthops
# ip nexthop | wc -l
0
Reported-by: Maksym Yaremchuk <maksymy@nvidia.com>
Tested-by: Maksym Yaremchuk <maksymy@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netfilter tries to reroute mangled packets as a different route might
need to be used following the mangling. When this happens, netfilter
does not populate the IP protocol, the source port and the destination
port in the flow key. Therefore, FIB rules that match on these fields
are ignored and packets can be misrouted.
Solve this by dissecting the outer flow and populating the flow key
before rerouting the packet. Note that flow dissection only happens when
FIB rules that match on these fields are installed, so in the common
case there should not be a penalty.
Reported-by: Michal Soltys <msoltyspl@yandex.pl>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
- keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
- fix build after move to net_generic
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2021-04-14
Not much this time:
1) Simplification of some variable calculations in esp4 and esp6.
From Jiapeng Chong and Junlin Yang.
2) Fix a clang Wformat warning in esp6 and ah6.
From Arnd Bergmann.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, tcp_allowed_congestion_control is global and writable;
writing to it in any net namespace will leak into all other net
namespaces.
tcp_available_congestion_control and tcp_allowed_congestion_control are
the only sysctls in ipv4_net_table (the per-netns sysctl table) with a
NULL data pointer; their handlers (proc_tcp_available_congestion_control
and proc_allowed_congestion_control) have no other way of referencing a
struct net. Thus, they operate globally.
Because ipv4_net_table does not use designated initializers, there is no
easy way to fix up this one "bad" table entry. However, the data pointer
updating logic shouldn't be applied to NULL pointers anyway, so we
instead force these entries to be read-only.
These sysctls used to exist in ipv4_table (init-net only), but they were
moved to the per-net ipv4_net_table, presumably without realizing that
tcp_allowed_congestion_control was writable and thus introduced a leak.
Because the intent of that commit was only to know (i.e. read) "which
congestion algorithms are available or allowed", this read-only solution
should be sufficient.
The logic added in recent commit
31c4d2f160: ("net: Ensure net namespace isolation of sysctls")
does not and cannot check for NULL data pointers, because
other table entries (e.g. /proc/sys/net/netfilter/nf_log/) have
.data=NULL but use other methods (.extra2) to access the struct net.
Fixes: 9cb8e048e5 ("net/ipv4/sysctl: show tcp_{allowed, available}_congestion_control in non-initial netns")
Signed-off-by: Jonathon Reinhart <jonathon.reinhart@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current icmp_rcv function drops all unknown ICMP types, including
ICMP_EXT_ECHOREPLY (type 43). In order to parse Extended Echo Reply messages, we have
to pass these packets to the ping_rcv function, which does not do any
other filtering and passes the packet to the designated socket.
Pass incoming RFC 8335 ICMP Extended Echo Reply packets to the ping_rcv
handler instead of discarding the packet.
Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix NAT IPv6 offload in the flowtable.
2) icmpv6 is printed as unknown in /proc/net/nf_conntrack.
3) Use div64_u64() in nft_limit, from Eric Dumazet.
4) Use pre_exit to unregister ebtables and arptables hooks,
from Florian Westphal.
5) Fix out-of-bound memset in x_tables compat match/target,
also from Florian.
6) Clone set elements expression to ensure proper initialization.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
xt_compat_match/target_from_user doesn't check that zeroing the area
to start of next rule won't write past end of allocated ruleset blob.
Remove this code and zero the entire blob beforehand.
Reported-by: syzbot+cfc0247ac173f597aaaa@syzkaller.appspotmail.com
Reported-by: Andy Nguyen <theflow@google.com>
Fixes: 9fa492cdc1 ("[NETFILTER]: x_tables: simplify compat API")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>