IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Blamed commit changed:
ptr = kmalloc(size);
if (ptr)
size = ksize(ptr);
to:
size = kmalloc_size_roundup(size);
ptr = kmalloc(size);
This allowed various crash as reported by syzbot [1]
and Kyle Zeng.
Problem is that if @size is bigger than 0x80000001,
kmalloc_size_roundup(size) returns 2^32.
kmalloc_reserve() uses a 32bit variable (obj_size),
so 2^32 is truncated to 0.
kmalloc(0) returns ZERO_SIZE_PTR which is not handled by
skb allocations.
Following trace can be triggered if a netdev->mtu is set
close to 0x7fffffff
We might in the future limit netdev->mtu to more sensible
limit (like KMALLOC_MAX_SIZE).
This patch is based on a syzbot report, and also a report
and tentative fix from Kyle Zeng.
[1]
BUG: KASAN: user-memory-access in __build_skb_around net/core/skbuff.c:294 [inline]
BUG: KASAN: user-memory-access in __alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
Write of size 32 at addr 00000000fffffd10 by task syz-executor.4/22554
CPU: 1 PID: 22554 Comm: syz-executor.4 Not tainted 6.1.39-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
Call trace:
dump_backtrace+0x1c8/0x1f4 arch/arm64/kernel/stacktrace.c:279
show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:286
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x120/0x1a0 lib/dump_stack.c:106
print_report+0xe4/0x4b4 mm/kasan/report.c:398
kasan_report+0x150/0x1ac mm/kasan/report.c:495
kasan_check_range+0x264/0x2a4 mm/kasan/generic.c:189
memset+0x40/0x70 mm/kasan/shadow.c:44
__build_skb_around net/core/skbuff.c:294 [inline]
__alloc_skb+0x3c4/0x6e8 net/core/skbuff.c:527
alloc_skb include/linux/skbuff.h:1316 [inline]
igmpv3_newpack+0x104/0x1088 net/ipv4/igmp.c:359
add_grec+0x81c/0x1124 net/ipv4/igmp.c:534
igmpv3_send_cr net/ipv4/igmp.c:667 [inline]
igmp_ifc_timer_expire+0x1b0/0x1008 net/ipv4/igmp.c:810
call_timer_fn+0x1c0/0x9f0 kernel/time/timer.c:1474
expire_timers kernel/time/timer.c:1519 [inline]
__run_timers+0x54c/0x710 kernel/time/timer.c:1790
run_timer_softirq+0x28/0x4c kernel/time/timer.c:1803
_stext+0x380/0xfbc
____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:79
call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:891
do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:84
invoke_softirq kernel/softirq.c:437 [inline]
__irq_exit_rcu+0x1c0/0x4cc kernel/softirq.c:683
irq_exit_rcu+0x14/0x78 kernel/softirq.c:695
el0_interrupt+0x7c/0x2e0 arch/arm64/kernel/entry-common.c:717
__el0_irq_handler_common+0x18/0x24 arch/arm64/kernel/entry-common.c:724
el0t_64_irq_handler+0x10/0x1c arch/arm64/kernel/entry-common.c:729
el0t_64_irq+0x1a0/0x1a4 arch/arm64/kernel/entry.S:584
Fixes: 12d6c1d3a2 ("skbuff: Proactively round up to kmalloc bucket size")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is the big set of tty and serial driver changes for 6.6-rc1.
Lots of cleanups in here this cycle, and some driver updates. Short
summary is:
- Jiri's continued work to make the tty code and apis be a bit more
sane with regards to modern kernel coding style and types
- cpm_uart driver updates
- n_gsm updates and fixes
- meson driver updates
- sc16is7xx driver updates
- 8250 driver updates for different hardware types
- qcom-geni driver fixes
- tegra serial driver change
- stm32 driver updates
- synclink_gt driver cleanups
- tty structure size reduction
All of these have been in linux-next this week with no reported issues.
The last bit of cleanups from Jiri and the tty structure size reduction
came in last week, a bit late but as they were just style changes and
size reductions, I figured they should get into this merge cycle so that
others can work on top of them with no merge conflicts.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZPH+jA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykKyACgldt6QeenTN+6dXIHS/eQHtTKZwMAn3arSeXI
QrUUnLFjOWyoX87tbMBQ
=LVw0
-----END PGP SIGNATURE-----
Merge tag 'tty-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here is the big set of tty and serial driver changes for 6.6-rc1.
Lots of cleanups in here this cycle, and some driver updates. Short
summary is:
- Jiri's continued work to make the tty code and apis be a bit more
sane with regards to modern kernel coding style and types
- cpm_uart driver updates
- n_gsm updates and fixes
- meson driver updates
- sc16is7xx driver updates
- 8250 driver updates for different hardware types
- qcom-geni driver fixes
- tegra serial driver change
- stm32 driver updates
- synclink_gt driver cleanups
- tty structure size reduction
All of these have been in linux-next this week with no reported
issues. The last bit of cleanups from Jiri and the tty structure size
reduction came in last week, a bit late but as they were just style
changes and size reductions, I figured they should get into this merge
cycle so that others can work on top of them with no merge conflicts"
* tag 'tty-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (199 commits)
tty: shrink the size of struct tty_struct by 40 bytes
tty: n_tty: deduplicate copy code in n_tty_receive_buf_real_raw()
tty: n_tty: extract ECHO_OP processing to a separate function
tty: n_tty: unify counts to size_t
tty: n_tty: use u8 for chars and flags
tty: n_tty: simplify chars_in_buffer()
tty: n_tty: remove unsigned char casts from character constants
tty: n_tty: move newline handling to a separate function
tty: n_tty: move canon handling to a separate function
tty: n_tty: use MASK() for masking out size bits
tty: n_tty: make n_tty_data::num_overrun unsigned
tty: n_tty: use time_is_before_jiffies() in n_tty_receive_overrun()
tty: n_tty: use 'num' for writes' counts
tty: n_tty: use output character directly
tty: n_tty: make flow of n_tty_receive_buf_common() a bool
Revert "tty: serial: meson: Add a earlycon for the T7 SoC"
Documentation: devices.txt: Fix minors for ttyCPM*
Documentation: devices.txt: Remove ttySIOC*
Documentation: devices.txt: Remove ttyIOC*
serial: 8250_bcm7271: improve bcm7271 8250 port
...
Route hints when the nexthop is part of a multipath group causes packets
in the same receive batch to be sent to the same nexthop irrespective of
the multipath hash of the packet. So, do not extract route hint for
packets whose destination is part of a multipath group.
A new SKB flag IP6SKB_MULTIPATH is introduced for this purpose, set the
flag when route is looked up in fib6_select_path() and use it in
ip6_can_use_hint() to check for the existence of the flag.
Fixes: 197dbf24e3 ("ipv6: introduce and uses route look hints for list input.")
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Route hints when the nexthop is part of a multipath group causes packets
in the same receive batch to be sent to the same nexthop irrespective of
the multipath hash of the packet. So, do not extract route hint for
packets whose destination is part of a multipath group.
A new SKB flag IPSKB_MULTIPATH is introduced for this purpose, set the
flag when route is looked up in ip_mkroute_input() and use it in
ip_extract_route_hint() to check for the existence of the flag.
Fixes: 02b2494161 ("ipv4: use dst hint for ipv4 list receive")
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit bf5c25d608 ("skbuff: in skb_segment, call zerocopy functions
once per nskb") added the call to zero copy functions in skb_segment().
The change introduced a bug in skb_segment() because skb_orphan_frags()
may possibly change the number of fragments or allocate new fragments
altogether leaving nrfrags and frag to point to the old values. This can
cause a panic with stacktrace like the one below.
[ 193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc
[ 193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G O 5.15.123+ #26
[ 193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0
[ 194.021892] Call Trace:
[ 194.027422] <TASK>
[ 194.072861] tcp_gso_segment+0x107/0x540
[ 194.082031] inet_gso_segment+0x15c/0x3d0
[ 194.090783] skb_mac_gso_segment+0x9f/0x110
[ 194.095016] __skb_gso_segment+0xc1/0x190
[ 194.103131] netem_enqueue+0x290/0xb10 [sch_netem]
[ 194.107071] dev_qdisc_enqueue+0x16/0x70
[ 194.110884] __dev_queue_xmit+0x63b/0xb30
[ 194.121670] bond_start_xmit+0x159/0x380 [bonding]
[ 194.128506] dev_hard_start_xmit+0xc3/0x1e0
[ 194.131787] __dev_queue_xmit+0x8a0/0xb30
[ 194.138225] macvlan_start_xmit+0x4f/0x100 [macvlan]
[ 194.141477] dev_hard_start_xmit+0xc3/0x1e0
[ 194.144622] sch_direct_xmit+0xe3/0x280
[ 194.147748] __dev_queue_xmit+0x54a/0xb30
[ 194.154131] tap_get_user+0x2a8/0x9c0 [tap]
[ 194.157358] tap_sendmsg+0x52/0x8e0 [tap]
[ 194.167049] handle_tx_zerocopy+0x14e/0x4c0 [vhost_net]
[ 194.173631] handle_tx+0xcd/0xe0 [vhost_net]
[ 194.176959] vhost_worker+0x76/0xb0 [vhost]
[ 194.183667] kthread+0x118/0x140
[ 194.190358] ret_from_fork+0x1f/0x30
[ 194.193670] </TASK>
In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags
local variable in skb_segment() stale. This resulted in the code hitting
i >= nrfrags prematurely and trying to move to next frag_skb using
list_skb pointer, which was NULL, and caused kernel panic. Move the call
to zero copy functions before using frags and nr_frags.
Fixes: bf5c25d608 ("skbuff: in skb_segment, call zerocopy functions once per nskb")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reported-by: Amit Goyal <agoyal@purestorage.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_bind_phc is read locklessly. Add corresponding annotations.
Fixes: d463126e23 ("net: sock: extend SO_TIMESTAMPING for PHC binding")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yangbo Lu <yangbo.lu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_tsflags can be read locklessly, add corresponding annotations.
Fixes: b9f40e21ef ("net-timestamp: move timestamp flags out of sk_flags")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
msk->rmem_fwd_alloc can be read locklessly.
Add mptcp_rmem_fwd_alloc_add(), similar to sk_forward_alloc_add(),
and appropriate READ_ONCE()/WRITE_ONCE() annotations.
Fixes: 6511882cdd ("mptcp: allocate fwd memory separately on the rx and tx path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every time sk->sk_forward_alloc is read locklessly,
add a READ_ONCE().
Add sk_forward_alloc_add() helper to centralize updates,
to reduce number of WRITE_ONCE().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_sk_diag_fill() has been changed to use sk_forward_alloc_get(),
but sk_get_meminfo() was forgotten.
Fixes: 292e6077b0 ("net: introduce sk_forward_alloc_get()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZPD2qwAKCRDbK58LschI
gzy9APoCsV3B0rJCX2PnxoKmx7ZwAbEhWRHN3iDAGgEOwuAdLQEAi1Mafivr/4Rr
WLi6AQOy+Erv7dAQRq2KbR2yE8rkEgg=
=BJ9X
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-08-31
We've added 15 non-merge commits during the last 3 day(s) which contain
a total of 17 files changed, 468 insertions(+), 97 deletions(-).
The main changes are:
1) BPF selftest fixes: one flake and one related to clang18 testing,
from Yonghong Song.
2) Fix a d_path BPF selftest failure after fast-forward from Linus'
tree, from Jiri Olsa.
3) Fix a preempt_rt splat in sockmap when using raw_spin_lock_t,
from John Fastabend.
4) Fix a xsk_diag_fill use-after-free race during socket cleanup,
from Magnus Karlsson.
5) Fix xsk_build_skb to address a buggy dereference of an ERR_PTR(),
from Tirthendu Sarkar.
6) Fix a bpftool build warning when compiled with -Wtype-limits,
from Yafang Shao.
7) Several misc fixes and cleanups in standardization docs,
from David Vernet.
8) Fix BPF selftest install to consider no_alu32/cpuv4/bpf-gcc flavors,
from Björn Töpel.
9) Annotate a data race in bpf_long_memcpy for KCSAN, from Daniel Borkmann.
10) Extend documentation with a description for CO-RE relocations,
from Eduard Zingerman.
11) Fix several invalid escape sequence warnings in bpf_doc.py script,
from Vishal Chourasia.
12) Fix the instruction set doc wrt offset of BPF-to-BPF call,
from Will Hawkins.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Include build flavors for install target
bpf: Annotate bpf_long_memcpy with data_race
selftests/bpf: Fix d_path test
bpf, docs: Fix invalid escape sequence warnings in bpf_doc.py
xsk: Fix xsk_diag use-after-free error during socket cleanup
bpf, docs: s/eBPF/BPF in standards documents
bpf, docs: Add abi.rst document to standardization subdirectory
bpf, docs: Move linux-notes.rst to root bpf docs tree
bpf, sockmap: Fix preempt_rt splat when using raw_spin_lock_t
docs/bpf: Add description for CO-RE relocations
bpf, docs: Correct source of offset for program-local call
selftests/bpf: Fix flaky cgroup_iter_sleepable subtest
xsk: Fix xsk_build_skb() error: 'skb' dereferencing possible ERR_PTR()
bpftool: Fix build warnings with -Wtype-limits
bpf: Prevent inlining of bpf_fentry_test7()
====================
Link: https://lore.kernel.org/r/20230831210019.14417-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
New Features:
* Enable the NFS v4.2 READ_PLUS operation by default
Stable Fixes:
* NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
* NFS: Fix a potential data corruption
Bugfixes:
* Fix various READ_PLUS issues including:
* smatch warnings
* xdr size calculations
* scratch buffer handling
* 32bit / highmem xdr page handling
* Fix checkpatch errors in file.c
* Fix redundant readdir request after an EOF
* Fix handling of COPY ERR_OFFLOAD_NO_REQ
* Fix assignment of xprtdata.cred
Cleanups:
* Remove unused xprtrdma function declarations
* Clean up an integer overflow check to avoid a warning
* Clean up #includes in dns_resolve.c
* Clean up nfs4_get_device_info so we don't pass a NULL pointer to __free_page()
* Clean up sunrpc TCP socket timeout configuration
* Guard against READDIR loops when entry names are too long
* Use EXCHID4_FLAG_USE_PNFS_DS for DS servers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmTwzwYACgkQ18tUv7Cl
QOtIhBAA+BOh7MB6yjlyctFxABJiXz2x2Dehxy7Ox15LfnyStqQAUEpk35CXWvjC
iNxpZJ486+WrzM76WGEaRbECK9nTQLK1yacR3V1zpnDwHWIJA6VHN6qU4JAfSMu7
XbhWkHWry6d7PXhvqHlaiYvPX2pF39wUzfH+vLlzS2QLIkpT6LnG0zVRJTQvLCmq
zE5xD+NCQ1Dpo9VnouuzW7VVfm532hI7GQNrpo0E0vWKgeQD+/fOpDu23MW8A1Ua
ZgVMAc7vScgDZH8/20Ze5PH4jAEB4gwEIzjreQlIXr7Tf+mE7qn435lgOuvdMQCW
eHhdNriZ2X6HMLhNFFpup8bkRKGCCTooTHC1W66n9CuxIAuVT5DNwBbakpagHSZf
J4ho81hEgBfc5zppISVjV6eFK4brM0rF9AliaIw8r/qGcMmO1CILi9tLGiheiJcT
LuId7U2sE/vfIa6SiBt7rx37/MkrgLlAgjpk4dCRJQW+gKVBi09sMGnDlgaRvCZz
T0WCsK4DgI9q2rScpwJYJbNWbC2Q8qUtYWW9LSRvwhbNdm/VbRnEHWA7eOwqqm8r
KkkF4chyoTJqpnF3SjxT/lyFk6GwsD+wXafOmEeuFA6Si3dHDU9i3aUf+cCXhwRI
uUzCUHYcCKnv4QVGPuAbIdxMgueNCuLoeWgTClVlqidv7GRyz7Y=
=rjmq
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Enable the NFS v4.2 READ_PLUS operation by default
Stable Fixes:
- NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
- NFS: Fix a potential data corruption
Bugfixes:
- Fix various READ_PLUS issues including:
- smatch warnings
- xdr size calculations
- scratch buffer handling
- 32bit / highmem xdr page handling
- Fix checkpatch errors in file.c
- Fix redundant readdir request after an EOF
- Fix handling of COPY ERR_OFFLOAD_NO_REQ
- Fix assignment of xprtdata.cred
Cleanups:
- Remove unused xprtrdma function declarations
- Clean up an integer overflow check to avoid a warning
- Clean up #includes in dns_resolve.c
- Clean up nfs4_get_device_info so we don't pass a NULL pointer
to __free_page()
- Clean up sunrpc TCP socket timeout configuration
- Guard against READDIR loops when entry names are too long
- Use EXCHID4_FLAG_USE_PNFS_DS for DS servers"
* tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (22 commits)
pNFS: Fix assignment of xprtdata.cred
NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQ
NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN
NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server
NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver
SUNRPC: Don't override connect timeouts in rpc_clnt_add_xprt()
SUNRPC: Allow specification of TCP client connect timeout at setup
SUNRPC: Refactor and simplify connect timeout
SUNRPC: Set the TCP_SYNCNT to match the socket timeout
NFS: Fix a potential data corruption
nfs: fix redundant readdir request after get eof
nfs/blocklayout: Use the passed in gfp flags
filemap: Fix errors in file.c
NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info
NFS: Move common includes outside ifdef
SUNRPC: clean up integer overflow check
xprtrdma: Remove unused function declaration rpcrdma_bc_post_recv()
NFS: Enable the READ_PLUS operation by default
SUNRPC: kmap() the xdr pages during decode
NFSv4.2: Rework scratch handling for READ_PLUS (again)
...
I'm thrilled to announce that the Linux in-kernel NFS server now
offers NFSv4 write delegations. A write delegation enables a client
to cache data and metadata for a single file more aggressively,
reducing network round trips and server workload. Many thanks to Dai
Ngo for contributing this facility, and to Jeff Layton and Neil
Brown for reviewing and testing it.
This release also sees the removal of all support for DES- and
triple-DES-based Kerberos encryption types in the kernel's SunRPC
implementation. These encryption types have been deprecated by the
Internet community for years and are considered insecure. This
change affects both the in-kernel NFS client and server.
The server's UDP and TCP socket transports have now fully adopted
David Howells' new bio_vec iterator so that no more than one
sendmsg() call is needed to transmit each RPC message. In
particular, this helps kTLS optimize record boundaries when sending
RPC-with-TLS replies, and it takes the server a baby step closer to
handling file I/O via folios.
We've begun work on overhauling the SunRPC thread scheduler to
remove a costly linked-list walk when looking for an idle RPC
service thread to wake. The pre-requisites are included in this
release. Thanks to Neil Brown for his ongoing work on this
improvement.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmTwoa0ACgkQM2qzM29m
f5cZvw/8CmFVNC27aMrJEhRRhwwrXbLzUkWh9GCYkG98PHYiLxLTvZ6qELXAax/a
UjSgIDSRcWl4z8M/tyBQtgsw7NADr+7XWqEoXR7HZ5pEEC/KNGM0oQWQ92ojjKYy
JmHdB02uaDJfcd9ioFTU13cw7q2BQfoe2xLI8yqis2vcVSu92AM7aIw+cvJIpwQB
inA3TIIsYTV/gPByXSfEtvmYACadoFiMvfvYwaWhjFS9MdSzFmcVG0Dp3EFIig29
odmWEofcz6uIvUWvUswWEGdoSu7uOKIztSuAI4PlTwaofUaSKG6e5kmtpr3cLERD
Uhg2lm5JgqkXBd7QHObNimJ4DtQzFwHmhA08qo8rd/zba75mn/Hr5IF0q3Rxs99J
SRYHcAeP8afKn5Ge0yzoTgCNcqhfz8KLRfoCQX49mljr+muNxld4nMklD2KdUwJi
XEB512/q3E4nUgopXZiSJYQYAq1CfdR5WpGipZ9X0XK9HZBDF/qhXGtk1YQuNWyj
ZxJS3bfBza4oVIvP5/ehjCIQwOvqkcrC5zZGDIgDvw9Q6L3L1wqmVntsdCLCLRcJ
jB4IOsj+DECfJ6w2vP2SZ3GeMtFnyuTQjhUTkjPuAKTBBiKo4Tj0o/agpfDYbWZy
1l3a2yH5jqJgkm4MaVh3YHRJGc0ub0ccpIrs3QQ4jvjMLQ/3Gcs=
=XGHs
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"I'm thrilled to announce that the Linux in-kernel NFS server now
offers NFSv4 write delegations. A write delegation enables a client to
cache data and metadata for a single file more aggressively, reducing
network round trips and server workload. Many thanks to Dai Ngo for
contributing this facility, and to Jeff Layton and Neil Brown for
reviewing and testing it.
This release also sees the removal of all support for DES- and
triple-DES-based Kerberos encryption types in the kernel's SunRPC
implementation. These encryption types have been deprecated by the
Internet community for years and are considered insecure. This change
affects both the in-kernel NFS client and server.
The server's UDP and TCP socket transports have now fully adopted
David Howells' new bio_vec iterator so that no more than one sendmsg()
call is needed to transmit each RPC message. In particular, this helps
kTLS optimize record boundaries when sending RPC-with-TLS replies, and
it takes the server a baby step closer to handling file I/O via
folios.
We've begun work on overhauling the SunRPC thread scheduler to remove
a costly linked-list walk when looking for an idle RPC service thread
to wake. The pre-requisites are included in this release. Thanks to
Neil Brown for his ongoing work on this improvement"
* tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits)
Documentation: Add missing documentation for EXPORT_OP flags
SUNRPC: Remove unused declaration rpc_modcount()
SUNRPC: Remove unused declarations
NFSD: da_addr_body field missing in some GETDEVICEINFO replies
SUNRPC: Remove return value of svc_pool_wake_idle_thread()
SUNRPC: make rqst_should_sleep() idempotent()
SUNRPC: Clean up svc_set_num_threads
SUNRPC: Count ingress RPC messages per svc_pool
SUNRPC: Deduplicate thread wake-up code
SUNRPC: Move trace_svc_xprt_enqueue
SUNRPC: Add enum svc_auth_status
SUNRPC: change svc_xprt::xpt_flags bits to enum
SUNRPC: change svc_rqst::rq_flags bits to enum
SUNRPC: change svc_pool::sp_flags bits to enum
SUNRPC: change cache_head.flags bits to enum
SUNRPC: remove timeout arg from svc_recv()
SUNRPC: change svc_recv() to return void.
SUNRPC: call svc_process() from svc_recv().
nfsd: separate nfsd_last_thread() from nfsd_put()
nfsd: Simplify code around svc_exit_thread() call in nfsd()
...
Fix a use-after-free error that is possible if the xsk_diag interface
is used after the socket has been unbound from the device. This can
happen either due to the socket being closed or the device
disappearing. In the early days of AF_XDP, the way we tested that a
socket was not bound to a device was to simply check if the netdevice
pointer in the xsk socket structure was NULL. Later, a better system
was introduced by having an explicit state variable in the xsk socket
struct. For example, the state of a socket that is on the way to being
closed and has been unbound from the device is XSK_UNBOUND.
The commit in the Fixes tag below deleted the old way of signalling
that a socket is unbound, setting dev to NULL. This in the belief that
all code using the old way had been exterminated. That was
unfortunately not true as the xsk diagnostics code was still using the
old way and thus does not work as intended when a socket is going
down. Fix this by introducing a test against the state variable. If
the socket is in the state XSK_UNBOUND, simply abort the diagnostic's
netlink operation.
Fixes: 18b1ab7aa7 ("xsk: Fix race at socket teardown")
Reported-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230831100119.17408-1-magnus.karlsson@gmail.com
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmTv1cgACgkQ1V2XiooU
IOSkexAAr7A9dsFLmP5gESXxkMIeJ+ZdwPeKd+XYm1qe+3AFAPd6fo1UxNKIoCD2
aEff/MJy+SImc3khnjRJ1swXZNI1FFOQMNhywqVvWdpT/Z4SB+yt3bYIXM9CG7qQ
X1s5n8zlP1WntCk716LGdwEcaA+hcmMTgVPaVPPJOZpNBQsZaB9TjuPMcI05zTBz
xkNwRAZFxssCwS0/bDqHH8guC95AxDdzgZRjGL9y1786y10/qNet9/WWxcx+MwTp
K8xA3WUPWiBMcY1N1amYb44tzMLWaLedGGzcuDFMth8s+pyxGOJM/QsNUyNG0qPr
9I4bIZWjgsi6OAFXLJQacXH0hXohChIyXFTq3yq09M5AG5EKQi0I9Da1olOtYER7
6yvaEFQICyGWcY9eg1tlqr6ZioKx/3g9Xa54jPcldXl3U0+qhuUj6qIkXrGbnahy
yizTpozEmMxFevdMbjCEZ6dRWixjFmB66KeVLuoyBpZXHoXvyGbCKkz/GJ45bFW8
gVgSTLZYtQYYCWA2CIdr3ucXlzibWnhh6b+yB3IYncHUitXu029IA51eeOuuEL9N
XAgExjAUU5GwMo6iPFEnIsPM5scYNkwnt90SaW6DaPVqAahqyZ2e84IYzw7zod2v
z1IGGp+tjUXIsBMXzLX96JBXWlK6w5nMDnYag4UUQia3X3fUZmc=
=2mzW
-----END PGP SIGNATURE-----
Merge tag 'nf-23-08-31' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix mangling of TCP options with non-linear skbuff, from Xiao Liang.
2) OOB read in xt_sctp due to missing sanitization of array length field.
From Wander Lairson Costa.
3) OOB read in xt_u32 due to missing sanitization of array length field.
Also from Wander Lairson Costa.
All of them above, always broken for several releases.
4) Missing audit log for set element reset command, from Phil Sutter.
5) Missing audit log for rule reset command, also from Phil.
These audit log support are missing in 6.5.
* tag 'nf-23-08-31' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: Audit log rule reset
netfilter: nf_tables: Audit log setelem reset
netfilter: xt_u32: validate user space input
netfilter: xt_sctp: validate the flag_info count
netfilter: nft_exthdr: Fix non-linear header modification
====================
Link: https://lore.kernel.org/r/20230830235935.465690-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Resetting rules' stateful data happens outside of the transaction logic,
so 'get' and 'dump' handlers have to emit audit log entries themselves.
Fixes: 8daa8fde3f ("netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since set element reset is not integrated into nf_tables' transaction
logic, an explicit log call is needed, similar to NFT_MSG_GETOBJ_RESET
handling.
For the sake of simplicity, catchall element reset will always generate
a dedicated log entry. This relieves nf_tables_dump_set() from having to
adjust the logged element count depending on whether a catchall element
was found or not.
Fixes: 079cd63321 ("netfilter: nf_tables: Introduce NFT_MSG_GETSETELEM_RESET")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The xt_u32 module doesn't validate the fields in the xt_u32 structure.
An attacker may take advantage of this to trigger an OOB read by setting
the size fields with a value beyond the arrays boundaries.
Add a checkentry function to validate the structure.
This was originally reported by the ZDI project (ZDI-CAN-18408).
Fixes: 1b50b8a371 ("[NETFILTER]: Add u32 match")
Cc: stable@vger.kernel.org
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
sctp_mt_check doesn't validate the flag_count field. An attacker can
take advantage of that to trigger a OOB read and leak memory
information.
Add the field validation in the checkentry function.
Fixes: 2e4e6a17af ("[NETFILTER] x_tables: Abstraction layer for {ip,ip6,arp}_tables")
Cc: stable@vger.kernel.org
Reported-by: Lucas Leong <wmliang@infosec.exchange>
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix skb_ensure_writable() size. Don't use nft_tcp_header_pointer() to
make it explicit that pointers point to the packet (not local buffer).
Fixes: 99d1712bc4 ("netfilter: exthdr: tcp option set support")
Fixes: 7890cbea66 ("netfilter: exthdr: add support for tcp option removal")
Cc: stable@vger.kernel.org
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
commit edf391ff17 ("snmp: add missing counters for RFC 4293") had
already added OutOctets for RFC 4293. In commit 2d8dbb04c6 ("snmp: fix
OutOctets counter to include forwarded datagrams"), OutOctets was
counted again, but not removed from ip_output().
According to RFC 4293 "3.2.3. IP Statistics Tables",
ipipIfStatsOutTransmits is not equal to ipIfStatsOutForwDatagrams. So
"IPSTATS_MIB_OUTOCTETS must be incremented when incrementing" is not
accurate. And IPSTATS_MIB_OUTOCTETS should be counted after fragment.
This patch reverts commit 2d8dbb04c6 ("snmp: fix OutOctets counter to
include forwarded datagrams") and move IPSTATS_MIB_OUTOCTETS to
ip_finish_output2 for ipv4.
Reviewed-by: Filip Pudak <filip.pudak@windriver.com>
Signed-off-by: Heng Guo <heng.guo@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sockmap and sockhash maps are a collection of psocks that are
objects representing a socket plus a set of metadata needed
to manage the BPF programs associated with the socket. These
maps use the stab->lock to protect from concurrent operations
on the maps, e.g. trying to insert to objects into the array
at the same time in the same slot. Additionally, a sockhash map
has a bucket lock to protect iteration and insert/delete into
the hash entry.
Each psock has a psock->link which is a linked list of all the
maps that a psock is attached to. This allows a psock (socket)
to be included in multiple sockmap and sockhash maps. This
linked list is protected the psock->link_lock.
They _must_ be nested correctly to avoid deadlock:
lock(stab->lock)
: do BPF map operations and psock insert/delete
lock(psock->link_lock)
: add map to psock linked list of maps
unlock(psock->link_lock)
unlock(stab->lock)
For non PREEMPT_RT kernels both raw_spin_lock_t and spin_lock_t
are guaranteed to not sleep. But, with PREEMPT_RT kernels the
spin_lock_t variants may sleep. In the current code we have
many patterns like this:
rcu_critical_section:
raw_spin_lock(stab->lock)
spin_lock(psock->link_lock) <- may sleep ouch
spin_unlock(psock->link_lock)
raw_spin_unlock(stab->lock)
rcu_critical_section
Nesting spin_lock() inside a raw_spin_lock() violates locking
rules for PREEMPT_RT kernels. And additionally we do alloc(GFP_ATOMICS)
inside the stab->lock, but those might sleep on PREEMPT_RT kernels.
The result is splats like this:
./test_progs -t sockmap_basic
[ 33.344330] bpf_testmod: loading out-of-tree module taints kernel.
[ 33.441933]
[ 33.442089] =============================
[ 33.442421] [ BUG: Invalid wait context ]
[ 33.442763] 6.5.0-rc5-01731-gec0ded2e0282 #4958 Tainted: G O
[ 33.443320] -----------------------------
[ 33.443624] test_progs/2073 is trying to lock:
[ 33.443960] ffff888102a1c290 (&psock->link_lock){....}-{3:3}, at: sock_map_update_common+0x2c2/0x3d0
[ 33.444636] other info that might help us debug this:
[ 33.444991] context-{5:5}
[ 33.445183] 3 locks held by test_progs/2073:
[ 33.445498] #0: ffff88811a208d30 (sk_lock-AF_INET){+.+.}-{0:0}, at: sock_map_update_elem_sys+0xff/0x330
[ 33.446159] #1: ffffffff842539e0 (rcu_read_lock){....}-{1:3}, at: sock_map_update_elem_sys+0xf5/0x330
[ 33.446809] #2: ffff88810d687240 (&stab->lock){+...}-{2:2}, at: sock_map_update_common+0x177/0x3d0
[ 33.447445] stack backtrace:
[ 33.447655] CPU: 10 PID
To fix observe we can't readily remove the allocations (for that
we would need to use/create something similar to bpf_map_alloc). So
convert raw_spin_lock_t to spin_lock_t. We note that sock_map_update
that would trigger the allocate and potential sleep is only allowed
through sys_bpf ops and via sock_ops which precludes hw interrupts
and low level atomic sections in RT preempt kernel. On non RT
preempt kernel there are no changes here and spin locks sections
and alloc(GFP_ATOMIC) are still not sleepable.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230830053517.166611-1-john.fastabend@gmail.com
Currently, xsk_build_skb() is a function that builds skb in two possible
ways and then is ended with common error handling.
We can distinguish four possible error paths and handling in xsk_build_skb():
1. sock_alloc_send_skb fails: Retry (skb is NULL).
2. skb_store_bits fails : Free skb and retry.
3. MAX_SKB_FRAGS exceeded: Free skb, cleanup and drop packet.
4. alloc_page fails for frag: Retry page allocation w/o freeing skb
1] and 3] can happen in xsk_build_skb_zerocopy(), which is one of the
two code paths responsible for building skb. Common error path in
xsk_build_skb() assumes that in case errno != -EAGAIN, skb is a valid
pointer, which is wrong as kernel test robot reports that in
xsk_build_skb_zerocopy() other errno values are returned for skb being
NULL.
To fix this, set -EOVERFLOW as error when MAX_SKB_FRAGS are exceeded
and packet needs to be dropped in both xsk_build_skb() and
xsk_build_skb_zerocopy() and use this to distinguish against all other
error cases. Also, add explicit kfree_skb() for 3] so that handling
of 1], 2], and 3] becomes identical where allocation needs to be retried.
Fixes: cf24f5a5fe ("xsk: add support for AF_XDP multi-buffer on Tx path")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Closes: https://lore.kernel.org/r/202307210434.OjgqFcbB-lkp@intel.com
Link: https://lore.kernel.org/bpf/20230823144713.2231808-1-tirthendu.sarkar@intel.com
With latest clang18, I hit test_progs failures for the following test:
#13/2 bpf_cookie/multi_kprobe_link_api:FAIL
#13/3 bpf_cookie/multi_kprobe_attach_api:FAIL
#13 bpf_cookie:FAIL
#75 fentry_fexit:FAIL
#76/1 fentry_test/fentry:FAIL
#76 fentry_test:FAIL
#80/1 fexit_test/fexit:FAIL
#80 fexit_test:FAIL
#110/1 kprobe_multi_test/skel_api:FAIL
#110/2 kprobe_multi_test/link_api_addrs:FAIL
#110/3 kprobe_multi_test/link_api_syms:FAIL
#110/4 kprobe_multi_test/attach_api_pattern:FAIL
#110/5 kprobe_multi_test/attach_api_addrs:FAIL
#110/6 kprobe_multi_test/attach_api_syms:FAIL
#110 kprobe_multi_test:FAIL
For example, for #13/2, the error messages are:
[...]
kprobe_multi_test_run:FAIL:kprobe_test7_result unexpected kprobe_test7_result: actual 0 != expected 1
[...]
kprobe_multi_test_run:FAIL:kretprobe_test7_result unexpected kretprobe_test7_result: actual 0 != expected 1
clang17 does not have this issue.
Further investigation shows that kernel func bpf_fentry_test7(), used in
the above tests, is inlined by the compiler although it is marked as
noinline.
int noinline bpf_fentry_test7(struct bpf_fentry_test_t *arg)
{
return (long)arg;
}
It is known that for simple functions like the above (e.g. just returning
a constant or an input argument), the clang compiler may still do inlining
for a noinline function. Adding 'asm volatile ("")' in the beginning of the
bpf_fentry_test7() can prevent inlining.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230826200843.2210074-1-yonghong.song@linux.dev
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs06gQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpq2jEACE+A8T1/teQmMA9PcIhG2gWSltlSmAKkVe
6Yy6FaXBc7M/1yINGWU+dam5xhTshEuGulgris5Yt8VcX7eas9KQvT1NGDa2KzP4
D44i4UbMD4z9E7arx67Eyvql0sd9LywQTa2nJB+o9yXmQUhTbkWPmMaIWmZi5QwP
KcOqzdve+xhWSwdzNsPltB3qnma/WDtguaauh41GksKpUVe+aDi8EDEnFyRTM2Be
HTTJEH2ZyxYwDzemR8xxr82eVz7KdTVDvn+ZoA/UM6I3SGj6SBmtj5mDLF1JKetc
I+IazsSCAE1kF7vmDbmxYiIvmE6d6I/3zwgGrKfH4dmysqClZvJQeG3/53XzSO5N
k0uJebB/S610EqQhAIZ1/KgjRVmrd75+2af+AxsC2pFeTaQRE4plCJgWhMbJleAE
XBnnC7TbPOWCKaZ6a5UKu8CE1FJnEROmOASPP6tRM30ah5JhlyLU4/4ce+uUPUQo
bhzv0uAQr5Ezs9JPawj+BTa3A4iLCpLzE+aSB46Czl166Eg57GR5tr5pQLtcwjus
pN5BO7o4y5BeWu1XeQQObQ5OiOBXqmbcl8aQepBbU4W8qRSCMPXYWe2+8jbxk3VV
3mZZ9iXxs71ntVDL8IeZYXH7NZK4MLIrqdeM5YwgpSioYAUjqxTm7a8I5I9NCvUx
DZIBNk2sAA==
=XS+G
-----END PGP SIGNATURE-----
Merge tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"Fairly quiet round in terms of features, mostly just improvements all
over the map for existing code. In detail:
- Initial support for socket operations through io_uring. Latter half
of this will likely land with the 6.7 kernel, then allowing things
like get/setsockopt (Breno)
- Cleanup of the cancel code, and then adding support for canceling
requests with the opcode as the key (me)
- Improvements for the io-wq locking (me)
- Fix affinity setting for SQPOLL based io-wq (me)
- Remove the io_uring userspace code. These were added initially as
copies from liburing, but all of them have since bitrotted and are
way out of date at this point. Rather than attempt to keep them in
sync, just get rid of them. People will have liburing available
anyway for these examples. (Pavel)
- Series improving the CQ/SQ ring caching (Pavel)
- Misc fixes and cleanups (Pavel, Yue, me)"
* tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linux: (47 commits)
io_uring: move iopoll ctx fields around
io_uring: move multishot cqe cache in ctx
io_uring: separate task_work/waiting cache line
io_uring: banish non-hot data to end of io_ring_ctx
io_uring: move non aligned field to the end
io_uring: add option to remove SQ indirection
io_uring: compact SQ/CQ heads/tails
io_uring: force inline io_fill_cqe_req
io_uring: merge iopoll and normal completion paths
io_uring: reorder cqring_flush and wakeups
io_uring: optimise extra io_get_cqe null check
io_uring: refactor __io_get_cqe()
io_uring: simplify big_cqe handling
io_uring: cqe init hardening
io_uring: improve cqe !tracing hot path
io_uring/rsrc: Annotate struct io_mapped_ubuf with __counted_by
io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used
io_uring: simplify io_run_task_work_sig return
io_uring/rsrc: keep one global dummy_ubuf
io_uring: never overflow io_aux_cqe
...
Long ago we set out to remove the kitchen sink on kernel/sysctl.c arrays and
placings sysctls to their own sybsystem or file to help avoid merge conflicts.
Matthew Wilcox pointed out though that if we're going to do that we might as
well also *save* space while at it and try to remove the extra last sysctl
entry added at the end of each array, a sentintel, instead of bloating the
kernel by adding a new sentinel with each array moved.
Doing that was not so trivial, and has required slowing down the moves of
kernel/sysctl.c arrays and measuring the impact on size by each new move.
The complex part of the effort to help reduce the size of each sysctl is being
done by the patient work of el señor Don Joel Granados. A lot of this is truly
painful code refactoring and testing and then trying to measure the savings of
each move and removing the sentinels. Although Joel already has code which does
most of this work, experience with sysctl moves in the past shows is we need to
be careful due to the slew of odd build failures that are possible due to the
amount of random Kconfig options sysctls use.
To that end Joel's work is split by first addressing the major housekeeping
needed to remove the sentinels, which is part of this merge request. The rest
of the work to actually remove the sentinels will be done later in future
kernel releases.
At first I was only going to send his first 7 patches of his patch series,
posted 1 month ago, but in retrospect due to the testing the changes have
received in linux-next and the minor changes they make this goes with the
entire set of patches Joel had planned: just sysctl house keeping. There are
networking changes but these are part of the house keeping too.
The preliminary math is showing this will all help reduce the overall build
time size of the kernel and run time memory consumed by the kernel by about
~64 bytes per array where we are able to remove each sentinel in the future.
That also means there is no more bloating the kernel with the extra ~64 bytes
per array moved as no new sentinels are created.
Most of this has been in linux-next for about a month, the last 7 patches took
a minor refresh 2 week ago based on feedback.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmTuVnMSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinIckP/imvRlfkO6L0IP7MmJBRPtwY01rsTAKO
Q14dZ//bG4DVQeGl1FdzrF6hhuLgekU0qW1YDFIWiCXO7CbaxaNBPSUkeW6ReVoC
R/VHNUPxSR1PWQy1OTJV2t4XKri2sB7ijmUsfsATtISwhei9bggTHEysShtP4tv+
U87DzhoqMnbYIsfMo49KCqOa1Qm7TmjC1a7WAp6Fph3GJuXAzZR5pXpsd0NtOZ9x
Ud5RT22icnQpMl7K+yPsqY6XcS5JkgBe/WbSzMAUkYZvBZFBq9t2D+OW5h9TZMhw
piJWQ9X0Rm7qI2D15mJfXwaOhhyDhWci391hzdJmS6DI0prf6Ma2NFdAWOt/zomI
uiRujS4bGeBUaK5F4TX2WQ1+jdMtAZ+0FncFnzt4U8q7dzUc91uVCm6iHW3gcfAb
N7OEg2ZL0gkkgCZHqKxN8wpNQiC2KwnNk+HLAbnL2a/oJYfBtdopQmlxWfrN2hpF
xxROiENqk483BRdMXDq6DR/gyDZmZWCobXIglSzlqCOjCOcLbDziIJ7pJk83ok09
h/QnXTYHf9protBq9OIQesgh2pwNzBBLifK84KZLKcb7IbdIKjpQrW5STp04oNGf
wcGJzEz8tXUe0UKyMM47AcHQGzIy6cdXNLjyF8a+m7rnZzr1ndnMqZyRStZzuQin
AUg2VWHKPmW9
=sq2p
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"Long ago we set out to remove the kitchen sink on kernel/sysctl.c
arrays and placings sysctls to their own sybsystem or file to help
avoid merge conflicts. Matthew Wilcox pointed out though that if we're
going to do that we might as well also *save* space while at it and
try to remove the extra last sysctl entry added at the end of each
array, a sentintel, instead of bloating the kernel by adding a new
sentinel with each array moved.
Doing that was not so trivial, and has required slowing down the moves
of kernel/sysctl.c arrays and measuring the impact on size by each new
move.
The complex part of the effort to help reduce the size of each sysctl
is being done by the patient work of el señor Don Joel Granados. A lot
of this is truly painful code refactoring and testing and then trying
to measure the savings of each move and removing the sentinels.
Although Joel already has code which does most of this work,
experience with sysctl moves in the past shows is we need to be
careful due to the slew of odd build failures that are possible due to
the amount of random Kconfig options sysctls use.
To that end Joel's work is split by first addressing the major
housekeeping needed to remove the sentinels, which is part of this
merge request. The rest of the work to actually remove the sentinels
will be done later in future kernel releases.
The preliminary math is showing this will all help reduce the overall
build time size of the kernel and run time memory consumed by the
kernel by about ~64 bytes per array where we are able to remove each
sentinel in the future. That also means there is no more bloating the
kernel with the extra ~64 bytes per array moved as no new sentinels
are created"
* tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: Use ctl_table_size as stopping criteria for list macro
sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl
vrf: Update to register_net_sysctl_sz
networking: Update to register_net_sysctl_sz
netfilter: Update to register_net_sysctl_sz
ax.25: Update to register_net_sysctl_sz
sysctl: Add size to register_net_sysctl function
sysctl: Add size arg to __register_sysctl_init
sysctl: Add size to register_sysctl
sysctl: Add a size arg to __register_sysctl_table
sysctl: Add size argument to init_header
sysctl: Add ctl_table_size to ctl_table_header
sysctl: Use ctl_table_header in list_for_each_table_entry
sysctl: Prefer ctl_table_header in proc_sysctl
The returned value is not used (any more), so don't return it.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Based on its name you would think that rqst_should_sleep() would be
read-only, not changing anything. But in fact it will clear
SP_TASK_PENDING if that was set. This is surprising, and it blurs the
line between "check for work to do" and "dequeue work to do".
So change the "test_and_clear" to simple "test" and clear the bit once
the thread has decided to wake up and return to the caller.
With this, it makes sense to *always* set SP_TASK_PENDING when asked,
rather than to set it only if no thread could be woken up.
[ cel: Previously TASK_PENDING indicated there is work waiting but no
idle threads were found to pick up that work. After this patch, it acts
as an XPT_BUSY flag for wake-ups that have no associated xprt. ]
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Document the API contract and remove stale or obvious comments.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_xprt_enqueue() can be costly, since it involves selecting and
waking up a process.
More than one enqueue is done per incoming RPC. For example,
svc_data_ready() enqueues, and so does svc_xprt_receive(). Also, if
an RPC message requires more than one call to ->recvfrom() to
receive it fully, each one of those calls does an enqueue.
To get a sense of the average number of transport enqueue operations
needed to process an incoming RPC message, re-use the "packets" pool
stat. Track the number of complete RPC messages processed by each
thread pool.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: Extract the loop that finds an idle service thread from
svc_xprt_enqueue() and svc_wake_up(). Both functions do just about
the same thing.
Note that svc_wake_up() currently does not hold the RCU read lock
while waking the target thread. It indeed should hold the lock, just
as svc_xprt_enqueue() does, to ensure the rqstp does not vanish
during the wake-up. This patch adds the RCU lock for svc_wake_up().
Note that shrinking the pool thread count is rare, and calls to
svc_wake_up() are also quite infrequent. In practice, this race is
very unlikely to be hit, so we are not marking the lock fix for
stable backport at this time.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The xpt_flags field frequently changes between the time that
svc_xprt_ready() grabs a copy and execution flow arrives at the
tracepoint at the tail of svc_xprt_enqueue(). In fact, there's
usually a sleep/wake-up in there, so those flags are almost
guaranteed to be different.
It would be more useful to record the exact flags that were used to
decide whether the transport is ready, so move the tracepoint.
Moving it means the tracepoint can't pick up the waker's pid. That
can be added to struct svc_rqst if it turns out that is important.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
In addition to the benefits of using an enum rather than a set of
macros, we now have a named type that can improve static type
checking of function return values.
As part of this change, I removed a stale comment from svcauth.h;
the return values from current implementations of the
auth_ops::release method are all zero/negative errno, not the SVC_OK
enum values as the old comment suggested.
Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Most svc threads have no interest in a timeout.
nfsd sets it to 1 hour, but this is a wart of no significance.
lockd uses the timeout so that it can call nlmsvc_retry_blocked().
It also sometimes calls svc_wake_up() to ensure this is called.
So change lockd to be consistent and always use svc_wake_up() to trigger
nlmsvc_retry_blocked() - using a timer instead of a timeout to
svc_recv().
And change svc_recv() to not take a timeout arg.
This makes the sp_threads_timedout counter always zero.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_recv() currently returns a 0 on success or one of two errors:
- -EAGAIN means no message was successfully received
- -EINTR means the thread has been told to stop
Previously nfsd would stop as the result of a signal as well as
following kthread_stop(). In that case the difference was useful: EINTR
means stop unconditionally. EAGAIN means stop if kthread_should_stop(),
continue otherwise.
Now threads only exit when kthread_should_stop() so we don't need the
distinction.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
All callers of svc_recv() go on to call svc_process() on success.
Simplify callers by having svc_recv() do that for them.
This loses one call to validate_process_creds() in nfsd. That was
debugging code added 14 years ago. I don't think we need to keep it.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The original implementation of nfsd used signals to stop threads during
shutdown.
In Linux 2.3.46pre5 nfsd gained the ability to shutdown threads
internally it if was asked to run "0" threads. After this user-space
transitioned to using "rpc.nfsd 0" to stop nfsd and sending signals to
threads was no longer an important part of the API.
In commit 3ebdbe5203 ("SUNRPC: discard svo_setup and rename
svc_set_num_threads_sync()") (v5.17-rc1~75^2~41) we finally removed the
use of signals for stopping threads, using kthread_stop() instead.
This patch makes the "obvious" next step and removes the ability to
signal nfsd threads - or any svc threads. nfsd stops allowing signals
and we don't check for their delivery any more.
This will allow for some simplification in later patches.
A change worth noting is in nfsd4_ssc_setup_dul(). There was previously
a signal_pending() check which would only succeed when the thread was
being shut down. It should really have tested kthread_should_stop() as
well. Now it just does the latter, not the former.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
With large NFS WRITE requests on TCP, I measured 5-10 thread wake-
ups to receive each request. This is because the socket layer
calls ->sk_data_ready() frequently, and each call triggers a
thread wake-up. Each recvmsg() seems to pull in less than 100KB.
Have the socket layer hold ->sk_data_ready() calls until the full
incoming message has arrived to reduce the wake-up rate.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Flamegraph analysis showed that the cork/uncork calls consume
nearly a third of the CPU time spent in svc_tcp_sendto(). The
other two consumers are mutex lock/unlock and svc_tcp_sendmsg().
Now that svc_tcp_sendto() coalesces RPC messages properly, there
is no need to introduce artificial delays to prevent sending
partial messages.
After applying this change, I measured a 1.2K read IOPS increase
for 8KB random I/O (several percent) on 56Gb IP over IB.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Commit da1661b93b ("SUNRPC: Teach server to use xprt_sock_sendmsg
for socket sends") modified svc_udp_sendto() to use xprt_sock_sendmsg()
because we originally believed xprt_sock_sendmsg() would be needed
for TLS support. That does not actually appear to be the case.
In addition, the linkage between the client and server send code has
been a bit of a maintenance headache because of the distinct ways
that the client and server handle memory allocation.
Going forward, eventually the XDR layer will deal with its buffers
in the form of bio_vec arrays, so convert this function accordingly.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There is now enough infrastructure in place to combine the stream
record marker into the biovec array used to send each outgoing RPC
message on TCP. The whole message can be more efficiently sent with
a single call to sock_sendmsg() using a bio_vec iterator.
Note that this also helps with RPC-with-TLS: the TLS implementation
can now clearly see where the upper layer message boundaries are.
Before, it would send each component of the xdr_buf (record marker,
head, page payload, tail) in separate TLS records.
Suggested-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add a helper to convert a whole xdr_buf directly into an array of
bio_vecs, then send this array instead of iterating piecemeal over
the xdr_buf containing the outbound RPC message.
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
All supported encryption types now use the same context import
function.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This code is now always on, so the ifdef can be removed.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Make it impossible to enable support for the DES or DES3 Kerberos
encryption types in SunRPC. These enctypes were deprecated by RFCs
6649 and 8429 because they are known to be insecure.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Jeff confirmed his original fix addressed his pynfs test failure,
but this same bug also impacted qemu: accessing qcow2 virtual disks
using direct I/O was failing. Jeff's fix missed that you have to
shorten the bio_vec element by the same amount as you increased
the page offset.
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: c96e2a695e ("sunrpc: set the bv_offset of first bvec in svc_tcp_sendmsg")
Tested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support tracking
KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the GENERIC_IOREMAP
ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency improvements
("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation, from
Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code ("Two
minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table range
API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM subsystem
documentation ("Improve mm documentation").
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
=Afws
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.
- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").
- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").
- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").
- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").
- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").
- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").
- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").
- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").
- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").
- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").
- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").
- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").
- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").
- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").
- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").
- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").
- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").
- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").
- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").
- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").
- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").
- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").
- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").
- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").
- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").
- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").
- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").
- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").
- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").
- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").
- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").
- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").
- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").
- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").
- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").
- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").
- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").
- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").
- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").
* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...
Core
----
- Increase size limits for to-be-sent skb frag allocations. This
allows tun, tap devices and packet sockets to better cope with large
writes operations.
- Store netdevs in an xarray, to simplify iterating over netdevs.
- Refactor nexthop selection for multipath routes.
- Improve sched class lifetime handling.
- Add backup nexthop ID support for bridge.
- Implement drop reasons support in openvswitch.
- Several data races annotations and fixes.
- Constify the sk parameter of routing functions.
- Prepend kernel version to netconsole message.
Protocols
---------
- Implement support for TCP probing the peer being under memory
pressure.
- Remove hard coded limitation on IPv6 specific info placement
inside the socket struct.
- Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated
per socket scaling factor.
- Scaling-up the IPv6 expired route GC via a separated list of
expiring routes.
- In-kernel support for the TLS alert protocol.
- Better support for UDP reuseport with connected sockets.
- Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
header size.
- Get rid of additional ancillary per MPTCP connection struct socket.
- Implement support for BPF-based MPTCP packet schedulers.
- Format MPTCP subtests selftests results in TAP.
- Several new SMC 2.1 features including unique experimental options,
max connections per lgr negotiation, max links per lgr negotiation.
BPF
---
- Multi-buffer support in AF_XDP.
- Add multi uprobe BPF links for attaching multiple uprobes
and usdt probes, which is significantly faster and saves extra fds.
- Implement an fd-based tc BPF attach API (TCX) and BPF link support on
top of it.
- Add SO_REUSEPORT support for TC bpf_sk_assign.
- Support new instructions from cpu v4 to simplify the generated code and
feature completeness, for x86, arm64, riscv64.
- Support defragmenting IPv(4|6) packets in BPF.
- Teach verifier actual bounds of bpf_get_smp_processor_id()
and fix perf+libbpf issue related to custom section handling.
- Introduce bpf map element count and enable it for all program types.
- Add a BPF hook in sys_socket() to change the protocol ID
from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy.
- Introduce bpf_me_mcache_free_rcu() and fix OOM under stress.
- Add uprobe support for the bpf_get_func_ip helper.
- Check skb ownership against full socket.
- Support for up to 12 arguments in BPF trampoline.
- Extend link_info for kprobe_multi and perf_event links.
Netfilter
---------
- Speed-up process exit by aborting ruleset validation if a
fatal signal is pending.
- Allow NLA_POLICY_MASK to be used with BE16/BE32 types.
Driver API
----------
- Page pool optimizations, to improve data locality and cache usage.
- Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need
for raw ioctl() handling in drivers.
- Simplify genetlink dump operations (doit/dumpit) providing them
the common information already populated in struct genl_info.
- Extend and use the yaml devlink specs to [re]generate the split ops.
- Introduce devlink selective dumps, to allow SF filtering SF based on
handle and other attributes.
- Add yaml netlink spec for netlink-raw families, allow route, link and
address related queries via the ynl tool.
- Remove phylink legacy mode support.
- Support offload LED blinking to phy.
- Add devlink port function attributes for IPsec.
New hardware / drivers
----------------------
- Ethernet:
- Broadcom ASP 2.0 (72165) ethernet controller
- MediaTek MT7988 SoC
- Texas Instruments AM654 SoC
- Texas Instruments IEP driver
- Atheros qca8081 phy
- Marvell 88Q2110 phy
- NXP TJA1120 phy
- WiFi:
- MediaTek mt7981 support
- Can:
- Kvaser SmartFusion2 PCI Express devices
- Allwinner T113 controllers
- Texas Instruments tcan4552/4553 chips
- Bluetooth:
- Intel Gale Peak
- Qualcomm WCN3988 and WCN7850
- NXP AW693 and IW624
- Mediatek MT2925
Drivers
-------
- Ethernet NICs:
- nVidia/Mellanox:
- mlx5:
- support UDP encapsulation in packet offload mode
- IPsec packet offload support in eswitch mode
- improve aRFS observability by adding new set of counters
- extends MACsec offload support to cover RoCE traffic
- dynamic completion EQs
- mlx4:
- convert to use auxiliary bus instead of custom interface logic
- Intel
- ice:
- implement switchdev bridge offload, even for LAG interfaces
- implement SRIOV support for LAG interfaces
- igc:
- add support for multiple in-flight TX timestamps
- Broadcom:
- bnxt:
- use the unified RX page pool buffers for XDP and non-XDP
- use the NAPI skb allocation cache
- OcteonTX2:
- support Round Robin scheduling HTB offload
- TC flower offload support for SPI field
- Freescale:
- add XDP_TX feature support
- AMD:
- ionic: add support for PCI FLR event
- sfc:
- basic conntrack offload
- introduce eth, ipv4 and ipv6 pedit offloads
- ST Microelectronics:
- stmmac: maximze PTP timestamping resolution
- Virtual NICs:
- Microsoft vNIC:
- batch ringing RX queue doorbell on receiving packets
- add page pool for RX buffers
- Virtio vNIC:
- add per queue interrupt coalescing support
- Google vNIC:
- add queue-page-list mode support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add port range matching tc-flower offload
- permit enslavement to netdevices with uppers
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- convert to phylink_pcs
- Renesas:
- r8A779fx: add speed change support
- rzn1: enables vlan support
- Ethernet PHYs:
- convert mv88e6xxx to phylink_pcs
- WiFi:
- Qualcomm Wi-Fi 7 (ath12k):
- extremely High Throughput (EHT) PHY support
- RealTek (rtl8xxxu):
- enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
RTL8192EU and RTL8723BU
- RealTek (rtw89):
- Introduce Time Averaged SAR (TAS) support
- Connector:
- support for event filtering
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmTt1ZoSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkgFUP/REFaYWdWUvAzmWeezyx9dqgZMfSOjWq
9QvySiA94OAOcjIYkb7wfzQ5BBAZqaBQ/f8XqWwS1EDDDEBs8sP1cxmABKwW7Hsr
qFRu2sOqLzKBk223d0jIgEocfQaFpGbF71gXoTlDivBjBi5UxWm9bF0XnbYWcKgO
/QEvzNosi9uNdi85Fzmv62J6YzAdidEpwGsM7X2CfejwNRmStxAEg/NwvRR0Hyiq
OJCo97omEgTRaUle8nc64PDx33u4h5kQ1BkaeHEv0rbE3hftFC2YPKn/InmqSFGz
6ew2xnrGPR37LCuAiCcIIv6yR7K0eu0iYJ7jXwZxBDqxGavEPuwWGBoCP6qFiitH
ZLWhIrAUrdmSbySkTOCONhJ475qFAuQoYHYpZnX/bJZUHlSsb/9lwDJYJQGpVfd1
/daqJVSb7lhaifmNO1iNd/ibCIXq9zapwtkRwA897M8GkZBTsnVvazFld1Em+Se3
Bx6DSDUVBqVQ9fpZG2IAGD6odDwOzC1lF2IoceFvK9Ff6oE0psI+A0qNLMkHxZbW
Qlo7LsNe53hpoCC+yHTfXX7e/X8eNt0EnCGOQJDusZ0Nr3K7H4LKFA0i8UBUK05n
4lKnnaSQW7GQgdofLWt103OMDR9GoDxpFsm7b1X9+AEk6Fz6tq50wWYeMZETUKYP
DCW8VGFOZjZM
=9CsR
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Increase size limits for to-be-sent skb frag allocations. This
allows tun, tap devices and packet sockets to better cope with
large writes operations
- Store netdevs in an xarray, to simplify iterating over netdevs
- Refactor nexthop selection for multipath routes
- Improve sched class lifetime handling
- Add backup nexthop ID support for bridge
- Implement drop reasons support in openvswitch
- Several data races annotations and fixes
- Constify the sk parameter of routing functions
- Prepend kernel version to netconsole message
Protocols:
- Implement support for TCP probing the peer being under memory
pressure
- Remove hard coded limitation on IPv6 specific info placement inside
the socket struct
- Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per
socket scaling factor
- Scaling-up the IPv6 expired route GC via a separated list of
expiring routes
- In-kernel support for the TLS alert protocol
- Better support for UDP reuseport with connected sockets
- Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR
header size
- Get rid of additional ancillary per MPTCP connection struct socket
- Implement support for BPF-based MPTCP packet schedulers
- Format MPTCP subtests selftests results in TAP
- Several new SMC 2.1 features including unique experimental options,
max connections per lgr negotiation, max links per lgr negotiation
BPF:
- Multi-buffer support in AF_XDP
- Add multi uprobe BPF links for attaching multiple uprobes and usdt
probes, which is significantly faster and saves extra fds
- Implement an fd-based tc BPF attach API (TCX) and BPF link support
on top of it
- Add SO_REUSEPORT support for TC bpf_sk_assign
- Support new instructions from cpu v4 to simplify the generated code
and feature completeness, for x86, arm64, riscv64
- Support defragmenting IPv(4|6) packets in BPF
- Teach verifier actual bounds of bpf_get_smp_processor_id() and fix
perf+libbpf issue related to custom section handling
- Introduce bpf map element count and enable it for all program types
- Add a BPF hook in sys_socket() to change the protocol ID from
IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy
- Introduce bpf_me_mcache_free_rcu() and fix OOM under stress
- Add uprobe support for the bpf_get_func_ip helper
- Check skb ownership against full socket
- Support for up to 12 arguments in BPF trampoline
- Extend link_info for kprobe_multi and perf_event links
Netfilter:
- Speed-up process exit by aborting ruleset validation if a fatal
signal is pending
- Allow NLA_POLICY_MASK to be used with BE16/BE32 types
Driver API:
- Page pool optimizations, to improve data locality and cache usage
- Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the
need for raw ioctl() handling in drivers
- Simplify genetlink dump operations (doit/dumpit) providing them the
common information already populated in struct genl_info
- Extend and use the yaml devlink specs to [re]generate the split ops
- Introduce devlink selective dumps, to allow SF filtering SF based
on handle and other attributes
- Add yaml netlink spec for netlink-raw families, allow route, link
and address related queries via the ynl tool
- Remove phylink legacy mode support
- Support offload LED blinking to phy
- Add devlink port function attributes for IPsec
New hardware / drivers:
- Ethernet:
- Broadcom ASP 2.0 (72165) ethernet controller
- MediaTek MT7988 SoC
- Texas Instruments AM654 SoC
- Texas Instruments IEP driver
- Atheros qca8081 phy
- Marvell 88Q2110 phy
- NXP TJA1120 phy
- WiFi:
- MediaTek mt7981 support
- Can:
- Kvaser SmartFusion2 PCI Express devices
- Allwinner T113 controllers
- Texas Instruments tcan4552/4553 chips
- Bluetooth:
- Intel Gale Peak
- Qualcomm WCN3988 and WCN7850
- NXP AW693 and IW624
- Mediatek MT2925
Drivers:
- Ethernet NICs:
- nVidia/Mellanox:
- mlx5:
- support UDP encapsulation in packet offload mode
- IPsec packet offload support in eswitch mode
- improve aRFS observability by adding new set of counters
- extends MACsec offload support to cover RoCE traffic
- dynamic completion EQs
- mlx4:
- convert to use auxiliary bus instead of custom interface
logic
- Intel
- ice:
- implement switchdev bridge offload, even for LAG
interfaces
- implement SRIOV support for LAG interfaces
- igc:
- add support for multiple in-flight TX timestamps
- Broadcom:
- bnxt:
- use the unified RX page pool buffers for XDP and non-XDP
- use the NAPI skb allocation cache
- OcteonTX2:
- support Round Robin scheduling HTB offload
- TC flower offload support for SPI field
- Freescale:
- add XDP_TX feature support
- AMD:
- ionic: add support for PCI FLR event
- sfc:
- basic conntrack offload
- introduce eth, ipv4 and ipv6 pedit offloads
- ST Microelectronics:
- stmmac: maximze PTP timestamping resolution
- Virtual NICs:
- Microsoft vNIC:
- batch ringing RX queue doorbell on receiving packets
- add page pool for RX buffers
- Virtio vNIC:
- add per queue interrupt coalescing support
- Google vNIC:
- add queue-page-list mode support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add port range matching tc-flower offload
- permit enslavement to netdevices with uppers
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- convert to phylink_pcs
- Renesas:
- r8A779fx: add speed change support
- rzn1: enables vlan support
- Ethernet PHYs:
- convert mv88e6xxx to phylink_pcs
- WiFi:
- Qualcomm Wi-Fi 7 (ath12k):
- extremely High Throughput (EHT) PHY support
- RealTek (rtl8xxxu):
- enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU),
RTL8192EU and RTL8723BU
- RealTek (rtw89):
- Introduce Time Averaged SAR (TAS) support
- Connector:
- support for event filtering"
* tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits)
net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show
net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler
net: stmmac: clarify difference between "interface" and "phy_interface"
r8152: add vendor/device ID pair for D-Link DUB-E250
devlink: move devlink_notify_register/unregister() to dev.c
devlink: move small_ops definition into netlink.c
devlink: move tracepoint definitions into core.c
devlink: push linecard related code into separate file
devlink: push rate related code into separate file
devlink: push trap related code into separate file
devlink: use tracepoint_enabled() helper
devlink: push region related code into separate file
devlink: push param related code into separate file
devlink: push resource related code into separate file
devlink: push dpipe related code into separate file
devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper
devlink: push shared buffer related code into separate file
devlink: push port related code into separate file
devlink: push object register/unregister notifications into separate helpers
inet: fix IP_TRANSPARENT error handling
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
=eJz5
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
At last, move the last bits out of leftover.c,
the devlink_notify_register/unregister() functions to dev.c
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-16-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for the trap code move, use tracepoint_enabled() helper
instead of trace_devlink_trap_report_enabled() which would not be
defined in that scope.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-10-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since both dpipe and resource code is using this helper, in preparation
for code split to separate files, move
devlink_dpipe_send_and_alloc_skb() helper into netlink.c. Rename it on
the way.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-5-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparations of leftover.c split to individual files, avoid need to
have object structures exposed in devl_internal.h and allow to have them
maintained in object files.
The register/unregister notifications need to know the structures
to iterate lists. To avoid the need, introduce per-object
register/unregister notification helpers and use them.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230828061657.300667-2-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
My recent patch forgot to change error handling for IP_TRANSPARENT
socket option.
WARNING: bad unlock balance detected!
6.5.0-rc7-syzkaller-01717-g59da9885767a #0 Not tainted
-------------------------------------
syz-executor151/5028 is trying to release lock (sk_lock-AF_INET) at:
[<ffffffff88213983>] sockopt_release_sock+0x53/0x70 net/core/sock.c:1073
but there are no more locks to release!
other info that might help us debug this:
1 lock held by syz-executor151/5028:
stack backtrace:
CPU: 0 PID: 5028 Comm: syz-executor151 Not tainted 6.5.0-rc7-syzkaller-01717-g59da9885767a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106
__lock_release kernel/locking/lockdep.c:5438 [inline]
lock_release+0x4b5/0x680 kernel/locking/lockdep.c:5781
sock_release_ownership include/net/sock.h:1824 [inline]
release_sock+0x175/0x1b0 net/core/sock.c:3527
sockopt_release_sock+0x53/0x70 net/core/sock.c:1073
do_ip_setsockopt+0x12c1/0x3640 net/ipv4/ip_sockglue.c:1364
ip_setsockopt+0x59/0xe0 net/ipv4/ip_sockglue.c:1419
raw_setsockopt+0x218/0x290 net/ipv4/raw.c:833
__sys_setsockopt+0x2cd/0x5b0 net/socket.c:2305
__do_sys_setsockopt net/socket.c:2316 [inline]
__se_sys_setsockopt net/socket.c:2313 [inline]
Fixes: 4bd0623f04 ("inet: move inet->transparent to inet->inet_flags")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at TC_ACT_* handling, the TC_ACT_CONSUMED is only handled in
sch_handle_ingress but not sch_handle_egress. This was added via cd11b16407
("net/tc: introduce TC_ACT_REINSERT.") and e5cf1baf92 ("act_mirred: use
TC_ACT_REINSERT when possible") and later got renamed into TC_ACT_CONSUMED
via 720f22fed8 ("net: sched: refactor reinsert action").
The initial work was targeted for ovs back then and only needed on ingress,
and the mirred action module also restricts it to only that. However, given
it's an API contract it would still make sense to make this consistent to
sch_handle_ingress and handle it on egress side in the same way, that is,
setting return code to "success" and returning NULL back to the caller as
otherwise an action module sitting on egress returning TC_ACT_CONSUMED could
lead to an UAF when untreated.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a memory leak for the tc egress path with TC_ACT_{STOLEN,QUEUED,TRAP}:
[...]
unreferenced object 0xffff88818bcb4f00 (size 232):
comm "softirq", pid 0, jiffies 4299085078 (age 134.028s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 80 70 61 81 88 ff ff 00 41 31 14 81 88 ff ff ..pa.....A1.....
backtrace:
[<ffffffff9991b938>] kmem_cache_alloc_node+0x268/0x400
[<ffffffff9b3d9231>] __alloc_skb+0x211/0x2c0
[<ffffffff9b3f0c7e>] alloc_skb_with_frags+0xbe/0x6b0
[<ffffffff9b3bf9a9>] sock_alloc_send_pskb+0x6a9/0x870
[<ffffffff9b6b3f00>] __ip_append_data+0x14d0/0x3bf0
[<ffffffff9b6ba24e>] ip_append_data+0xee/0x190
[<ffffffff9b7e1496>] icmp_push_reply+0xa6/0x470
[<ffffffff9b7e4030>] icmp_reply+0x900/0xa00
[<ffffffff9b7e42e3>] icmp_echo.part.0+0x1a3/0x230
[<ffffffff9b7e444d>] icmp_echo+0xcd/0x190
[<ffffffff9b7e9566>] icmp_rcv+0x806/0xe10
[<ffffffff9b699bd1>] ip_protocol_deliver_rcu+0x351/0x3d0
[<ffffffff9b699f14>] ip_local_deliver_finish+0x2b4/0x450
[<ffffffff9b69a234>] ip_local_deliver+0x174/0x1f0
[<ffffffff9b69a4b2>] ip_sublist_rcv_finish+0x1f2/0x420
[<ffffffff9b69ab56>] ip_sublist_rcv+0x466/0x920
[...]
I was able to reproduce this via:
ip link add dev dummy0 type dummy
ip link set dev dummy0 up
tc qdisc add dev eth0 clsact
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip protocol 1 0xff action mirred egress redirect dev dummy0
ping 1.1.1.1
<stolen>
After the fix, there are no kmemleak reports with the reproducer. This is
in line with what is also done on the ingress side, and from debugging the
skb_unref(skb) on dummy xmit and sch_handle_egress() side, it is visible
that these are two different skbs with both skb_unref(skb) as true. The two
seen skbs are due to mirred doing a skb_clone() internally as use_reinsert
is false in tcf_mirred_act() for egress. This was initially reported by Gal.
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/bdfc2640-8f65-5b56-4472-db8e2b161aab@nvidia.com
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There was a previous attempt to fix an out-of-bounds access in the DCCP
error handlers, but that fix assumed that the error handlers only want
to access the first 8 bytes of the DCCP header. Actually, they also look
at the DCCP sequence number, which is stored beyond 8 bytes, so an
explicit pskb_may_pull() is required.
Fixes: 6706a97fec ("dccp: fix out of bound access in dccp_v4_err()")
Fixes: 1aa9d1a0e7 ("ipv6: dccp: fix out of bound access in dccp_v6_err()")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tls_cipher_desc also contains the algorithm name needed by
crypto_alloc_aead, use it.
Finally, use get_cipher_desc to check if the cipher_type coming from
userspace is valid, and remove the cipher_type switch.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/53d021d80138aa125a9cef4468aa5ce531975a7b.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We can get rid of some local variables, but we have to keep nonce_size
because tls1.3 uses nonce_size = 0 for all ciphers.
We can also drop the runtime sanity checks on iv/rec_seq/tag size,
since we have compile time checks on those values.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/deed9c4430a62c31751a72b8c03ad66ffe710717.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Every cipher uses the same code to update its crypto_info struct based
on the values contained in the cctx, with only the struct type and
size/offset changing. We can get those from tls_cipher_desc, and use
a single pair of memcpy and final copy_to_user.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/c21a904b91e972bdbbf9d1c6d2731ccfa1eedf72.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tls_sw_fallback_init already gets the key and tag size from
tls_cipher_desc. We can now also check that the cipher type is valid,
and stop hard-coding the algorithm name passed to crypto_alloc_aead.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/c8c94b8fcafbfb558e09589c1f1ad48dbdf92f76.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tls_set_device_offload is already getting iv and rec_seq sizes from
tls_cipher_desc. We can now also check if the cipher_type coming from
userspace is valid and can be offloaded.
We can also remove the runtime check on rec_seq, since we validate it
at compile time.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/8ab71b8eca856c7aaf981a45fe91ac649eb0e2e9.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- add nonce, usually equal to iv_size but not for chacha
- add offsets into the crypto_info for each field
- add algorithm name
- add offloadable flag
Also add helpers to access each field of a crypto_info struct
described by a tls_cipher_desc.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/39d5f476d63c171097764e8d38f6f158b7c109ae.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tls_cipher_size_desc indexes ciphers by their type, but we're not
using indices 0..50 of the array. Each struct tls_cipher_size_desc is
20B, so that's a lot of unused memory. We can reindex the array
starting at the lowest used cipher_type.
Introduce the get_cipher_size_desc helper to find the right item and
avoid out-of-bounds accesses, and make tls_cipher_size_desc's size
explicit so that gcc reminds us to update TLS_CIPHER_MIN/MAX when we
add a new cipher.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/5e054e370e240247a5d37881a1cd93a67c15f4ca.1692977948.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Expose port function commands to enable / disable IPsec packet offloads,
this is used to control the port IPsec capabilities.
When IPsec packet is disabled for a function of the port (default),
function cannot offload IPsec packet operations (encapsulation and XFRM
policy offload). When enabled, IPsec packet operations can be offloaded
by the function of the port, which includes crypto operation
(Encrypt/Decrypt), IPsec encapsulation and XFRM state and policy
offload.
Example of a PCI VF port which supports IPsec packet offloads:
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable ipsec_packet disable
$ devlink port function set pci/0000:06:00.0/1 ipsec_packet enable
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable ipsec_packet enable
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-3-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Expose port function commands to enable / disable IPsec crypto offloads,
this is used to control the port IPsec capabilities.
When IPsec crypto is disabled for a function of the port (default),
function cannot offload any IPsec crypto operations (Encrypt/Decrypt and
XFRM state offloading). When enabled, IPsec crypto operations can be
offloaded by the function of the port.
Example of a PCI VF port which supports IPsec crypto offloads:
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto disable
$ devlink port function set pci/0000:06:00.0/1 ipsec_crypto enable
$ devlink port show pci/0000:06:00.0/1
pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0
function:
hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto enable
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230825062836.103744-2-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
HFSC assumes that inner classes have an fsc curve, but it is currently
possible for classes without an fsc curve to become parents. This leads
to bugs including a use-after-free.
Don't allow non-root classes without HFSC_FSC to become parents.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Budimir Markovic <markovicbudimir@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230824084905.422-1-markovicbudimir@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZOjkTAAKCRDbK58LschI
gx32AP9gaaHFBtOYBfoenKTJfMgv1WhtQHIBas+WN9ItmBx9MAEA4gm/VyQ6oD7O
EBjJKJQ2CZ/QKw7cNacXw+l5jF7/+Q0=
=8P7g
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-08-25
We've added 87 non-merge commits during the last 8 day(s) which contain
a total of 104 files changed, 3719 insertions(+), 4212 deletions(-).
The main changes are:
1) Add multi uprobe BPF links for attaching multiple uprobes
and usdt probes, which is significantly faster and saves extra fds,
from Jiri Olsa.
2) Add support BPF cpu v4 instructions for arm64 JIT compiler,
from Xu Kuohai.
3) Add support BPF cpu v4 instructions for riscv64 JIT compiler,
from Pu Lehui.
4) Fix LWT BPF xmit hooks wrt their return values where propagating
the result from skb_do_redirect() would trigger a use-after-free,
from Yan Zhai.
5) Fix a BPF verifier issue related to bpf_kptr_xchg() with local kptr
where the map's value kptr type and locally allocated obj type
mismatch, from Yonghong Song.
6) Fix BPF verifier's check_func_arg_reg_off() function wrt graph
root/node which bypassed reg->off == 0 enforcement,
from Kumar Kartikeya Dwivedi.
7) Lift BPF verifier restriction in networking BPF programs to treat
comparison of packet pointers not as a pointer leak,
from Yafang Shao.
8) Remove unmaintained XDP BPF samples as they are maintained
in xdp-tools repository out of tree, from Toke Høiland-Jørgensen.
9) Batch of fixes for the tracing programs from BPF samples in order
to make them more libbpf-aware, from Daniel T. Lee.
10) Fix a libbpf signedness determination bug in the CO-RE relocation
handling logic, from Andrii Nakryiko.
11) Extend libbpf to support CO-RE kfunc relocations. Also follow-up
fixes for bpf_refcount shared ownership implementation,
both from Dave Marchevsky.
12) Add a new bpf_object__unpin() API function to libbpf,
from Daniel Xu.
13) Fix a memory leak in libbpf to also free btf_vmlinux
when the bpf_object gets closed, from Hao Luo.
14) Small error output improvements to test_bpf module, from Helge Deller.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (87 commits)
selftests/bpf: Add tests for rbtree API interaction in sleepable progs
bpf: Allow bpf_spin_{lock,unlock} in sleepable progs
bpf: Consider non-owning refs to refcounted nodes RCU protected
bpf: Reenable bpf_refcount_acquire
bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes
bpf: Consider non-owning refs trusted
bpf: Ensure kptr_struct_meta is non-NULL for collection insert and refcount_acquire
selftests/bpf: Enable cpu v4 tests for RV64
riscv, bpf: Support unconditional bswap insn
riscv, bpf: Support signed div/mod insns
riscv, bpf: Support 32-bit offset jmp insn
riscv, bpf: Support sign-extension mov insns
riscv, bpf: Support sign-extension load insns
riscv, bpf: Fix missing exception handling and redundant zext for LDX_B/H/W
samples/bpf: Add note to README about the XDP utilities moved to xdp-tools
samples/bpf: Cleanup .gitignore
samples/bpf: Remove the xdp_sample_pkts utility
samples/bpf: Remove the xdp1 and xdp2 utilities
samples/bpf: Remove the xdp_rxq_info utility
samples/bpf: Remove the xdp_redirect* utilities
...
====================
Link: https://lore.kernel.org/r/20230825194319.12727-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The second pull request for v6.6, this time with both stack and driver
changes. Unusually we have only one major new feature but lots of
small cleanup all over, I guess this is due to people have been on
vacation the last month.
Major changes:
rtw89
* Introduce Time Averaged SAR (TAS) support
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmToqosRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZv9XQf9HDq9smbuWLvwzNjbbS31hHFLmnfhN8Zp
+Zzn47gpMCle9ahGLQyw8lcfNPWCMyqOu4sGQ6hyyuH+YXoxZryuq9QDwWo9L/b1
5Cpm4IaBYBMm0ZoOkWw2lQSzGyNrXgvCEKRVC+pYQMvr5V2aEWxT/kT4guiou9D5
OXPRFN2iqZP0Q3TKcfKWRnWn3S0Ok3kZCFuXcWkL0sgwjqP/wbAPO1XNI1IImKNM
xUd0zT4vK/layYq7i20y8blglI5kcp/aKCFEwYpQC2WPeZ3Wtl1G9PQ8eze5Gc2Q
NTw3xfr6tENIcAmYoLdBdKbUq6e6pwLwXlojlZ2beR6s7LHM30AinQ==
=2Hja
-----END PGP SIGNATURE-----
Merge tag 'wireless-next-2023-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.6
The second pull request for v6.6, this time with both stack and driver
changes. Unusually we have only one major new feature but lots of
small cleanup all over, I guess this is due to people have been on
vacation the last month.
Major changes:
rtw89
- Introduce Time Averaged SAR (TAS) support
* tag 'wireless-next-2023-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (114 commits)
wifi: rtlwifi: rtl8723: Remove unused function rtl8723_cmd_send_packet()
wifi: rtw88: usb: kill and free rx urbs on probe failure
wifi: rtw89: Fix clang -Wimplicit-fallthrough in rtw89_query_sar()
wifi: rtw89: phy: modify register setting of ENV_MNTR, PHYSTS and DIG
wifi: rtw89: phy: add phy_gen_def::cr_base to support WiFi 7 chips
wifi: rtw89: mac: define register address of rx_filter to generalize code
wifi: rtw89: mac: define internal memory address for WiFi 7 chip
wifi: rtw89: mac: generalize code to indirectly access WiFi internal memory
wifi: rtw89: mac: add mac_gen_def::band1_offset to map MAC band1 register address
wifi: wlcore: sdio: Use module_sdio_driver macro to simplify the code
wifi: rtw89: initialize multi-channel handling
wifi: rtw89: provide functions to configure NoA for beacon update
wifi: rtw89: call rtw89_chan_get() by vif chanctx if aware of vif
wifi: rtw89: sar: let caller decide the center frequency to query
wifi: rtw89: refine rtw89_correct_cck_chan() by rtw89_hw_to_nl80211_band()
wifi: rtw89: add function prototype for coex request duration
Fix nomenclature for USB and PCI wireless devices
wifi: ath: Use is_multicast_ether_addr() to check multicast Ether address
wifi: ath12k: Remove unused declarations
wifi: ath12k: add check max message length while scanning with extraie
...
====================
Link: https://lore.kernel.org/r/20230825132230.A0833C433C8@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>