62a6183c13
9096 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Maciej Fijalkowski
|
fbadd83a61 |
xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
XSK ZC Rx path calculates the size of data that will be posted to XSK Rx
queue via subtracting xdp_buff::data_end from xdp_buff::data.
In bpf_xdp_frags_increase_tail(), when underlying memory type of
xdp_rxq_info is MEM_TYPE_XSK_BUFF_POOL, add offset to data_end in tail
fragment, so that later on user space will be able to take into account
the amount of bytes added by XDP program.
Fixes:
|
||
Maciej Fijalkowski
|
c5114710c8 |
xsk: fix usage of multi-buffer BPF helpers for ZC XDP
Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory
type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens:
[1136314.192256] BUG: kernel NULL pointer dereference, address:
0000000000000034
[1136314.203943] #PF: supervisor read access in kernel mode
[1136314.213768] #PF: error_code(0x0000) - not-present page
[1136314.223550] PGD 0 P4D 0
[1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI
[1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257
[1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT,
BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210
[1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86
[1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246
[1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX:
0000000000000000
[1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
ffffc9003168c000
[1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09:
0000000000010000
[1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12:
0000000000000001
[1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15:
0000000000000001
[1136314.373298] FS: 00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000)
knlGS:0000000000000000
[1136314.386105] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4:
00000000007706f0
[1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1136314.431890] PKRU: 55555554
[1136314.439143] Call Trace:
[1136314.446058] <IRQ>
[1136314.452465] ? __die+0x20/0x70
[1136314.459881] ? page_fault_oops+0x15b/0x440
[1136314.468305] ? exc_page_fault+0x6a/0x150
[1136314.476491] ? asm_exc_page_fault+0x22/0x30
[1136314.484927] ? __xdp_return+0x6c/0x210
[1136314.492863] bpf_xdp_adjust_tail+0x155/0x1d0
[1136314.501269] bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60
[1136314.511263] ice_clean_rx_irq_zc+0x206/0xc60 [ice]
[1136314.520222] ? ice_xmit_zc+0x6e/0x150 [ice]
[1136314.528506] ice_napi_poll+0x467/0x670 [ice]
[1136314.536858] ? ttwu_do_activate.constprop.0+0x8f/0x1a0
[1136314.546010] __napi_poll+0x29/0x1b0
[1136314.553462] net_rx_action+0x133/0x270
[1136314.561619] __do_softirq+0xbe/0x28e
[1136314.569303] do_softirq+0x3f/0x60
This comes from __xdp_return() call with xdp_buff argument passed as
NULL which is supposed to be consumed by xsk_buff_free() call.
To address this properly, in ZC case, a node that represents the frag
being removed has to be pulled out of xskb_list. Introduce
appropriate xsk helpers to do such node operation and use them
accordingly within bpf_xdp_adjust_tail().
Fixes:
|
||
Jakub Kicinski
|
d09486a04f |
net: fix removing a namespace with conflicting altnames
Mark reports a BUG() when a net namespace is removed.
kernel BUG at net/core/dev.c:11520!
Physical interfaces moved outside of init_net get "refunded"
to init_net when that namespace disappears. The main interface
name may get overwritten in the process if it would have
conflicted. We need to also discard all conflicting altnames.
Recent fixes addressed ensuring that altnames get moved
with the main interface, which surfaced this problem.
Reported-by: Марк Коренберг <socketpair@gmail.com>
Link: https://lore.kernel.org/all/CAEmTpZFZ4Sv3KwqFOY2WKDHeZYdi0O7N5H1nTvcGp=SAEavtDg@mail.gmail.com/
Fixes:
|
||
Eric Dumazet
|
a54d51fb2d |
udp: fix busy polling
Generic sk_busy_loop_end() only looks at sk->sk_receive_queue
for presence of packets.
Problem is that for UDP sockets after blamed commit, some packets
could be present in another queue: udp_sk(sk)->reader_queue
In some cases, a busy poller could spin until timeout expiration,
even if some packets are available in udp_sk(sk)->reader_queue.
v3: - make sk_busy_loop_end() nicer (Willem)
v2: - add a READ_ONCE(sk->sk_family) in sk_is_inet() to avoid KCSAN splats.
- add a sk_is_inet() check in sk_is_udp() (Willem feedback)
- add a sk_is_inet() check in sk_is_tcp().
Fixes:
|
||
Zhengchao Shao
|
198bc90e0e |
tcp: make sure init the accept_queue's spinlocks once
When I run syz's reproduction C program locally, it causes the following issue: pvqspinlock: lock 0xffff9d181cd5c660 has corrupted value 0x0! WARNING: CPU: 19 PID: 21160 at __pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508) Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:__pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508) Code: 73 56 3a ff 90 c3 cc cc cc cc 8b 05 bb 1f 48 01 85 c0 74 05 c3 cc cc cc cc 8b 17 48 89 fe 48 c7 c7 30 20 ce 8f e8 ad 56 42 ff <0f> 0b c3 cc cc cc cc 0f 0b 0f 1f 40 00 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffa8d200604cb8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9d1ef60e0908 RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9d1ef60e0900 RBP: ffff9d181cd5c280 R08: 0000000000000000 R09: 00000000ffff7fff R10: ffffa8d200604b68 R11: ffffffff907dcdc8 R12: 0000000000000000 R13: ffff9d181cd5c660 R14: ffff9d1813a3f330 R15: 0000000000001000 FS: 00007fa110184640(0000) GS:ffff9d1ef60c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 000000011f65e000 CR4: 00000000000006f0 Call Trace: <IRQ> _raw_spin_unlock (kernel/locking/spinlock.c:186) inet_csk_reqsk_queue_add (net/ipv4/inet_connection_sock.c:1321) inet_csk_complete_hashdance (net/ipv4/inet_connection_sock.c:1358) tcp_check_req (net/ipv4/tcp_minisocks.c:868) tcp_v4_rcv (net/ipv4/tcp_ipv4.c:2260) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205) ip_local_deliver_finish (net/ipv4/ip_input.c:234) __netif_receive_skb_one_core (net/core/dev.c:5529) process_backlog (./include/linux/rcupdate.h:779) __napi_poll (net/core/dev.c:6533) net_rx_action (net/core/dev.c:6604) __do_softirq (./arch/x86/include/asm/jump_label.h:27) do_softirq (kernel/softirq.c:454 kernel/softirq.c:441) </IRQ> <TASK> __local_bh_enable_ip (kernel/softirq.c:381) __dev_queue_xmit (net/core/dev.c:4374) ip_finish_output2 (./include/net/neighbour.h:540 net/ipv4/ip_output.c:235) __ip_queue_xmit (net/ipv4/ip_output.c:535) __tcp_transmit_skb (net/ipv4/tcp_output.c:1462) tcp_rcv_synsent_state_process (net/ipv4/tcp_input.c:6469) tcp_rcv_state_process (net/ipv4/tcp_input.c:6657) tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1929) __release_sock (./include/net/sock.h:1121 net/core/sock.c:2968) release_sock (net/core/sock.c:3536) inet_wait_for_connect (net/ipv4/af_inet.c:609) __inet_stream_connect (net/ipv4/af_inet.c:702) inet_stream_connect (net/ipv4/af_inet.c:748) __sys_connect (./include/linux/file.h:45 net/socket.c:2064) __x64_sys_connect (net/socket.c:2073 net/socket.c:2070 net/socket.c:2070) do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:82) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) RIP: 0033:0x7fa10ff05a3d Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48 RSP: 002b:00007fa110183de8 EFLAGS: 00000202 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000020000054 RCX: 00007fa10ff05a3d RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000003 RBP: 00007fa110183e20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fa110184640 R13: 0000000000000000 R14: 00007fa10fe8b060 R15: 00007fff73e23b20 </TASK> The issue triggering process is analyzed as follows: Thread A Thread B tcp_v4_rcv //receive ack TCP packet inet_shutdown tcp_check_req tcp_disconnect //disconnect sock ... tcp_set_state(sk, TCP_CLOSE) inet_csk_complete_hashdance ... inet_csk_reqsk_queue_add inet_listen //start listen spin_lock(&queue->rskq_lock) inet_csk_listen_start ... reqsk_queue_alloc ... spin_lock_init spin_unlock(&queue->rskq_lock) //warning When the socket receives the ACK packet during the three-way handshake, it will hold spinlock. And then the user actively shutdowns the socket and listens to the socket immediately, the spinlock will be initialized. When the socket is going to release the spinlock, a warning is generated. Also the same issue to fastopenq.lock. Move init spinlock to inet_create and inet_accept to make sure init the accept_queue's spinlocks once. Fixes: |
||
Linus Torvalds
|
736b5545d3 |
Including fixes from bpf and netfilter.
Previous releases - regressions: - Revert "net: rtnetlink: Enslave device before bringing it up", breaks the case inverse to the one it was trying to fix - net: dsa: fix oob access in DSA's netdevice event handler dereference netdev_priv() before check its a DSA port - sched: track device in tcf_block_get/put_ext() only for clsact binder types - net: tls, fix WARNING in __sk_msg_free when record becomes full during splice and MORE hint set - sfp-bus: fix SFP mode detect from bitrate - drv: stmmac: prevent DSA tags from breaking COE Previous releases - always broken: - bpf: fix no forward progress in in bpf_iter_udp if output buffer is too small - bpf: reject variable offset alu on registers with a type of PTR_TO_FLOW_KEYS to prevent oob access - netfilter: tighten input validation - net: add more sanity check in virtio_net_hdr_to_skb() - rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid infinite loop - amt: do not use the portion of skb->cb area which may get clobbered - mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option Misc: - spring cleanup of inactive maintainers Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmWpnvoACgkQMUZtbf5S Irvskg/+Or5tETxOmpQXxnj6ECZyrSp0Jcyd7+TIcos/7JfPdn3Kebl004SG4h/s bwKDOIIP1iSjQ+0NFsPjyYIVd6wFuCElSB7npV5uQAT6ptXx7A4Ym68/rVxodI8T 6hiYV/mlPuZF8JjRhtp/VJL8sY1qnG7RIUB4oH3y9HQNfwZX0lIWChuUilHuWfbq zQ2Iu97tMkoIBjXrkIT3Qaj0aFxYbjCOrg9zy+FZ69a7Rmrswr//7amlCH6saNTx Ku7Wl8FXhe7O23OiM6GSl7AechSM1aJ5kOS3orseej0+aSp9eH3ekYGmbsQr6sjz ix/eZ7V7SUkJK3bEH5haeymk4TDV3lHE8SziMbosK4wVbHOyPwEmqCxppADYJLZs WycHZKcTBluFBOxknAofH7m5Hh0ToXkeTfpptSSGtRB4WncAOMsMapr3yS4WXg/q AnOo/tzCBgMrnSJtD/kjqgUiCk8vYoLc8lBR9K74l0zqI1sf13OfuTHvEgqIS6z1 Ir/ewlAV6fCH8gQbyzjKUVlyjZS4+vFv19xg/2GgLf+LdyzcCOxUZkND3/DE6+OA Dgf9gtABYU4hGXMUfTfml3KCBTF65QmY8dIh17zraNylYUHEJ2lI4D+sdiqWUrXb mXPBJh4nOPwIV5t2gT80skNwF3aWPr6l4ieY2codSbP04rO74S8= =YhQQ -----END PGP SIGNATURE----- Merge tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and netfilter. Previous releases - regressions: - Revert "net: rtnetlink: Enslave device before bringing it up", breaks the case inverse to the one it was trying to fix - net: dsa: fix oob access in DSA's netdevice event handler dereference netdev_priv() before check its a DSA port - sched: track device in tcf_block_get/put_ext() only for clsact binder types - net: tls, fix WARNING in __sk_msg_free when record becomes full during splice and MORE hint set - sfp-bus: fix SFP mode detect from bitrate - drv: stmmac: prevent DSA tags from breaking COE Previous releases - always broken: - bpf: fix no forward progress in in bpf_iter_udp if output buffer is too small - bpf: reject variable offset alu on registers with a type of PTR_TO_FLOW_KEYS to prevent oob access - netfilter: tighten input validation - net: add more sanity check in virtio_net_hdr_to_skb() - rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid infinite loop - amt: do not use the portion of skb->cb area which may get clobbered - mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option Misc: - spring cleanup of inactive maintainers" * tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) i40e: Include types.h to some headers ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes selftests: mlxsw: qos_pfc: Remove wrong description mlxsw: spectrum_router: Register netdevice notifier before nexthop mlxsw: spectrum_acl_tcam: Fix stack corruption mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure ethtool: netlink: Add missing ethnl_ops_begin/complete selftests: bonding: Add more missing config options selftests: netdevsim: add a config file libbpf: warn on unexpected __arg_ctx type when rewriting BTF selftests/bpf: add tests confirming type logic in kernel for __arg_ctx bpf: enforce types for __arg_ctx-tagged arguments in global subprogs bpf: extract bpf_ctx_convert_map logic and make it more reusable libbpf: feature-detect arg:ctx tag support in kernel ipvs: avoid stat macros calls from preemptible context netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description netfilter: nf_tables: skip dead set elements in netlink dump netfilter: nf_tables: do not allow mismatch field size and set key length ... |
||
Nicolas Dichtel
|
ec4ffd100f |
Revert "net: rtnetlink: Enslave device before bringing it up"
This reverts commit |
||
Linus Torvalds
|
4c72e2b8c4 |
for-6.8/io_uring-2024-01-08
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcOk0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq2wEACgP1Gvfb2cR65B/6tkLJ6neKnYOPVSYx9F 4Mv6KS306vmOsE67LqynhtAe/Hgfv0NKADN+mKLLV8IdwhneLAwFxRoHl8QPO2dW DIxohDMLIqzY/SOUBqBHMS8b5lEr79UVh8vXjAuuYCBTcHBndi4hj0S0zkwf5YiK Vma6ty+TnzmOv+c6+OCgZZlwFhkxD1pQBM5iOlFRAkdt3Er/2e8FqthK0/od7Fgy Ii9xdxYZU9KTsM4R+IWhqZ1rj1qUdtMPSGzFRbzFt3b6NdANRZT9MUlxvSkuHPIE P/9anpmgu41gd2JbHuyVnw+4jdZMhhZUPUYtphmQ5n35rcFZjv1gnJalH/Nw/DTE 78bDxxP0+35bkuq5MwHfvA3imVsGnvnFx5MlZBoqvv+VK4S3Q6E2+ARQZdiG+2Is 8Uyjzzt0nq2waUO3H1JLkgiM9J9LFqaYQLE68u569m4NCfRSpoZva9KVmNpre2K0 9WdGy2gfzaYAQkGud2TULzeLiEbWfvZAL0o43jYXkVPRKmnffCEphf2X6C2IDV/3 5ZR4bandRRC6DVnE+8n1MB06AYtpJPB/w3rplON+l/V5Gnb7wRNwiUrBr/F15OOy OPbAnP6k56wY/JRpgzdbNqwGi8EvGWX6t3Kqjp2mczhs2as8M193FKS2xppl1TYl BPyTsAfbdQ== =5v/U -----END PGP SIGNATURE----- Merge tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: "Mostly just come fixes and cleanups, but one feature as well. In detail: - Harden the check for handling IOPOLL based on return (Pavel) - Various minor optimizations (Pavel) - Drop remnants of SCM_RIGHTS fd passing support, now that it's no longer supported since 6.7 (me) - Fix for a case where bytes_done wasn't initialized properly on a failure condition for read/write requests (me) - Move the register related code to a separate file (me) - Add support for returning the provided ring buffer head (me) - Add support for adding a direct descriptor to the normal file table (me, Christian Brauner) - Fix for ensuring pending task_work for a ring with DEFER_TASKRUN is run even if we timeout waiting (me)" * tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux: io_uring: ensure local task_work is run on wait timeout io_uring/kbuf: add method for returning provided buffer ring head io_uring/rw: ensure io->bytes_done is always initialized io_uring: drop any code related to SCM_RIGHTS io_uring/unix: drop usage of io_uring socket io_uring/register: move io_uring_register(2) related code to register.c io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL io_uring/cmd: inline io_uring_cmd_get_task io_uring/cmd: inline io_uring_cmd_do_in_task_lazy io_uring: split out cmd api into a separate header io_uring: optimise ltimeout for inline execution io_uring: don't check iopoll if request completes |
||
Linus Torvalds
|
3e7aeb78ab |
Networking changes for 6.8.
Core & protocols ---------------- - Analyze and reorganize core networking structs (socks, netdev, netns, mibs) to optimize cacheline consumption and set up build time warnings to safeguard against future header changes. This improves TCP performances with many concurrent connections up to 40%. - Add page-pool netlink-based introspection, exposing the memory usage and recycling stats. This helps indentify bad PP users and possible leaks. - Refine TCP/DCCP source port selection to no longer favor even source port at connect() time when IP_LOCAL_PORT_RANGE is set. This lowers the time taken by connect() for hosts having many active connections to the same destination. - Refactor the TCP bind conflict code, shrinking related socket structs. - Refactor TCP SYN-Cookie handling, as a preparation step to allow arbitrary SYN-Cookie processing via eBPF. - Tune optmem_max for 0-copy usage, increasing the default value to 128KB and namespecifying it. - Allow coalescing for cloned skbs coming from page pools, improving RX performances with some common configurations. - Reduce extension header parsing overhead at GRO time. - Add bridge MDB bulk deletion support, allowing user-space to request the deletion of matching entries. - Reorder nftables struct members, to keep data accessed by the datapath first. - Introduce TC block ports tracking and use. This allows supporting multicast-like behavior at the TC layer. - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and classifiers (RSVP and tcindex). - More data-race annotations. - Extend the diag interface to dump TCP bound-only sockets. - Conditional notification of events for TC qdisc class and actions. - Support for WPAN dynamic associations with nearby devices, to form a sub-network using a specific PAN ID. - Implement SMCv2.1 virtual ISM device support. - Add support for Batman-avd mulicast packet type. BPF --- - Tons of verifier improvements: - BPF register bounds logic and range support along with a large test suite - log improvements - complete precision tracking support for register spills - track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs - support for user's global BPF subprogram arguments with few commonly requested annotations for a better developer experience - support tracking of BPF_JNE which helps cases when the compiler transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like - several fixes - Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload. - Fix kCFI bugs in BPF all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. - Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques. - Support uid/gid options when mounting bpffs. - Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id. - Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext. - Add BPF link_info support for uprobe multi link along with bpftool integration for the latter. - Support for VLAN tag in XDP hints. - Remove deprecated bpfilter kernel leftovers given the project is developed in user-space (https://github.com/facebook/bpfilter). Misc ---- - Support for parellel TC self-tests execution. - Increase MPTCP self-tests coverage. - Updated the bridge documentation, including several so-far undocumented features. - Convert all the net self-tests to run in unique netns, to avoid random failures due to conflict and allow concurrent runs. - Add TCP-AO self-tests. - Add kunit tests for both cfg80211 and mac80211. - Autogenerate Netlink families documentation from YAML spec. - Add yml-gen support for fixed headers and recursive nests, the tool can now generate user-space code for all genetlink families for which we have specs. - A bunch of additional module descriptions fixes. - Catch incorrect freeing of pages belonging to a page pool. Driver API ---------- - Rust abstractions for network PHY drivers; do not cover yet the full C API, but already allow implementing functional PHY drivers in rust. - Introduce queue and NAPI support in the netdev Netlink interface, allowing complete access to the device <> NAPIs <> queues relationship. - Introduce notifications filtering for devlink to allow control application scale to thousands of instances. - Improve PHY validation, requesting rate matching information for each ethtool link mode supported by both the PHY and host. - Add support for ethtool symmetric-xor RSS hash. - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD platform. - Expose pin fractional frequency offset value over new DPLL generic netlink attribute. - Convert older drivers to platform remove callback returning void. - Add support for PHY package MMD read/write. New hardware / drivers ---------------------- - Ethernet: - Octeon CN10K devices - Broadcom 5760X P7 - Qualcomm SM8550 SoC - Texas Instrument DP83TG720S PHY - Bluetooth: - IMC Networks Bluetooth radio Removed ------- - WiFi: - libertas 16-bit PCMCIA support - Atmel at76c50x drivers - HostAP ISA/PCMCIA style 802.11b driver - zd1201 802.11b USB dongles - Orinoco ISA/PCMCIA 802.11b driver - Aviator/Raytheon driver - Planet WL3501 driver - RNDIS USB 802.11b driver Drivers ------- - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - allow one by one port representors creation and removal - add temperature and clock information reporting - add get/set for ethtool's header split ringparam - add again FW logging - adds support switchdev hardware packet mirroring - iavf: implement symmetric-xor RSS hash - igc: add support for concurrent physical and free-running timers - i40e: increase the allowable descriptors - nVidia/Mellanox: - Preparation for Socket-Direct multi-dev netdev. That will allow in future releases combining multiple PFs devices attached to different NUMA nodes under the same netdev - Broadcom (bnxt): - TX completion handling improvements - add basic ntuple filter support - reduce MSIX vectors usage for MQPRIO offload - add VXLAN support, USO offload and TX coalesce completion for P7 - Marvell Octeon EP: - xmit-more support - add PF-VF mailbox support and use it for FW notifications for VFs - Wangxun (ngbe/txgbe): - implement ethtool functions to operate pause param, ring param, coalesce channel number and msglevel - Netronome/Corigine (nfp): - add flow-steering support - support UDP segmentation offload - Ethernet NICs embedded, slower, virtual: - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver - stmmac: add support for HW-accelerated VLAN stripping - TI AM654x sw: add mqprio, frame preemption & coalescing - gve: add support for non-4k page sizes. - virtio-net: support dynamic coalescing moderation - nVidia/Mellanox Ethernet datacenter switches: - allow firmware upgrade without a reboot - more flexible support for bridge flooding via the compressed FID flooding mode - Ethernet embedded switches: - Microchip: - fine-tune flow control and speed configurations in KSZ8xxx - KSZ88X3: enable setting rmii reference - Renesas: - add jumbo frames support - Marvell: - 88E6xxx: add "eth-mac" and "rmon" stats support - Ethernet PHYs: - aquantia: add firmware load support - at803x: refactor the driver to simplify adding support for more chip variants - NXP C45 TJA11xx: Add MACsec offload support - Wifi: - MediaTek (mt76): - NVMEM EEPROM improvements - mt7996 Extremely High Throughput (EHT) improvements - mt7996 Wireless Ethernet Dispatcher (WED) support - mt7996 36-bit DMA support - Qualcomm (ath12k): - support for a single MSI vector - WCN7850: support AP mode - Intel (iwlwifi): - new debugfs file fw_dbg_clear - allow concurrent P2P operation on DFS channels - Bluetooth: - QCA2066: support HFP offload - ISO: more broadcast-related improvements - NXP: better recovery in case receiver/transmitter get out of sync Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35 98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF ucqPEcZB5cRE =1bW6 -----END PGP SIGNATURE----- Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "The most interesting thing is probably the networking structs reorganization and a significant amount of changes is around self-tests. Core & protocols: - Analyze and reorganize core networking structs (socks, netdev, netns, mibs) to optimize cacheline consumption and set up build time warnings to safeguard against future header changes This improves TCP performances with many concurrent connections up to 40% - Add page-pool netlink-based introspection, exposing the memory usage and recycling stats. This helps indentify bad PP users and possible leaks - Refine TCP/DCCP source port selection to no longer favor even source port at connect() time when IP_LOCAL_PORT_RANGE is set. This lowers the time taken by connect() for hosts having many active connections to the same destination - Refactor the TCP bind conflict code, shrinking related socket structs - Refactor TCP SYN-Cookie handling, as a preparation step to allow arbitrary SYN-Cookie processing via eBPF - Tune optmem_max for 0-copy usage, increasing the default value to 128KB and namespecifying it - Allow coalescing for cloned skbs coming from page pools, improving RX performances with some common configurations - Reduce extension header parsing overhead at GRO time - Add bridge MDB bulk deletion support, allowing user-space to request the deletion of matching entries - Reorder nftables struct members, to keep data accessed by the datapath first - Introduce TC block ports tracking and use. This allows supporting multicast-like behavior at the TC layer - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and classifiers (RSVP and tcindex) - More data-race annotations - Extend the diag interface to dump TCP bound-only sockets - Conditional notification of events for TC qdisc class and actions - Support for WPAN dynamic associations with nearby devices, to form a sub-network using a specific PAN ID - Implement SMCv2.1 virtual ISM device support - Add support for Batman-avd mulicast packet type BPF: - Tons of verifier improvements: - BPF register bounds logic and range support along with a large test suite - log improvements - complete precision tracking support for register spills - track aligned STACK_ZERO cases as imprecise spilled registers. This improves the verifier "instructions processed" metric from single digit to 50-60% for some programs - support for user's global BPF subprogram arguments with few commonly requested annotations for a better developer experience - support tracking of BPF_JNE which helps cases when the compiler transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like - several fixes - Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload - Fix kCFI bugs in BPF all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y - Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques - Support uid/gid options when mounting bpffs - Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id - Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext - Add BPF link_info support for uprobe multi link along with bpftool integration for the latter - Support for VLAN tag in XDP hints - Remove deprecated bpfilter kernel leftovers given the project is developed in user-space (https://github.com/facebook/bpfilter) Misc: - Support for parellel TC self-tests execution - Increase MPTCP self-tests coverage - Updated the bridge documentation, including several so-far undocumented features - Convert all the net self-tests to run in unique netns, to avoid random failures due to conflict and allow concurrent runs - Add TCP-AO self-tests - Add kunit tests for both cfg80211 and mac80211 - Autogenerate Netlink families documentation from YAML spec - Add yml-gen support for fixed headers and recursive nests, the tool can now generate user-space code for all genetlink families for which we have specs - A bunch of additional module descriptions fixes - Catch incorrect freeing of pages belonging to a page pool Driver API: - Rust abstractions for network PHY drivers; do not cover yet the full C API, but already allow implementing functional PHY drivers in rust - Introduce queue and NAPI support in the netdev Netlink interface, allowing complete access to the device <> NAPIs <> queues relationship - Introduce notifications filtering for devlink to allow control application scale to thousands of instances - Improve PHY validation, requesting rate matching information for each ethtool link mode supported by both the PHY and host - Add support for ethtool symmetric-xor RSS hash - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD platform - Expose pin fractional frequency offset value over new DPLL generic netlink attribute - Convert older drivers to platform remove callback returning void - Add support for PHY package MMD read/write New hardware / drivers: - Ethernet: - Octeon CN10K devices - Broadcom 5760X P7 - Qualcomm SM8550 SoC - Texas Instrument DP83TG720S PHY - Bluetooth: - IMC Networks Bluetooth radio Removed: - WiFi: - libertas 16-bit PCMCIA support - Atmel at76c50x drivers - HostAP ISA/PCMCIA style 802.11b driver - zd1201 802.11b USB dongles - Orinoco ISA/PCMCIA 802.11b driver - Aviator/Raytheon driver - Planet WL3501 driver - RNDIS USB 802.11b driver Driver updates: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - allow one by one port representors creation and removal - add temperature and clock information reporting - add get/set for ethtool's header split ringparam - add again FW logging - adds support switchdev hardware packet mirroring - iavf: implement symmetric-xor RSS hash - igc: add support for concurrent physical and free-running timers - i40e: increase the allowable descriptors - nVidia/Mellanox: - Preparation for Socket-Direct multi-dev netdev. That will allow in future releases combining multiple PFs devices attached to different NUMA nodes under the same netdev - Broadcom (bnxt): - TX completion handling improvements - add basic ntuple filter support - reduce MSIX vectors usage for MQPRIO offload - add VXLAN support, USO offload and TX coalesce completion for P7 - Marvell Octeon EP: - xmit-more support - add PF-VF mailbox support and use it for FW notifications for VFs - Wangxun (ngbe/txgbe): - implement ethtool functions to operate pause param, ring param, coalesce channel number and msglevel - Netronome/Corigine (nfp): - add flow-steering support - support UDP segmentation offload - Ethernet NICs embedded, slower, virtual: - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver - stmmac: add support for HW-accelerated VLAN stripping - TI AM654x sw: add mqprio, frame preemption & coalescing - gve: add support for non-4k page sizes. - virtio-net: support dynamic coalescing moderation - nVidia/Mellanox Ethernet datacenter switches: - allow firmware upgrade without a reboot - more flexible support for bridge flooding via the compressed FID flooding mode - Ethernet embedded switches: - Microchip: - fine-tune flow control and speed configurations in KSZ8xxx - KSZ88X3: enable setting rmii reference - Renesas: - add jumbo frames support - Marvell: - 88E6xxx: add "eth-mac" and "rmon" stats support - Ethernet PHYs: - aquantia: add firmware load support - at803x: refactor the driver to simplify adding support for more chip variants - NXP C45 TJA11xx: Add MACsec offload support - Wifi: - MediaTek (mt76): - NVMEM EEPROM improvements - mt7996 Extremely High Throughput (EHT) improvements - mt7996 Wireless Ethernet Dispatcher (WED) support - mt7996 36-bit DMA support - Qualcomm (ath12k): - support for a single MSI vector - WCN7850: support AP mode - Intel (iwlwifi): - new debugfs file fw_dbg_clear - allow concurrent P2P operation on DFS channels - Bluetooth: - QCA2066: support HFP offload - ISO: more broadcast-related improvements - NXP: better recovery in case receiver/transmitter get out of sync" * tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits) lan78xx: remove redundant statement in lan78xx_get_eee lan743x: remove redundant statement in lan743x_ethtool_get_eee bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer() bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel() bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter() tcp: Revert no longer abort SYN_SENT when receiving some ICMP Revert "mlx5 updates 2023-12-20" Revert "net: stmmac: Enable Per DMA Channel interrupt" ipvlan: Remove usage of the deprecated ida_simple_xx() API ipvlan: Fix a typo in a comment net/sched: Remove ipt action tests net: stmmac: Use interrupt mode INTM=1 for per channel irq net: stmmac: Add support for TX/RX channel interrupt net: stmmac: Make MSI interrupt routine generic dt-bindings: net: snps,dwmac: per channel irq net: phy: at803x: make read_status more generic net: phy: at803x: add support for cdt cross short test for qca808x net: phy: at803x: refactor qca808x cable test get status function net: phy: at803x: generalize cdt fault length function net: ethernet: cortina: Drop TSO support ... |
||
Linus Torvalds
|
fb46e22a9e |
Many singleton patches against the MM code. The patch series which
are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series "maple_tree: add mt_free_one() and mt_attr() helpers" "Some cleanups of maple tree" - In the series "mm: use memmap_on_memory semantics for dax/kmem" Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series "Add folio_zero_tail() and folio_fill_tail()" "Make folio_start_writeback return void" "Fix fault handler's handling of poisoned tail pages" "Convert aops->error_remove_page to ->error_remove_folio" "Finish two folio conversions" "More swap folio conversions" - Kefeng Wang has also contributed folio-related work in the series "mm: cleanup and use more folio in page fault" - Jim Cromie has improved the kmemleak reporting output in the series "tweak kmemleak report format". - In the series "stackdepot: allow evicting stack traces" Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series "mm: page_alloc: fixes for high atomic reserve caluculations". - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series "samples: introduce cgroup events listeners". - Some mapletree maintanance work from Liam Howlett in the series "maple_tree: iterator state changes". - Nhat Pham has improved zswap's approach to writeback in the series "workload-specific and memory pressure-driven zswap writeback". - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series "mm/damon: let users feed and tame/auto-tune DAMOS" "selftests/damon: add Python-written DAMON functionality tests" "mm/damon: misc updates for 6.8" - Yosry Ahmed has improved memcg's stats flushing in the series "mm: memcg: subtree stats flushing and thresholds". - In the series "Multi-size THP for anonymous memory" Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series "More buffer_head cleanups". - Suren Baghdasaryan has done work on Andrea Arcangeli's series "userfaultfd move option". UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a "KSM Advisor", in the series "mm/ksm: Add ksm advisor". This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series "mm/zswap: dstmem reuse optimizations and cleanups". - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is "Clean up the writeback paths". - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series "kasan: save mempool stack traces". - Andrey also performed some KASAN maintenance work in the series "kasan: assorted clean-ups". - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series "mm/rmap: interface overhaul". - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series "mm/mglru: Kconfig cleanup". - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series "Remove some lruvec page accounting functions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU= =0NHs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ... |
||
Linus Torvalds
|
c604110e66 |
vfs-6.8.misc
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc= =0bRl -----END PGP SIGNATURE----- Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ... |
||
Zhengchao Shao
|
c4a5ee9c09 |
fib: rules: remove repeated assignment in fib_nl2rule
In fib_nl2rule(), 'err' variable has been set to -EINVAL during declaration, and no need to set the 'err' variable to -EINVAL again. So, remove it. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Jakub Kicinski
|
e63c1822ac |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt.c |
||
Jakub Kicinski
|
fe1eb24bd5 |
Revert "Introduce PHY listing and link_topology tracking"
This reverts commit |
||
Thomas Lange
|
382a32018b |
net: Implement missing SO_TIMESTAMPING_NEW cmsg support
Commit |
||
Eric Dumazet
|
d3d344a1ca |
net-device: move xdp_prog to net_device_read_rx
xdp_prog is used in receive path, both from XDP enabled drivers
and from netif_elide_gro().
This patch also removes two 4-bytes holes.
Fixes:
|
||
Zhengchao Shao
|
b4c1d4d973 |
fib: remove unnecessary input parameters in fib_default_rule_add
When fib_default_rule_add is invoked, the value of the input parameter 'flags' is always 0. Rules uses kzalloc to allocate memory, so 'flags' has been initialized to 0. Therefore, remove the input parameter 'flags' in fib_default_rule_add. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240102071519.3781384-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Jörn-Thorben Hinz
|
7f6ca95d16 |
net: Implement missing getsockopt(SO_TIMESTAMPING_NEW)
Commit |
||
Eric Dumazet
|
993498e537 |
net-device: move gso_partial_features to net_device_read_tx
dev->gso_partial_features is read from tx fast path for GSO packets.
Move it to appropriate section to avoid a cache line miss.
Fixes:
|
||
Maxime Chevallier
|
02018c544e |
net: phy: Introduce ethernet link topology representation
Link topologies containing multiple network PHYs attached to the same net_device can be found when using a PHY as a media converter for use with an SFP connector, on which an SFP transceiver containing a PHY can be used. With the current model, the transceiver's PHY can't be used for operations such as cable testing, timestamping, macsec offload, etc. The reason being that most of the logic for these configuration, coming from either ethtool netlink or ioctls tend to use netdev->phydev, which in multi-phy systems will reference the PHY closest to the MAC. Introduce a numbering scheme allowing to enumerate PHY devices that belong to any netdev, which can in turn allow userspace to take more precise decisions with regard to each PHY's configuration. The numbering is maintained per-netdev, in a phy_device_list. The numbering works similarly to a netdevice's ifindex, with identifiers that are only recycled once INT_MAX has been reached. This prevents races that could occur between PHY listing and SFP transceiver removal/insertion. The identifiers are assigned at phy_attach time, as the numbering depends on the netdevice the phy is attached to. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Andrey Konovalov
|
74e831af16 |
skbuff: use mempool KASAN hooks
Instead of using slab-internal KASAN hooks for poisoning and unpoisoning cached objects, use the proper mempool KASAN hooks. Also check the return value of kasan_mempool_poison_object to prevent double-free and invali-free bugs. Link: https://lkml.kernel.org/r/a3482c41395c69baa80eb59dbb06beef213d2a14.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
1ce9a05239 |
kasan: rename and document kasan_(un)poison_object_data
Rename kasan_unpoison_object_data to kasan_unpoison_new_object and add a documentation comment. Do the same for kasan_poison_object_data. The new names and the comments should suggest the users that these hooks are intended for internal use by the slab allocator. The following patch will remove non-slab-internal uses of these hooks. No functional changes. [andreyknvl@google.com: update references to renamed functions in comments] Link: https://lkml.kernel.org/r/20231221180637.105098-1-andrey.konovalov@linux.dev Link: https://lkml.kernel.org/r/eab156ebbd635f9635ef67d1a4271f716994e628.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Ido Schimmel
|
cd4d7263d5 |
genetlink: Use internal flags for multicast groups
As explained in commit |
||
Kevin Hao
|
3fb65f6bc7 |
net: pktgen: Use wait_event_freezable_timeout() for freezable kthread
A freezable kernel thread can enter frozen state during freezing by either calling try_to_freeze() or using wait_event_freezable() and its variants. So for the following snippet of code in a kernel thread loop: wait_event_interruptible_timeout(); try_to_freeze(); We can change it to a simple wait_event_freezable_timeout() and then eliminate a function call. Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Radu Pirea (NXP OSS)
|
90abde49ea |
net: rename dsa_realloc_skb to skb_ensure_writable_head_tail
Rename dsa_realloc_skb to skb_ensure_writable_head_tail and move it to skbuff.c to use it as helper. Signed-off-by: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Paolo Abeni
|
56794e5358 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |
||
Paolo Abeni
|
74769d810e |
bpf-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZYQVjgAKCRDbK58LschI g/NfAP9xMBCASd22+KPu44FtPPO5DKcdG7hATXZMpb/cygF8GQEAojcZ4jztx42S F1+4RPEoxrn31oVYdtEGFY9q85ruzgA= =2XhN -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-12-21 Hi David, hi Jakub, hi Paolo, hi Eric, The following pull-request contains BPF updates for your *net* tree. We've added 3 non-merge commits during the last 5 day(s) which contain a total of 4 files changed, 45 insertions(+). The main changes are: 1) Fix a syzkaller splat which triggered an oob issue in bpf_link_show_fdinfo(), from Jiri Olsa. 2) Fix another syzkaller-found issue which triggered a NULL pointer dereference in BPF sockmap for unconnected unix sockets, from John Fastabend. bpf-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Add missing BPF_LINK_TYPE invocations bpf: sockmap, test for unconnected af_unix sock bpf: syzkaller found null ptr deref in unix_bpf proto add ==================== Link: https://lore.kernel.org/r/20231221104844.1374-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
Eric Dumazet
|
24ab059d2e |
net: check dev->gso_max_size in gso_features_check()
Some drivers might misbehave if TSO packets get too big.
GVE for instance uses a 16bit field in its TX descriptor,
and will do bad things if a packet is bigger than 2^16 bytes.
Linux TCP stack honors dev->gso_max_size, but there are
other ways for too big packets to reach an ndo_start_xmit()
handler : virtio_net, af_packet, GRO...
Add a generic check in gso_features_check() and fallback
to GSO when needed.
gso_max_size was added in the blamed commit.
Fixes:
|
||
Thomas Weißschuh
|
d6e5794b06 |
net: avoid build bug in skb extension length calculation
GCC seems to incorrectly fail to evaluate skb_ext_total_length() at
compile time under certain conditions.
The issue even occurs if all values in skb_ext_type_len[] are "0",
ruling out the possibility of an actual overflow.
As the patch has been in mainline since v6.6 without triggering the
problem it seems to be a very uncommon occurrence.
As the issue only occurs when -fno-tree-loop-im is specified as part of
CFLAGS_GCOV, disable the BUILD_BUG_ON() only when building with coverage
reporting enabled.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312171924.4FozI5FG-lkp@intel.com/
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/lkml/487cfd35-fe68-416f-9bfd-6bb417f98304@app.fastmail.com/
Fixes:
|
||
Victor Nogueira
|
b6a3c6066a |
net: sched: Make tc-related drop reason more flexible for remaining qdiscs
Incrementing on Daniel's patch[1], make tc-related drop reason more flexible for remaining qdiscs - that is, all qdiscs aside from clsact. In essence, the drop reason will be set by cls_api and act_api in case any error occurred in the data path. With that, we can give the user more detailed information so that they can distinguish between a policy drop or an error drop. [1] https://lore.kernel.org/all/20231009092655.22025-1-daniel@iogearbox.net Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Victor Nogueira
|
fb2780721c |
net: sched: Move drop_reason to struct tc_skb_cb
Move drop_reason from struct tcf_result to skb cb - more specifically to struct tc_skb_cb. With that, we'll be able to also set the drop reason for the remaining qdiscs (aside from clsact) that do not have access to tcf_result when time comes to set the skb drop reason. Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Ido Schimmel
|
2601e9c4b1 |
rtnetlink: bridge: Enable MDB bulk deletion
Now that both the common code as well as individual drivers support MDB bulk deletion, allow user space to make such requests. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Ido Schimmel
|
d8e81f1311 |
rtnetlink: bridge: Invoke MDB bulk deletion when needed
Invoke the new MDB bulk deletion device operation when the 'NLM_F_BULK' flag is set in the netlink message header. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Ido Schimmel
|
e0cd06f7fc |
rtnetlink: bridge: Use a different policy for MDB bulk delete
For MDB bulk delete we will need to validate 'MDBA_SET_ENTRY' differently compared to regular delete. Specifically, allow the ifindex to be zero (in case not filtering on bridge port) and force the address to be zero as bulk delete based on address is not supported. Do that by introducing a new policy and choosing the correct policy based on the presence of the 'NLM_F_BULK' flag in the netlink message header. Use nlmsg_parse() for strict validation. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Jens Axboe
|
a4104821ad |
io_uring/unix: drop usage of io_uring socket
Since we no longer allow sending io_uring fds over SCM_RIGHTS, move to using io_is_uring_fops() to detect whether this is a io_uring fd or not. With that done, kill off io_uring_get_socket() as nobody calls it anymore. This is in preparation to yanking out the rest of the core related to unix gc with io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
Andrii Nakryiko
|
d17aff807f |
Revert BPF token-related functionality
This patch includes the following revert (one conflicting BPF FS patch and three token patch sets, represented by merge commits): - revert |
||
Jakub Kicinski
|
2130c519a4 |
bpf: Use nla_ok() instead of checking nla_len directly
nla_len may also be too short to be sane, in which case after
recent changes nla_len() will return a wrapped value.
Fixes:
|
||
Jakub Kicinski
|
c49b292d03 |
netdev
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmWAz2EACgkQ6rmadz2v bToqrw/9EwroZCc8GEHOKAlb/fzrMvn92rLo0ZW/cGN84QJPnx4zM6Zo0+fgLaaN oqqztwMUwdzGC3uX3FfVXaaLKbJ/MeHeL9BXFZNW8zkRHciw4R7kIBhOdPnHyET7 uT+rQ4xPe1Mt7e9PjepKlSL5mEsxWfBkdUgsdn19Z2Vjdfr9mZMhYWYMJGcfTCD1 TwxHKBPhq5fN3IsshmMBB8IrRp1HStUKb65MgZ4dI22LJXxTsFkx5XMFXcmuqvkH NhKj8jDcPEEh31bYcb6aG2Z4onw5F2lquygjk1Qyy5cyw45m/ipJKAXKdAyvJG+R VZCWOET/9wbRwFSK5wxwihCuKghFiofK52i2PcGtXZh0PCouyZZneSJOKM0yVWKO BvuJBxK4ETRnQyN6ZxhuJiEXG3/YMBBhyR2TX1LntVK9ct/k7qFVzATG49J39/sR SYMbptBRj4a5oMJ1qn0nFVEDFkg0jTnTDNnsEpcz60Ayt6EsJ1XosO5yz2huf861 xgRMTKMseyG1/uV45tQ8ZPzbSPpBxjUi9Dl3coYsIm1a+y6clWUXcarONY5KVrpS CR98DuFgl+E7dXuisd/Kz2p2KxxSPq8nytsmLlgOvrUqhwiXqB+TKN8EHgIapVOt l1A5LrzXFTcGlT9MlaWBqEIy83Bu1nqQqbxrAFOE0k8A5jomXaw= =stU2 -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-12-18 This PR is larger than usual and contains changes in various parts of the kernel. The main changes are: 1) Fix kCFI bugs in BPF, from Peter Zijlstra. End result: all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. 2) Introduce BPF token object, from Andrii Nakryiko. It adds an ability to delegate a subset of BPF features from privileged daemon (e.g., systemd) through special mount options for userns-bound BPF FS to a trusted unprivileged application. The design accommodates suggestions from Christian Brauner and Paul Moore. Example: $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp 3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei. - Complete precision tracking support for register spills - Fix verification of possibly-zero-sized stack accesses - Fix access to uninit stack slots - Track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs. - Fix verifier retval logic 4) Support for VLAN tag in XDP hints, from Larysa Zaremba. 5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu. End result: better memory utilization and lower I$ miss for calls to BPF via BPF trampoline. 6) Fix race between BPF prog accessing inner map and parallel delete, from Hou Tao. 7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu. It allows BPF interact with IPSEC infra. The intent is to support software RSS (via XDP) for the upcoming ipsec pcpu work. Experiments on AWS demonstrate single tunnel pcpu ipsec reaching line rate on 100G ENA nics. 8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao. 9) BPF file verification via fsverity, from Song Liu. It allows BPF progs get fsverity digest. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits) bpf: Ensure precise is reset to false in __mark_reg_const_zero() selftests/bpf: Add more uprobe multi fail tests bpf: Fail uprobe multi link with negative offset selftests/bpf: Test the release of map btf s390/bpf: Fix indirect trampoline generation selftests/bpf: Temporarily disable dummy_struct_ops test on s390 x86/cfi,bpf: Fix bpf_exception_cb() signature bpf: Fix dtor CFI cfi: Add CFI_NOSEAL() x86/cfi,bpf: Fix bpf_struct_ops CFI x86/cfi,bpf: Fix bpf_callback_t CFI x86/cfi,bpf: Fix BPF JIT call cfi: Flip headers selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment bpf: Limit the number of kprobes when attaching program to multiple kprobes bpf: Limit the number of uprobes when attaching program to multiple uprobes bpf: xdp: Register generic_kfunc_set with XDP programs selftests/bpf: utilize string values for delegate_xxx mount options ... ==================== Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Pedro Tammela
|
174523479a |
net: rtnl: use rcu_replace_pointer_rtnl in rtnl_unregister_*
With the introduction of the rcu_replace_pointer_rtnl helper, cleanup the rtnl_unregister_* functions to use the helper instead of open coding it. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Liang Chen
|
f7dc3248dc |
skbuff: Optimization of SKB coalescing for page pool
In order to address the issues encountered with commit
|
||
Liang Chen
|
8cfa2dee32 |
skbuff: Add a function to check if a page belongs to page_pool
Wrap code for checking if a page is a page_pool page into a function for better readability and ease of reuse. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Mina Almasry <almarsymina@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Liang Chen
|
aaf153aece |
page_pool: halve BIAS_MAX for multiple user references of a fragment
Up to now, we were only subtracting from the number of used page fragments to figure out when a page could be freed or recycled. A following patch introduces support for multiple users referencing the same fragment. So reduce the initial page fragments value to half to avoid overflowing. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Mina Almasry <almarsymina@google.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Eric Dumazet
|
f5769faeec |
net: Namespace-ify sysctl_optmem_max
optmem_max being used in tx zerocopy, we want to be able to control it on a netns basis. Following patch changes two tests. Tested: oqq130:~# cat /proc/sys/net/core/optmem_max 131072 oqq130:~# echo 1000000 >/proc/sys/net/core/optmem_max oqq130:~# cat /proc/sys/net/core/optmem_max 1000000 oqq130:~# unshare -n oqq130:~# cat /proc/sys/net/core/optmem_max 131072 oqq130:~# exit logout oqq130:~# cat /proc/sys/net/core/optmem_max 1000000 Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Eric Dumazet
|
4944566706 |
net: increase optmem_max default value
For many years, /proc/sys/net/core/optmem_max default value on a 64bit kernel has been 20 KB. Regular usage of TCP tx zerocopy needs a bit more. Google has used 128KB as the default value for 7 years without any problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Shigeru Yoshida
|
cac23b7d76 |
net: Return error from sk_stream_wait_connect() if sk_wait_event() fails
The following NULL pointer dereference issue occurred:
BUG: kernel NULL pointer dereference, address: 0000000000000000
<...>
RIP: 0010:ccid_hc_tx_send_packet net/dccp/ccid.h:166 [inline]
RIP: 0010:dccp_write_xmit+0x49/0x140 net/dccp/output.c:356
<...>
Call Trace:
<TASK>
dccp_sendmsg+0x642/0x7e0 net/dccp/proto.c:801
inet_sendmsg+0x63/0x90 net/ipv4/af_inet.c:846
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x83/0xe0 net/socket.c:745
____sys_sendmsg+0x443/0x510 net/socket.c:2558
___sys_sendmsg+0xe5/0x150 net/socket.c:2612
__sys_sendmsg+0xa6/0x120 net/socket.c:2641
__do_sys_sendmsg net/socket.c:2650 [inline]
__se_sys_sendmsg net/socket.c:2648 [inline]
__x64_sys_sendmsg+0x45/0x50 net/socket.c:2648
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x43/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
sk_wait_event() returns an error (-EPIPE) if disconnect() is called on the
socket waiting for the event. However, sk_stream_wait_connect() returns
success, i.e. zero, even if sk_wait_event() returns -EPIPE, so a function
that waits for a connection with sk_stream_wait_connect() may misbehave.
In the case of the above DCCP issue, dccp_sendmsg() is waiting for the
connection. If disconnect() is called in concurrently, the above issue
occurs.
This patch fixes the issue by returning error from sk_stream_wait_connect()
if sk_wait_event() fails.
Fixes:
|
||
Jakub Kicinski
|
8f674972d6 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/iavf/iavf_ethtool.c |
||
Jakub Kicinski
|
c3f687d8df |
net: page_pool: factor out releasing DMA from releasing the page
Releasing the DMA mapping will be useful for other types of pages, so factor it out. Make sure compiler inlines it, to avoid any regressions. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Liang Chen
|
0a149ab78e |
page_pool: transition to reference count management after page draining
To support multiple users referencing the same fragment, 'pp_frag_count' is renamed to 'pp_ref_count', transitioning pp pages from fragment management to reference count management after draining based on the suggestion from [1]. The idea is that the concept of fragmenting exists before the page is drained, and all related functions retain their current names. However, once the page is drained, its management shifts to being governed by 'pp_ref_count'. Therefore, all functions associated with that lifecycle stage of a pp page are renamed. [1] http://lore.kernel.org/netdev/f71d9448-70c8-8793-dc9a-0eb48a570300@huawei.com Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://lore.kernel.org/r/20231212044614.42733-2-liangchen.linux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Eric Dumazet
|
23d05d563b |
net: prevent mss overflow in skb_segment()
Once again syzbot is able to crash the kernel in skb_segment() [1]
GSO_BY_FRAGS is a forbidden value, but unfortunately the following
computation in skb_segment() can reach it quite easily :
mss = mss * partial_segs;
65535 = 3 * 5 * 17 * 257, so many initial values of mss can lead to
a bad final result.
Make sure to limit segmentation so that the new mss value is smaller
than GSO_BY_FRAGS.
[1]
general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
CPU: 1 PID: 5079 Comm: syz-executor993 Not tainted 6.7.0-rc4-syzkaller-00141-g1ae4cd3cbdd0 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023
RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551
Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00
RSP: 0018:ffffc900043473d0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597
RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070
RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff
R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0
R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046
FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
udp6_ufo_fragment+0xa0e/0xd00 net/ipv6/udp_offload.c:109
ipv6_gso_segment+0x534/0x17e0 net/ipv6/ip6_offload.c:120
skb_mac_gso_segment+0x290/0x610 net/core/gso.c:53
__skb_gso_segment+0x339/0x710 net/core/gso.c:124
skb_gso_segment include/net/gso.h:83 [inline]
validate_xmit_skb+0x36c/0xeb0 net/core/dev.c:3626
__dev_queue_xmit+0x6f3/0x3d60 net/core/dev.c:4338
dev_queue_xmit include/linux/netdevice.h:3134 [inline]
packet_xmit+0x257/0x380 net/packet/af_packet.c:276
packet_snd net/packet/af_packet.c:3087 [inline]
packet_sendmsg+0x24c6/0x5220 net/packet/af_packet.c:3119
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0xd5/0x180 net/socket.c:745
__sys_sendto+0x255/0x340 net/socket.c:2190
__do_sys_sendto net/socket.c:2202 [inline]
__se_sys_sendto net/socket.c:2198 [inline]
__x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x40/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f8692032aa9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 d1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff8d685418 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f8692032aa9
RDX: 0000000000010048 RSI: 00000000200000c0 RDI: 0000000000000003
RBP: 00000000000f4240 R08: 0000000020000540 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff8d685480
R13: 0000000000000001 R14: 00007fff8d685480 R15: 0000000000000003
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:skb_segment+0x181d/0x3f30 net/core/skbuff.c:4551
Code: 83 e3 02 e9 fb ed ff ff e8 90 68 1c f9 48 8b 84 24 f8 00 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 8a 21 00 00 48 8b 84 24 f8 00
RSP: 0018:ffffc900043473d0 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000010046 RCX: ffffffff886b1597
RDX: 000000000000000e RSI: ffffffff886b2520 RDI: 0000000000000070
RBP: ffffc90004347578 R08: 0000000000000005 R09: 000000000000ffff
R10: 000000000000ffff R11: 0000000000000002 R12: ffff888063202ac0
R13: 0000000000010000 R14: 000000000000ffff R15: 0000000000000046
FS: 0000555556e7e380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020010000 CR3: 0000000027ee2000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Fixes:
|
||
John Fastabend
|
8d6650646c |
bpf: syzkaller found null ptr deref in unix_bpf proto add
I added logic to track the sock pair for stream_unix sockets so that we
ensure lifetime of the sock matches the time a sockmap could reference
the sock (see fixes tag). I forgot though that we allow af_unix unconnected
sockets into a sock{map|hash} map.
This is problematic because previous fixed expected sk_pair() to exist
and did not NULL check it. Because unconnected sockets have a NULL
sk_pair this resulted in the NULL ptr dereference found by syzkaller.
BUG: KASAN: null-ptr-deref in unix_stream_bpf_update_proto+0x72/0x430 net/unix/unix_bpf.c:171
Write of size 4 at addr 0000000000000080 by task syz-executor360/5073
Call Trace:
<TASK>
...
sock_hold include/net/sock.h:777 [inline]
unix_stream_bpf_update_proto+0x72/0x430 net/unix/unix_bpf.c:171
sock_map_init_proto net/core/sock_map.c:190 [inline]
sock_map_link+0xb87/0x1100 net/core/sock_map.c:294
sock_map_update_common+0xf6/0x870 net/core/sock_map.c:483
sock_map_update_elem_sys+0x5b6/0x640 net/core/sock_map.c:577
bpf_map_update_value+0x3af/0x820 kernel/bpf/syscall.c:167
We considered just checking for the null ptr and skipping taking a ref
on the NULL peer sock. But, if the socket is then connected() after
being added to the sockmap we can cause the original issue again. So
instead this patch blocks adding af_unix sockets that are not in the
ESTABLISHED state.
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+e8030702aefd3444fb9e@syzkaller.appspotmail.com
Fixes:
|