1265548 Commits

Author SHA1 Message Date
Jakub Kicinski
cf1ca1f66d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/ipv4/ip_gre.c
  17af420545a7 ("erspan: make sure erspan_base_hdr is present in skb->head")
  5832c4a77d69 ("ip_tunnel: convert __be16 tunnel flags to bitmaps")
https://lore.kernel.org/all/20240402103253.3b54a1cf@canb.auug.org.au/

Adjacent changes:

net/ipv6/ip6_fib.c
  d21d40605bca ("ipv6: Fix infinite recursion in fib6_dump_done().")
  5fc68320c1fb ("ipv6: remove RTNL protection from inet6_dump_fib()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 18:01:07 -07:00
Linus Torvalds
c88b9b4cde Including fixes from netfilter, bluetooth and bpf.
Fairly usual collection of driver and core fixes. The large selftest
 accompanying one of the fixes is also becoming a common occurrence.
 
 Current release - regressions:
 
  - ipv6: fix infinite recursion in fib6_dump_done()
 
  - net/rds: fix possible null-deref in newly added error path
 
 Current release - new code bugs:
 
  - net: do not consume a full cacheline for system_page_pool
 
  - bpf: fix bpf_arena-related file descriptor leaks in the verifier
 
  - drv: ice: fix freeing uninitialized pointers, fixing misuse of
    the newfangled __free() auto-cleanup
 
 Previous releases - regressions:
 
  - x86/bpf: fixes the BPF JIT with retbleed=stuff
 
  - xen-netfront: add missing skb_mark_for_recycle, fix page pool
    accounting leaks, revealed by recently added explicit warning
 
  - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
    non-wildcard addresses
 
  - Bluetooth:
    - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT"
      with better workarounds to un-break some buggy Qualcomm devices
    - set conn encrypted before conn establishes, fix re-connecting
      to some headsets which use slightly unusual sequence of msgs
 
  - mptcp:
    - prevent BPF accessing lowat from a subflow socket
    - don't account accept() of non-MPC client as fallback to TCP
 
  - drv: mana: fix Rx DMA datasize and skb_over_panic
 
  - drv: i40e: fix VF MAC filter removal
 
 Previous releases - always broken:
 
  - gro: various fixes related to UDP tunnels - netns crossing problems,
    incorrect checksum conversions, and incorrect packet transformations
    which may lead to panics
 
  - bpf: support deferring bpf_link dealloc to after RCU grace period
 
  - nf_tables:
    - release batch on table validation from abort path
    - release mutex after nft_gc_seq_end from abort path
    - flush pending destroy work before exit_net release
 
  - drv: r8169: skip DASH fw status checks when DASH is disabled
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmYO91wACgkQMUZtbf5S
 IrvHBQ/+PH/hobI+o3aLqwtdVlyxhmA31bVQ0I3aTIZV7c3ideMBcfgYa8TiZM2g
 pLiBiWoJXCN0h33wgUmlUee+sBvpoPCdPjGD/g99OJyKWjVt2D7ObnSwxMfjHUoq
 dtcN2JupqHP0SHz6wPPCmnWtTLxSGUsDdKjmkHQcCRhQIGTYFkYyHcOmPgNbBjaB
 6jvmH1kE9WQTFD8QcOMaZmXQ5omoafpxxQLsgundtOWxPWHL7XNvk0B5k/ESDRG1
 ujbxwtNnOESzpxZMQ6OyZlsnN/1tWfnEvLJFYVwf9BMrOlahJT/f5b/EJ9/Xy4dC
 zkAp7Tul3uAvNRKhBNhVBTWQbnIykmiNMp1VBFmiScQAy8hcnX+6d4LKTIHxbXZK
 V3AqcUS6YU2nyMdLRkhvq9f3uxD6hcY19gQdyqgCUPOtyUAs/JPv7lXQjCuuEqkq
 urEZkigUApnEqPIrIqANJ7nXUy3U0K8qU6evOZoGZ5OdiKeNKC3+tIr+g2f1ZUZq
 a7Dkat7JH9WQ7IG8Geody6Z30K9EpSqYMTKzB5wTfmuqw6cV8bl9OAW9UOSRK0GL
 pyG8GwpkpFPkNiZdu9Zt44Pno5xdLIa1+C3QZR0r5CJWYAzCbI80MppP5veF9Mw+
 v+2v8iBWuh9iv0AUj9KJOwG5QQ+EXLUuSlhtx/DFnmn2CJ9plXI=
 =6bQI
 -----END PGP SIGNATURE-----

Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, bluetooth and bpf.

  Fairly usual collection of driver and core fixes. The large selftest
  accompanying one of the fixes is also becoming a common occurrence.

  Current release - regressions:

   - ipv6: fix infinite recursion in fib6_dump_done()

   - net/rds: fix possible null-deref in newly added error path

  Current release - new code bugs:

   - net: do not consume a full cacheline for system_page_pool

   - bpf: fix bpf_arena-related file descriptor leaks in the verifier

   - drv: ice: fix freeing uninitialized pointers, fixing misuse of the
     newfangled __free() auto-cleanup

  Previous releases - regressions:

   - x86/bpf: fixes the BPF JIT with retbleed=stuff

   - xen-netfront: add missing skb_mark_for_recycle, fix page pool
     accounting leaks, revealed by recently added explicit warning

   - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
     non-wildcard addresses

   - Bluetooth:
      - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
        better workarounds to un-break some buggy Qualcomm devices
      - set conn encrypted before conn establishes, fix re-connecting to
        some headsets which use slightly unusual sequence of msgs

   - mptcp:
      - prevent BPF accessing lowat from a subflow socket
      - don't account accept() of non-MPC client as fallback to TCP

   - drv: mana: fix Rx DMA datasize and skb_over_panic

   - drv: i40e: fix VF MAC filter removal

  Previous releases - always broken:

   - gro: various fixes related to UDP tunnels - netns crossing
     problems, incorrect checksum conversions, and incorrect packet
     transformations which may lead to panics

   - bpf: support deferring bpf_link dealloc to after RCU grace period

   - nf_tables:
      - release batch on table validation from abort path
      - release mutex after nft_gc_seq_end from abort path
      - flush pending destroy work before exit_net release

   - drv: r8169: skip DASH fw status checks when DASH is disabled"

* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  netfilter: validate user input for expected length
  net/sched: act_skbmod: prevent kernel-infoleak
  net: usb: ax88179_178a: avoid the interface always configured as random address
  net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
  net: ravb: Always update error counters
  net: ravb: Always process TX descriptor ring
  netfilter: nf_tables: discard table flag update with pending basechain deletion
  netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
  netfilter: nf_tables: reject new basechain after table flag update
  netfilter: nf_tables: flush pending destroy work before exit_net release
  netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
  netfilter: nf_tables: release batch on table validation from abort path
  Revert "tg3: Remove residual error handling in tg3_suspend"
  tg3: Remove residual error handling in tg3_suspend
  net: mana: Fix Rx DMA datasize and skb_over_panic
  net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
  net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
  net: stmmac: fix rx queue priority assignment
  net: txgbe: fix i2c dev name cannot match clkdev
  net: fec: Set mac_managed_pm during probe
  ...
2024-04-04 14:49:10 -07:00
Linus Torvalds
ec25bd8d98 bcachefs repair code for 6.9-rc3
A couple more small fixes, and new repair code.
 
 We can now automatically recover from arbitrary corrupted interior btree
 nodes by scanning, and we can reconstruct metadata as needed to bring a
 filesystem back into a working, consistent, read-write state and
 preserve access to whatevver wasn't corrupted.
 
 Meaning - you can blow away all metadata except for extents and dirents
 leaf nodes, and repair will reconstruct everything else and give you
 your data, and under the correct paths. If inodes are missing i_size
 will be slightly off and permissions/ownership/timestamps will be gone,
 and we do still need the snapshots btree if snapshots were in use - in
 the future we'll be able to guess the snapshot tree structure in some
 situations.
 
 IOW - aside from shaking out remaining bugs (fuzz testing is still
 coming), repair code should be complete and if repair ever doesn't work
 that's the highest priority bug that I want to know about immediately.
 
 This patchset was kindly tested by a user from India who accidentally
 wiped one drive out of a three drive filesystem with no replication on
 the family computer - it took a couple weeks but we got everything
 important back.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYNq9IACgkQE6szbY3K
 bnaG9w/+Od0iq4Nqx62Mf8+O5DLnZZNu3c9aUOEiuzdXlNrpUr4S9j4WwDxTb/EN
 2a3ldXY5AhauZqEW7Qv+WBZvVVbm3GYH+oOYQo8V+yf1oGNB3+AGxBCCmruHJGLk
 5nmwsRyVm1ihAKxn1oxwrDDPtOlxbGOlc4peR+nCY/b5QnlXegGkGfRAHO/z9bul
 4JdBYBqR4KBGdevIV8EG2WVa6ASA6mF1QOboeB6INekD4klDpm41gK/0S9Uf2oXm
 q1PiN655YHquXbJTT9k/HtVX4WhlcaHv+R4KeZ5TEReJjB57ot/M8Rx57lgsYHP6
 TeyR4Y5VYGLYqlwMK5RiKyGLB92qNFcSlg5inASyTCUNi1KKu12SpqS3+Nel6+tF
 gu4F4ElSvAcsmJ6LrfsfP9B8u0ULDkIyq9xBFFbLTIpLuDOqz8FcgFpZrpiO445w
 F6FcYXqt2/fP7gxA3GzdFjeUojIjWNMJapgpsePg/HGNArBsoAsBL8rAhAyetG3Z
 EOJlrJ8m59/QoPgXBpScfbS7cxk3JgrUzfSI/oKaEr2lS0YNlYjQANYHoEHTFaxA
 bMWKXwMkvqz49MMm5WLaMIOYDJRDtrt0qpnW7x+qU7ik/VkHeUTJr07bSRIKT0z1
 yNCynYtdbeQVfekZQS6JwsyTs/ehbI1OVN8MGwVRCrQTonYz+BA=
 =7/rR
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-04-03' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs repair code from Kent Overstreet:
 "A couple more small fixes, and new repair code.

  We can now automatically recover from arbitrary corrupted interior
  btree nodes by scanning, and we can reconstruct metadata as needed to
  bring a filesystem back into a working, consistent, read-write state
  and preserve access to whatevver wasn't corrupted.

  Meaning - you can blow away all metadata except for extents and
  dirents leaf nodes, and repair will reconstruct everything else and
  give you your data, and under the correct paths. If inodes are missing
  i_size will be slightly off and permissions/ownership/timestamps will
  be gone, and we do still need the snapshots btree if snapshots were in
  use - in the future we'll be able to guess the snapshot tree structure
  in some situations.

  IOW - aside from shaking out remaining bugs (fuzz testing is still
  coming), repair code should be complete and if repair ever doesn't
  work that's the highest priority bug that I want to know about
  immediately.

  This patchset was kindly tested by a user from India who accidentally
  wiped one drive out of a three drive filesystem with no replication on
  the family computer - it took a couple weeks but we got everything
  important back"

* tag 'bcachefs-2024-04-03' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: reconstruct_inode()
  bcachefs: Subvolume reconstruction
  bcachefs: Check for extents that point to same space
  bcachefs: Reconstruct missing snapshot nodes
  bcachefs: Flag btrees with missing data
  bcachefs: Topology repair now uses nodes found by scanning to fill holes
  bcachefs: Repair pass for scanning for btree nodes
  bcachefs: Don't skip fake btree roots in fsck
  bcachefs: bch2_btree_root_alloc() -> bch2_btree_root_alloc_fake()
  bcachefs: Etyzinger cleanups
  bcachefs: bch2_shoot_down_journal_keys()
  bcachefs: Clear recovery_passes_required as they complete without errors
  bcachefs: ratelimit informational fsck errors
  bcachefs: Check for bad needs_discard before doing discard
  bcachefs: Improve bch2_btree_update_to_text()
  mean_and_variance: Drop always failing tests
  bcachefs: fix nocow lock deadlock
  bcachefs: BCH_WATERMARK_interior_updates
  bcachefs: Fix btree node reserve
2024-04-04 14:36:32 -07:00
Jakub Kicinski
1cfa2f10f4 bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZg7vEAAKCRDbK58LschI
 gxAeAQD18uHqGbcCCnOVnaETdi3bQcSBpobuclThpN1WdMILcwEA+5Vz9Lxmt6BH
 qF/vbVHAOeCZt3LTOJ/jx1GRVstVWQk=
 =+QKv
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-04-04

We've added 7 non-merge commits during the last 5 day(s) which contain
a total of 9 files changed, 75 insertions(+), 24 deletions(-).

The main changes are:

1) Fix x86 BPF JIT under retbleed=stuff which causes kernel panics due to
   incorrect destination IP calculation and incorrect IP for relocations,
   from Uros Bizjak and Joan Bruguera Micó.

2) Fix BPF arena file descriptor leaks in the verifier,
   from Anton Protopopov.

3) Defer bpf_link deallocation to after RCU grace period as currently
   running multi-{kprobes,uprobes} programs might still access cookie
   information from the link, from Andrii Nakryiko.

4) Fix a BPF sockmap lock inversion deadlock in map_delete_elem reported
   by syzkaller, from Jakub Sitnicki.

5) Fix resolve_btfids build with musl libc due to missing linux/types.h
   include, from Natanael Copa.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, sockmap: Prevent lock inversion deadlock in map delete elem
  x86/bpf: Fix IP for relocating call depth accounting
  x86/bpf: Fix IP after emitting call depth accounting
  bpf: fix possible file descriptor leaks in verifier
  tools/resolve_btfids: fix build with musl libc
  bpf: support deferring bpf_link dealloc to after RCU grace period
  bpf: put uprobe link's path and task in release callback
====================

Link: https://lore.kernel.org/r/20240404183258.4401-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 11:37:39 -07:00
Jakub Kicinski
1148c4098e Merge branch 'selftests-net-groundwork-for-ynl-based-tests'
Jakub Kicinski says:

====================
selftests: net: groundwork for YNL-based tests (YNL prep)

v1: https://lore.kernel.org/all/20240402010520.1209517-1-kuba@kernel.org/
====================

Merge the non-controversial YNL adjustment and spec additions.

Link: https://lore.kernel.org/r/20240403023426.1762996-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 10:33:57 -07:00
Jakub Kicinski
b269d2b4a5 tools: ynl: copy netlink error to NlError
Typing e.nl_msg.error when processing exception is a bit tedious
and counter-intuitive. Set a local .error member to the positive
value of the netlink level error.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20240403023426.1762996-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 10:33:52 -07:00
Jakub Kicinski
1d056bf9a4 netlink: specs: define ethtool header flags
When interfacing with the ethtool commands it's handy to
be able to use the names of the flags. Example:

    ethnl.pause_get({"header": {"dev-index": cfg.ifindex,
                                "flags": {'stats'}}})

Note that not all commands accept all the flags,
but the meaning of the bits does not change command
to command.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/20240403023426.1762996-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 10:33:52 -07:00
Eric Dumazet
0c83842df4 netfilter: validate user input for expected length
I got multiple syzbot reports showing old bugs exposed
by BPF after commit 20f2505fb436 ("bpf: Try to avoid kzalloc
in cgroup/{s,g}etsockopt")

setsockopt() @optlen argument should be taken into account
before copying data.

 BUG: KASAN: slab-out-of-bounds in copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
 BUG: KASAN: slab-out-of-bounds in copy_from_sockptr include/linux/sockptr.h:55 [inline]
 BUG: KASAN: slab-out-of-bounds in do_replace net/ipv4/netfilter/ip_tables.c:1111 [inline]
 BUG: KASAN: slab-out-of-bounds in do_ipt_set_ctl+0x902/0x3dd0 net/ipv4/netfilter/ip_tables.c:1627
Read of size 96 at addr ffff88802cd73da0 by task syz-executor.4/7238

CPU: 1 PID: 7238 Comm: syz-executor.4 Not tainted 6.9.0-rc2-next-20240403-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Call Trace:
 <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
  print_address_description mm/kasan/report.c:377 [inline]
  print_report+0x169/0x550 mm/kasan/report.c:488
  kasan_report+0x143/0x180 mm/kasan/report.c:601
  kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
  __asan_memcpy+0x29/0x70 mm/kasan/shadow.c:105
  copy_from_sockptr_offset include/linux/sockptr.h:49 [inline]
  copy_from_sockptr include/linux/sockptr.h:55 [inline]
  do_replace net/ipv4/netfilter/ip_tables.c:1111 [inline]
  do_ipt_set_ctl+0x902/0x3dd0 net/ipv4/netfilter/ip_tables.c:1627
  nf_setsockopt+0x295/0x2c0 net/netfilter/nf_sockopt.c:101
  do_sock_setsockopt+0x3af/0x720 net/socket.c:2311
  __sys_setsockopt+0x1ae/0x250 net/socket.c:2334
  __do_sys_setsockopt net/socket.c:2343 [inline]
  __se_sys_setsockopt net/socket.c:2340 [inline]
  __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
 do_syscall_64+0xfb/0x240
 entry_SYSCALL_64_after_hwframe+0x72/0x7a
RIP: 0033:0x7fd22067dde9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fd21f9ff0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00007fd2207abf80 RCX: 00007fd22067dde9
RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007fd2206ca47a R08: 0000000000000001 R09: 0000000000000000
R10: 0000000020000880 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fd2207abf80 R15: 00007ffd2d0170d8
 </TASK>

Allocated by task 7238:
  kasan_save_stack mm/kasan/common.c:47 [inline]
  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
  poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
  __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
  kasan_kmalloc include/linux/kasan.h:211 [inline]
  __do_kmalloc_node mm/slub.c:4069 [inline]
  __kmalloc_noprof+0x200/0x410 mm/slub.c:4082
  kmalloc_noprof include/linux/slab.h:664 [inline]
  __cgroup_bpf_run_filter_setsockopt+0xd47/0x1050 kernel/bpf/cgroup.c:1869
  do_sock_setsockopt+0x6b4/0x720 net/socket.c:2293
  __sys_setsockopt+0x1ae/0x250 net/socket.c:2334
  __do_sys_setsockopt net/socket.c:2343 [inline]
  __se_sys_setsockopt net/socket.c:2340 [inline]
  __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
 do_syscall_64+0xfb/0x240
 entry_SYSCALL_64_after_hwframe+0x72/0x7a

The buggy address belongs to the object at ffff88802cd73da0
 which belongs to the cache kmalloc-8 of size 8
The buggy address is located 0 bytes inside of
 allocated 1-byte region [ffff88802cd73da0, ffff88802cd73da1)

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802cd73020 pfn:0x2cd73
flags: 0xfff80000000000(node=0|zone=1|lastcpupid=0xfff)
page_type: 0xffffefff(slab)
raw: 00fff80000000000 ffff888015041280 dead000000000100 dead000000000122
raw: ffff88802cd73020 000000008080007f 00000001ffffefff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 5103, tgid 2119833701 (syz-executor.4), ts 5103, free_ts 70804600828
  set_page_owner include/linux/page_owner.h:32 [inline]
  post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1490
  prep_new_page mm/page_alloc.c:1498 [inline]
  get_page_from_freelist+0x2e7e/0x2f40 mm/page_alloc.c:3454
  __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4712
  __alloc_pages_node_noprof include/linux/gfp.h:244 [inline]
  alloc_pages_node_noprof include/linux/gfp.h:271 [inline]
  alloc_slab_page+0x5f/0x120 mm/slub.c:2249
  allocate_slab+0x5a/0x2e0 mm/slub.c:2412
  new_slab mm/slub.c:2465 [inline]
  ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3615
  __slab_alloc+0x58/0xa0 mm/slub.c:3705
  __slab_alloc_node mm/slub.c:3758 [inline]
  slab_alloc_node mm/slub.c:3936 [inline]
  __do_kmalloc_node mm/slub.c:4068 [inline]
  kmalloc_node_track_caller_noprof+0x286/0x450 mm/slub.c:4089
  kstrdup+0x3a/0x80 mm/util.c:62
  device_rename+0xb5/0x1b0 drivers/base/core.c:4558
  dev_change_name+0x275/0x860 net/core/dev.c:1232
  do_setlink+0xa4b/0x41f0 net/core/rtnetlink.c:2864
  __rtnl_newlink net/core/rtnetlink.c:3680 [inline]
  rtnl_newlink+0x180b/0x20a0 net/core/rtnetlink.c:3727
  rtnetlink_rcv_msg+0x89b/0x10d0 net/core/rtnetlink.c:6594
  netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2559
  netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
  netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
page last free pid 5146 tgid 5146 stack trace:
  reset_page_owner include/linux/page_owner.h:25 [inline]
  free_pages_prepare mm/page_alloc.c:1110 [inline]
  free_unref_page+0xd3c/0xec0 mm/page_alloc.c:2617
  discard_slab mm/slub.c:2511 [inline]
  __put_partials+0xeb/0x130 mm/slub.c:2980
  put_cpu_partial+0x17c/0x250 mm/slub.c:3055
  __slab_free+0x2ea/0x3d0 mm/slub.c:4254
  qlink_free mm/kasan/quarantine.c:163 [inline]
  qlist_free_all+0x9e/0x140 mm/kasan/quarantine.c:179
  kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
  __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:322
  kasan_slab_alloc include/linux/kasan.h:201 [inline]
  slab_post_alloc_hook mm/slub.c:3888 [inline]
  slab_alloc_node mm/slub.c:3948 [inline]
  __do_kmalloc_node mm/slub.c:4068 [inline]
  __kmalloc_node_noprof+0x1d7/0x450 mm/slub.c:4076
  kmalloc_node_noprof include/linux/slab.h:681 [inline]
  kvmalloc_node_noprof+0x72/0x190 mm/util.c:634
  bucket_table_alloc lib/rhashtable.c:186 [inline]
  rhashtable_rehash_alloc+0x9e/0x290 lib/rhashtable.c:367
  rht_deferred_worker+0x4e1/0x2440 lib/rhashtable.c:427
  process_one_work kernel/workqueue.c:3218 [inline]
  process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3299
  worker_thread+0x86d/0xd70 kernel/workqueue.c:3380
  kthread+0x2f0/0x390 kernel/kthread.c:388
  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243

Memory state around the buggy address:
 ffff88802cd73c80: 07 fc fc fc 05 fc fc fc 05 fc fc fc fa fc fc fc
 ffff88802cd73d00: fa fc fc fc fa fc fc fc fa fc fc fc fa fc fc fc
>ffff88802cd73d80: fa fc fc fc 01 fc fc fc fa fc fc fc fa fc fc fc
                               ^
 ffff88802cd73e00: fa fc fc fc fa fc fc fc 05 fc fc fc 07 fc fc fc
 ffff88802cd73e80: 07 fc fc fc 07 fc fc fc 07 fc fc fc 07 fc fc fc

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20240404122051.2303764-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:39:52 -07:00
Jakub Kicinski
d432f7bdc1 netfilter pull request 24-04-04
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmYOgt8ACgkQ1V2XiooU
 IOTX7RAAjhvbva/ussuSDVjDNlYkeZ96uwJNeJz/sHKm7tV043yWwQrYpOy74pBe
 oxgya6IbI9yR5JVPlfrsfpgjmCbiWuiZmrMBUr0v1EQ4Zb1WD1KWnoEFKRS80hMq
 qZfqCP6gWH6e7nxkD9URamBUOBemN6rg6gqDPIs9khtm9dKrUCKms8kalTlN2gKl
 x1jnQwwKUZPtWXw5gN875BHrmX+kBm0YPGr11ys5mYvtUpmvSAwE1l33BAUKLzR+
 vkd54f5zF+ZuURE2yYAsa4ZQh/Muho9OVv69r3rJM4thsJmvZLgPYqLZ7iEgxsBQ
 RnG034d9liDeNYvsAhZSVTtzT8M8ctfSH/omyAF6QtPN6RdDd5A0XWp6/EhnuAeq
 G1qKlXULwfG82uwD6hui0OBRJUTB4muFF4T28iDlCvUbFaEqftU6C/eSjcBbzJNc
 nfD7AZsmXJFbEfqamHmezrKcCoCcdkXJlw0SoUp8YgO0mn9txJ3f2hTDsh/48ibA
 Y5v6YXwQxP/UYugW8pFTsircaf3uasMJpk49/OlTuGXGj+NtHS0yiaw1kgpVsvPC
 mzc/k9iHwan+lCAXQVHPZPwY0puIXIzgcBpJxfzcsJOQVi6SQIAjjRANpqFoGYcz
 P2VGeBj1/+Xvhi62OFmvNNDw6iG4LUae4LEVj6gJuzPVqsvZ4qY=
 =PKX1
 -----END PGP SIGNATURE-----

Merge tag 'nf-24-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

Patch #1 unlike early commit path stage which triggers a call to abort,
         an explicit release of the batch is required on abort, otherwise
         mutex is released and commit_list remains in place.

Patch #2 release mutex after nft_gc_seq_end() in commit path, otherwise
         async GC worker could collect expired objects.

Patch #3 flush pending destroy work in module removal path, otherwise UaF
         is possible.

Patch #4 and #6 restrict the table dormant flag with basechain updates
	 to fix state inconsistency in the hook registration.

Patch #5 adds missing RCU read side lock to flowtable type to avoid races
	 with module removal.

* tag 'nf-24-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: discard table flag update with pending basechain deletion
  netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
  netfilter: nf_tables: reject new basechain after table flag update
  netfilter: nf_tables: flush pending destroy work before exit_net release
  netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
  netfilter: nf_tables: release batch on table validation from abort path
====================

Link: https://lore.kernel.org/r/20240404104334.1627-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:38:52 -07:00
Jakub Kicinski
a66323e4fa Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2024-04-03 (ice, idpf)

This series contains updates to ice and idpf drivers.

Dan Carpenter initializes some pointer declarations to NULL as needed for
resource cleanup on ice driver.

Petr Oros corrects assignment of VLAN operators to fix Rx VLAN filtering
in legacy mode for ice.

Joshua calls eth_type_trans() on unknown packets to prevent possible
kernel panic on idpf.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  idpf: fix kernel panic on unknown packet types
  ice: fix enabling RX VLAN filtering
  ice: Fix freeing uninitialized pointers
====================

Link: https://lore.kernel.org/r/20240403201929.1945116-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:34:35 -07:00
Eric Dumazet
d313eb8b77 net/sched: act_skbmod: prevent kernel-infoleak
syzbot found that tcf_skbmod_dump() was copying four bytes
from kernel stack to user space [1].

The issue here is that 'struct tc_skbmod' has a four bytes hole.

We need to clear the structure before filling fields.

[1]
BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline]
 BUG: KMSAN: kernel-infoleak in copy_to_user_iter lib/iov_iter.c:24 [inline]
 BUG: KMSAN: kernel-infoleak in iterate_ubuf include/linux/iov_iter.h:29 [inline]
 BUG: KMSAN: kernel-infoleak in iterate_and_advance2 include/linux/iov_iter.h:245 [inline]
 BUG: KMSAN: kernel-infoleak in iterate_and_advance include/linux/iov_iter.h:271 [inline]
 BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x366/0x2520 lib/iov_iter.c:185
  instrument_copy_to_user include/linux/instrumented.h:114 [inline]
  copy_to_user_iter lib/iov_iter.c:24 [inline]
  iterate_ubuf include/linux/iov_iter.h:29 [inline]
  iterate_and_advance2 include/linux/iov_iter.h:245 [inline]
  iterate_and_advance include/linux/iov_iter.h:271 [inline]
  _copy_to_iter+0x366/0x2520 lib/iov_iter.c:185
  copy_to_iter include/linux/uio.h:196 [inline]
  simple_copy_to_iter net/core/datagram.c:532 [inline]
  __skb_datagram_iter+0x185/0x1000 net/core/datagram.c:420
  skb_copy_datagram_iter+0x5c/0x200 net/core/datagram.c:546
  skb_copy_datagram_msg include/linux/skbuff.h:4050 [inline]
  netlink_recvmsg+0x432/0x1610 net/netlink/af_netlink.c:1962
  sock_recvmsg_nosec net/socket.c:1046 [inline]
  sock_recvmsg+0x2c4/0x340 net/socket.c:1068
  __sys_recvfrom+0x35a/0x5f0 net/socket.c:2242
  __do_sys_recvfrom net/socket.c:2260 [inline]
  __se_sys_recvfrom net/socket.c:2256 [inline]
  __x64_sys_recvfrom+0x126/0x1d0 net/socket.c:2256
 do_syscall_64+0xd5/0x1f0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

Uninit was stored to memory at:
  pskb_expand_head+0x30f/0x19d0 net/core/skbuff.c:2253
  netlink_trim+0x2c2/0x330 net/netlink/af_netlink.c:1317
  netlink_unicast+0x9f/0x1260 net/netlink/af_netlink.c:1351
  nlmsg_unicast include/net/netlink.h:1144 [inline]
  nlmsg_notify+0x21d/0x2f0 net/netlink/af_netlink.c:2610
  rtnetlink_send+0x73/0x90 net/core/rtnetlink.c:741
  rtnetlink_maybe_send include/linux/rtnetlink.h:17 [inline]
  tcf_add_notify net/sched/act_api.c:2048 [inline]
  tcf_action_add net/sched/act_api.c:2071 [inline]
  tc_ctl_action+0x146e/0x19d0 net/sched/act_api.c:2119
  rtnetlink_rcv_msg+0x1737/0x1900 net/core/rtnetlink.c:6595
  netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2559
  rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6613
  netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
  netlink_unicast+0xf4c/0x1260 net/netlink/af_netlink.c:1361
  netlink_sendmsg+0x10df/0x11f0 net/netlink/af_netlink.c:1905
  sock_sendmsg_nosec net/socket.c:730 [inline]
  __sock_sendmsg+0x30f/0x380 net/socket.c:745
  ____sys_sendmsg+0x877/0xb60 net/socket.c:2584
  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2638
  __sys_sendmsg net/socket.c:2667 [inline]
  __do_sys_sendmsg net/socket.c:2676 [inline]
  __se_sys_sendmsg net/socket.c:2674 [inline]
  __x64_sys_sendmsg+0x307/0x4a0 net/socket.c:2674
 do_syscall_64+0xd5/0x1f0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

Uninit was stored to memory at:
  __nla_put lib/nlattr.c:1041 [inline]
  nla_put+0x1c6/0x230 lib/nlattr.c:1099
  tcf_skbmod_dump+0x23f/0xc20 net/sched/act_skbmod.c:256
  tcf_action_dump_old net/sched/act_api.c:1191 [inline]
  tcf_action_dump_1+0x85e/0x970 net/sched/act_api.c:1227
  tcf_action_dump+0x1fd/0x460 net/sched/act_api.c:1251
  tca_get_fill+0x519/0x7a0 net/sched/act_api.c:1628
  tcf_add_notify_msg net/sched/act_api.c:2023 [inline]
  tcf_add_notify net/sched/act_api.c:2042 [inline]
  tcf_action_add net/sched/act_api.c:2071 [inline]
  tc_ctl_action+0x1365/0x19d0 net/sched/act_api.c:2119
  rtnetlink_rcv_msg+0x1737/0x1900 net/core/rtnetlink.c:6595
  netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2559
  rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6613
  netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
  netlink_unicast+0xf4c/0x1260 net/netlink/af_netlink.c:1361
  netlink_sendmsg+0x10df/0x11f0 net/netlink/af_netlink.c:1905
  sock_sendmsg_nosec net/socket.c:730 [inline]
  __sock_sendmsg+0x30f/0x380 net/socket.c:745
  ____sys_sendmsg+0x877/0xb60 net/socket.c:2584
  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2638
  __sys_sendmsg net/socket.c:2667 [inline]
  __do_sys_sendmsg net/socket.c:2676 [inline]
  __se_sys_sendmsg net/socket.c:2674 [inline]
  __x64_sys_sendmsg+0x307/0x4a0 net/socket.c:2674
 do_syscall_64+0xd5/0x1f0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

Local variable opt created at:
  tcf_skbmod_dump+0x9d/0xc20 net/sched/act_skbmod.c:244
  tcf_action_dump_old net/sched/act_api.c:1191 [inline]
  tcf_action_dump_1+0x85e/0x970 net/sched/act_api.c:1227

Bytes 188-191 of 248 are uninitialized
Memory access of size 248 starts at ffff888117697680
Data copied to user address 00007ffe56d855f0

Fixes: 86da71b57383 ("net_sched: Introduce skbmod action")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20240403130908.93421-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:32:29 -07:00
Jose Ignacio Tornos Martinez
2e91bb99b9 net: usb: ax88179_178a: avoid the interface always configured as random address
After the commit d2689b6a86b9 ("net: usb: ax88179_178a: avoid two
consecutive device resets"), reset is not executed from bind operation and
mac address is not read from the device registers or the devicetree at that
moment. Since the check to configure if the assigned mac address is random
or not for the interface, happens after the bind operation from
usbnet_probe, the interface keeps configured as random address, although the
address is correctly read and set during open operation (the only reset
now).

In order to keep only one reset for the device and to avoid the interface
always configured as random address, after reset, configure correctly the
suitable field from the driver, if the mac address is read successfully from
the device registers or the devicetree. Take into account if a locally
administered address (random) was previously stored.

cc: stable@vger.kernel.org # 6.6+
Fixes: d2689b6a86b9 ("net: usb: ax88179_178a: avoid two consecutive device resets")
Reported-by: Dave Stevenson  <dave.stevenson@raspberrypi.com>
Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240403132158.344838-1-jtornosm@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:30:18 -07:00
Jakub Kicinski
1dee310c26 Merge branch 'bnxt_en-update-for-net-next'
Pavan Chebbi says:

====================
bnxt_en: Update for net-next

This patchset contains the following updates to bnxt:

- Patch 1 supports handling Downstream Port Containment (DPC) AER
on older chipsets

- Patch 2 enables XPS by default on driver load

- Patch 3 optimizes page pool allocation for numa nodes

- Patch 4 & 5 add support for XDP metadata

- Patch 6 updates firmware interface

- Patch 7 adds a warning about limitations on certain transceivers
====================

Link: https://lore.kernel.org/r/20240402093753.331120-1-pavan.chebbi@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:13:24 -07:00
Sreekanth Reddy
e193f53aed bnxt_en: Add warning message about disallowed speed change
Some chips may not allow changing default speed when dual rate
transceivers modules are used. Firmware on those chips will
indicate the same to the driver.

Add a warning message when speed change is not supported
because a dual rate transceiver is detected by the NIC.

Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240402093753.331120-8-pavan.chebbi@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:13:20 -07:00
Pavan Chebbi
4e474addc0 bnxt_en: Update firmware interface to 1.10.3.39
This updated interface supports backing store APIs to
configure host FW trace buffers, updates transceivers ID
types, updates to TrueFlow features and other changes
for the basic L2 features.

Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240402093753.331120-7-pavan.chebbi@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:13:19 -07:00
Somnath Kotur
0ae1fafc8b bnxt_en: Add XDP Metadata support
- Change the last arg to xdp_prepare_buff to true from false.
- Ensure that when XDP_PASS is returned the xdp->data_meta area
is copied to the skb->data area. Account for the meta data
size on skb allocation and do a pull after to move it to the
"reserved" zone.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240402093753.331120-6-pavan.chebbi@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:13:19 -07:00
Somnath Kotur
1614f06e09 bnxt_en: Change bnxt_rx_xdp function prototype
Change bnxt_rx_xdp() to take a pointer to xdp instead of stack
variable.
This is in prepartion for the XDP metadata patch change where
the BPF program can change the value of the xdp.meta_data.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240402093753.331120-5-pavan.chebbi@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:13:19 -07:00
Somnath Kotur
fba2e4e5db bnxt_en: Allocate page pool per numa node
Driver's Page Pool allocation code looks at the node local
to the PCIe device to determine where to allocate memory.
In scenarios where the core count per NUMA node is low (< default rings)
it makes sense to exhaust page pool allocations on
Node 0 first and then moving on to allocating page pools
for the remaining rings from Node 1.

With this patch, and the following configuration on the NIC
$ ethtool -L ens1f0np0 combined 16
(core count/node = 12, first 12 rings on node#0, last 4 rings node#1)
and traffic redirected to a ring on node#1 , we see a performance
improvement of ~20%

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240402093753.331120-4-pavan.chebbi@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:13:19 -07:00
Somnath Kotur
8635ae8e99 bnxt_en: Enable XPS by default on driver load
Enable XPS on default during NIC open. The choice of
Tx queue is based on the CPU executing the thread that
submits the Tx request. The pool of Tx queues will be
spread evenly across both device-attached NUMA nodes(local)
and remote NUMA nodes.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240402093753.331120-3-pavan.chebbi@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:13:19 -07:00
Vikas Gupta
d5ab32e9b0 bnxt_en: Add delay to handle Downstream Port Containment (DPC) AER
In case of DPC, after issuing the hot reset, the
kernel waits for 100ms for the device to complete
the reset. However on some older chips, the firmware
may take up to 1 second to complete the reset, only
after which the driver can restart the card.

Introduce delay of 900ms to handle this scenario on
the older chipsets.

Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240402093753.331120-2-pavan.chebbi@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04 09:13:19 -07:00
linke li
04172043bd net: ethernet: mtk_eth_soc: Reuse value using READ_ONCE instead of re-rereading it
In mtk_flow_entry_update_l2, the hwe->ib1 is read using READ_ONCE at the
beginning of the function, checked, and then re-read from hwe->ib1,
may void all guarantees of the checks. Reuse the value that was read by
READ_ONCE to ensure the consistency of the ib1 throughout the function.

Signed-off-by: linke li <lilinke99@qq.com>
Link: https://lore.kernel.org/r/tencent_C699E9540505523424F11A9BD3D21B86840A@qq.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-04 15:46:52 +02:00
Christophe JAILLET
c120209bce net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
The definition and declaration of sja1110_pcs_mdio_write_c45() don't have
parameters in the same order.

Knowing that sja1110_pcs_mdio_write_c45() is used as a function pointer
in 'sja1105_info' structure with .pcs_mdio_write_c45, and that we have:

   int (*pcs_mdio_write_c45)(struct mii_bus *bus, int phy, int mmd,
				  int reg, u16 val);

it is likely that the definition is the one to change.

Found with cppcheck, funcArgOrderDifferent.

Fixes: ae271547bba6 ("net: dsa: sja1105: C45 only transactions for PCS")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Michael Walle <mwalle@kernel.org>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/ff2a5af67361988b3581831f7bd1eddebfb4c48f.1712082763.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-04 12:51:45 +02:00
Paul Barker
101b76418d net: ravb: Always update error counters
The error statistics should be updated each time the poll function is
called, even if the full RX work budget has been consumed. This prevents
the counts from becoming stuck when RX bandwidth usage is high.

This also ensures that error counters are not updated after we've
re-enabled interrupts as that could result in a race condition.

Also drop an unnecessary space.

Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/20240402145305.82148-2-paul.barker.ct@bp.renesas.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-04 12:46:13 +02:00
Paul Barker
596a425491 net: ravb: Always process TX descriptor ring
The TX queue should be serviced each time the poll function is called,
even if the full RX work budget has been consumed. This prevents
starvation of the TX queue when RX bandwidth usage is high.

Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper")
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/20240402145305.82148-1-paul.barker.ct@bp.renesas.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-04 12:46:13 +02:00
Pablo Neira Ayuso
1bc83a019b netfilter: nf_tables: discard table flag update with pending basechain deletion
Hook unregistration is deferred to the commit phase, same occurs with
hook updates triggered by the table dormant flag. When both commands are
combined, this results in deleting a basechain while leaving its hook
still registered in the core.

Fixes: 179d9ba5559a ("netfilter: nf_tables: fix table flag updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-04 11:38:35 +02:00
Ziyang Xuan
24225011d8 netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
nft_unregister_flowtable_type() within nf_flow_inet_module_exit() can
concurrent with __nft_flowtable_type_get() within nf_tables_newflowtable().
And thhere is not any protection when iterate over nf_tables_flowtables
list in __nft_flowtable_type_get(). Therefore, there is pertential
data-race of nf_tables_flowtables list entry.

Use list_for_each_entry_rcu() to iterate over nf_tables_flowtables list
in __nft_flowtable_type_get(), and use rcu_read_lock() in the caller
nft_flowtable_type_get() to protect the entire type query process.

Fixes: 3b49e2e94e6e ("netfilter: nf_tables: add flow table netlink frontend")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-04 11:38:34 +02:00
Pablo Neira Ayuso
994209ddf4 netfilter: nf_tables: reject new basechain after table flag update
When dormant flag is toggled, hooks are disabled in the commit phase by
iterating over current chains in table (existing and new).

The following configuration allows for an inconsistent state:

  add table x
  add chain x y { type filter hook input priority 0; }
  add table x { flags dormant; }
  add chain x w { type filter hook input priority 1; }

which triggers the following warning when trying to unregister chain w
which is already unregistered.

[  127.322252] WARNING: CPU: 7 PID: 1211 at net/netfilter/core.c:50                                                                     1 __nf_unregister_net_hook+0x21a/0x260
[...]
[  127.322519] Call Trace:
[  127.322521]  <TASK>
[  127.322524]  ? __warn+0x9f/0x1a0
[  127.322531]  ? __nf_unregister_net_hook+0x21a/0x260
[  127.322537]  ? report_bug+0x1b1/0x1e0
[  127.322545]  ? handle_bug+0x3c/0x70
[  127.322552]  ? exc_invalid_op+0x17/0x40
[  127.322556]  ? asm_exc_invalid_op+0x1a/0x20
[  127.322563]  ? kasan_save_free_info+0x3b/0x60
[  127.322570]  ? __nf_unregister_net_hook+0x6a/0x260
[  127.322577]  ? __nf_unregister_net_hook+0x21a/0x260
[  127.322583]  ? __nf_unregister_net_hook+0x6a/0x260
[  127.322590]  ? __nf_tables_unregister_hook+0x8a/0xe0 [nf_tables]
[  127.322655]  nft_table_disable+0x75/0xf0 [nf_tables]
[  127.322717]  nf_tables_commit+0x2571/0x2620 [nf_tables]

Fixes: 179d9ba5559a ("netfilter: nf_tables: fix table flag updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-04 11:34:42 +02:00
Pablo Neira Ayuso
24cea96770 netfilter: nf_tables: flush pending destroy work before exit_net release
Similar to 2c9f0293280e ("netfilter: nf_tables: flush pending destroy
work before netlink notifier") to address a race between exit_net and
the destroy workqueue.

The trace below shows an element to be released via destroy workqueue
while exit_net path (triggered via module removal) has already released
the set that is used in such transaction.

[ 1360.547789] BUG: KASAN: slab-use-after-free in nf_tables_trans_destroy_work+0x3f5/0x590 [nf_tables]
[ 1360.547861] Read of size 8 at addr ffff888140500cc0 by task kworker/4:1/152465
[ 1360.547870] CPU: 4 PID: 152465 Comm: kworker/4:1 Not tainted 6.8.0+ #359
[ 1360.547882] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[ 1360.547984] Call Trace:
[ 1360.547991]  <TASK>
[ 1360.547998]  dump_stack_lvl+0x53/0x70
[ 1360.548014]  print_report+0xc4/0x610
[ 1360.548026]  ? __virt_addr_valid+0xba/0x160
[ 1360.548040]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 1360.548054]  ? nf_tables_trans_destroy_work+0x3f5/0x590 [nf_tables]
[ 1360.548176]  kasan_report+0xae/0xe0
[ 1360.548189]  ? nf_tables_trans_destroy_work+0x3f5/0x590 [nf_tables]
[ 1360.548312]  nf_tables_trans_destroy_work+0x3f5/0x590 [nf_tables]
[ 1360.548447]  ? __pfx_nf_tables_trans_destroy_work+0x10/0x10 [nf_tables]
[ 1360.548577]  ? _raw_spin_unlock_irq+0x18/0x30
[ 1360.548591]  process_one_work+0x2f1/0x670
[ 1360.548610]  worker_thread+0x4d3/0x760
[ 1360.548627]  ? __pfx_worker_thread+0x10/0x10
[ 1360.548640]  kthread+0x16b/0x1b0
[ 1360.548653]  ? __pfx_kthread+0x10/0x10
[ 1360.548665]  ret_from_fork+0x2f/0x50
[ 1360.548679]  ? __pfx_kthread+0x10/0x10
[ 1360.548690]  ret_from_fork_asm+0x1a/0x30
[ 1360.548707]  </TASK>

[ 1360.548719] Allocated by task 192061:
[ 1360.548726]  kasan_save_stack+0x20/0x40
[ 1360.548739]  kasan_save_track+0x14/0x30
[ 1360.548750]  __kasan_kmalloc+0x8f/0xa0
[ 1360.548760]  __kmalloc_node+0x1f1/0x450
[ 1360.548771]  nf_tables_newset+0x10c7/0x1b50 [nf_tables]
[ 1360.548883]  nfnetlink_rcv_batch+0xbc4/0xdc0 [nfnetlink]
[ 1360.548909]  nfnetlink_rcv+0x1a8/0x1e0 [nfnetlink]
[ 1360.548927]  netlink_unicast+0x367/0x4f0
[ 1360.548935]  netlink_sendmsg+0x34b/0x610
[ 1360.548944]  ____sys_sendmsg+0x4d4/0x510
[ 1360.548953]  ___sys_sendmsg+0xc9/0x120
[ 1360.548961]  __sys_sendmsg+0xbe/0x140
[ 1360.548971]  do_syscall_64+0x55/0x120
[ 1360.548982]  entry_SYSCALL_64_after_hwframe+0x55/0x5d

[ 1360.548994] Freed by task 192222:
[ 1360.548999]  kasan_save_stack+0x20/0x40
[ 1360.549009]  kasan_save_track+0x14/0x30
[ 1360.549019]  kasan_save_free_info+0x3b/0x60
[ 1360.549028]  poison_slab_object+0x100/0x180
[ 1360.549036]  __kasan_slab_free+0x14/0x30
[ 1360.549042]  kfree+0xb6/0x260
[ 1360.549049]  __nft_release_table+0x473/0x6a0 [nf_tables]
[ 1360.549131]  nf_tables_exit_net+0x170/0x240 [nf_tables]
[ 1360.549221]  ops_exit_list+0x50/0xa0
[ 1360.549229]  free_exit_list+0x101/0x140
[ 1360.549236]  unregister_pernet_operations+0x107/0x160
[ 1360.549245]  unregister_pernet_subsys+0x1c/0x30
[ 1360.549254]  nf_tables_module_exit+0x43/0x80 [nf_tables]
[ 1360.549345]  __do_sys_delete_module+0x253/0x370
[ 1360.549352]  do_syscall_64+0x55/0x120
[ 1360.549360]  entry_SYSCALL_64_after_hwframe+0x55/0x5d

(gdb) list *__nft_release_table+0x473
0x1e033 is in __nft_release_table (net/netfilter/nf_tables_api.c:11354).
11349           list_for_each_entry_safe(flowtable, nf, &table->flowtables, list) {
11350                   list_del(&flowtable->list);
11351                   nft_use_dec(&table->use);
11352                   nf_tables_flowtable_destroy(flowtable);
11353           }
11354           list_for_each_entry_safe(set, ns, &table->sets, list) {
11355                   list_del(&set->list);
11356                   nft_use_dec(&table->use);
11357                   if (set->flags & (NFT_SET_MAP | NFT_SET_OBJECT))
11358                           nft_map_deactivate(&ctx, set);
(gdb)

[ 1360.549372] Last potentially related work creation:
[ 1360.549376]  kasan_save_stack+0x20/0x40
[ 1360.549384]  __kasan_record_aux_stack+0x9b/0xb0
[ 1360.549392]  __queue_work+0x3fb/0x780
[ 1360.549399]  queue_work_on+0x4f/0x60
[ 1360.549407]  nft_rhash_remove+0x33b/0x340 [nf_tables]
[ 1360.549516]  nf_tables_commit+0x1c6a/0x2620 [nf_tables]
[ 1360.549625]  nfnetlink_rcv_batch+0x728/0xdc0 [nfnetlink]
[ 1360.549647]  nfnetlink_rcv+0x1a8/0x1e0 [nfnetlink]
[ 1360.549671]  netlink_unicast+0x367/0x4f0
[ 1360.549680]  netlink_sendmsg+0x34b/0x610
[ 1360.549690]  ____sys_sendmsg+0x4d4/0x510
[ 1360.549697]  ___sys_sendmsg+0xc9/0x120
[ 1360.549706]  __sys_sendmsg+0xbe/0x140
[ 1360.549715]  do_syscall_64+0x55/0x120
[ 1360.549725]  entry_SYSCALL_64_after_hwframe+0x55/0x5d

Fixes: 0935d5588400 ("netfilter: nf_tables: asynchronous release")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-04 11:34:42 +02:00
Pablo Neira Ayuso
0d459e2ffb netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
The commit mutex should not be released during the critical section
between nft_gc_seq_begin() and nft_gc_seq_end(), otherwise, async GC
worker could collect expired objects and get the released commit lock
within the same GC sequence.

nf_tables_module_autoload() temporarily releases the mutex to load
module dependencies, then it goes back to replay the transaction again.
Move it at the end of the abort phase after nft_gc_seq_end() is called.

Cc: stable@vger.kernel.org
Fixes: 720344340fb9 ("netfilter: nf_tables: GC transaction race with abort path")
Reported-by: Kuan-Ting Chen <hexrabbit@devco.re>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-04 11:34:42 +02:00
Pablo Neira Ayuso
a45e688957 netfilter: nf_tables: release batch on table validation from abort path
Unlike early commit path stage which triggers a call to abort, an
explicit release of the batch is required on abort, otherwise mutex is
released and commit_list remains in place.

Add WARN_ON_ONCE to ensure commit_list is empty from the abort path
before releasing the mutex.

After this patch, commit_list is always assumed to be empty before
grabbing the mutex, therefore

  03c1f1ef1584 ("netfilter: Cleanup nft_net->module_list from nf_tables_exit_net()")

only needs to release the pending modules for registration.

Cc: stable@vger.kernel.org
Fixes: c0391b6ab810 ("netfilter: nf_tables: missing validation from the abort path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-04-04 11:34:41 +02:00
Paolo Abeni
72076fc9fe Revert "tg3: Remove residual error handling in tg3_suspend"
This reverts commit 9ab4ad295622a3481818856762471c1f8c830e18.

I went out of coffee and applied it to the wrong tree. Blame on me.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-04 10:51:01 +02:00
Nikita Kiryushin
d72b735712 tg3: Remove residual error handling in tg3_suspend
As of now, tg3_power_down_prepare always ends with success, but
the error handling code from former tg3_set_power_state call is still here.

This code became unreachable in commit c866b7eac073 ("tg3: Do not use
legacy PCI power management").

Remove (now unreachable) error handling code for simplification and change
tg3_power_down_prepare to a void function as its result is no more checked.

Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240401191418.361747-1-kiryushin@ancud.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-04 10:49:42 +02:00
Nikita Kiryushin
9ab4ad2956 tg3: Remove residual error handling in tg3_suspend
As of now, tg3_power_down_prepare always ends with success, but
the error handling code from former tg3_set_power_state call is still here.

This code became unreachable in commit c866b7eac073 ("tg3: Do not use
legacy PCI power management").

Remove (now unreachable) error handling code for simplification and change
tg3_power_down_prepare to a void function as its result is no more checked.

Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240401191418.361747-1-kiryushin@ancud.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-04 10:16:50 +02:00
Jakub Kicinski
57a03d83f2 Merge branch 'mlxsw-preparations-for-improving-performance'
Petr Machata says:

====================
mlxsw: Preparations for improving performance

Amit Cohen writes:

mlxsw driver will use NAPI for event processing in a next patch set.
Some additional improvements will be added later. This patch set
prepares the code for NAPI usage and refactor some relevant areas. See
more details in commit messages.

Patch Set overview:
Patches #1-#2 are preparations for patch #3
Patch #3 setups tasklets as part of queue initializtion
Patch #4 removes handling of unlikely scenario
Patch #5 removes unused counters
Patch #6 makes style change in mlxsw_pci_eq_tasklet()
Patch #7-#10 poll command interface instead of EQ0 usage
Patches #11-#12 make style change and break the function
mlxsw_pci_cq_tasklet()
Patches #13-#14 remove functions which can be replaced by a stored value
Patch #15 improves accessing to descriptor queue instance
====================

Link: https://lore.kernel.org/r/cover.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:44 -07:00
Amit Cohen
77c6e27df9 mlxsw: pci: Store DQ pointer as part of CQ structure
Currently, for each completion, we check the number of descriptor queue
and take it via mlxsw_pci_{sdq,rdq}_get(). This is inefficient, the
DQ should be the same for all the completions in CQ, as each CQ handles
only one DQ - SDQ or RDQ. This mapping is handled as part of DQ
initialization via mlxsw_cmd_mbox_sw2hw_dq_cq_set().

Instead, as part of DQ initialization, set DQ pointer in the appropriate
CQ structure. When we handle completions, warn in case that the DQ number
that we expect is different from the number we get in the CQE. Call
WARN_ON_ONCE() only after checking the value, to avoid calling this method
for each completion.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/a5b2559cd6d532c120f3194f89a1e257110318f1.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:41 -07:00
Amit Cohen
82238f0ddb mlxsw: pci: Remove mlxsw_pci_cq_count()
Currently, for each interrupt we call mlxsw_pci_cq_count() to determine the
number of CQs. This call makes additional two function's calls. This can
be removed by storing this value as part of structure 'mlxsw_pci', as we
already do for number of SDQs. Remove the function and
__mlxsw_pci_queue_count() which is now not used and store the value
instead.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/f08ad113e8160678f3c8d401382a696c6c7f44c7.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:41 -07:00
Amit Cohen
0cd1453b7e mlxsw: pci: Remove mlxsw_pci_sdq_count()
The number of SDQs is stored as part of 'mlxsw_pci' structure. In some
cases, the driver uses this value and in some cases it calls
mlxsw_pci_sdq_count() to get the value. Align the code to use the
stored value. This simplifies the code and makes it clearer that the
value is always the same. Rename 'mlxsw_pci->num_sdq_cqs' to
'mlxsw_pci->num_sdqs' as now it is used not only in CQ context.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/0c8788506d9af35d589dbf64be35a508fd63d681.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:41 -07:00
Amit Cohen
1df7d871e3 mlxsw: pci: Break mlxsw_pci_cq_tasklet() into tasklets per queue type
Completion queues are used for completions of RDQ or SDQ. Each
completion queue is used for one DQ. The first CQs are used for SDQs and
the rest are used for RDQs.

Currently, for each CQE (completion queue element), we check 'sr' value
(send/receive) to know if it is completion of RDQ or SDQ. Actually, we
do not really have to check it, as according to the queue number we know
if it handles completions of Rx or Tx.

Break the tasklet into two - one for Rx (RDQ) and one for Tx (SDQ). Then,
setup the appropriate tasklet for each queue as part of queue
initialization. Use 'sr' value for unlikely case that we get completion
with type that we do not expect. Call WARN_ON_ONCE() only after checking
the value, to avoid calling this method for each completion.

A next patch set will use NAPI to handle events, then we will have a
separate poll method for Rx and Tx. This change is a preparation for
NAPI usage.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/50fbc366f8de54cb5dc72a7c4f394333ef71f1d0.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:40 -07:00
Amit Cohen
a0639236d4 mlxsw: pci: Make style change in mlxsw_pci_cq_tasklet()
This function will be broken into several functions later. As preparation,
reorder variables to reverse xmas tree.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/7170a8f4429ecb5a539b0374c621697778ff8363.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:40 -07:00
Amit Cohen
2c200863fc mlxsw: pci: Remove unused wait queue
The previous patch changed the code to do not handle command interface
from event queue. With this change the wait queue is not used anymore.
Remove it and 'wait_done' variable.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/f3af6a5a9dabd97d2920cefe475c6aa57767f504.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:40 -07:00
Amit Cohen
6fc280a365 mlxsw: pci: Use only one event queue
The device supports two event queues. EQ0 is used for command interface
completion events. EQ1 is used for completion events of RDQ or SDQ.

Currently, for each EQE (event queue element), we check the queue number
and handle accordingly. More than that, for each interrupt we schedule
tasklets for both EQs. This is really ineffective, especially because of
the fact that EQ0 is used only as part of driver init/fini, when EMADs are
not available. There is no point to schedule the tasklet for it and check
each EQE.

A previous patch changed the code to poll command interface for each use of
it. It means that now there is no real reason to use EQ0, as we poll the
command interface.

Initialize only one event queue and use it as EQ1 (this is determined by
queue number). Then, for each interrupt we can schedule the tasklet only
for one queue and we do not have to check the queue number. This
simplifies the code and should improve performance. Note that polling
command interface is ok as we use it only as part of driver init/fini.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/23d764f5c032e4c363b98590b746a4b32d2bf900.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:40 -07:00
Amit Cohen
7bc6a3098c mlxsw: pci: Rename MLXSW_PCI_EQS_COUNT
Currently we use MLXSW_PCI_EQS_COUNT event queues. A next patch will
change the driver to initialize only EQ1, as EQ0 is not required anymore
when we poll command interface.

Rename the macro to MLXSW_PCI_EQS_MAX as later we will not initialize
the maximum supported EQs, this value represents the maximum and a new
macro will be added to represent the actual used queues.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/b08df430b62f23ca1aa3aaa257896d2d95aa7691.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:40 -07:00
Amit Cohen
d4b3930b19 mlxsw: pci: Poll command interface for each cmd_exec()
Command interface is used for configuring and querying FW when EMADs are
not available. During the time that the driver sets up the asynchronous
queues, it polls the command interface for getting completions. Then,
there is a short period when asynchronous queues work, but EMADs are not
available (marked in the code as nopoll = true). During this time, we
send commands via command interface, but we do not poll it, as we can get
an interrupt for the completion. Completions of command interface are
received from HW in EQ0 (event queue 0).

The usage of EQ0 instead of polling is done only 4 times during
initialization and one time during tear down, but it makes an overhead
during lifetime of the driver. For each interrupt, we have to check if
we get events in EQ0 or EQ1 and handle them. This is really ineffective,
especially because of the fact that EQ0 is used only as part of driver
init/fini.

Instead, we can poll command interface for each call of cmd_exec(). It
means that when we send a command via command interface (as EMADs are
not available), we will poll it, regardless of availability of the
asynchronous queues. This will allow us to configure later only EQ1 and
simplify the flow.

Remove 'nopoll' indication and change mlxsw_pci_cmd_exec() to poll till
answer/timeout regardless of queues' state. For now, completions are
handled also by EQ0, but it will be removed in next patch. Additional
cleanups will be added in next patches.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/e674c70380ceda953e0e45a77334c5d22e69938f.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:40 -07:00
Amit Cohen
29ad2a9906 mlxsw: pci: Make style changes in mlxsw_pci_eq_tasklet()
This function will be used later only for EQ1. As preparation, reorder
variables to reverse xmas tree and return earlier when it is possible, to
simplify the code.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/2412d6c135b2a6aedb4484f5d8baab3aecd7b9ae.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:40 -07:00
Amit Cohen
57beea8e56 mlxsw: pci: Remove unused counters
The structure 'mlxsw_pci_queue' stores several counters which were consumed
via debugfs. Since commit 9a32562becd9 ("mlxsw: Remove debugfs interface"),
these counters are not used. Remove them. This makes the 'union u' and
'struct eq' redundant. Maintain 'struct cq' as it will be extended later.

Replace increasing 'q->u.eq.ev_other_count' with WARN_ON_ONCE(), as it is
used in an unreasonable case of receiving event in EQ which is not EQ0 or
EQ1. When the queues are initialized, we check number of event queues and
fail with the print "Unsupported number of queues" in case that the driver
tries to initialize more than two queues.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/ee9e658800aa0390e08342100bc27daff4c176c0.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:39 -07:00
Amit Cohen
38b124cb4e mlxsw: pci: Arm CQ doorbell regardless of number of completions
Currently, as part of mlxsw_pci_cq_tasklet(), we check if any item
was handled, and only in such case we arm doorbell. This is unlikely case,
as we schedule tasklet only for CQs that we get an event for them, which
means that they contain completions to handle. Remove this check, which
is supposed to be true always, and even if it is false, it is not a mistake
to ring the doorbell. We can warn on such case, but it is not really worth
to add a check which will be run for each CQ handling when we do not expect
to reach it and it does not point to logic error that should be handled.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/f8efa481bfe7bebb9f93bb803f44ab7da77f53e6.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:39 -07:00
Amit Cohen
fb29028ae7 mlxsw: pci: Do not setup tasklet from operation
Currently, the structure 'mlxsw_pci_queue_ops' holds a pointer to the
callback function of tasklet. This is used only for EQ and CQ. mlxsw
driver will use NAPI in a following patch set, so CQ will not use tasklet
anymore. As preparation, remove this pointer from the shared operation
structure and setup the tasklet as part of queue initialization.
For now, setup tasklet for EQ and CQ. Later, CQ code will be changed.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/a326cae5fc1ad085a1a063c004983de6fe389414.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:39 -07:00
Amit Cohen
f46de9f0e7 mlxsw: pci: Move mlxsw_pci_cq_{init, fini}()
Move mlxsw_pci_cq_{init, fini}() after mlxsw_pci_cq_tasklet() as a next
patch will setup the tasklet as part of initialization.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/25196cb5baf5acf6ec1e956203790e018ba8e306.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:39 -07:00
Amit Cohen
d38718a525 mlxsw: pci: Move mlxsw_pci_eq_{init, fini}()
Move mlxsw_pci_eq_{init, fini}() after mlxsw_pci_eq_tasklet() as a next
patch will setup the tasklet as part of initialization.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/7ae120a02e1c490084daae7e684a0d40b7cce4e7.1712062203.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:50:39 -07:00
Jakub Kicinski
6b164687f8 Merge branch 'mlx5-misc-patches'
Tariq Toukan says:

====================
mlx5 misc patches

This patchset includes small features and misc code enhancements for the
mlx5 core and EN drivers.

Patches 1-4 by Gal improves the mlx5e ethtool stats implementation, for
example by using standard helpers ethtool_sprintf/puts.

Patch 5 by me adds a reset option for the FW command interface debugfs
stats entries. This allows explicit FW command interface stats reset
between different runs of a test case.

Patches 6 and 7 are simple cleanups.

Patch 8 by Gal adds driver support for 800Gbps link modes.

Patch 9 by Jianbo enhances the L4 steering abilities.

Patches 10-11 by Jianbo save redundant operations.
====================

Link: https://lore.kernel.org/r/20240402133043.56322-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03 19:48:01 -07:00