linux

iv/linux

Author	SHA1	Message	Date
Tasos Sahanidis	6be2e7522e	ALSA: ymfpci: Fix BUG_ON in probe function The snd_dma_buffer.bytes field now contains the aligned size, which this snd_BUG_ON() did not account for, resulting in the following: [ 9.625915] ------------[ cut here ]------------ [ 9.633440] WARNING: CPU: 0 PID: 126 at sound/pci/ymfpci/ymfpci_main.c:2168 snd_ymfpci_create+0x681/0x698 [snd_ymfpci] [ 9.648926] Modules linked in: snd_ymfpci(+) snd_intel_dspcfg kvm(+) snd_intel_sdw_acpi snd_ac97_codec snd_mpu401_uart snd_opl3_lib irqbypass snd_hda_codec gameport snd_rawmidi crct10dif_pclmul crc32_pclmul cfg80211 snd_hda_core polyval_clmulni polyval_generic gf128mul snd_seq_device ghash_clmulni_intel snd_hwdep ac97_bus sha512_ssse3 rfkill snd_pcm aesni_intel tg3 snd_timer crypto_simd snd mxm_wmi libphy cryptd k10temp fam15h_power pcspkr soundcore sp5100_tco wmi acpi_cpufreq mac_hid dm_multipath sg loop fuse dm_mod bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sr_mod cdrom ata_generic pata_acpi firewire_ohci crc32c_intel firewire_core xhci_pci crc_itu_t pata_via xhci_pci_renesas floppy [ 9.711849] CPU: 0 PID: 126 Comm: kworker/0:2 Not tainted 6.1.21-1-lts #1 08d2e5ece03136efa7c6aeea9a9c40916b1bd8da [ 9.722200] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./990FX Extreme4, BIOS P2.70 06/05/2014 [ 9.732204] Workqueue: events work_for_cpu_fn [ 9.736580] RIP: 0010:snd_ymfpci_create+0x681/0x698 [snd_ymfpci] [ 9.742594] Code: 8c c0 4c 89 e2 48 89 df 48 c7 c6 92 c6 8c c0 e8 15 d0 e9 ff 48 83 c4 08 44 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f e9 d3 7a 33 e3 <0f> 0b e9 cb fd ff ff 41 bd fb ff ff ff eb db 41 bd f4 ff ff ff eb [ 9.761358] RSP: 0018:ffffab64804e7da0 EFLAGS: 00010287 [ 9.766594] RAX: ffff8fa2df06c400 RBX: ffff8fa3073a8000 RCX: ffff8fa303fbc4a8 [ 9.773734] RDX: ffff8fa2df06d000 RSI: 0000000000000010 RDI: 0000000000000020 [ 9.780876] RBP: ffff8fa300b5d0d0 R08: ffff8fa3073a8e50 R09: 00000000df06bf00 [ 9.788018] R10: ffff8fa2df06bf00 R11: 00000000df068200 R12: ffff8fa3073a8918 [ 9.795159] R13: 0000000000000000 R14: 0000000000000080 R15: ffff8fa2df068200 [ 9.802317] FS: 0000000000000000(0000) GS:ffff8fa9fec00000(0000) knlGS:0000000000000000 [ 9.810414] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.816158] CR2: 000055febaf66500 CR3: 0000000101a2e000 CR4: 00000000000406f0 [ 9.823301] Call Trace: [ 9.825747] <TASK> [ 9.827889] snd_card_ymfpci_probe+0x194/0x950 [snd_ymfpci b78a5fe64b5663a6390a909c67808567e3e73615] [ 9.837030] ? finish_task_switch.isra.0+0x90/0x2d0 [ 9.841918] local_pci_probe+0x45/0x80 [ 9.845680] work_for_cpu_fn+0x1a/0x30 [ 9.849431] process_one_work+0x1c7/0x380 [ 9.853464] worker_thread+0x1af/0x390 [ 9.857225] ? rescuer_thread+0x3b0/0x3b0 [ 9.861254] kthread+0xde/0x110 [ 9.864414] ? kthread_complete_and_exit+0x20/0x20 [ 9.869210] ret_from_fork+0x22/0x30 [ 9.872792] </TASK> [ 9.874985] ---[ end trace 0000000000000000 ]--- Fixes: 5c1733e33c88 ("ALSA: memalloc: Align buffer allocations in page size") Signed-off-by: Tasos Sahanidis <tasos@tasossah.com> Link: https://lore.kernel.org/r/20230329032808.170403-1-tasos@tasossah.com Signed-off-by: Takashi Iwai <tiwai@suse.de>	2023-03-29 08:25:59 +02:00
Tasos Sahanidis	f33fc15767	ALSA: ymfpci: Create card with device-managed snd_devm_card_new() snd_card_ymfpci_remove() was removed in commit c6e6bb5eab74 ("ALSA: ymfpci: Allocate resources with device-managed APIs"), but the call to snd_card_new() was not replaced with snd_devm_card_new(). Since there was no longer a call to snd_card_free, unloading the module would eventually result in Oops: [697561.532887] BUG: unable to handle page fault for address: ffffffffc0924480 [697561.532893] #PF: supervisor read access in kernel mode [697561.532896] #PF: error_code(0x0000) - not-present page [697561.532899] PGD ae1e15067 P4D ae1e15067 PUD ae1e17067 PMD 11a8f5067 PTE 0 [697561.532905] Oops: 0000 [#1] PREEMPT SMP NOPTI [697561.532909] CPU: 21 PID: 5080 Comm: wireplumber Tainted: G W OE 6.2.7 #1 [697561.532914] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS, BIOS 4408 10/28/2022 [697561.532916] RIP: 0010:try_module_get.part.0+0x1a/0xe0 [697561.532924] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc bf 01 00 00 00 e8 56 3c f8 ff <41> 83 3c 24 02 0f 84 96 00 00 00 41 8b 84 24 30 03 00 00 85 c0 0f [697561.532927] RSP: 0018:ffffbe9b858c3bd8 EFLAGS: 00010246 [697561.532930] RAX: ffff9815d14f1900 RBX: ffff9815c14e6000 RCX: 0000000000000000 [697561.532933] RDX: 0000000000000000 RSI: ffffffffc055092c RDI: ffffffffb3778c1a [697561.532935] RBP: ffffbe9b858c3be8 R08: 0000000000000040 R09: ffff981a1a741380 [697561.532937] R10: ffffbe9b858c3c80 R11: 00000009d56533a6 R12: ffffffffc0924480 [697561.532939] R13: ffff9823439d8500 R14: 0000000000000025 R15: ffff9815cd109f80 [697561.532942] FS: 00007f13084f1f80(0000) GS:ffff9824aef40000(0000) knlGS:0000000000000000 [697561.532945] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [697561.532947] CR2: ffffffffc0924480 CR3: 0000000145344000 CR4: 0000000000350ee0 [697561.532949] Call Trace: [697561.532951] <TASK> [697561.532955] try_module_get+0x13/0x30 [697561.532960] snd_ctl_open+0x61/0x1c0 [snd] [697561.532976] snd_open+0xb4/0x1e0 [snd] [697561.532989] chrdev_open+0xc7/0x240 [697561.532995] ? fsnotify_perm.part.0+0x6e/0x160 [697561.533000] ? __pfx_chrdev_open+0x10/0x10 [697561.533005] do_dentry_open+0x169/0x440 [697561.533009] vfs_open+0x2d/0x40 [697561.533012] path_openat+0xa9d/0x10d0 [697561.533017] ? debug_smp_processor_id+0x17/0x20 [697561.533022] ? trigger_load_balance+0x65/0x370 [697561.533026] do_filp_open+0xb2/0x160 [697561.533032] ? _raw_spin_unlock+0x19/0x40 [697561.533036] ? alloc_fd+0xa9/0x190 [697561.533040] do_sys_openat2+0x9f/0x160 [697561.533044] __x64_sys_openat+0x55/0x90 [697561.533048] do_syscall_64+0x3b/0x90 [697561.533052] entry_SYSCALL_64_after_hwframe+0x72/0xdc [697561.533056] RIP: 0033:0x7f1308a40db4 [697561.533059] Code: 24 20 eb 8f 66 90 44 89 54 24 0c e8 46 68 f8 ff 44 8b 54 24 0c 44 89 e2 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 32 44 89 c7 89 44 24 0c e8 78 68 f8 ff 8b 44 [697561.533062] RSP: 002b:00007ffcce664450 EFLAGS: 00000293 ORIG_RAX: 0000000000000101 [697561.533066] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f1308a40db4 [697561.533068] RDX: 0000000000080000 RSI: 00007ffcce664690 RDI: 00000000ffffff9c [697561.533070] RBP: 00007ffcce664690 R08: 0000000000000000 R09: 0000000000000012 [697561.533072] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000080000 [697561.533074] R13: 00007f13054b069b R14: 0000565209f83200 R15: 0000000000000000 [697561.533078] </TASK> Fixes: c6e6bb5eab74 ("ALSA: ymfpci: Allocate resources with device-managed APIs") Signed-off-by: Tasos Sahanidis <tasos@tasossah.com> Link: https://lore.kernel.org/r/20230329032422.170024-1-tasos@tasossah.com Signed-off-by: Takashi Iwai <tiwai@suse.de>	2023-03-29 08:25:51 +02:00
Felix Fietkau	07b3af42d8	net: ethernet: mtk_eth_soc: fix tx throughput regression with direct 1G links Using the QDMA tx scheduler to throttle tx to line speed works fine for switch ports, but apparently caused a regression on non-switch ports. Based on a number of tests, it seems that this throttling can be safely dropped without re-introducing the issues on switch ports that the tx scheduling changes resolved. Link: https://lore.kernel.org/netdev/trinity-92c3826f-c2c8-40af-8339-bc6d0d3ffea4-1678213958520@3c-app-gmx-bs16/ Fixes: f63959c7eec3 ("net: ethernet: mtk_eth_soc: implement multi-queue support for per-port queues") Reported-by: Frank Wunderlich <frank-w@public-files.de> Reported-by: Daniel Golle <daniel@makrotopia.org> Tested-by: Daniel Golle <daniel@makrotopia.org> Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230324140404.95745-1-nbd@nbd.name Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 23:23:50 -07:00
Wolfram Sang	cdeccd13a0	Revert "sh_eth: remove open coded netif_running()" This reverts commit ce1fdb065695f49ef6f126d35c1abbfe645d62d5. It turned out this actually introduces a race condition. netif_running() is not a suitable check for get_stats. Reported-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230327152112.15635-1-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 19:23:32 -07:00
Jakub Kicinski	2600badfea	Merge branch 'net-refcount-address-dst_entry-reference-count-scalability-issues' Thomas Gleixner says: ==================== net, refcount: Address dst_entry reference count scalability issues This is version 3 of this series. Version 2 can be found here: https://lore.kernel.org/lkml/20230307125358.772287565@linutronix.de Wangyang and Arjan reported a bottleneck in the networking code related to struct dst_entry::__refcnt. Performance tanks massively when concurrency on a dst_entry increases. This happens when there are a large amount of connections to or from the same IP address. The memtier benchmark when run on the same host as memcached amplifies this massively. But even over real network connections this issue can be observed at an obviously smaller scale (due to the network bandwith limitations in my setup, i.e. 1Gb). How to reproduce: Run memcached with -t $N and memtier_benchmark with -t $M and --ratio=1:100 on the same machine. localhost connections amplify the problem. Start with the defaults for $N and $M and increase them. Depending on your machine this will tank at some point. But even in reasonably small $N, $M scenarios the refcount operations and the resulting false sharing fallout becomes visible in perf top. At some point it becomes the dominating issue. There are two factors which make this reference count a scalability issue: 1) False sharing dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into a seperate cacheline vs. the read mostly members located at the beginning of the struct. That prevents false sharing vs. the struct members in the first 64 bytes of the structure, but there is also dst_entry::lwtstate which is located after the reference count and in the same cache line. This member is read after a reference count has been acquired. The other problem is struct rtable, which embeds a struct dst_entry at offset 0. struct dst_entry has a size of 112 bytes, which means that the struct members of rtable which follow the dst member share the same cache line as dst_entry::__refcnt. Especially rtable::rt_genid is also read by the contexts which have a reference count acquired already. When dst_entry:__refcnt is incremented or decremented via an atomic operation these read accesses stall and contribute to the performance problem. 2) atomic_inc_not_zero() A reference on dst_entry:__refcnt is acquired via atomic_inc_not_zero() and released via atomic_dec_return(). atomic_inc_not_zero() is implemted via a atomic_try_cmpxchg() loop, which exposes O(N^2) behaviour under contention with N concurrent operations. Contention scalability is degrading with even a small amount of contenders and gets worse from there. Lightweight instrumentation exposed an average of 8!! retry loops per atomic_inc_not_zero() invocation in a inc()/dec() loop running concurrently on 112 CPUs. There is nothing which can be done to make atomic_inc_not_zero() more scalable. The following series addresses these issues: 1) Reorder and pad struct dst_entry to prevent the false sharing. 2) Implement and use a reference count implementation which avoids the atomic_inc_not_zero() problem. It is slightly less performant in the case of the final 0 -> -1 transition, but the deconstruction of these objects is a low frequency event. get()/put() pairs are in the hotpath and that's what this implementation optimizes for. The algorithm of this reference count is only suitable for RCU managed objects. Therefore it cannot replace the refcount_t algorithm, which is also based on atomic_inc_not_zero(), due to a subtle race condition related to the 0 -> -1 transition and the final verdict to mark the reference count dead. See details in patch 2/3. It might be just my lack of imagination which declares this to be impossible and I'd be happy to be proven wrong. As a bonus the new rcuref implementation provides underflow/overflow detection and mitigation while being performance wise on par with open coded atomic_inc_not_zero() / atomic_dec_return() pairs even in the non-contended case. The combination of these two changes results in performance gains in micro benchmarks and also localhost and networked memtier benchmarks talking to memcached. It's hard to quantify the benchmark results as they depend heavily on the micro-architecture and the number of concurrent operations. The overall gain of both changes for localhost memtier ranges from 1.2X to 3.2X and from +2% to %5% range for networked operations on a 1Gb connection. A micro benchmark which enforces maximized concurrency shows a gain between 1.2X and 4.7X!!! Obviously this is focussed on a particular problem and therefore needs to be discussed in detail. It also requires wider testing outside of the cases which this is focussed on. Though the false sharing issue is obvious and should be addressed independent of the more focussed reference count changes. The series is also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rcuref Changes vs. V2: - Rename __refcnt to __rcuref (Linus) - Fix comments and changelogs (Mark, Qiuxu) - Fixup kernel doc of generated atomic_add_negative() variants I want to say thanks to Wangyang who analyzed the issue and provided the initial fix for the false sharing problem. Further thanks go to Arjan Peter, Marc, Will and Borislav for valuable input and providing test results on machines which I do not have access to, and to Linus and Eric, Qiuxu and Mark for helpful feedback. ==================== Link: https://lore.kernel.org/r/20230323102649.764958589@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 18:53:43 -07:00
Thomas Gleixner	bc9d3a9f2a	net: dst: Switch to rcuref_t reference counting Under high contention dst_entry::__refcnt becomes a significant bottleneck. atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into high retry rates on contention. Switch the reference count to rcuref_t which results in a significant performance gain. Rename the reference count member to __rcuref to reflect the change. The gain depends on the micro-architecture and the number of concurrent operations and has been measured in the range of +25% to +130% with a localhost memtier/memcached benchmark which amplifies the problem massively. Running the memtier/memcached benchmark over a real (1Gb) network connection the conversion on top of the false sharing fix for struct dst_entry::__refcnt results in a total gain in the 2%-5% range over the upstream baseline. Reported-by: Wangyang Guo <wangyang.guo@intel.com> Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de Link: https://lore.kernel.org/r/20230323102800.215027837@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 18:52:28 -07:00
Wangyang Guo	d288a162dd	net: dst: Prevent false sharing vs. dst_entry:: __refcnt dst_entry::__refcnt is highly contended in scenarios where many connections happen from and to the same IP. The reference count is an atomic_t, so the reference count operations have to take the cache-line exclusive. Aside of the unavoidable reference count contention there is another significant problem which is caused by that: False sharing. perf top identified two affected read accesses. dst_entry::lwtstate and rtable::rt_genid. dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into a seperate cacheline vs. the read mostly members located at the beginning of the struct. That prevents false sharing vs. the struct members in the first 64 bytes of the structure, but there is also dst_entry::lwtstate which is located after the reference count and in the same cache line. This member is read after a reference count has been acquired. struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a size of 112 bytes, which means that the struct members of rtable which follow the dst member share the same cache line as dst_entry::__refcnt. Especially rtable::rt_genid is also read by the contexts which have a reference count acquired already. When dst_entry:__refcnt is incremented or decremented via an atomic operation these read accesses stall. This was found when analysing the memtier benchmark in 1:100 mode, which amplifies the problem extremly. Move the rt[6i]_uncached[_list] members out of struct rtable and struct rt6_info into struct dst_entry to provide padding and move the lwtstate member after that so it ends up in the same cache line. The resulting improvement depends on the micro-architecture and the number of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached benchmark. [ tglx: Rearrange struct ] Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230323102800.042297517@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 18:52:22 -07:00
Jakub Kicinski	b133fffe57	Merge branch 'locking/rcuref' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pulling rcurefs from Peter for tglx's work. Link: https://lore.kernel.org/all/20230328084534.GE4253@hirez.programming.kicks-ass.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 18:49:35 -07:00
Saeed Mahameed	163c2c7059	net/mlx5e: Fix build break on 32bit The cited commit caused the following build break in mlx5 due to a change in size of MAX_SKB_FRAGS. error: format '%lu' expects argument of type 'long unsigned int', but argument 7 has type 'unsigned int' [-Werror=format=] Fix this by explicit casting. Fixes: 3948b05950fd ("net: introduce a config option to tweak MAX_SKB_FRAGS") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230328200723.125122-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-28 16:18:59 -07:00
Dragos Tatulea	3905f8d64c	net/mlx5e: RX, Remove unnecessary recycle parameter and page_cache stats The recycle parameter used during page release is no longer necessary: the page pool can detect when the page cannot be recycled to the cache or ring without any outside hint. The page pool will also take care of cleaning up after itself once all the inflight pages have been released. So no need to explicitly release pages to the system. Remove the internal page_cache stats as the mlx5e_page_cache struct no longer exists. Delete the documentation entries along with the stats. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:59 -07:00
Dragos Tatulea	cd640b0503	net/mlx5e: RX, Break the wqe bulk refill in smaller chunks To avoid overflowing the page pool's cache, don't release the whole bulk which is usually larger than the cache refill size. Group release+alloc instead into cache refill units that allow releasing to the cache and then allocating from the cache. A refill_unit variable is added as a iteration unit over the wqe_bulk when doing release+alloc. For a single ring, single core, default MTU (1500) TCP stream test the number of pages allocated from the cache directly (rx_pp_recycle_cached) increases from 0% to 52%: +---------------------------------------------+ \| Page Pool stats (/sec) \| Before \| After \| +-------------------------+---------+---------+ \|rx_pp_alloc_fast \| 2145422 \| 2193802 \| \|rx_pp_alloc_slow \| 2 \| 0 \| \|rx_pp_alloc_empty \| 2 \| 0 \| \|rx_pp_alloc_refill \| 34059 \| 16634 \| \|rx_pp_alloc_waive \| 0 \| 0 \| \|rx_pp_recycle_cached \| 0 \| 1145818 \| \|rx_pp_recycle_cache_full \| 0 \| 0 \| \|rx_pp_recycle_ring \| 2179361 \| 1064616 \| \|rx_pp_recycle_ring_full \| 121 \| 0 \| +---------------------------------------------+ With this patch, the performance for legacy rq for the above test is back to baseline. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:59 -07:00
Dragos Tatulea	4ba2b4988c	net/mlx5e: RX, Increase WQE bulk size for legacy rq Deferred page release was added to legacy rq but its desired effect (driver releases last fragment to page pool cache) is not yet visible due to the WQE bulks being too small. This patch increases the WQE bulk size to span 512 KB and clip it to one quarter of the rx queue size. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:59 -07:00
Dragos Tatulea	76238d0fbd	net/mlx5e: RX, Split off release path for xsk buffers for legacy rq Don't mix xsk buffer releases with page releases anymore. This is needed for handling of deferred page release. Add a new bulk free function for xsk buffers from wqe frags. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:59 -07:00
Dragos Tatulea	3f93f82988	net/mlx5e: RX, Defer page release in legacy rq for better recycling Currently, fragmented pages from the page pool can be released in two ways: 1) In the mlx5e driver when trimming off the unused fragments AND the associated skb fragments have been released. This path allows recycling of pages to the page pool cache (allow_direct == true). 2) On the skb release path (last fragment release), which will always release pages to the page pool ring (allow_direct == false). Whichever is releasing the last fragment will be decisive on where the page gets released: the cache or the ring. So we obviously want to maximize for doing the release from 1. This patch does that by deferring the release of page fragments right before requesting new ones from the page pool. A flag is added to make sure that there's no release before first alloc and that XDP_TX fragments are not released prematurely. This is a preparation patch that doesn't unlock the performance improvements yet. A followup patch will do that. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:59 -07:00
Dragos Tatulea	625dff29df	net/mlx5e: RX, Change wqe last_in_page field from bool to bit flags Change the bool flag to a bitfield as we'll use it in a downstream patch in the series to add signaling about skipping a fragment release. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:58 -07:00
Dragos Tatulea	4c2a132368	net/mlx5e: RX, Defer page release in striding rq for better recycling Currently, for striding RQ, fragmented pages from the page pool can get released in two ways: 1) In the mlx5e driver when trimming off the unused fragments AND the associated skb fragments have been released. This path allows recycling of pages to the page pool cache (allow_direct == true). 2) On the skb release path (last fragment release), which will always release pages to the page pool ring (allow_direct == false). Whichever is releasing the last fragment will be decisive on where the page gets released: the cache or the ring. So we obviously want to maximize for doing the release from 1. This patch does that by deferring the release of page fragments right before requesting new ones from the page pool. Extra care needs to be taken for the corner cases: * On first call, make sure that release is not called. The skip_release_bitmap is used for this purpose. * On rq shutdown, make sure that all wqes that were not in the linked list are released. For a single ring, single core, default MTU (1500) TCP stream test the number of pages allocated from the cache directly (rx_pp_recycle_cached) increases from 31 % to 98 %: +----------------------------------------------+ \| Page Pool stats (/sec) \| Before \| After \| +-------------------------+---------+----------+ \|rx_pp_alloc_fast \| 2137754 \| 2261033 \| \|rx_pp_alloc_slow \| 47 \| 9 \| \|rx_pp_alloc_empty \| 47 \| 9 \| \|rx_pp_alloc_refill \| 23230 \| 819 \| \|rx_pp_alloc_waive \| 0 \| 0 \| \|rx_pp_recycle_cached \| 672182 \| 2209015 \| \|rx_pp_recycle_cache_full \| 1789 \| 0 \| \|rx_pp_recycle_ring \| 1485848 \| 52259 \| \|rx_pp_recycle_ring_full \| 3003 \| 584 \| +----------------------------------------------+ With this patch, the performance in striding rq for the above test is back to baseline. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:58 -07:00
Dragos Tatulea	38a36efccd	net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name The xdp_xmit_bitmap currently serves only one purpose: to avoid releasing pages that are still in use due to XDP TX. A following patch will use this bitmap in a slightly different context but for the same purpose. So rename the bitmap to a more generic name that reflects the purpose not the context. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:58 -07:00
Dragos Tatulea	6f57428460	net/mlx5e: RX, Enable skb page recycling through the page_pool Start using the page_pool skb recycling api to recycle all pages back to the page pool and stop using atomic page reference counting. The mlx5e driver used to manage in-flight pages using page refcounting: for each fragment there were 2 atomic write operations happening (one for building the skb and one on skb release). The page_pool api introduced a method to track page fragments more optimally: * The page's pp_fragment_count is set to a large bias on page alloc (1 x atomic write operation). * The driver tracks the actual page fragments in a non atomic variable. * When the skb is recycled, pp_fragment_count is decremented (atomic write operation). * When page is released in the driver, the unused number of fragments (relative to the bias) is deducted from pp_fragment_count (atomic write operation). * Last page defragmentation will only be an atomic read. So in total there are `number of fragments + 1` atomic write ops. As opposed to previously: `2 * frags` atomic writes ops. Pages are wrapped in a mlx5e_frag_page structure which also contains the number of fragments. This makes it easy to count the fragments in the driver. This change brings performance improvements for the case when the old rx page_cache had low recycling rates due to head of queue blocking. For a iperf3 TCP test with a single stream, on a single core (iperf and receive queue running on same core), the following improvements can be noticed: * Striding rq: - before (net-next baseline): bitrate = 30.1 Gbits/sec - after : bitrate = 31.4 Gbits/sec (diff: 4.14 %) * Legacy rq: - before (net-next baseline): bitrate = 30.2 Gbits/sec - after : bitrate = 33.0 Gbits/sec (diff: 8.48 %) There are 2 temporary performance degradations introduced: 1) TCP streams that had a good recycling rate with the old page_cache have a degradation for both striding and linear rq. This is due to very low page pool cache recycling: the pages are released during skb recycle which will release pages to the page pool ring for safety. The following patches in this series will tackle this problem by deferring the page release in the driver to increase the chance of having pages recycled to the cache. 2) XDP performance is now lower (4-5 %) due to the higher number of atomic operations used for fragment management. But this opens the door for supporting multiple packets per page in XDP, which will bring a big gain. Otherwise, performance is similar to baseline. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:58 -07:00
Dragos Tatulea	4a5c5e2500	net/mlx5e: RX, Enable dma map and sync from page_pool allocator Remove driver dma mapping and unmapping of pages. Let the page_pool api do it. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:58 -07:00
Dragos Tatulea	08c9b61b07	net/mlx5e: RX, Remove internal page_cache This patch removes the internal rx page_cache and uses the generic page_pool api only. It used to be that the page_pool couldn't handle all the mlx5 driver usecases, but with the introduction of skb recycling and page fragmentaton in the page_pool full switch can now be made. Some benfits of this transition: * Better page recycling in the cases when the page_cache was suffering from head of queue blocking. The page_pool doesn't have this issue. * DMA mapping/unmapping can be managed by the page_pool. * mlx5e_rq size reduced by more than 50% due to the page_cache array being deleted. This patch only removes the page_cache. Downstream patches will enable the required page_pool features and will add further fine-tuning. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:57 -07:00
Dragos Tatulea	ca6ef9f031	net/mlx5e: RX, Store SHAMPO header pages in array Save allocated SHAMPO header pages to an array to which the mlx5e_dma_info page will point to. This change is a preparation for introducing mlx5e_frag_page structure in a downstream patch. There's no new functionality introduced. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:57 -07:00
Dragos Tatulea	d39092caae	net/mlx5e: RX, Remove alloc unit layout constraint for striding rq This change removes the usage of mlx5e_alloc_unit union for striding rq. The change is more straightforward than legacy rq as the alloc units union is already in place. This patch only moves things around: instead of an array of unions make it a union of arrays. This has the effect that each mlx5e_mpw_info will allocate the largest possible size of the array member. For xsk this means that the array of xdp_buff pointers for the wqe will still be contiguous, but there will be some extra unused bytes at the end of the array. As further patch in the series will add the mlx5e_frag_page struct for which the described size constraint will no longer hold. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:57 -07:00
Dragos Tatulea	8fb1814f58	net/mlx5e: RX, Remove alloc unit layout constraint for legacy rq The mlx5e_alloc_unit union is conveniently used to store arrays of pointers to struct page or struct xdp_buff (for xsk). The union is currently expected to have the size of a pointer for xsk batch allocations to work. This is conveniet for the current state of the code but makes it impossible to add a structure of a different size to the alloc unit. A further patch in the series will add the mlx5e_frag_page struct for which the described size constraint will no longer hold. This change removes the usage of mlx5e_alloc_unit union for legacy rq: - A union of arrays is introduced (mlx5e_alloc_units) to replace the array of unions to allow structures of different sizes. - Each fragment has a pointer to a unit in the mlx5e_alloc_units array. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:57 -07:00
Dragos Tatulea	09df037017	net/mlx5e: RX, Remove mlx5e_alloc_unit argument in page allocation Change internal page cache and page pool api to use a struct page ** instead of a mlx5e_alloc_unit *. This is the first change in a series which is meant to remove the mlx5e_alloc_unit altogether. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2023-03-28 13:43:56 -07:00
Linus Torvalds	fcd476ea6a	Urgent RCU pull request for v6.3 This commit brings the rcu_torture_read event trace into line with the new trace tools by replacing this event trace's __field() with the corresponding __array(). Without this commit, the new trace tools will fail when presented wtih an rcu_torture_read event trace, which is a regression from the viewpoint of trace tools users. https://lore.kernel.org/all/20230320133650.5388a05e@gandalf.local.home/ -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmQjH24THHBhdWxtY2tA a2VybmVsLm9yZwAKCRCevxLzctn7jHhWD/9poElrFAcHjSII+grUJj2F9FxEH3eK 92c7KngRPskN2rVSbJf/iPqIujrmOWridAnTXYEAiEtcn08sse5ucmHGbi9Mid2u o6W5bjB1tLGkLm2uHB7cKAoF++a0pm/O70Di0Ui+Z7hBanCzNdKfBIlV1DbFlNwt oVTy+iqqYfzOvXZ42gtvctpj0h2q9saMb3As5fP7sUjAIB90rCH8Fy14cMvVZs0q EWXk1yBCWSetSt668YWhUgiZJdy9nXWuxdE48bIqKPLPTFcNP4lCf5z19ygcC8vN pBLL9/qJKKzAytVqXQQokYgEzel+gXvJSfeWD/rQKjmA0rL/Jue8RF/wERNyUKO7 +V0Qs2fJMvIkg0TiifqO8aEA/DfYDTnbornwFLzhyMidNO4JtludSndhc1HdIeE/ 5rKsRT+ZyOp1iEsyiPlHh9LvQY0ds90C1oCCs3lPv4KvXDnm3558jzVVe+L5QSnE l3XtOr/h7PG8YY38QDHyIyjgu2bynxVpxYMqDO8GoVWxMe019Hwj8n9ENRfo+mZ9 fEhvhMazPmiK06jemk0iz4HQickwDZ+bOFOM1SV9n+HiO9BRYN/rHZc/PU5368Ln yTcVpAGZKaYdtuQvo536xy9kKH1CQOzFcAW8e4JAu5rNsa8csIPtYcgEczm5qiaE n+hdQV0uXY+71Q== =3b8C -----END PGP SIGNATURE----- Merge tag 'urgent-rcu.2023.03.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This brings the rcu_torture_read event trace into line with the new trace tools by replacing this event trace's __field() with the corresponding __array(). Without this, the new trace tools will fail when presented wtih an rcu_torture_read event trace, which is a regression from the viewpoint of trace tools users" Link: https://lore.kernel.org/all/20230320133650.5388a05e@gandalf.local.home/ * tag 'urgent-rcu.2023.03.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu: Fix rcu_torture_read ftrace event	2023-03-28 13:28:52 -07:00
Linus Torvalds	756c1a0593	linux-kselftest-fixes-6.3-rc5 This Kselftest fixes update for Linux 6.3-rc5 consists of one single fix for sigaltstack test -Wuninitialized warning found when building with clang. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmQjEZsACgkQCwJExA0N Qxzgeg//RJZvZr1iqQ0w+C21e4BRzMMPjQEwAbKEoN9opiRx4nzMdSoJKxl3x66S NTSaAST89plIaJGpHAbqC59SVWsUwBY9XdrknNV7PAwIBx4yHeQ8JlpIOK6dXoYV tyGQxKvOQOtKP+Ou8iGKP9yv81dZylrcCVDrWG34xfZRqo3FIL6cs/U3DiGF+k8D ofLKaqqENdOJtXzE6aIk4hJK7I60EvMealwF22KLnOKaNl8Lh6zWCRzf6BAV7eYY 8NO8sg4nAjszDF3HBkgxN7ON3c8Ei8Rn8DMfh5nhV6ni0wkwkd6uabcVaewixvJ2 qbVVRuBRECUqvJ1AwYizmXszizDEPTrl/1F8EpTmwe3rXd61WazN9eEXUW3chCeu bQJaBzKYVemy/d+JEEYwgHHg9kM7V80wlZE9lnPtocZp4agkkvKqInGBeaTSbyuX sEeYk1QPDBQ5kVqjzydm6eNqkG4l4g0o8GJHfKy0cO98vYt/IcJirMpP6iXI7aNt Z+VZsD8lIMoiLJ/mk1+uJqUvj8Er/BePwoaSgQDa8HEjtYC0umrpPtSErLb43cDO f8pWVLdaSF55lD+O4JC5QJcjPj1xQkHlo6J4B3vWPC2rHZX8VBfpFxuhBobOUO3S M1nDO+iODepDTxY/mm4p7lSct3a7oOoGxjj7lFUCG3DLHwqRiLg= =cSRs -----END PGP SIGNATURE----- Merge tag 'linux-kselftest-fixes-6.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull Kselftest fixes from Shuah Khan: "One single fix for sigaltstack test -Wuninitialized warning found when building with clang" * tag 'linux-kselftest-fixes-6.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests: sigaltstack: fix -Wuninitialized	2023-03-28 13:14:47 -07:00
Linus Torvalds	05c24161f4	s390 updates for 6.3-rc5 - Fix an error handling issue with PTRACE_GET_LAST_BREAK request so that an -EFAULT is returned if put_user() fails, instead of ignoring it. - Fix a build race for the modules_prepare target when CONFIG_EXPOLINE_EXTERN is enabled by reintroducing the dependence on scripts. - Fix a memory leak in vfio_ap device driver. - Adds missing earlyclobber annotations to __clear_user() inline assembly to prevent incorrect register allocation. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmQi33kACgkQjYWKoQLX FBhpNgf/adZLOVnz9bFr96O9yB62j7Bh95bjeNSs6qsLDbJNfxkf9XNajDlVaNgT XLrVKh/WDnozkaU850jqUYzIQHSiMJobCs+AVqttCAEBkZ9TiVzbd/lFDVNKTiAl DqQ1SbG4urABmREsb07IgpFMCksbKMqo+xVuXbunHSP3YzNchQirXqslugiSy6zR FD1Ee+yhLTYlphgx7MjyeGnpyikXDR58PuTYuFmvOFElgRybfzOCS5lkPUfoGV9K vXf/dMx366gNyc+8TwS3fbgKMyu1n/8BlqMAcTaaq28oyHMye1QIv2kHDWViFqFU XSVGyxR1FD3Lq04PCkT69mnt1JXLKw== =/4ms -----END PGP SIGNATURE----- Merge tag 's390-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix an error handling issue with PTRACE_GET_LAST_BREAK request so that -EFAULT is returned if put_user() fails, instead of ignoring it - Fix a build race for the modules_prepare target when CONFIG_EXPOLINE_EXTERN is enabled by reintroducing the dependence on scripts - Fix a memory leak in vfio_ap device driver - Add missing earlyclobber annotations to __clear_user() inline assembly to prevent incorrect register allocation * tag 's390-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/ptrace: fix PTRACE_GET_LAST_BREAK error handling s390: reintroduce expoline dependence to scripts s390/vfio-ap: fix memory leak in vfio_ap device driver s390/uaccess: add missing earlyclobber annotations to __clear_user()	2023-03-28 10:43:04 -07:00
Jakob Koschel	e9a1cc2e4c	ice: fix invalid check for empty list in ice_sched_assoc_vsi_to_agg() The code implicitly assumes that the list iterator finds a correct handle. If 'vsi_handle' is not found the 'old_agg_vsi_info' was pointing to an bogus memory location. For safety a separate list iterator variable should be used to make the != NULL check on 'old_agg_vsi_info' correct under any circumstances. Additionally Linus proposed to avoid any use of the list iterator variable after the loop, in the attempt to move the list iterator variable declaration into the macro to avoid any potential misuse after the loop. Using it in a pointer comparison after the loop is undefined behavior and should be omitted if possible [1]. Fixes: 37c592062b16 ("ice: remove the VSI info from previous agg") Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1] Signed-off-by: Jakob Koschel <jkl820.git@gmail.com> Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-28 09:48:49 -07:00
Junfeng Guo	29486b6df3	ice: add profile conflict check for AVF FDIR Add profile conflict check while adding some FDIR rules to avoid unexpected flow behavior, rules may have conflict including: IPv4 <---> {IPv4_UDP, IPv4_TCP, IPv4_SCTP} IPv6 <---> {IPv6_UDP, IPv6_TCP, IPv6_SCTP} For example, when we create an FDIR rule for IPv4, this rule will work on packets including IPv4, IPv4_UDP, IPv4_TCP and IPv4_SCTP. But if we then create an FDIR rule for IPv4_UDP and then destroy it, the first FDIR rule for IPv4 cannot work on pkt IPv4_UDP then. To prevent this unexpected behavior, we add restriction in software when creating FDIR rules by adding necessary profile conflict check. Fixes: 1f7ea1cd6a37 ("ice: Enable FDIR Configure for AVF") Signed-off-by: Junfeng Guo <junfeng.guo@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-28 09:48:41 -07:00
Brett Creeley	d94dbdc4e0	ice: Fix ice_cfg_rdma_fltr() to only update relevant fields The current implementation causes ice_vsi_update() to update all VSI fields based on the cached VSI context. This also assumes that the ICE_AQ_VSI_PROP_Q_OPT_VALID bit is set. This can cause problems if the VSI context is not correctly synced by the driver. Fix this by only updating the fields that correspond to ICE_AQ_VSI_PROP_Q_OPT_VALID. Also, make sure to save the updated result in the cached VSI context on success. Fixes: 348048e724a0 ("ice: Implement iidc operations") Co-developed-by: Robert Malz <robertx.malz@intel.com> Signed-off-by: Robert Malz <robertx.malz@intel.com> Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Piotr Raczynski <piotr.raczynski@intel.com> Tested-by: Jakub Andrysiak <jakub.andrysiak@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-28 09:43:12 -07:00
Jesse Brandeburg	66ceaa4c45	ice: fix W=1 headers mismatch make modules W=1 returns: .../ice/ice_txrx_lib.c:448: warning: Function parameter or member 'first_idx' not described in 'ice_finalize_xdp_rx' .../ice/ice_txrx.c:948: warning: Function parameter or member 'ntc' not described in 'ice_get_rx_buf' .../ice/ice_txrx.c:1038: warning: Excess function parameter 'rx_buf' description in 'ice_construct_skb' Fix these warnings by adding and deleting the deviant arguments. Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side") Fixes: d7956d81f150 ("ice: Pull out next_to_clean bump out of ice_put_rx_buf()") CC: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Piotr Raczynski <piotr.raczynski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2023-03-28 09:42:05 -07:00
Grygorii Strashko	86e2eca4dd	net: ethernet: ti: am65-cpsw: enable p0 host port rx_vlan_remap By default, the tagged ingress packets to the switch from the host port P0 get internal switch priority assigned equal to the DMA CPPI channel number they came from, unless CPSW_P0_CONTROL_REG.RX_REMAP_VLAN is enabled. This causes issues with applying QoS policies and mapping packets on external port fifos, because the default configuration is vlan_aware and DMA CPPI channels are shared between all external ports. Hence enable CPSW_P0_CONTROL_REG.RX_REMAP_VLAN so packet will preserve internal switch priority assigned following the VLAN(priority) tag no matter through which DMA CPPI Channels packets enter the switch. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Link: https://lore.kernel.org/r/20230327092103.3256118-1-s-vadapalli@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 15:29:50 +02:00
Grygorii Strashko	5c8560c4a1	net: ethernet: ti: am65-cpsw: add .ndo to set dma per-queue rate Enable rate limiting TX DMA queues for CPSW interface by configuring the rate in absolute Mb/s units per TX queue. Example: ethtool -L eth0 tx 4 echo 100 > /sys/class/net/eth0/queues/tx-0/tx_maxrate echo 200 > /sys/class/net/eth0/queues/tx-1/tx_maxrate echo 50 > /sys/class/net/eth0/queues/tx-2/tx_maxrate echo 30 > /sys/class/net/eth0/queues/tx-3/tx_maxrate # disable echo 0 > /sys/class/net/eth0/queues/tx-0/tx_maxrate Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Link: https://lore.kernel.org/r/20230327085758.3237155-1-s-vadapalli@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 15:29:22 +02:00
Paolo Abeni	917fd7d6cd	Merge branch 'xen-netback-fix-issue-introduced-recently' Juergen Gross says: ==================== xen/netback: fix issue introduced recently The fix for XSA-423 introduced a bug which resulted in loss of network connection in some configurations. The first patch is fixing the issue, while the second one is removing a test which isn't needed. ==================== Link: https://lore.kernel.org/r/20230327083646.18690-1-jgross@suse.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 14:16:42 +02:00
Juergen Gross	8fb8ebf948	xen/netback: remove not needed test in xenvif_tx_build_gops() The tests for the number of grant mapping or copy operations reaching the array size of the operations buffer at the end of the main loop in xenvif_tx_build_gops() isn't needed. The loop can handle at maximum MAX_PENDING_REQS transfer requests, as XEN_RING_NR_UNCONSUMED_REQUESTS() is taking unsent responses into consideration, too. Remove the tests. Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 14:16:40 +02:00
Juergen Gross	05310f31ca	xen/netback: don't do grant copy across page boundary Fix xenvif_get_requests() not to do grant copy operations across local page boundaries. This requires to double the maximum number of copy operations per queue, as each copy could now be split into 2. Make sure that struct xenvif_tx_cb doesn't grow too large. Cc: stable@vger.kernel.org Fixes: ad7f402ae4f4 ("xen/netback: Ensure protocol headers don't fall in the non-linear area") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 14:16:40 +02:00
Wolfram Sang	f22c993f31	smsc911x: avoid PHY being resumed when interface is not up SMSC911x doesn't need mdiobus suspend/resume, that's why it sets 'mac_managed_pm'. However, setting it needs to be moved from init to probe, so mdiobus PM functions will really never be called (e.g. when the interface is not up yet during suspend/resume). Fixes: 3ce9f2bef755 ("net: smsc911x: Stop and start PHY during suspend and resume") Suggested-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230327083138.6044-1-wsa+renesas@sang-engineering.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 13:39:47 +02:00
Paolo Abeni	d8b0c963e9	Merge branch 'allocate-multiple-skbuffs-on-tx' Arseniy Krasnov says: ==================== allocate multiple skbuffs on tx This adds small optimization for tx path: instead of allocating single skbuff on every call to transport, allocate multiple skbuff's until credit space allows, thus trying to send as much as possible data without return to af_vsock.c. Also this patchset includes second patch which adds check and return from 'virtio_transport_get_credit()' and 'virtio_transport_put_credit()' when these functions are called with 0 argument. This is needed, because zero argument makes both functions to behave as no-effect, but both of them always tries to acquire spinlock. Moreover, first patch always calls function 'virtio_transport_put_credit()' with zero argument in case of successful packet transmission. ==================== Link: https://lore.kernel.org/r/b0d15942-65ba-3a32-ba8d-fed64332d8f6@sberdevices.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 12:03:54 +02:00
Arseniy Krasnov	e3ec366eb0	virtio/vsock: check argument to avoid no effect call Both of these functions have no effect when input argument is 0, so to avoid useless spinlock access, check argument before it. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 12:03:50 +02:00
Arseniy Krasnov	b68ffb1b3b	virtio/vsock: allocate multiple skbuffs on tx This adds small optimization for tx path: instead of allocating single skbuff on every call to transport, allocate multiple skbuff's until credit space allows, thus trying to send as much as possible data without return to af_vsock.c. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 12:03:50 +02:00
Paolo Abeni	b4c66d755e	Merge branch 'net-mvpp2-rss-fixes' Sven Auhagen says: ==================== net: mvpp2: rss fixes This patch series fixes up some rss problems in the mvpp2 driver. The classifier is missing some fragmentation flags, the parser has the QinQ headers switched and the PPPoE Layer 4 detecion is not working correctly. This is leading to no or bad rss for the default settings. ==================== Link: https://lore.kernel.org/r/20230325163903.ofefgus43x66as7i@Svens-MacBookPro.local Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 11:34:12 +02:00
Sven Auhagen	031a416c21	net: mvpp2: parser fix PPPoE In PPPoE add all IPv4 header option length to the parser and adjust the L3 and L4 offset accordingly. Currently the L4 match does not work with PPPoE and all packets are matched as L3 IP4 OPT. Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit") Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 11:34:01 +02:00
Sven Auhagen	a587a84813	net: mvpp2: parser fix QinQ The mvpp2 parser entry for QinQ has the inner and outer VLAN in the wrong order. Fix the problem by swapping them. Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit") Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Reviewed-by: Marcin Wojtas <mw@semihalf.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 11:34:01 +02:00
Sven Auhagen	9a251cae51	net: mvpp2: classifier flow fix fragmentation flags Add missing IP Fragmentation Flag. Fixes: f9358e12a0af ("net: mvpp2: split ingress traffic into multiple flows") Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Reviewed-by: Marcin Wojtas <mw@semihalf.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-03-28 11:34:01 +02:00
Thomas Gleixner	ee1ee6db07	atomics: Provide rcuref - scalable reference counting atomic_t based reference counting, including refcount_t, uses atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is implemented with a atomic_try_cmpxchg() loop. High contention of the reference count leads to retry loops and scales badly. There is nothing to improve on this implementation as the semantics have to be preserved. Provide rcuref as a scalable alternative solution which is suitable for RCU managed objects. Similar to refcount_t it comes with overflow and underflow detection and mitigation. rcuref treats the underlying atomic_t as an unsigned integer and partitions this space into zones: 0x00000000 - 0x7FFFFFFF valid zone (1 .. (INT_MAX + 1) references) 0x80000000 - 0xBFFFFFFF saturation zone 0xC0000000 - 0xFFFFFFFE dead zone 0xFFFFFFFF no reference rcuref_get() unconditionally increments the reference count with atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the reference count with atomic_add_negative_release(). This unconditional increment avoids the inc_not_zero() problem, but requires a more complex implementation on the put() side when the count drops from 0 to -1. When this transition is detected then it is attempted to mark the reference count dead, by setting it to the midpoint of the dead zone with a single atomic_cmpxchg_release() operation. This operation can fail due to a concurrent rcuref_get() elevating the reference count from -1 to 0 again. If the unconditional increment in rcuref_get() hits a reference count which is marked dead (or saturated) it will detect it after the fact and bring back the reference count to the midpoint of the respective zone. The zones provide enough tolerance which makes it practically impossible to escape from a zone. The racy implementation of rcuref_put() requires to protect rcuref_put() against a grace period ending in order to prevent a subtle use after free. As RCU is the only mechanism which allows to protect against that, it is not possible to fully replace the atomic_inc_not_zero() based implementation of refcount_t with this scheme. The final drop is slightly more expensive than the atomic_dec_return() counterpart, but that's not the case which this is optimized for. The optimization is on the high frequeunt get()/put() pairs and their scalability. The performance of an uncontended rcuref_get()/put() pair where the put() is not dropping the last reference is still on par with the plain atomic operations, while at the same time providing overflow and underflow detection and mitigation. The performance of rcuref compared to plain atomic_inc_not_zero() and atomic_dec_return() based reference counting under contention: - Micro benchmark: All CPUs running a increment/decrement loop on an elevated reference count, which means the 0 to -1 transition never happens. The performance gain depends on microarchitecture and the number of CPUs and has been observed in the range of 1.3X to 4.7X - Conversion of dst_entry::__refcnt to rcuref and testing with the localhost memtier/memcached benchmark. That benchmark shows the reference count contention prominently. The performance gain depends on microarchitecture and the number of CPUs and has been observed in the range of 1.1X to 2.6X over the previous fix for the false sharing issue vs. struct dst_entry::__refcnt. When memtier is run over a real 1Gb network connection, there is a small gain on top of the false sharing fix. The two changes combined result in a 2%-5% total gain for that networked test. Reported-by: Wangyang Guo <wangyang.guo@intel.com> Reported-by: Arjan Van De Ven <arjan.van.de.ven@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230323102800.158429195@linutronix.de	2023-03-28 10:39:29 +02:00
Thomas Gleixner	e5ab9eff46	atomics: Provide atomic_add_negative() variants atomic_add_negative() does not provide the relaxed/acquire/release variants. Provide them in preparation for a new scalable reference count algorithm. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230323102800.101763813@linutronix.de	2023-03-28 10:39:29 +02:00
Jakub Kicinski	4cee0fb9cc	linux-can-next-for-6.4-20230327 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEDs2BvajyNKlf9TJQvlAcSiqKBOgFAmQhRSETHG1rbEBwZW5n dXRyb25peC5kZQAKCRC+UBxKKooE6Ap9B/kB0/KPmJuPw8uhCcZV0LqzRHeitJU/ iTGCNX0yxE9k2MuBZJBdk/j1mis9aD0l1K3QAHz5kUNNE1i1LR6ljvbazjJBw/5/ jVPi4zUm1bvveMYLvL7bpin1wc4UsbGHaFAfSLENfPVkB8gHMWOCOMKbdqcJ/utk bpTn6vW3YZRlw6wcKykig73lPKiVrBs5HDVKMFb40vhTtPZs41a9jNksupXOUgzL 4oQq97BeItzgOWrttocMnF68WA+8gYvS8d5Y85qr1cpEwfi4vbLHb063+TItZYfF 5JlueeoavQDwSnwH6bGuEIa/SzoGfOqrXlVg6Slse393RRrK8jTE2AoX =CY8n -----END PGP SIGNATURE----- Merge tag 'linux-can-next-for-6.4-20230327' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2023-03-27 The first 2 patches by Geert Uytterhoeven add transceiver support and improve the error messages in the rcar_canfd driver. Cai Huoqing contributes 3 patches which remove a redundant call to pci_clear_master() in the c_can, ctucanfd and kvaser_pciefd driver. Frank Jungclaus's patch replaces the struct esd_usb_msg with a union in the esd_usb driver to improve readability. Markus Schneider-Pargmann contributes 5 patches to improve the performance in the m_can driver, especially for SPI attached controllers like the tcan4x5x. * tag 'linux-can-next-for-6.4-20230327' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: m_can: Keep interrupts enabled during peripheral read can: m_can: Disable unused interrupts can: m_can: Remove double interrupt enable can: m_can: Always acknowledge all interrupts can: m_can: Remove repeated check for is_peripheral can: esd_usb: Improve code readability by means of replacing struct esd_usb_msg with a union can: kvaser_pciefd: Remove redundant pci_clear_master can: ctucanfd: Remove redundant pci_clear_master can: c_can: Remove redundant pci_clear_master can: rcar_canfd: Improve error messages can: rcar_canfd: Add transceiver support ==================== Link: https://lore.kernel.org/r/20230327073354.1003134-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-27 19:55:10 -07:00
Jakub Kicinski	da954ae18c	Merge branch 'add-tx-push-buf-len-param-to-ethtool' Shay Agroskin says: ==================== Add tx push buf len param to ethtool This patchset adds a new sub-configuration to ethtool get/set queue params (ethtool -g) called 'tx-push-buf-len'. This configuration specifies the maximum number of bytes of a transmitted packet a driver can push directly to the underlying device ('push' mode). The motivation for pushing some of the bytes to the device has the advantages of - Allowing a smart device to take fast actions based on the packet's header - Reducing latency for small packets that can be copied completely into the device This new param is practically similar to tx-copybreak value that can be set using ethtool's tunable but conceptually serves a different purpose. While tx-copybreak is used to reduce the overhead of DMA mapping and makes no sense to use if less than the whole segment gets copied, tx-push-buf-len allows to improve performance by analyzing the packet's data (usually headers) before performing the DMA operation. The configuration can be queried and set using the commands: $ ethtool -g [interface] # ethtool -G [interface] tx-push-buf-len [number of bytes] This patchset also adds support for the new configuration in ENA driver for which this parameter ensures efficient resources management on the device side. ==================== Link: https://lore.kernel.org/r/20230323163610.1281468-1-shayagr@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-27 19:50:00 -07:00
Shay Agroskin	060cdac218	net: ena: Advertise TX push support LLQ is auto enabled by the device and disabling it isn't supported on new ENA generations while on old ones causes sub-optimal performance. This patch adds advertisement of push-mode when LLQ mode is used, but rejects an attempt to modify it. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-27 19:49:59 -07:00
Shay Agroskin	b0c59e5396	net: ena: Add support to changing tx_push_buf_len The ENA driver allows for two distinct values for the number of bytes of the packet's payload that can be written directly to the device. For a value of 224 the driver turns on Large LLQ Header mode in which the first 224 of the packet's payload are written to the LLQ. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-27 19:49:59 -07:00

... 3 4 5 6 7 ...

1171485 Commits