linux

iv/linux

Author	SHA1	Message	Date
Srinivas Pandruvada	03c83982a0	cpufreq: intel_pstate: ITMT support for overclocked system On systems with overclocking enabled, CPPC Highest Performance can be hard coded to 0xff. In this case even if we have cores with different highest performance, ITMT can't be enabled as the current implementation depends on CPPC Highest Performance. On such systems we can use MSR_HWP_CAPABILITIES maximum performance field when CPPC.Highest Performance is 0xff. Due to legacy reasons, we can't solely depend on MSR_HWP_CAPABILITIES as in some older systems CPPC Highest Performance is the only way to identify different performing cores. Reported-by: Michael Larabel <Michael@MichaelLarabel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Michael Larabel <Michael@MichaelLarabel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-11-23 14:11:18 +01:00
Hector Martin	113972d2e1	usb: typec: tipd: Fix initialization sequence for cd321x The power state switch needs to happen first, as that kickstarts the firmware into normal mode. Fixes: `c9c14be664` ("usb: typec: tipd: Switch CD321X power state to S0") Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Hector Martin <marcan@marcan.st> Link: https://lore.kernel.org/r/20211120030717.84287-3-marcan@marcan.st Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-11-23 14:08:08 +01:00
Hector Martin	7b9c90e3e6	usb: typec: tipd: Fix typo in cd321x_switch_power_state SPSS should've been SSPS. Fixes: `c9c14be664` ("usb: typec: tipd: Switch CD321X power state to S0") Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Hector Martin <marcan@marcan.st> Link: https://lore.kernel.org/r/20211120030717.84287-2-marcan@marcan.st Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-11-23 14:08:08 +01:00
Mathias Nyman	6cca13de26	usb: hub: Fix locking issues with address0_mutex Fix the circular lock dependency and unbalanced unlock of addess0_mutex introduced when fixing an address0_mutex enumeration retry race in commit ae6dc22d2d1 ("usb: hub: Fix usb enumeration issue due to address0 race") Make sure locking order between port_dev->status_lock and address0_mutex is correct, and that address0_mutex is not unlocked in hub_port_connect "done:" codepath which may be reached without locking address0_mutex Fixes: `6ae6dc22d2` ("usb: hub: Fix usb enumeration issue due to address0 race") Cc: <stable@vger.kernel.org> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Hans de Goede <hdegoede@redhat.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20211123101656.1113518-1-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-11-23 14:05:28 +01:00
Rafael J. Wysocki	ed38eb49d1	cpufreq: intel_pstate: Fix active mode offline/online EPP handling After commit `4adcf2e582` ("cpufreq: intel_pstate: Add ->offline and ->online callbacks") the EPP value set by the "performance" scaling algorithm in the active mode is not restored after an offline/online cycle which replaces it with the saved EPP value coming from user space. Address this issue by forcing intel_pstate_hwp_set() to set a new EPP value when it runs first time after online. Fixes: `4adcf2e582` ("cpufreq: intel_pstate: Add ->offline and ->online callbacks") Link: https://lore.kernel.org/linux-pm/adc7132c8655bd4d1c8b6129578e931a14fe1db2.camel@linux.intel.com/ Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: 5.9+ <stable@vger.kernel.org> # 5.9+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-11-23 14:02:11 +01:00
Adamos Ttofari	cd23f02f16	cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs Commit `fbdc21e9b0` ("cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode") enabled the use of Intel P-State driver for Ice Lake servers. But it doesn't cover the case when OS can't control P-States. Therefore, for Ice Lake server, if MSR_MISC_PWR_MGMT bits 8 or 18 are enabled, then the Intel P-State driver should exit as OS can't control P-States. Fixes: `fbdc21e9b0` ("cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode") Signed-off-by: Adamos Ttofari <attofari@amazon.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2021-11-23 14:02:10 +01:00
Marek Behún	7b1b62bc1e	net: marvell: mvpp2: increase MTU limit when XDP enabled Currently mvpp2_xdp_setup won't allow attaching XDP program if mtu > ETH_DATA_LEN (1500). The mvpp2_change_mtu on the other hand checks whether MVPP2_RX_PKT_SIZE(mtu) > MVPP2_BM_LONG_PKT_SIZE. These two checks are semantically different. Moreover this limit can be increased to MVPP2_MAX_RX_BUF_SIZE, since in mvpp2_rx we have xdp.data = data + MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM; xdp.frame_sz = PAGE_SIZE; Change the checks to check whether mtu > MVPP2_MAX_RX_BUF_SIZE Fixes: `07dd0a7aae` ("mvpp2: add basic XDP support") Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:37:09 +00:00
Alex Elder	e4e9bfb7c9	net: ipa: kill ipa_cmd_pipeline_clear() Calling ipa_cmd_pipeline_clear() after stopping the channel underlying the AP<-modem RX endpoint can lead to a deadlock. This occurs in the ->runtime_suspend device power operation for the IPA driver. While this callback is in progress, any other requests for power will block until the callback returns. Stopping the AP<-modem RX channel does not prevent the modem from sending another packet to this endpoint. If a packet arrives for an RX channel when the channel is stopped, an SUSPEND IPA interrupt condition will be pending. Handling an IPA interrupt requires power, so ipa_isr_thread() calls pm_runtime_get_sync() first thing. The problem occurs because a "pipeline clear" command will not complete while such a SUSPEND interrupt condition exists. So the SUSPEND IPA interrupt handler won't proceed until it gets power; that won't happen until the ->runtime_suspend callback (and its "pipeline clear" command) completes; and that can't happen while the SUSPEND interrupt condition exists. It turns out that in this case there is no need to use the "pipeline clear" command. There are scenarios in which clearing the pipeline is required while suspending, but those are not (yet) supported upstream. So a simple fix, avoiding the potential deadlock, is to stop calling ipa_cmd_pipeline_clear() in ipa_endpoint_suspend(). This removes the only user of ipa_cmd_pipeline_clear(), so get rid of that function. It can be restored again whenever it's needed. This is basically a manual revert along with an explanation for commit `6cb63ea6a3` ("net: ipa: introduce ipa_cmd_tag_process()"). Fixes: `6cb63ea6a3` ("net: ipa: introduce ipa_cmd_tag_process()") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:26:40 +00:00
Martyn Welch	a049a30fc2	net: usb: Correct PHY handling of smsc95xx The smsc95xx driver is dropping phy speed settings and causing a stack trace at device unbind: [ 536.379147] smsc95xx 2-1:1.0 eth1: unregister 'smsc95xx' usb-ci_hdrc.2-1, smsc95xx USB 2.0 Ethernet [ 536.425029] ------------[ cut here ]------------ [ 536.429650] WARNING: CPU: 0 PID: 439 at fs/kernfs/dir.c:1535 kernfs_remove_by_name_ns+0xb8/0xc0 [ 536.438416] kernfs: can not remove 'attached_dev', no directory [ 536.444363] Modules linked in: xts dm_crypt dm_mod atmel_mxt_ts smsc95xx usbnet [ 536.451748] CPU: 0 PID: 439 Comm: sh Tainted: G W 5.15.0 #1 [ 536.458636] Hardware name: Freescale i.MX53 (Device Tree Support) [ 536.464735] Backtrace: [ 536.467190] [<80b1c904>] (dump_backtrace) from [<80b1cb48>] (show_stack+0x20/0x24) [ 536.474787] r7:000005ff r6:8035b294 r5:600f0013 r4:80d8af78 [ 536.480449] [<80b1cb28>] (show_stack) from [<80b1f764>] (dump_stack_lvl+0x48/0x54) [ 536.488035] [<80b1f71c>] (dump_stack_lvl) from [<80b1f788>] (dump_stack+0x18/0x1c) [ 536.495620] r5:00000009 r4:80d9b820 [ 536.499198] [<80b1f770>] (dump_stack) from [<80124fac>] (__warn+0xfc/0x114) [ 536.506187] [<80124eb0>] (__warn) from [<80b1d21c>] (warn_slowpath_fmt+0xa8/0xdc) [ 536.513688] r7:000005ff r6:80d9b820 r5:80d9b8e0 r4:83744000 [ 536.519349] [<80b1d178>] (warn_slowpath_fmt) from [<8035b294>] (kernfs_remove_by_name_ns+0xb8/0xc0) [ 536.528416] r9:00000001 r8:00000000 r7:824926dc r6:00000000 r5:80df6c2c r4:00000000 [ 536.536162] [<8035b1dc>] (kernfs_remove_by_name_ns) from [<80b1f56c>] (sysfs_remove_link+0x4c/0x50) [ 536.545225] r6:7f00f02c r5:80df6c2c r4:83306400 [ 536.549845] [<80b1f520>] (sysfs_remove_link) from [<806f9c8c>] (phy_detach+0xfc/0x11c) [ 536.557780] r5:82492000 r4:83306400 [ 536.561359] [<806f9b90>] (phy_detach) from [<806f9cf8>] (phy_disconnect+0x4c/0x58) [ 536.568943] r7:824926dc r6:7f00f02c r5:82492580 r4:83306400 [ 536.574604] [<806f9cac>] (phy_disconnect) from [<7f00a310>] (smsc95xx_disconnect_phy+0x30/0x38 [smsc95xx]) [ 536.584290] r5:82492580 r4:82492580 [ 536.587868] [<7f00a2e0>] (smsc95xx_disconnect_phy [smsc95xx]) from [<7f001570>] (usbnet_stop+0x70/0x1a0 [usbnet]) [ 536.598161] r5:82492580 r4:82492000 [ 536.601740] [<7f001500>] (usbnet_stop [usbnet]) from [<808baa70>] (__dev_close_many+0xb4/0x12c) [ 536.610466] r8:83744000 r7:00000000 r6:83744000 r5:83745b74 r4:82492000 [ 536.617170] [<808ba9bc>] (__dev_close_many) from [<808bab78>] (dev_close_many+0x90/0x120) [ 536.625365] r7:00000001 r6:83745b74 r5:83745b8c r4:82492000 [ 536.631026] [<808baae8>] (dev_close_many) from [<808bf408>] (unregister_netdevice_many+0x15c/0x704) [ 536.640094] r9:00000001 r8:81130b98 r7:83745b74 r6:83745bc4 r5:83745b8c r4:82492000 [ 536.647840] [<808bf2ac>] (unregister_netdevice_many) from [<808bfa50>] (unregister_netdevice_queue+0xa0/0xe8) [ 536.657775] r10:8112bcc0 r9:83306c00 r8:83306c80 r7:8291e420 r6:83744000 r5:00000000 [ 536.665608] r4:82492000 [ 536.668143] [<808bf9b0>] (unregister_netdevice_queue) from [<808bfac0>] (unregister_netdev+0x28/0x30) [ 536.677381] r6:7f01003c r5:82492000 r4:82492000 [ 536.682000] [<808bfa98>] (unregister_netdev) from [<7f000b40>] (usbnet_disconnect+0x64/0xdc [usbnet]) [ 536.691241] r5:82492000 r4:82492580 [ 536.694819] [<7f000adc>] (usbnet_disconnect [usbnet]) from [<8076b958>] (usb_unbind_interface+0x80/0x248) [ 536.704406] r5:7f01003c r4:83306c80 [ 536.707984] [<8076b8d8>] (usb_unbind_interface) from [<8061765c>] (device_release_driver_internal+0x1c4/0x1cc) [ 536.718005] r10:8112bcc0 r9:80dff1dc r8:83306c80 r7:83744000 r6:7f01003c r5:00000000 [ 536.725838] r4:8291e420 [ 536.728373] [<80617498>] (device_release_driver_internal) from [<80617684>] (device_release_driver+0x20/0x24) [ 536.738302] r7:83744000 r6:810d4f4c r5:8291e420 r4:8176ae30 [ 536.743963] [<80617664>] (device_release_driver) from [<806156cc>] (bus_remove_device+0xf0/0x148) [ 536.752858] [<806155dc>] (bus_remove_device) from [<80610018>] (device_del+0x198/0x41c) [ 536.760880] r7:83744000 r6:8116e2e4 r5:8291e464 r4:8291e420 [ 536.766542] [<8060fe80>] (device_del) from [<80768fe8>] (usb_disable_device+0xcc/0x1e0) [ 536.774576] r10:8112bcc0 r9:80dff1dc r8:00000001 r7:8112bc48 r6:8291e400 r5:00000001 [ 536.782410] r4:83306c00 [ 536.784945] [<80768f1c>] (usb_disable_device) from [<80769c30>] (usb_set_configuration+0x514/0x8dc) [ 536.794011] r10:00000000 r9:00000000 r8:832c3600 r7:00000004 r6:810d5688 r5:00000000 [ 536.801844] r4:83306c00 [ 536.804379] [<8076971c>] (usb_set_configuration) from [<80775fac>] (usb_generic_driver_disconnect+0x34/0x38) [ 536.814236] r10:832c3610 r9:83745ef8 r8:832c3600 r7:00000004 r6:810d5688 r5:83306c00 [ 536.822069] r4:83306c00 [ 536.824605] [<80775f78>] (usb_generic_driver_disconnect) from [<8076b850>] (usb_unbind_device+0x30/0x70) [ 536.834100] r5:83306c00 r4:810d5688 [ 536.837678] [<8076b820>] (usb_unbind_device) from [<8061765c>] (device_release_driver_internal+0x1c4/0x1cc) [ 536.847432] r5:822fb480 r4:83306c80 [ 536.851009] [<80617498>] (device_release_driver_internal) from [<806176a8>] (device_driver_detach+0x20/0x24) [ 536.860853] r7:00000004 r6:810d4f4c r5:810d5688 r4:83306c80 [ 536.866515] [<80617688>] (device_driver_detach) from [<80614d98>] (unbind_store+0x70/0xe4) [ 536.874793] [<80614d28>] (unbind_store) from [<80614118>] (drv_attr_store+0x30/0x3c) [ 536.882554] r7:00000000 r6:00000000 r5:83739200 r4:80614d28 [ 536.888217] [<806140e8>] (drv_attr_store) from [<8035cb68>] (sysfs_kf_write+0x48/0x54) [ 536.896154] r5:83739200 r4:806140e8 [ 536.899732] [<8035cb20>] (sysfs_kf_write) from [<8035be84>] (kernfs_fop_write_iter+0x11c/0x1d4) [ 536.908446] r5:83739200 r4:00000004 [ 536.912024] [<8035bd68>] (kernfs_fop_write_iter) from [<802b87fc>] (vfs_write+0x258/0x3e4) [ 536.920317] r10:00000000 r9:83745f58 r8:83744000 r7:00000000 r6:00000004 r5:00000000 [ 536.928151] r4:82adacc0 [ 536.930687] [<802b85a4>] (vfs_write) from [<802b8b0c>] (ksys_write+0x74/0xf4) [ 536.937842] r10:00000004 r9:007767a0 r8:83744000 r7:00000000 r6:00000000 r5:82adacc0 [ 536.945676] r4:82adacc0 [ 536.948213] [<802b8a98>] (ksys_write) from [<802b8ba4>] (sys_write+0x18/0x1c) [ 536.955367] r10:00000004 r9:83744000 r8:80100244 r7:00000004 r6:76f47b58 r5:76fc0350 [ 536.963200] r4:00000004 [ 536.965735] [<802b8b8c>] (sys_write) from [<80100060>] (ret_fast_syscall+0x0/0x48) [ 536.973320] Exception stack(0x83745fa8 to 0x83745ff0) [ 536.978383] 5fa0: 00000004 76fc0350 00000001 007767a0 00000004 00000000 [ 536.986569] 5fc0: 00000004 76fc0350 76f47b58 00000004 76f47c7c 76f48114 00000000 7e87991c [ 536.994753] 5fe0: 00000498 7e879908 76e6dce8 76eca2e8 [ 536.999922] ---[ end trace 9b835d809816b435 ]--- The driver should not be connecting and disconnecting the PHY when the device is opened and closed, it should be stopping and starting the PHY. The phy should be connected as part of binding and disconnected during unbinding. As this results in the PHY not being reset during open, link speed, etc. settings set prior to the link coming up are now not being lost. It is necessary for phy_stop() to only be called when the phydev still exists (resolving the above stack trace). When unbinding, ".unbind" will be called prior to ".stop", with phy_disconnect() already having called phy_stop() before the phydev becomes inaccessible. Signed-off-by: Martyn Welch <martyn.welch@collabora.com> Cc: Steve Glendinning <steve.glendinning@shawell.net> Cc: UNGLinuxDriver@microchip.com Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: stable@kernel.org # v5.15 Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:24:41 +00:00
Zheyu Ma	b82d71c0f8	net: chelsio: cxgb4vf: Fix an error code in cxgb4vf_pci_probe() During the process of driver probing, probe function should return < 0 for failure, otherwise kernel will treat value == 0 as success. Therefore, we should set err to -EINVAL when adapter->registered_device_map is NULL. Otherwise kernel will assume that driver has been successfully probed and will cause unexpected errors. Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:15:53 +00:00
Heiner Kallweit	c75a9ad436	r8169: fix incorrect mac address assignment The original changes brakes MAC address assignment on older chip versions (see bug report [0]), and it brakes random MAC assignment. is_valid_ether_addr() requires that its argument is word-aligned. Add the missing alignment to array mac_addr. [0] https://bugzilla.kernel.org/show_bug.cgi?id=215087 Fixes: `1c5d09d587` ("ethernet: r8169: use eth_hw_addr_set()") Reported-by: Richard Herbert <rherbert@sympatico.ca> Tested-by: Richard Herbert <rherbert@sympatico.ca> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:12:37 +00:00
David S. Miller	52911bb62e	Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2021-11-22 Maciej Fijalkowski says: Here are the two fixes for issues around ethtool's set_channels() callback for ice driver. Both are related to XDP resources. First one corrects the size of vsi->txq_map that is used to track the usage of Tx resources and the second one prevents the wrong refcounting of bpf_prog. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:10:07 +00:00
David S. Miller	60ebd6737c	Merge branch 'ipa-fixes' Alex Elder says: ==================== net: ipa: prevent shutdown during setup The setup phase of the IPA driver occurs in one of two ways. Normally, it is done directly by the main driver probe function. But some systems (those having a "modem-init" DTS property) don't start setup until an SMP2P interrupt (sent by the modem) arrives. Because it isn't performed by the probe function, setup on "modem-init" systems could be underway at the time a driver remove (or shutdown) request arrives (or vice-versa). This situation can lead to hardware state not being cleaned up properly. This series addresses this problem by having the driver remove function disable the setup interrupt. A consequence of this is that setup will complete if it is underway when the remove function is called. So now, when removing the driver, setup: - will have already completed; - is underway, and will complete before proceeding; or - will not have begun (and will not occur). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:06:40 +00:00
Alex Elder	8afc7e471a	net: ipa: separate disabling setup from modem stop The IPA setup_complete flag is set at the end of ipa_setup(), when the setup phase of initialization has completed successfully. This occurs as part of driver probe processing, or (if "modem-init" is specified in the DTS file) it is triggered by the "ipa-setup-ready" SMP2P interrupt generated by the modem. In the latter case, it's possible for driver shutdown (or remove) to begin while setup processing is underway, and this can't be allowed. The problem is that the setup_complete flag is not adequate to signal that setup is underway. If setup_complete is set, it will never be un-set, so that case is not a problem. But if setup_complete is false, there's a chance setup is underway. Because setup is triggered by an interrupt on a "modem-init" system, there is a simple way to ensure the value of setup_complete is safe to read. The threaded handler--if it is executing--will complete as part of a request to disable the "ipa-modem-ready" interrupt. This means that ipa_setup() (which is called from the handler) will run to completion if it was underway, or will never be called otherwise. The request to disable the "ipa-setup-ready" interrupt is currently made within ipa_modem_stop(). Instead, disable the interrupt outside that function in the two places it's called. In the case of ipa_remove(), this ensures the setup_complete flag is safe to read before we read it. Rename ipa_smp2p_disable() to be ipa_smp2p_irq_disable_setup(), to be more specific about its effect. Fixes: `530f9216a9` ("soc: qcom: ipa: AP/modem communications") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:06:40 +00:00
Alex Elder	33a153100b	net: ipa: directly disable ipa-setup-ready interrupt We currently maintain a "disabled" Boolean flag to determine whether the "ipa-setup-ready" SMP2P IRQ handler does anything. That flag must be accessed under protection of a mutex. Instead, disable the SMP2P interrupt when requested, which prevents the interrupt handler from ever being called. More importantly, it synchronizes a thread disabling the interrupt with the completion of the interrupt handler in case they run concurrently. Use the IPA setup_complete flag rather than the disabled flag in the handler to determine whether to ignore any interrupts arriving after the first. Rename the "disabled" flag to be "setup_disabled", to be specific about its purpose. Fixes: `530f9216a9` ("soc: qcom: ipa: AP/modem communications") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 12:06:40 +00:00
Miquel Raynal	bed68f4f4d	docs: i2c: smbus-protocol: mention the repeated start condition Sr is a repeated start, it is used in both I2C and SMBus protocols. Provide its description and replace start ("S") conditions with repeated start ("Sr") conditions when relevant. This allows the documentation to match the SMBus specification available at [1]. [1] http://www.smbus.org/specs/SMBus_3_1_20180319.pdf Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>	2021-11-23 12:59:41 +01:00
David S. Miller	bd08ee2315	Merge branch 'mlxsw-fixes' Ido Schimmel says: ==================== mlxsw: Two small fixes Patch #1 fixes a recent regression that prevents the driver from loading with old firmware versions. Patch #2 protects the driver from a NULL pointer dereference when working on top of a buggy firmware. This was never observed in an actual system, only on top of an emulator during development. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:46:18 +00:00
Amit Cohen	63b08b1f68	mlxsw: spectrum: Protect driver from buggy firmware When processing port up/down events generated by the device's firmware, the driver protects itself from events reported for non-existent local ports, but not the CPU port (local port 0), which exists, but lacks a netdev. This can result in a NULL pointer dereference when calling netif_carrier_{on,off}(). Fix this by bailing early when processing an event reported for the CPU port. Problem was only observed when running on top of a buggy emulator. Fixes: `28b1987ef5` ("mlxsw: spectrum: Register CPU port with devlink") Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:46:18 +00:00
Danielle Ratson	ce4995bc6c	mlxsw: spectrum: Allow driver to load with old firmware versions The driver fails to load with old firmware versions that cannot report the maximum number of RIF MAC profiles [1]. Fix this by defaulting to a maximum of a single profile in such situations, as multiple profiles are not supported by old firmware versions. [1] mlxsw_spectrum 0000:03:00.0: cannot register bus device mlxsw_spectrum: probe of 0000:03:00.0 failed with error -5 Fixes: `1c375ffb2e` ("mlxsw: spectrum_router: Expose RIF MAC profiles to devlink resource") Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reported-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:46:18 +00:00
David S. Miller	5789d04b77	Merge branch 'smc-fixes' Tony Lu says: ==================== smc: Fixes for closing process and minor cleanup Patch 1 is a minor cleanup for local struct sock variables. Patch 2 ensures the active closing side enters TIME_WAIT. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:42:24 +00:00
Tony Lu	606a63c978	net/smc: Ensure the active closing peer first closes clcsock The side that actively closed socket, it's clcsock doesn't enter TIME_WAIT state, but the passive side does it. It should show the same behavior as TCP sockets. Consider this, when client actively closes the socket, the clcsock in server enters TIME_WAIT state, which means the address is occupied and won't be reused before TIME_WAIT dismissing. If we restarted server, the service would be unavailable for a long time. To solve this issue, shutdown the clcsock in [A], perform the TCP active close progress first, before the passive closed side closing it. So that the actively closed side enters TIME_WAIT, not the passive one. Client \| Server close() // client actively close \| smc_release() \| smc_close_active() // PEERCLOSEWAIT1 \| smc_close_final() // abort or closed = 1\| smc_cdc_get_slot_and_msg_send() \| [A] \| \|smc_cdc_msg_recv_action() // ACTIVE \| queue_work(smc_close_wq, &conn->close_work) \| smc_close_passive_work() // PROCESSABORT or APPCLOSEWAIT1 \| smc_close_passive_abort_received() // only in abort \| \|close() // server recv zero, close \| smc_release() // PROCESSABORT or APPCLOSEWAIT1 \| smc_close_active() \| smc_close_abort() or smc_close_final() // CLOSED \| smc_cdc_get_slot_and_msg_send() // abort or closed = 1 smc_cdc_msg_recv_action() \| smc_clcsock_release() queue_work(smc_close_wq, &conn->close_work) \| sock_release(tcp) // actively close clc, enter TIME_WAIT smc_close_passive_work() // PEERCLOSEWAIT1 \| smc_conn_free() smc_close_passive_abort_received() // CLOSED\| smc_conn_free() \| smc_clcsock_release() \| sock_release(tcp) // passive close clc \| Link: https://www.spinics.net/lists/netdev/msg780407.html Fixes: `b38d732477` ("smc: socket closing and linkgroup cleanup") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:42:24 +00:00
Tony Lu	45c3ff7a9a	net/smc: Clean up local struct sock variables There remains some variables to replace with local struct sock. So clean them up all. Fixes: `3163c5071f` ("net/smc: use local struct sock variables consistently") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:42:24 +00:00
Nikolay Aleksandrov	1c743127cc	net: nexthop: fix null pointer dereference when IPv6 is not enabled When we try to add an IPv6 nexthop and IPv6 is not enabled (!CONFIG_IPV6) we'll hit a NULL pointer dereference[1] in the error path of nh_create_ipv6() due to calling ipv6_stub->fib6_nh_release. The bug has been present since the beginning of IPv6 nexthop gateway support. Commit `1aefd3de7b` ("ipv6: Add fib6_nh_init and release to stubs") tells us that only fib6_nh_init has a dummy stub because fib6_nh_release should not be called if fib6_nh_init returns an error, but the commit below added a call to ipv6_stub->fib6_nh_release in its error path. To fix it return the dummy stub's -EAFNOSUPPORT error directly without calling ipv6_stub->fib6_nh_release in nh_create_ipv6()'s error path. [1] Output is a bit truncated, but it clearly shows the error. BUG: kernel NULL pointer dereference, address: 000000000000000000 #PF: supervisor instruction fetch in kernel modede #PF: error_code(0x0010) - not-present pagege PGD 0 P4D 0 Oops: 0010 [#1] PREEMPT SMP NOPTI CPU: 4 PID: 638 Comm: ip Kdump: loaded Not tainted 5.16.0-rc1+ #446 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014 RIP: 0010:0x0 Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. RSP: 0018:ffff888109f5b8f0 EFLAGS: 00010286^Ac RAX: 0000000000000000 RBX: ffff888109f5ba28 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8881008a2860 RBP: ffff888109f5b9d8 R08: 0000000000000000 R09: 0000000000000000 R10: ffff888109f5b978 R11: ffff888109f5b948 R12: 00000000ffffff9f R13: ffff8881008a2a80 R14: ffff8881008a2860 R15: ffff8881008a2840 FS: 00007f98de70f100(0000) GS:ffff88822bf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 0000000100efc000 CR4: 00000000000006e0 Call Trace: <TASK> nh_create_ipv6+0xed/0x10c rtm_new_nexthop+0x6d7/0x13f3 ? check_preemption_disabled+0x3d/0xf2 ? lock_is_held_type+0xbe/0xfd rtnetlink_rcv_msg+0x23f/0x26a ? check_preemption_disabled+0x3d/0xf2 ? rtnl_calcit.isra.0+0x147/0x147 netlink_rcv_skb+0x61/0xb2 netlink_unicast+0x100/0x187 netlink_sendmsg+0x37f/0x3a0 ? netlink_unicast+0x187/0x187 sock_sendmsg_nosec+0x67/0x9b ____sys_sendmsg+0x19d/0x1f9 ? copy_msghdr_from_user+0x4c/0x5e ? rcu_read_lock_any_held+0x2a/0x78 ___sys_sendmsg+0x6c/0x8c ? asm_sysvec_apic_timer_interrupt+0x12/0x20 ? lockdep_hardirqs_on+0xd9/0x102 ? sockfd_lookup_light+0x69/0x99 __sys_sendmsg+0x50/0x6e do_syscall_64+0xcb/0xf2 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f98dea28914 Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53 RSP: 002b:00007fff859f5e68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e2e RAX: ffffffffffffffda RBX: 00000000619cb810 RCX: 00007f98dea28914 RDX: 0000000000000000 RSI: 00007fff859f5ed0 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008 R10: fffffffffffffce6 R11: 0000000000000246 R12: 0000000000000001 R13: 000055c0097ae520 R14: 000055c0097957fd R15: 00007fff859f63a0 </TASK> Modules linked in: bridge stp llc bonding virtio_net Cc: stable@vger.kernel.org Fixes: `53010f991a` ("nexthop: Add support for IPv6 gateways") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:39:53 +00:00
Huang Pei	e5b40668e9	slip: fix macro redefine warning MIPS/IA64 define END as assembly function ending, which conflict with END definition in slip.h, just undef it at first Reported-by: lkp@intel.com Signed-off-by: Huang Pei <huangpei@loongson.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:38:13 +00:00
Huang Pei	16517829f2	hamradio: fix macro redefine warning MIPS/IA64 define END as assembly function ending, which conflict with END definition in mkiss.c, just undef it at first Reported-by: lkp@intel.com Signed-off-by: Huang Pei <huangpei@loongson.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-23 11:38:13 +00:00
Jon Hunter	5f719948b5	mmc: spi: Add device-tree SPI IDs Commit `5fa6863ba6` ("spi: Check we have a spi_device_id for each DT compatible") added a test to check that every SPI driver has a spi_device_id for each DT compatiable string defined by the driver and warns if the spi_device_id is missing. The spi_device_id is missing for the MMC SPI driver and the following warning is now seen. WARNING KERN SPI driver mmc_spi has no spi_device_id for mmc-spi-slot Fix this by adding the necessary spi_device_id. Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Link: https://lore.kernel.org/r/20211115113813.238044-1-jonathanh@nvidia.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>	2021-11-23 12:32:28 +01:00
Vincent Whitchurch	84e1d0bf1d	i2c: virtio: disable timeout handling If a timeout is hit, it can result is incorrect data on the I2C bus and/or memory corruptions in the guest since the device can still be operating on the buffers it was given while the guest has freed them. Here is, for example, the start of a slub_debug splat which was triggered on the next transfer after one transfer was forced to timeout by setting a breakpoint in the backend (rust-vmm/vhost-device): BUG kmalloc-1k (Not tainted): Poison overwritten First byte 0x1 instead of 0x6b Allocated in virtio_i2c_xfer+0x65/0x35c age=350 cpu=0 pid=29 __kmalloc+0xc2/0x1c9 virtio_i2c_xfer+0x65/0x35c __i2c_transfer+0x429/0x57d i2c_transfer+0x115/0x134 i2cdev_ioctl_rdwr+0x16a/0x1de i2cdev_ioctl+0x247/0x2ed vfs_ioctl+0x21/0x30 sys_ioctl+0xb18/0xb41 Freed in virtio_i2c_xfer+0x32e/0x35c age=244 cpu=0 pid=29 kfree+0x1bd/0x1cc virtio_i2c_xfer+0x32e/0x35c __i2c_transfer+0x429/0x57d i2c_transfer+0x115/0x134 i2cdev_ioctl_rdwr+0x16a/0x1de i2cdev_ioctl+0x247/0x2ed vfs_ioctl+0x21/0x30 sys_ioctl+0xb18/0xb41 There is no simple fix for this (the driver would have to always create bounce buffers and hold on to them until the device eventually returns the buffers), so just disable the timeout support for now. Fixes: `3cfc883804` ("i2c: virtio: add a virtio i2c frontend driver") Acked-by: Jie Deng <jie.deng@intel.com> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Wolfram Sang <wsa@kernel.org>	2021-11-23 10:55:48 +01:00
Johan Hovold	aa5721a9e0	USB: serial: pl2303: fix GC type detection At least some PL2303GC have a bcdDevice of 0x105 instead of 0x100 as the datasheet claims. Add it to the list of known release numbers for the HXN (G) type. Note the chip type could only be determined indirectly based on its package being of QFP type, which appears to only be available for PL2303GC. Fixes: `894758d057` ("USB: serial: pl2303: tighten type HXN (G) detection") Cc: stable@vger.kernel.org # 5.13 Reported-by: Anton Lundin <glance@acc.umu.se> Link: https://lore.kernel.org/r/20211123071613.GZ108031@montezuma.acc.umu.se Link: https://lore.kernel.org/r/20211123091017.30708-1-johan@kernel.org Signed-off-by: Johan Hovold <johan@kernel.org>	2021-11-23 10:49:18 +01:00
Jarkko Nikula	03a976c9af	i2c: i801: Fix interrupt storm from SMB_ALERT signal Currently interrupt storm will occur from i2c-i801 after first transaction if SMB_ALERT signal is enabled and ever asserted. It is enough if the signal is asserted once even before the driver is loaded and does not recover because that interrupt is not acknowledged. This fix aims to fix it by two ways: - Add acknowledging for the SMB_ALERT interrupt status - Disable the SMB_ALERT interrupt on platforms where possible since the driver currently does not make use for it Acknowledging resets the SMB_ALERT interrupt status on all platforms and also should help to avoid interrupt storm on older platforms where the SMB_ALERT interrupt disabling is not available. For simplicity this fix reuses the host notify feature for disabling and restoring original register value. Link: https://bugzilla.kernel.org/show_bug.cgi?id=177311 Reported-by: ck+kernelbugzilla@bl4ckb0x.de Reported-by: stephane.poignant@protonmail.com Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Reviewed-by: Jean Delvare <jdelvare@suse.de> Tested-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Wolfram Sang <wsa@kernel.org>	2021-11-23 10:43:50 +01:00
Jean Delvare	9b5bf58781	i2c: i801: Restore INTREN on unload If driver interrupts are enabled, SMBHSTCNT_INTREN will be 1 after the first transaction, and will stay to that value forever. This means that interrupts will be generated for both host-initiated transactions and also SMBus Alert events even after the driver is unloaded. To be on the safe side, we should restore the initial state of this bit at suspend and reboot time, as we do for several other configuration bits already and for the same reason: the BIOS should be handed the device in the same configuration state in which we received it. Otherwise interrupts may be generated which nobody will process. Signed-off-by: Jean Delvare <jdelvare@suse.de> Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Signed-off-by: Wolfram Sang <wsa@kernel.org>	2021-11-23 10:43:50 +01:00
Abel Vesa	aa6fed90fe	dt-bindings: i2c: imx-lpi2c: Fix i.MX 8QM compatible matching The i.MX 8QM DTS files use two compatibles, so update the binding to fix dtbs_check warnings like: arch/arm64/boot/dts/freescale/imx8qm-mek.dt.yaml: i2c@5a800000: compatible: ['fsl,imx8qm-lpi2c', 'fsl,imx7ulp-lpi2c'] is too long Signed-off-by: Abel Vesa <abel.vesa@nxp.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Wolfram Sang <wsa@kernel.org>	2021-11-23 09:54:00 +01:00
Marco Elver	73743c3b09	perf: Ignore sigtrap for tracepoints destined for other tasks syzbot reported that the warning in perf_sigtrap() fires, saying that the event's task does not match current: \| WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513 \| Modules linked in: \| CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0 \| Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 \| RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline] \| RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline] \| RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513 \| ... \| Call Trace: \| <IRQ> \| irq_work_single+0x106/0x220 kernel/irq_work.c:211 \| irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242 \| irq_work_run+0x4f/0xd0 kernel/irq_work.c:251 \| __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22 \| sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17 \| </IRQ> \| <TASK> \| asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664 \| RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline] \| RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194 \| ... \| coredump_task_exit kernel/exit.c:371 [inline] \| do_exit+0x1865/0x25c0 kernel/exit.c:771 \| do_group_exit+0xe7/0x290 kernel/exit.c:929 \| get_signal+0x3b0/0x1ce0 kernel/signal.c:2820 \| arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868 \| handle_signal_work kernel/entry/common.c:148 [inline] \| exit_to_user_mode_loop kernel/entry/common.c:172 [inline] \| exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207 \| __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] \| syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 \| do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 \| entry_SYSCALL_64_after_hwframe+0x44/0xae On x86 this shouldn't happen, which has arch_irq_work_raise(). The test program sets up a perf event with sigtrap set to fire on the 'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup(). This happened because the 'sched_wakeup' tracepoint also takes a task argument passed on to perf_tp_event(), which is used to deliver the event to that other task. Since we cannot deliver synchronous signals to other tasks, skip an event if perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is set, which will avoid ever entering perf_sigtrap() for such events. Fixes: `97ba62b278` ("perf: Add support for SIGTRAP on perf events") Reported-by: syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com	2021-11-23 09:45:37 +01:00
Muchun Song	14c2404884	locking/rwsem: Optimize down_read_trylock() under highly contended case We found that a process with 10 thousnads threads has been encountered a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of workload which will concurrently allocate lots of memory in different threads sometimes. In this case, we will see the down_read_trylock() with a high hotspot. Therefore, we suppose that rwsem has a regression at least since Linux-v5.4. In order to easily debug this problem, we write a simply benchmark to create the similar situation lile the following. ```c++ #include <sys/mman.h> #include <sys/time.h> #include <sys/resource.h> #include <sched.h> #include <cstdio> #include <cassert> #include <thread> #include <vector> #include <chrono> volatile int mutex; void trigger(int cpu, char* ptr, std::size_t sz) { cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu, &set); assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0); while (mutex); for (std::size_t i = 0; i < sz; i += 4096) { ptr = '\0'; ptr += 4096; } } int main(int argc, char argv[]) { std::size_t sz = 100; if (argc > 1) sz = atoi(argv[1]); auto nproc = std:🧵:hardware_concurrency(); std::vector<std::thread> thr; sz <<= 30; auto* ptr = mmap(nullptr, sz, PROT_READ \| PROT_WRITE, MAP_ANON \| MAP_PRIVATE, -1, 0); assert(ptr != MAP_FAILED); char* cptr = static_cast<char*>(ptr); auto run = sz / nproc; run = (run >> 12) << 12; mutex = 1; for (auto i = 0U; i < nproc; ++i) { thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); })); cptr += run; } rusage usage_start; getrusage(RUSAGE_SELF, &usage_start); auto start = std::chrono::system_clock::now(); mutex = 0; for (auto& t : thr) t.join(); rusage usage_end; getrusage(RUSAGE_SELF, &usage_end); auto end = std::chrono::system_clock::now(); timeval utime; timeval stime; timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime); timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime); printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec); printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec); printf("real: %lu\n", std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()); return 0; } ``` The functionality of above program is simply which creates `nproc` threads and each of them are trying to touch memory (trigger page fault) on different CPU. Then we will see the similar profile by `perf top`. 25.55% [kernel] [k] down_read_trylock 14.78% [kernel] [k] handle_mm_fault 13.45% [kernel] [k] up_read 8.61% [kernel] [k] clear_page_erms 3.89% [kernel] [k] __do_page_fault The highest hot instruction, which accounts for about 92%, in down_read_trylock() is cmpxchg like the following. 91.89 │ lock cmpxchg %rdx,(%rdi) Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4, so we easily found that the commit `ddb20d1d3a` ("locking/rwsem: Optimize down_read_trylock()") caused the regression. The reason is that the commit assumes the rwsem is not contended at all. But it is not always true for mmap lock which could be contended with thousands threads. So most threads almost need to run at least 2 times of "cmpxchg" to acquire the lock. The overhead of atomic operation is higher than non-atomic instructions, which caused the regression. By using the above benchmark, the real executing time on a x86-64 system before and after the patch were: Before Patch After Patch # of Threads real real reduced by ------------ ------ ------ ---------- 1 65,373 65,206 ~0.0% 4 15,467 15,378 ~0.5% 40 6,214 5,528 ~11.0% For the uncontended case, the new down_read_trylock() is the same as before. For the contended cases, the new down_read_trylock() is faster than before. The more contended, the more fast. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com	2021-11-23 09:45:36 +01:00
Waiman Long	d257cc8cb8	locking/rwsem: Make handoff bit handling more consistent There are some inconsistency in the way that the handoff bit is being handled in readers and writers that lead to a race condition. Firstly, when a queue head writer set the handoff bit, it will clear it when the writer is being killed or interrupted on its way out without acquiring the lock. That is not the case for a queue head reader. The handoff bit will simply be inherited by the next waiter. Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both the waiter and handoff bits are cleared if the wait queue becomes empty. For rwsem_down_write_slowpath(), however, the handoff bit is not checked and cleared if the wait queue is empty. This can potentially make the handoff bit set with empty wait queue. Worse, the situation in rwsem_down_write_slowpath() relies on wstate, a variable set outside of the critical section containing the ->count manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF can be double subtracted, corrupting ->count. To make the handoff bit handling more consistent and robust, extract out handoff bit clearing code into the new rwsem_del_waiter() helper function. Also, completely eradicate wstate; always evaluate everything inside the same critical section. The common function will only use atomic_long_andnot() to clear bits when the wait queue is empty to avoid possible race condition. If the first waiter with handoff bit set is killed or interrupted to exit the slowpath without acquiring the lock, the next waiter will inherit the handoff bit. While at it, simplify the trylock for loop in rwsem_down_write_slowpath() to make it easier to read. Fixes: `4f23dbc1e6` ("locking/rwsem: Implement lock handoff to prevent lock starvation") Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com	2021-11-23 09:45:35 +01:00
Huang Jianan	57bbeacdbe	erofs: fix deadlock when shrink erofs slab We observed the following deadlock in the stress test under low memory scenario: Thread A Thread B - erofs_shrink_scan - erofs_try_to_release_workgroup - erofs_workgroup_try_to_freeze -- A - z_erofs_do_read_page - z_erofs_collection_begin - z_erofs_register_collection - erofs_insert_workgroup - xa_lock(&sbi->managed_pslots) -- B - erofs_workgroup_get - erofs_wait_on_workgroup_freezed -- A - xa_erase - xa_lock(&sbi->managed_pslots) -- B To fix this, it needs to hold xa_lock before freezing the workgroup since xarray will be touched then. So let's hold the lock before accessing each workgroup, just like what we did with the radix tree before. [ Gao Xiang: Jianhua Hao also reports this issue at https://lore.kernel.org/r/b10b85df30694bac8aadfe43537c897a@xiaomi.com ] Link: https://lore.kernel.org/r/20211118135844.3559-1-huangjianan@oppo.com Fixes: `64094a0441` ("erofs: convert workstn to XArray") Reviewed-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Reported-by: Jianhua Hao <haojianhua1@xiaomi.com> Signed-off-by: Gao Xiang <xiang@kernel.org>	2021-11-23 14:58:16 +08:00
Shin'ichiro Kawasaki	2d62253eb1	scsi: scsi_debug: Zero clear zones at reset write pointer When a reset is requested the position of the write pointer is updated but the data in the corresponding zone is not cleared. Instead scsi_debug returns any data written before the write pointer was reset. This is an error and prevents using scsi_debug for stale page cache testing of the BLKRESETZONE ioctl. Zero written data in the zone when resetting the write pointer. Link: https://lore.kernel.org/r/20211122061223.298890-1-shinichiro.kawasaki@wdc.com Fixes: `f0d1cf9378` ("scsi: scsi_debug: Add ZBC zone commands") Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2021-11-22 22:27:51 -05:00
Mike Christie	eb97545d62	scsi: core: sysfs: Fix setting device state to SDEV_RUNNING This fixes an issue added in commit `4edd8cd4e8` ("scsi: core: sysfs: Fix hang when device state is set via sysfs") where if userspace is requesting to set the device state to SDEV_RUNNING when the state is already SDEV_RUNNING, we return -EINVAL instead of count. The commmit above set ret to count for this case, when it should have set it to 0. Link: https://lore.kernel.org/r/20211120164917.4924-1-michael.christie@oracle.com Fixes: `4edd8cd4e8` ("scsi: core: sysfs: Fix hang when device state is set via sysfs") Reviewed-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2021-11-22 22:18:54 -05:00
George Kennedy	e0a2c28da1	scsi: scsi_debug: Sanity check block descriptor length in resp_mode_select() In resp_mode_select() sanity check the block descriptor len to avoid UAF. BUG: KASAN: use-after-free in resp_mode_select+0xa4c/0xb40 drivers/scsi/scsi_debug.c:2509 Read of size 1 at addr ffff888026670f50 by task scsicmd/15032 CPU: 1 PID: 15032 Comm: scsicmd Not tainted 5.15.0-01d0625 #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Call Trace: <TASK> dump_stack_lvl+0x89/0xb5 lib/dump_stack.c:107 print_address_description.constprop.9+0x28/0x160 mm/kasan/report.c:257 kasan_report.cold.14+0x7d/0x117 mm/kasan/report.c:443 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report_generic.c:306 resp_mode_select+0xa4c/0xb40 drivers/scsi/scsi_debug.c:2509 schedule_resp+0x4af/0x1a10 drivers/scsi/scsi_debug.c:5483 scsi_debug_queuecommand+0x8c9/0x1e70 drivers/scsi/scsi_debug.c:7537 scsi_queue_rq+0x16b4/0x2d10 drivers/scsi/scsi_lib.c:1521 blk_mq_dispatch_rq_list+0xb9b/0x2700 block/blk-mq.c:1640 __blk_mq_sched_dispatch_requests+0x28f/0x590 block/blk-mq-sched.c:325 blk_mq_sched_dispatch_requests+0x105/0x190 block/blk-mq-sched.c:358 __blk_mq_run_hw_queue+0xe5/0x150 block/blk-mq.c:1762 __blk_mq_delay_run_hw_queue+0x4f8/0x5c0 block/blk-mq.c:1839 blk_mq_run_hw_queue+0x18d/0x350 block/blk-mq.c:1891 blk_mq_sched_insert_request+0x3db/0x4e0 block/blk-mq-sched.c:474 blk_execute_rq_nowait+0x16b/0x1c0 block/blk-exec.c:63 sg_common_write.isra.18+0xeb3/0x2000 drivers/scsi/sg.c:837 sg_new_write.isra.19+0x570/0x8c0 drivers/scsi/sg.c:775 sg_ioctl_common+0x14d6/0x2710 drivers/scsi/sg.c:941 sg_ioctl+0xa2/0x180 drivers/scsi/sg.c:1166 __x64_sys_ioctl+0x19d/0x220 fs/ioctl.c:52 do_syscall_64+0x3a/0x80 arch/x86/entry/common.c:50 entry_SYSCALL_64_after_hwframe+0x44/0xae arch/x86/entry/entry_64.S:113 Link: https://lore.kernel.org/r/1637262208-28850-1-git-send-email-george.kennedy@oracle.com Reported-by: syzkaller <syzkaller@googlegroups.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: George Kennedy <george.kennedy@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2021-11-22 22:16:09 -05:00
Pavel Begunkov	674ee8e1b4	io_uring: correct link-list traversal locking As io_remove_next_linked() is now under ->timeout_lock (see io_link_timeout_fn), we should update locking around io_for_each_link() and io_match_task() to use the new lock. Cc: stable@kernel.org # 5.15+ Fixes: `89850fce16` ("io_uring: run timeouts from task_work") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b54541cedf7de59cb5ae36109e58529ca16e66aa.1637631883.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-22 19:31:54 -07:00
Ming Lei	efcf593223	block: avoid to touch unloaded module instance when opening bdev disk->fops->owner is grabbed in blkdev_get_no_open() after the disk kobject refcount is increased. This way can't make sure that disk->fops->owner is still alive since del_gendisk() still can move on if the kobject refcount of disk is grabbed by open() and disk->fops->open() isn't called yet. Fixes the issue by moving try_module_get() into blkdev_get_by_dev() with ->open_mutex() held, then we can drain the in-progress open() in del_gendisk(). Meantime new open() won't succeed because disk becomes not alive. This way is reasonable because blkdev_get_no_open() needn't to touch disk->fops or defined callbacks. Cc: Christoph Hellwig <hch@lst.de> Cc: czhong@redhat.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211111020343.316126-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-11-22 18:35:37 -07:00
Linus Torvalds	c7756f3a32	media fixes for v5.16-rc3 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAmGbRRYACgkQCF8+vY7k 4RWYVw/8DwN6kyjGkBkAMGoHCXUYeoATYn3l5NVQaAyxTxF8s5d5YPFOXhJn77s9 S8CVvqy3JvZWeI/8L6mmOZl0rBNiDkURvnXyvcMv2lHb+VJpcb3hulgtMrRPogMw EFH2DjP9FE+ipOcqaAYrBIgyNRgvhqV+GKcqAXH9X4o+6lOZUKXVZcUbz/T6rWUQ MbZdc1aq1GA4SBj/scI0QMZskJ6dFtZTSeLUTgHo/QcybkxPh64Sqw2k3pc+xzRr SQB60DsDbzCuBX0aQjJWxNAUmu/b/DZdyd6yuiFvZCtsCiWH8rlkkr0kLmJB19Wr Xv7o5+heTYh5dKrDzA4VpbKeM5pjUn5hd3iK5C+YLNS7WwUNeBlrgT4JD1TUhj/w zwMRiKTBqMbi2lkOzcgxGMKn3SCkSa7WiKw25EdxQbn22laKP+6/a6ELg2nniARm N8OmTkpOYDK6va2j8U1ouJBwHOVoF+rUOzyD8CsRwC1zMjCdefN7dhvOd8Wh87Qo vCahUAHTCb5qANbRWqp2khTyg+LKK6/jGKa6IxVLAHhc9ZvyOCJW4iHEQ52cIFQI 5FxR+CYJUN8vMxHsm90Y/8xis17HaOrkfQcWX7o/GVQh6jPrihez2TC0zoq3kj6N ID6Qaqs3tLCcsjX5XDBGiM2sDITrWNIZYMoIMg0vf29g5rED/aY= =f3wv -----END PGP SIGNATURE----- Merge tag 'media/v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: - fix VIDIOC_DQEVENT ioctl handling for 32-bit userspace with a 64-bit kernel - regression fix for videobuf2 core - fix for CEC core when handling non-block transmit - hi846: fix a clang warning * tag 'media/v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: media: hi846: remove the of_match_ptr macro media: hi846: include property.h instead of of_graph.h media: cec: copy sequence field for the reply media: videobuf2-dma-sg: Fix buf->vb NULL pointer dereference media: v4l2-core: fix VIDIOC_DQEVENT handling on non-x86	2021-11-22 14:58:57 -08:00
NeilBrown	064a91771f	SUNRPC: use different lock keys for INET6 and LOCAL xprtsock.c reclassifies sock locks based on the protocol. However there are 3 protocols and only 2 classification keys. The same key is used for both INET6 and LOCAL. This causes lockdep complaints. The complaints started since Commit `ea9afca88b` ("SUNRPC: Replace use of socket sk_callback_lock with sock_lock") which resulted in the sock locks beings used more. So add another key, and renumber them slightly. Fixes: `ea9afca88b` ("SUNRPC: Replace use of socket sk_callback_lock with sock_lock") Fixes: `176e21ee2e` ("SUNRPC: Support for RPC over AF_LOCAL transports") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2021-11-22 17:44:49 -05:00
Nadav Amit	13e4ad2ce8	hugetlbfs: flush before unlock on move_hugetlb_page_tables() We must flush the TLB before releasing i_mmap_rwsem to avoid the potential reuse of an unshared PMDs page. This is not true in the case of move_hugetlb_page_tables(). The last reference on the page table can therefore be dropped before the TLB flush took place. Prevent it by reordering the operations and flushing the TLB before releasing i_mmap_rwsem. Fixes: `550a7d60bd` ("mm, hugepages: add mremap() support for hugepage backed vma") Signed-off-by: Nadav Amit <namit@vmware.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-22 11:36:46 -08:00
Nadav Amit	a4a118f2ee	hugetlbfs: flush TLBs correctly after huge_pmd_unshare When __unmap_hugepage_range() calls to huge_pmd_unshare() succeed, a TLB flush is missing. This TLB flush must be performed before releasing the i_mmap_rwsem, in order to prevent an unshared PMDs page from being released and reused before the TLB flush took place. Arguably, a comprehensive solution would use mmu_gather interface to batch the TLB flushes and the PMDs page release, however it is not an easy solution: (1) try_to_unmap_one() and try_to_migrate_one() also call huge_pmd_unshare() and they cannot use the mmu_gather interface; and (2) deferring the release of the page reference for the PMDs page until after i_mmap_rwsem is dropeed can confuse huge_pmd_unshare() into thinking PMDs are shared when they are not. Fix __unmap_hugepage_range() by adding the missing TLB flush, and forcing a flush when unshare is successful. Fixes: `24669e5847` ("hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages)" # 3.6 Signed-off-by: Nadav Amit <namit@vmware.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-22 11:36:46 -08:00
Marta Plantykow	f65ee535df	ice: avoid bpf_prog refcount underflow Ice driver has the routines for managing XDP resources that are shared between ndo_bpf op and VSI rebuild flow. The latter takes place for example when user changes queue count on an interface via ethtool's set_channels(). There is an issue around the bpf_prog refcounting when VSI is being rebuilt - since ice_prepare_xdp_rings() is called with vsi->xdp_prog as an argument that is used later on by ice_vsi_assign_bpf_prog(), same bpf_prog pointers are swapped with each other. Then it is also interpreted as an 'old_prog' which in turn causes us to call bpf_prog_put on it that will decrement its refcount. Below splat can be interpreted in a way that due to zero refcount of a bpf_prog it is wiped out from the system while kernel still tries to refer to it: [ 481.069429] BUG: unable to handle page fault for address: ffffc9000640f038 [ 481.077390] #PF: supervisor read access in kernel mode [ 481.083335] #PF: error_code(0x0000) - not-present page [ 481.089276] PGD 100000067 P4D 100000067 PUD 1001cb067 PMD 106d2b067 PTE 0 [ 481.097141] Oops: 0000 [#1] PREEMPT SMP PTI [ 481.101980] CPU: 12 PID: 3339 Comm: sudo Tainted: G OE 5.15.0-rc5+ #1 [ 481.110840] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016 [ 481.122021] RIP: 0010:dev_xdp_prog_id+0x25/0x40 [ 481.127265] Code: 80 00 00 00 00 0f 1f 44 00 00 89 f6 48 c1 e6 04 48 01 fe 48 8b 86 98 08 00 00 48 85 c0 74 13 48 8b 50 18 31 c0 48 85 d2 74 07 <48> 8b 42 38 8b 40 20 c3 48 8b 96 90 08 00 00 eb e8 66 2e 0f 1f 84 [ 481.148991] RSP: 0018:ffffc90007b63868 EFLAGS: 00010286 [ 481.155034] RAX: 0000000000000000 RBX: ffff889080824000 RCX: 0000000000000000 [ 481.163278] RDX: ffffc9000640f000 RSI: ffff889080824010 RDI: ffff889080824000 [ 481.171527] RBP: ffff888107af7d00 R08: 0000000000000000 R09: ffff88810db5f6e0 [ 481.179776] R10: 0000000000000000 R11: ffff8890885b9988 R12: ffff88810db5f4bc [ 481.188026] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 481.196276] FS: 00007f5466d5bec0(0000) GS:ffff88903fb00000(0000) knlGS:0000000000000000 [ 481.205633] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 481.212279] CR2: ffffc9000640f038 CR3: 000000014429c006 CR4: 00000000003706e0 [ 481.220530] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 481.228771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 481.237029] Call Trace: [ 481.239856] rtnl_fill_ifinfo+0x768/0x12e0 [ 481.244602] rtnl_dump_ifinfo+0x525/0x650 [ 481.249246] ? __alloc_skb+0xa5/0x280 [ 481.253484] netlink_dump+0x168/0x3c0 [ 481.257725] netlink_recvmsg+0x21e/0x3e0 [ 481.262263] ____sys_recvmsg+0x87/0x170 [ 481.266707] ? __might_fault+0x20/0x30 [ 481.271046] ? _copy_from_user+0x66/0xa0 [ 481.275591] ? iovec_from_user+0xf6/0x1c0 [ 481.280226] ___sys_recvmsg+0x82/0x100 [ 481.284566] ? sock_sendmsg+0x5e/0x60 [ 481.288791] ? __sys_sendto+0xee/0x150 [ 481.293129] __sys_recvmsg+0x56/0xa0 [ 481.297267] do_syscall_64+0x3b/0xc0 [ 481.301395] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 481.307238] RIP: 0033:0x7f5466f39617 [ 481.311373] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bd 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 [ 481.342944] RSP: 002b:00007ffedc7f4308 EFLAGS: 00000246 ORIG_RAX: 000000000000002f [ 481.361783] RAX: ffffffffffffffda RBX: 00007ffedc7f5460 RCX: 00007f5466f39617 [ 481.380278] RDX: 0000000000000000 RSI: 00007ffedc7f5360 RDI: 0000000000000003 [ 481.398500] RBP: 00007ffedc7f53f0 R08: 0000000000000000 R09: 000055d556f04d50 [ 481.416463] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffedc7f5360 [ 481.434131] R13: 00007ffedc7f5350 R14: 00007ffedc7f5344 R15: 0000000000000e98 [ 481.451520] Modules linked in: ice(OE) af_packet binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp mxm_wmi mei_me coretemp mei ipmi_si ipmi_msghandler wmi acpi_pad acpi_power_meter ip_tables x_tables autofs4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ahci crypto_simd cryptd libahci lpc_ich [last unloaded: ice] [ 481.528558] CR2: ffffc9000640f038 [ 481.542041] ---[ end trace d1f24c9ecf5b61c1 ]--- Fix this by only calling ice_vsi_assign_bpf_prog() inside ice_prepare_xdp_rings() when current vsi->xdp_prog pointer is NULL. This way set_channels() flow will not attempt to swap the vsi->xdp_prog pointers with itself. Also, sprinkle around some comments that provide a reasoning about correlation between driver and kernel in terms of bpf_prog refcount. Fixes: `efc2214b60` ("ice: Add support for XDP") Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Marta Plantykow <marta.a.plantykow@intel.com> Co-developed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2021-11-22 08:35:36 -08:00
Maciej Fijalkowski	792b208658	ice: fix vsi->txq_map sizing The approach of having XDP queue per CPU regardless of user's setting exposed a hidden bug that could occur in case when Rx queue count differ from Tx queue count. Currently vsi->txq_map's size is equal to the doubled vsi->alloc_txq, which is not correct due to the fact that XDP rings were previously based on the Rx queue count. Below splat can be seen when ethtool -L is used and XDP rings are configured: [ 682.875339] BUG: kernel NULL pointer dereference, address: 000000000000000f [ 682.883403] #PF: supervisor read access in kernel mode [ 682.889345] #PF: error_code(0x0000) - not-present page [ 682.895289] PGD 0 P4D 0 [ 682.898218] Oops: 0000 [#1] PREEMPT SMP PTI [ 682.903055] CPU: 42 PID: 2878 Comm: ethtool Tainted: G OE 5.15.0-rc5+ #1 [ 682.912214] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016 [ 682.923380] RIP: 0010:devres_remove+0x44/0x130 [ 682.928527] Code: 49 89 f4 55 48 89 fd 4c 89 ff 53 48 83 ec 10 e8 92 b9 49 00 48 8b 9d a8 02 00 00 48 8d 8d a0 02 00 00 49 89 c2 48 39 cb 74 0f <4c> 3b 63 10 74 25 48 8b 5b 08 48 39 cb 75 f1 4c 89 ff 4c 89 d6 e8 [ 682.950237] RSP: 0018:ffffc90006a679f0 EFLAGS: 00010002 [ 682.956285] RAX: 0000000000000286 RBX: ffffffffffffffff RCX: ffff88908343a370 [ 682.964538] RDX: 0000000000000001 RSI: ffffffff81690d60 RDI: 0000000000000000 [ 682.972789] RBP: ffff88908343a0d0 R08: 0000000000000000 R09: 0000000000000000 [ 682.981040] R10: 0000000000000286 R11: 3fffffffffffffff R12: ffffffff81690d60 [ 682.989282] R13: ffffffff81690a00 R14: ffff8890819807a8 R15: ffff88908343a36c [ 682.997535] FS: 00007f08c7bfa740(0000) GS:ffff88a03fd00000(0000) knlGS:0000000000000000 [ 683.006910] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 683.013557] CR2: 000000000000000f CR3: 0000001080a66003 CR4: 00000000003706e0 [ 683.021819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 683.030075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 683.038336] Call Trace: [ 683.041167] devm_kfree+0x33/0x50 [ 683.045004] ice_vsi_free_arrays+0x5e/0xc0 [ice] [ 683.050380] ice_vsi_rebuild+0x4c8/0x750 [ice] [ 683.055543] ice_vsi_recfg_qs+0x9a/0x110 [ice] [ 683.060697] ice_set_channels+0x14f/0x290 [ice] [ 683.065962] ethnl_set_channels+0x333/0x3f0 [ 683.070807] genl_family_rcv_msg_doit+0xea/0x150 [ 683.076152] genl_rcv_msg+0xde/0x1d0 [ 683.080289] ? channels_prepare_data+0x60/0x60 [ 683.085432] ? genl_get_cmd+0xd0/0xd0 [ 683.089667] netlink_rcv_skb+0x50/0xf0 [ 683.094006] genl_rcv+0x24/0x40 [ 683.097638] netlink_unicast+0x239/0x340 [ 683.102177] netlink_sendmsg+0x22e/0x470 [ 683.106717] sock_sendmsg+0x5e/0x60 [ 683.110756] __sys_sendto+0xee/0x150 [ 683.114894] ? handle_mm_fault+0xd0/0x2a0 [ 683.119535] ? do_user_addr_fault+0x1f3/0x690 [ 683.134173] __x64_sys_sendto+0x25/0x30 [ 683.148231] do_syscall_64+0x3b/0xc0 [ 683.161992] entry_SYSCALL_64_after_hwframe+0x44/0xae Fix this by taking into account the value that num_possible_cpus() yields in addition to vsi->alloc_txq instead of doubling the latter. Fixes: `efc2214b60` ("ice: Add support for XDP") Fixes: `22bf877e52` ("ice: introduce XDP_TX fallback path") Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2021-11-22 08:33:54 -08:00
David S. Miller	03a000bfd7	Merge branch 'nh-group-refcnt' Nikolay Aleksandrov says: ==================== net: nexthop: fix refcount issues when replacing groups This set fixes a refcount bug when replacing nexthop groups and modifying routes. It is complex because the objects look valid when debugging memory dumps, but we end up having refcount dependency between unlinked objects which can never be released, so in turn they cannot free their resources and refcounts. The problem happens because we can have stale IPv6 per-cpu dsts in nexthops which were removed from a group. Even though the IPv6 gen is bumped, the dsts won't be released until traffic passes through them or the nexthop is freed, that can take arbitrarily long time, and even worse we can create a scenario[1] where it can never be released. The fix is to release the IPv6 per-cpu dsts of replaced nexthops after an RCU grace period so no new ones can be created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which is used by the nexthop code only when necessary. We can further optimize group replacement, but that is more suited for net-next as these patches would have to be backported to stable releases. v2: patch 02: update commit msg patch 03: check for mausezahn before testing and make a few comments more verbose [1] This info is also present in patch 02's commit message. Initial state: $ ip nexthop list id 200 via 2002:db8::2 dev bridge.10 scope link onlink id 201 via 2002:db8::3 dev bridge scope link onlink id 203 group 201/200 $ ip -6 route 2001:db8::10 nhid 203 metric 1024 pref medium nexthop via 2002:db8::3 dev bridge weight 1 onlink nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink Create rt6_info through one of the multipath legs, e.g.: $ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10 (pkt_inj is just a custom packet generator, nothing special) Then remove that leg from the group by replace (let's assume it is id 200 in this case): $ ip nexthop replace id 203 group 201 Now remove the IPv6 route: $ ip -6 route del 2001:db8::10/128 The route won't be really deleted due to the stale rt6_info holding 1 refcnt in nexthop id 200. At this point we have the following reference count dependency: (deleted) IPv6 route holds 1 reference over nhid 203 nh 203 holds 1 ref over id 201 nh 200 holds 1 ref over the net device and the route due to the stale rt6_info Now to create circular dependency between nh 200 and the IPv6 route, and also to get a reference over nh 200, restore nhid 200 in the group: $ ip nexthop replace id 203 group 201/200 And now we have a permanent circular dependncy because nhid 203 holds a reference over nh 200 and 201, but the route holds a ref over nh 203 and is deleted. To trigger the bug just delete the group (nhid 203): $ ip nexthop del id 203 It won't really be deleted due to the IPv6 route dependency, and now we have 2 unlinked and deleted objects that reference each other: the group and the IPv6 route. Since the group drops the reference it holds over its entries at free time (i.e. its own refcount needs to drop to 0) that will never happen and we get a permanent ref on them, since one of the entries holds a reference over the IPv6 route it will also never be released. At this point the dependencies are: (deleted, only unlinked) IPv6 route holds reference over group nh 203 (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200 nh 200 holds 1 ref over the net device and the route due to the stale rt6_info This is the last point where it can be fixed by running traffic through nh 200, and specifically through the same CPU so the rt6_info (dst) will get released due to the IPv6 genid, that in turn will free the IPv6 route, which in turn will free the ref count over the group nh 203. If nh 200 is deleted at this point, it will never be released due to the ref from the unlinked group 203, it will only be unlinked: $ ip nexthop del id 200 $ ip nexthop $ Now we can never release that stale rt6_info, we have IPv6 route with ref over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with rt6_info (dst) with ref over the net device and the IPv6 route. All of these objects are only unlinked, and cannot be released, thus they can't release their ref counts. Message from syslogd@dev at Nov 19 14:04:10 ... kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 Message from syslogd@dev at Nov 19 14:04:20 ... kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 15:44:49 +00:00
Nikolay Aleksandrov	02ebe49ab0	selftests: net: fib_nexthops: add test for group refcount imbalance bug The new selftest runs a sequence which causes circular refcount dependency between deleted objects which cannot be released and results in a netdevice refcount imbalance. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 15:44:49 +00:00
Nikolay Aleksandrov	1005f19b93	net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group When replacing a nexthop group, we must release the IPv6 per-cpu dsts of the removed nexthop entries after an RCU grace period because they contain references to the nexthop's net device and to the fib6 info. With specific series of events[1] we can reach net device refcount imbalance which is unrecoverable. IPv4 is not affected because dsts don't take a refcount on the route. [1] $ ip nexthop list id 200 via 2002:db8::2 dev bridge.10 scope link onlink id 201 via 2002:db8::3 dev bridge scope link onlink id 203 group 201/200 $ ip -6 route 2001:db8::10 nhid 203 metric 1024 pref medium nexthop via 2002:db8::3 dev bridge weight 1 onlink nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink Create rt6_info through one of the multipath legs, e.g.: $ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10 (pkt_inj is just a custom packet generator, nothing special) Then remove that leg from the group by replace (let's assume it is id 200 in this case): $ ip nexthop replace id 203 group 201 Now remove the IPv6 route: $ ip -6 route del 2001:db8::10/128 The route won't be really deleted due to the stale rt6_info holding 1 refcnt in nexthop id 200. At this point we have the following reference count dependency: (deleted) IPv6 route holds 1 reference over nhid 203 nh 203 holds 1 ref over id 201 nh 200 holds 1 ref over the net device and the route due to the stale rt6_info Now to create circular dependency between nh 200 and the IPv6 route, and also to get a reference over nh 200, restore nhid 200 in the group: $ ip nexthop replace id 203 group 201/200 And now we have a permanent circular dependncy because nhid 203 holds a reference over nh 200 and 201, but the route holds a ref over nh 203 and is deleted. To trigger the bug just delete the group (nhid 203): $ ip nexthop del id 203 It won't really be deleted due to the IPv6 route dependency, and now we have 2 unlinked and deleted objects that reference each other: the group and the IPv6 route. Since the group drops the reference it holds over its entries at free time (i.e. its own refcount needs to drop to 0) that will never happen and we get a permanent ref on them, since one of the entries holds a reference over the IPv6 route it will also never be released. At this point the dependencies are: (deleted, only unlinked) IPv6 route holds reference over group nh 203 (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200 nh 200 holds 1 ref over the net device and the route due to the stale rt6_info This is the last point where it can be fixed by running traffic through nh 200, and specifically through the same CPU so the rt6_info (dst) will get released due to the IPv6 genid, that in turn will free the IPv6 route, which in turn will free the ref count over the group nh 203. If nh 200 is deleted at this point, it will never be released due to the ref from the unlinked group 203, it will only be unlinked: $ ip nexthop del id 200 $ ip nexthop $ Now we can never release that stale rt6_info, we have IPv6 route with ref over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with rt6_info (dst) with ref over the net device and the IPv6 route. All of these objects are only unlinked, and cannot be released, thus they can't release their ref counts. Message from syslogd@dev at Nov 19 14:04:10 ... kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 Message from syslogd@dev at Nov 19 14:04:20 ... kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 Fixes: `7bf4796dd0` ("nexthops: add support for replace") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 15:44:49 +00:00
Nikolay Aleksandrov	8837cbbf85	net: ipv6: add fib6_nh_release_dsts stub We need a way to release a fib6_nh's per-cpu dsts when replacing nexthops otherwise we can end up with stale per-cpu dsts which hold net device references, so add a new IPv6 stub called fib6_nh_release_dsts. It must be used after an RCU grace period, so no new dsts can be created through a group's nexthop entry. Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed so it doesn't need a dummy stub when IPv6 is not enabled. Fixes: `7bf4796dd0` ("nexthops: add support for replace") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-22 15:44:49 +00:00

... 3 4 5 6 7 ...

1058873 Commits