352 Commits

Author SHA1 Message Date
John Garry
11ff0c98fc scsi: hisi_sas: Drain bcast events in hisi_sas_rescan_topology()
In resetting the controller, SATA devices may be lost.

The issue is that when we insert the bcast events to rescan the topology in
hisi_sas_rescan_topology(), when we subsequently nexus reset the SATA
devices in hisi_sas_async_I_T_nexus_reset(), there is a small timing window
in which the remote phy is down and we process the bcast event (meaning
that libsas judges that the disk is lost).

Ensure that all bcast events are processed prior to the nexus reset to
close this window.

Link: https://lore.kernel.org/r/1662378529-101489-4-git-send-email-john.garry@huawei.com
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-09-06 22:28:11 -04:00
John Garry
bc5551157a scsi: hisi_sas: Clear HISI_SAS_HW_FAULT_BIT earlier
Once the controller HW has been reset then we can unset flag
HISI_SAS_HW_FAULT_BIT. In clearing this flag earlier we can now
successfully execute commands in hisi_sas_controller_reset_done(), like
bcast processing.

Link: https://lore.kernel.org/r/1662378529-101489-3-git-send-email-john.garry@huawei.com
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-09-06 22:28:10 -04:00
Xiang Chen
f0902095a7 scsi: hisi_sas: Relocate DMA unmap of SMP task
Currently SMP tasks are DMA unmapped only when cq of SMP I/O is returned
normally. If the cq of SMP I/O is returned with exception actually SMP TAS
is never unmapped. Relocate DMA unmap of SMP task to fix the issue.

Link: https://lore.kernel.org/r/1657823002-139010-4-git-send-email-john.garry@huawei.com
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-07-18 23:04:12 -04:00
Xiang Chen
bc22f9c06c scsi: hisi_sas: Remove unnecessary variable to hold DMA map elements
Use slot->n_elem to store the return value of dma_map_sg() for SSP and SMP
IOs, and remove unnecessary variable n_elem_req.

Link: https://lore.kernel.org/r/1657823002-139010-3-git-send-email-john.garry@huawei.com
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-07-18 23:04:11 -04:00
John Garry
6c6ac8b777 scsi: hisi_sas: Fix memory ordering in hisi_sas_task_deliver()
The memories for the slot should be observed to be written prior to
observing the slot as ready.

Prior to commit 26fc0ea74fcb ("scsi: libsas: Drop SAS_TASK_AT_INITIATOR"),
we had a spin_lock() + spin_unlock() immediately before marking the slot as
ready. The spin_unlock() - with release semantics - caused the slot memory
to be observed to be written.

Now that the spin_lock() + spin_unlock() is gone, use a smp_wmb().

Link: https://lore.kernel.org/r/1652774661-12935-1-git-send-email-john.garry@huawei.com
Fixes: 26fc0ea74fcb ("scsi: libsas: Drop SAS_TASK_AT_INITIATOR")
Reported-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-05-19 20:16:26 -04:00
John Garry
e9dedc13bb scsi: hisi_sas: Fix rescan after deleting a disk
Removing an ATA device via sysfs means that the device may not be found
through re-scanning:

root@ubuntu:/home/john# lsscsi
[0:0:0:0] disk SanDisk LT0200MO P404 /dev/sda
[0:0:1:0] disk ATA HGST HUS724040AL A8B0 /dev/sdb
[0:0:8:0] enclosu 12G SAS Expander RevB -
root@ubuntu:/home/john# echo 1 > /sys/block/sdb/device/delete
root@ubuntu:/home/john# echo "- - -" > /sys/class/scsi_host/host0/scan
root@ubuntu:/home/john# lsscsi
[0:0:0:0] disk SanDisk LT0200MO P404 /dev/sda
[0:0:8:0] enclosu 12G SAS Expander RevB -
root@ubuntu:/home/john#

The problem is that the rescan of the device may conflict with the device
in being re-initialized, as follows:

 - In the rescan we call hisi_sas_slave_alloc() in store_scan() ->
   sas_user_scan() -> [__]scsi_scan_target() -> scsi_probe_and_add_lunc()
   -> scsi_alloc_sdev() -> hisi_sas_slave_alloc() -> hisi_sas_init_device()
   In hisi_sas_init_device() we issue an IT nexus reset for ATA devices

 - That IT nexus causes the remote PHY to go down and this triggers a bcast
   event

 - In parallel libsas processes the bcast event, finds that the phy is down
   and marks the device as gone

The hard reset issued in hisi_sas_init_device() is unncessary - as
described in the code comment - so remove it. Also set dev status as
HISI_SAS_DEV_NORMAL as the hisi_sas_init_device() call.

Link: https://lore.kernel.org/r/1652354134-171343-4-git-send-email-john.garry@huawei.com
Fixes: 36c6b7613ef1 ("scsi: hisi_sas: Initialise devices in .slave_alloc callback")
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Reviewed-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-05-19 20:16:26 -04:00
John Garry
71453bd9d1 scsi: hisi_sas: Use sas_ata_wait_after_reset() in IT nexus reset
We have seen errors like this when a SATA device is probed:

[524.566298] hisi_sas_v3_hw 0000L74:02.0: erroneous completion iptt=4096 ...
[524.582827] sas: TMF task open reject failed 500e004aaaaaaaa00

Since commit 21c7e972475e ("scsi: hisi_sas: Disable SATA disk phy for
severe I_T nexus reset failure"), we issue an ATA softreset to disks after
a phy reset to ensure that they are in sound working order. If the
softreset is issued before the remote phy has come back up then the
softreset will fail (errors as above). Remedy this by waiting for the phy
to come back up after the reset.

Link: https://lore.kernel.org/r/1652354134-171343-3-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Reviewed-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-05-19 20:16:26 -04:00
Dan Carpenter
066f4c3194 scsi: hisi_sas: Remove stray fallthrough annotation
This case statement doesn't fall through any more so remove the fallthrough
annotation.

Link: https://lore.kernel.org/r/20220317075214.GC25237@kili
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-29 23:53:03 -04:00
John Garry
095478a6e5 scsi: hisi_sas: Use libsas internal abort support
Use the common libsas internal abort functionality.

In addition, this driver has special handling for internal abort timeouts -
specifically whether to reset the controller in that instance, so extend
the API for that.

Timeout is now increased to 20 * Hz from 6 * Hz.

We also retry for failure now, but this should not make a difference.

Link: https://lore.kernel.org/r/1647001432-239276-5-git-send-email-john.garry@huawei.com
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-03-14 23:33:24 -04:00
Xiang Chen
512623de52 scsi: hisi_sas: Change hisi_sas_control_phy() phyup timeout
The time of phyup not only depends on the controller but also the type of
disk connected. As an example, from experience, for some SATA disks the
amount of time from reset/power-on to receive the D2H FIS for phyup can
take upto and more than 10s sometimes. According to the specification of
some SATA disks such as ST14000NM0018, the max time from power-on to ready
is 30s.

Based on this the current timeout of phyup at 2s which is not enough. So
set the value as HISI_SAS_WAIT_PHYUP_TIMEOUT (30s) in
hisi_sas_control_phy().

For v3 hw there is a pre-existing workaround for a HW bug, being that we
issue a link reset when the OOB occurs but the phyup does not. The current
phyup timeout is HISI_SAS_WAIT_PHYUP_TIMEOUT. So if this does occur from
when issuing a phy enable or similar via hisi_sas_control_phy(), the
subsequent HW workaround linkreset processing calls hisi_sas_control_phy(),
but this will pend the original phy reset timing out, so it is safe.

Link: https://lore.kernel.org/r/1645703489-87194-3-git-send-email-john.garry@huawei.com
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-27 21:46:40 -05:00
John Garry
3f2e252ef7 scsi: libsas: Add sas_execute_ata_cmd()
Add a function to execute an ATA command using the TMF code, and use in the
hisi_sas driver. That driver needs to be able to issue the command on a
specific phy, so add an interface for that.

With that, hisi_sas_exec_internal_tmf_task() may be deleted.

Link: https://lore.kernel.org/r/1645534259-27068-19-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-22 21:11:02 -05:00
John Garry
4fea759edf scsi: libsas: Add sas_abort_task()
Add a generic implementation of abort task TMF handler, and use in LLDDs.

With that, some LLDDs custom TMF functions can now be deleted.

Link: https://lore.kernel.org/r/1645112566-115804-18-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:36 -05:00
John Garry
72f8810e1f scsi: libsas: Add sas_query_task()
Add a generic implementation of query task TMF handler, and use in LLDDs.

Link: https://lore.kernel.org/r/1645112566-115804-17-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:36 -05:00
John Garry
29d7769055 scsi: libsas: Add sas_lu_reset()
Add a generic implementation of LU reset TMF handler, and use in LLDDs.

Link: https://lore.kernel.org/r/1645112566-115804-16-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:36 -05:00
John Garry
e858545295 scsi: libsas: Add sas_clear_task_set()
Add a generic implementation of clear task set TMF handler, and use in
LLDDs.

Link: https://lore.kernel.org/r/1645112566-115804-15-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:36 -05:00
John Garry
69b80a0ed0 scsi: libsas: Add sas_abort_task_set()
Add a generic implementation of abort task set TMF handler, and use in
LLDDs.

Link: https://lore.kernel.org/r/1645112566-115804-14-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:36 -05:00
John Garry
693e66a0a6 scsi: libsas: Add TMF handler aborted callback
The hisi_sas and pm8001 TMF handlers have some special processing for when
the TMF is aborted, so add a callback and fill it in for those drivers.

Link: https://lore.kernel.org/r/1645112566-115804-13-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:35 -05:00
John Garry
96e54376a8 scsi: libsas: Add sas_task.tmf
Add a pointer to a sas_tmf_task to the sas_task struct, as this will be
used when the common LLDD TMF code is factored out.

Also set it for the LLDDs to store per-sas_task TMF info.

Link: https://lore.kernel.org/r/1645112566-115804-9-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:35 -05:00
John Garry
bbfe82cdba scsi: libsas: Add struct sas_tmf_task
Some of the LLDDs which use libsas have their own definition of a struct
to hold TMF info, so add a common struct for libsas.

Also add an interim force phy id field for hisi_sas driver, which will be
removed once the STP "TMF" code is factored out.

Even though some LLDDs (pm8001) use a u32 for the tag, u16 will be adequate,
as that named driver only uses tags in range [0, 1024).

Link: https://lore.kernel.org/r/1645112566-115804-8-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:35 -05:00
John Garry
da19eaba6e scsi: hisi_sas: Delete unused I_T_NEXUS_RESET_PHYUP_TIMEOUT
There is no user, so delete it.

Link: https://lore.kernel.org/r/1645112566-115804-6-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:34 -05:00
John Garry
25882c82f8 scsi: libsas: Delete lldd_clear_aca callback
This callback is never called, so remove support.

Link: https://lore.kernel.org/r/1645112566-115804-4-git-send-email-john.garry@huawei.com
Tested-by: Yihang Li <liyihang6@hisilicon.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-19 15:59:34 -05:00
Martin K. Petersen
ac2beb4e3b Merge branch '5.17/scsi-fixes' into 5.18/scsi-staging
Pull 5.17 fixes branch into 5.18 tree to resolve a few pm8001 driver
merge conflicts.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-14 21:51:29 -05:00
John Garry
26fc0ea74f scsi: libsas: Drop SAS_TASK_AT_INITIATOR
This flag is now only ever set, so delete it.

This also avoids a use-after-free in the pm8001 queue path, as reported in
the following:

https://lore.kernel.org/linux-scsi/c3cb7228-254e-9584-182b-007ac5e6fe0a@huawei.com/T/#m28c94c6d3ff582ec4a9fa54819180740e8bd4cfb

https://lore.kernel.org/linux-scsi/0cc0c435-b4f2-9c76-258d-865ba50a29dd@huawei.com/

[mkp: checkpatch + two SAS_TASK_AT_INITIATOR references]

Link: https://lore.kernel.org/r/1644489804-85730-3-git-send-email-john.garry@huawei.com
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-02-11 17:02:50 -05:00
John Garry
c763ec4c10 scsi: hisi_sas: Fix setting of hisi_sas_slot.is_internal
The hisi_sas_slot.is_internal member is not set properly for ATA commands
which the driver sends directly. A TMF struct pointer is normally used as a
test to set this, but it is NULL for those commands. It's not ideal, but
pass an empty TMF struct to set that member properly.

Link: https://lore.kernel.org/r/1643627607-138785-1-git-send-email-john.garry@huawei.com
Fixes: dc313f6b125b ("scsi: hisi_sas: Factor out task prep and delivery code")
Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-01-31 16:50:59 -05:00
Christophe JAILLET
8001fa240f scsi: hisi_sas: Remove useless DMA-32 fallback configuration
As stated in [1], dma_set_mask() with a 64-bit mask never fails if
dev->dma_mask is non-NULL. So, if it fails, the 32-bit case will also fail
for the same reason.

Simplify code and remove some dead code accordingly.

[1]: https://lore.kernel.org/linux-kernel/YL3vSPK5DXTNvgdx@infradead.org/#t

Link: https://lore.kernel.org/r/1bf2d3660178b0e6f172e5208bc0bd68d31d9268.1642237482.git.christophe.jaillet@wanadoo.fr
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-01-24 23:30:29 -05:00
Xiang Chen
5d9224fb07 scsi: hisi_sas: Remove unused variable and check in hisi_sas_send_ata_reset_each_phy()
In commit 29e2bac87421 ("scsi: hisi_sas: Fix some issues related to
asd_sas_port->phy_list"), we use asd_sas_port->phy_mask instead of
accessing asd_sas_port->phy_list, and it is enough to use
asd_sas_port->phy_mask to check the state of phy, so remove the unused
check and variable.

Link: https://lore.kernel.org/r/1641300126-53574-1-git-send-email-chenxiang66@hisilicon.com
Fixes: 29e2bac87421 ("scsi: hisi_sas: Fix some issues related to asd_sas_port->phy_list")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Colin King <colin.i.king@gmail.com>
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2022-01-07 09:09:39 -05:00
Xiang Chen
ae9b69e85e scsi: hisi_sas: Keep controller active between ISR of phyup and the event being processed
It is possible that controller may become suspended between processing a
phyup interrupt and the event being processed by libsas. As such, we can't
ensure the controller is active when processing the phyup event - this may
cause the phyup event to be lost or other issues.  To avoid any possible
issues, add pm_runtime_get_noresume() in phyup interrupt handler and
pm_runtime_put_sync() in the work handler exit to ensure that we stay
always active. Since we only want to call pm_runtime_get_noresume() for v3
hw, signal this will a new event, HISI_PHYE_PHY_UP_PM.

Link: https://lore.kernel.org/r/1639999298-244569-14-git-send-email-chenxiang66@hisilicon.com
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-22 23:38:31 -05:00
Xiang Chen
29e2bac874 scsi: hisi_sas: Fix some issues related to asd_sas_port->phy_list
Most places that use asd_sas_port->phy_list are protected by spinlock
asd_sas_port->phy_list_lock, however there are still some places which miss
grabbing the lock. Add it in function hisi_sas_refresh_port_id() when
accessing asd_sas_port->phy_list. This carries a risk that list mutates
while at the same time dropping the lock in function
hisi_sas_send_ata_reset_each_phy(). Read asd_sas_port->phy_mask instead of
accessing asd_sas_port->phy_list to avoid this risk.

Link: https://lore.kernel.org/r/1639999298-244569-6-git-send-email-chenxiang66@hisilicon.com
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-22 23:38:29 -05:00
John Garry
6cc7390877 scsi: Revert "scsi: hisi_sas: Filter out new PHY up events during suspend"
This reverts commit b14a37e011d829404c29a5ae17849d7efb034893.

In that commit, we had to filter out phy-up events during suspend, as it
work cause a deadlock between processing the phyup event and the resume HA
function try to drain the HA event workqueue to complete the resume
process.

Now that we no longer try to drain the HA event queue during the HA resume
processor, the deadlock would not occur, so remove the special handling for
it.

Link: https://lore.kernel.org/r/1639999298-244569-3-git-send-email-chenxiang66@hisilicon.com
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-22 23:38:29 -05:00
Qi Liu
37310bad7f scsi: hisi_sas: Fix phyup timeout on FPGA
The OOB interrupt and phyup interrupt handlers may run out-of-order in high
CPU usage scenarios. Since the hisi_sas_phy.timer is added in
hisi_sas_phy_oob_ready() and disarmed in phy_up_v3_hw(), this out-of-order
execution will cause hisi_sas_phy.timer timeout to trigger.

To solve, protect hisi_sas_phy.timer and .attached with a lock, and ensure
that the timer won't be added after phyup handler completes.

Link: https://lore.kernel.org/r/1639579061-179473-8-git-send-email-john.garry@huawei.com
Signed-off-by: Qi Liu <liuqi115@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-16 22:59:58 -05:00
Qi Liu
16775db613 scsi: hisi_sas: Prevent parallel FLR and controller reset
If we issue a controller reset command during executing a FLR a hung task
may be found:

 Call trace:
  __switch_to+0x158/0x1cc
  __schedule+0x2e8/0x85c
  schedule+0x7c/0x110
  schedule_timeout+0x190/0x1cc
  __down+0x7c/0xd4
  down+0x5c/0x7c
  hisi_sas_task_exec+0x510/0x680 [hisi_sas_main]
  hisi_sas_queue_command+0x24/0x30 [hisi_sas_main]
  smp_execute_task_sg+0xf4/0x23c [libsas]
  sas_smp_phy_control+0x110/0x1e0 [libsas]
  transport_sas_phy_reset+0xc8/0x190 [libsas]
  phy_reset_work+0x2c/0x40 [libsas]
  process_one_work+0x1dc/0x48c
  worker_thread+0x15c/0x464
  kthread+0x160/0x170
  ret_from_fork+0x10/0x18

This is a race condition which occurs when the FLR completes first.

Here the host HISI_SAS_RESETTING_BIT flag out gets of sync as
HISI_SAS_RESETTING_BIT is not always cleared with the hisi_hba.sem held, so
now only set/unset HISI_SAS_RESETTING_BIT under hisi_hba.sem .

Link: https://lore.kernel.org/r/1639579061-179473-7-git-send-email-john.garry@huawei.com
Signed-off-by: Qi Liu <liuqi115@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-16 22:59:58 -05:00
Qi Liu
20c634932a scsi: hisi_sas: Prevent parallel controller reset and control phy command
A user may issue a control phy command from sysfs at any time, even if the
controller is resetting.

If a phy is disabled by hardreset/linkreset command before calling
get_phys_state() in the reset path, the saved phy state may be incorrect.

To avoid incorrectly recording the phy state, use hisi_hba.sem to ensure
that the controller reset may not run at the same time as when the phy
control function is running.

Link: https://lore.kernel.org/r/1639579061-179473-6-git-send-email-john.garry@huawei.com
Signed-off-by: Qi Liu <liuqi115@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-16 22:59:57 -05:00
John Garry
dc313f6b12 scsi: hisi_sas: Factor out task prep and delivery code
The task prep code is the same between the normal path (in
hisi_sas_task_prep()) and the internal abort path, so factor is out into a
common function.

Link: https://lore.kernel.org/r/1639579061-179473-5-git-send-email-john.garry@huawei.com
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-16 22:59:57 -05:00
John Garry
08c61b5d90 scsi: hisi_sas: Pass abort structure for internal abort
To help factor out code in future, it's useful to know if we're executing
an internal abort, so pass a pointer to the structure. The idea is that a
NULL pointer means not an internal abort.

Link: https://lore.kernel.org/r/1639579061-179473-4-git-send-email-john.garry@huawei.com
Reviewed-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-16 22:59:57 -05:00
John Garry
934385a4fd scsi: hisi_sas: Make internal abort have no task proto
For an internal abort, the task does not have a protocol, so set to none.

This will make it easier to differentiate internal abort tasks in future.

Link: https://lore.kernel.org/r/1639579061-179473-3-git-send-email-john.garry@huawei.com
Reviewed-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-16 22:59:57 -05:00
John Garry
0e4620856b scsi: hisi_sas: Start delivery hisi_sas_task_exec() directly
Currently we start delivery of commands to the DQ after returning from
hisi_sas_task_exec() with success.

Let's just start delivery directly in that function without having to check
if some local variable is set.

Link: https://lore.kernel.org/r/1639579061-179473-2-git-send-email-john.garry@huawei.com
Reviewed-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-16 22:59:57 -05:00
Christophe JAILLET
4d6942e266 scsi: hisi_sas: Use non-atomic bitmap functions when possible
All uses of the 'hisi_hba->slot_index_tags' bitmap are protected with the
'hisi_hba->lock' spinlock.

Prefer the non-atomic '__[set|clear]_bit()' functions to save a few cycles.

Link: https://lore.kernel.org/r/8ee33e463523db080e6a2c06f332e47abb69359b.1637961191.git.christophe.jaillet@wanadoo.fr
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-06 22:12:34 -05:00
Christophe JAILLET
d43efddf62 scsi: hisi_sas: Remove some useless code in hisi_sas_alloc()
The 'hisi_hba->slot_index_tags' bitmap is allocated with bitmap_zalloc() so
it is already cleared. There is no need to clear it another time, one bit
at a time.

Remove the corresponding useless code.

Link: https://lore.kernel.org/r/41c86e7e3e05a13bd586d8ee1b81296140b7a6eb.1637961191.git.christophe.jaillet@wanadoo.fr
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-06 22:12:33 -05:00
Christophe JAILLET
54585ec62f scsi: hisi_sas: Use devm_bitmap_zalloc() when applicable
'hisi_hba->slot_index_tags' is a bitmap. Use 'devm_bitmap_zalloc()' to
simplify code, improve the semantic, and avoid some open-coded arithmetic
in allocator arguments.

Link: https://lore.kernel.org/r/4afa3f71e66c941c660627c7f5b0223b51968ebb.1637961191.git.christophe.jaillet@wanadoo.fr
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-06 22:12:33 -05:00
Luo Jiaxing
21c7e97247 scsi: hisi_sas: Disable SATA disk phy for severe I_T nexus reset failure
If the softreset fails in the I_T reset, libsas will then continue to issue
a controller reset to try to recover.

However a faulty disk may cause the softreset to fail, and resetting the
controller will not help this scenario. Indeed, we will just continue the
cycle of error handle handling to try to recover.

So if the softreset fails upon certain conditions, just disable the phy
associated with the disk. The user needs to handle this problem.

Link: https://lore.kernel.org/r/1634041588-74824-5-git-send-email-john.garry@huawei.com
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-10-12 22:46:07 -04:00
Xiang Chen
046ab7d0f5 scsi: hisi_sas: Wait for phyup in hisi_sas_control_phy()
When issuing a hardreset/linkreset/phy_set_linkrate from sysfs, the phy
will be disabled and re-enabled for the directly attached scenario.

It takes some time for the phy to come back up after re-enabling the phy.
If the controller becomes suspended while waiting for the phy to come back,
the phy up may be lost (along with the disk).

To solve this problem, wait for the phy up to occur with a timeout. Indeed
this is already done in hisi_sas_debug_I_T_nexus_reset() for local phys, so
just relocate the functionality to hisi_sas_control_phy().

Since the HA workqueue is drained when suspending the controller, and the
phy control function is called from the same workqueue, we can guarantee
that the controller will not be suspended during this period.

Link: https://lore.kernel.org/r/1634041588-74824-3-git-send-email-john.garry@huawei.com
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-10-12 22:46:06 -04:00
Xiang Chen
36c6b7613e scsi: hisi_sas: Initialise devices in .slave_alloc callback
Perform driver-specific SCSI device initialization in the designated SCSI
midlayer callback instead of relying on the libsas "device found" callback.

The SCSI midlayer .slave_alloc interface is called prior to sending any I/O
to the device.

Link: https://lore.kernel.org/r/1634041588-74824-2-git-send-email-john.garry@huawei.com
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-10-12 22:46:06 -04:00
Xiang Chen
080b4f976b scsi: hisi_sas: Replace del_timer() calls with del_timer_sync()
Some usage of del_timer() in the driver is potentially unsafe.

When running the sas_task->slow_task timer in
hisi_sas_exec_internal_tmf_task(), execution may be blocked in function
hisi_sas_task_exec(); so it is possible that the timer is running when the
callback to disable the timer is running. This could be dangerous, as we
immediately release resources which the timer callback uses after disabling
the timer. The same situation may be found at other sites, such as
_hisi_sas_internal_task_abort().

Change calls to del_timer() to del_timer_sync() as necessary, to ensure any
timer has finished when disabling.

Also remove calls to timer_pending() prior to del_timer() as it is not
necessary.

Link: https://lore.kernel.org/r/1629799260-120116-5-git-send-email-john.garry@huawei.com
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-09-13 23:56:02 -04:00
Luo Jiaxing
b5a9fa20e3 scsi: hisi_sas: Rename HISI_SAS_{RESET -> RESETTING}_BIT
HISI_SAS_RESET_BIT means that the controller is being reset, and so the
name is a bit vague. Rename it to HISI_SAS_RESETTING_BIT.

Link: https://lore.kernel.org/r/1629799260-120116-4-git-send-email-john.garry@huawei.com
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-09-13 23:56:02 -04:00
Bart Van Assche
1effbface9 scsi: hisi_sas: Use scsi_cmd_to_rq() instead of scsi_cmnd.request
Prepare for removal of the request pointer by using scsi_cmd_to_rq()
instead. This patch does not change any functionality.

Link: https://lore.kernel.org/r/20210809230355.8186-22-bvanassche@acm.org
Acked-by: John Garry <john.garry@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-11 22:25:39 -04:00
Luo Jiaxing
e8a4d0daae scsi: hisi_sas: Speed up error handling when internal abort timeout occurs
If an internal task abort timeout occurs, the controller has developed a
fault, and needs to be reset to be recovered.

When this occurs during error handling, the current policy is to allow
error handling to continue, and the inevitable nexus ha reset will handle
the required reset.

However various steps of error handling need to taken before this happens.
These also involve some level of HW interaction, which will also fail with
various timeouts.

Speed up this process by recording a HW fault bit for an internal abort
timeout - when this is set, just automatically error any HW interaction,
and essentially go straight to clear nexus ha (to reset the controller).

Link: https://lore.kernel.org/r/1623058179-80434-6-git-send-email-john.garry@huawei.com
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-06-09 23:21:52 -04:00
Luo Jiaxing
63ece9eb35 scsi: hisi_sas: Reset controller for internal abort timeout
If an internal task abort timeout occurs, the controller has developed a
fault, and needs to be reset to be recovered. However if a timeout occurs
during SCSI error handling, issuing a controller reset immediately may
conflict with the error handling.

To handle internal abort in these two scenarios, only queue the reset when
not in an error handling function. In the case of a timeout during error
handling, do nothing and rely on the inevitable ha nexus reset to reset the
controller.

Link: https://lore.kernel.org/r/1623058179-80434-5-git-send-email-john.garry@huawei.com
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-06-09 23:21:51 -04:00
Luo Jiaxing
2f12a49951 scsi: hisi_sas: Include HZ in timer macros
Include HZ in timer macros to make the code more concise.

Link: https://lore.kernel.org/r/1623058179-80434-4-git-send-email-john.garry@huawei.com
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-06-09 23:21:51 -04:00
Luo Jiaxing
0f75733991 scsi: hisi_sas: Run I_T nexus resets in parallel for clear nexus reset
For a clear nexus reset operation, the I_T nexus resets are executed
serially for each device. For devices attached through an expander, this
may take 2s per device; so, in total, could take a long time.

Reduce the total time by running the I_T nexus resets in parallel through
async operations.

Link: https://lore.kernel.org/r/1623058179-80434-3-git-send-email-john.garry@huawei.com
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-06-09 23:21:51 -04:00
Luo Jiaxing
366da0da1f scsi: hisi_sas: Put a limit of link reset retries
If an OOB event is received but the phy still fails to come up, a link
reset will be issued repeatedly at an interval of 20s until the phy comes
up.

Set a limit for link reset issue retries to avoid printing the timeout
message endlessly.

Link: https://lore.kernel.org/r/1623058179-80434-2-git-send-email-john.garry@huawei.com
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-06-09 23:21:51 -04:00