IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit fdb827e4a3f84cb92e286a821114ac0ad79c8281 upstream.
Fix sparse warning:
drivers/scsi/lpfc/lpfc_nportdisc.c:344:1: warning:
symbol 'lpfc_defer_acc_rsp' was not declared. Should it be static?
Link: https://lore.kernel.org/r/20200107014956.41748-1-yuehaibing@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f7cb0d0945ebc9879aff72cf7b3342fd1040ffaa upstream.
Fix sparse warnings:
drivers/scsi/lpfc/lpfc_nportdisc.c:290:1: warning: symbol 'lpfc_defer_pt2pt_acc' was not declared. Should it be static?
Link: https://lore.kernel.org/r/1570183477-137273-1-git-send-email-zhengbin13@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reviewed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e5785d3ec32f5f44dd88cd7b398e496742630469 upstream.
Commit 9816ef6ecbc1 ("scsi: lpfc: Use after free in lpfc_rq_buf_free()")
was made to correct a use after free condition in lpfc_rq_buf_free().
Unfortunately, a subsequent patch cut on a tree without the fix
inadvertently reverted the fix.
Put the fix back: Move the freeing of the rqb_entry to after the print
function that references it.
Link: https://lore.kernel.org/r/20201020202719.54726-4-james.smart@broadcom.com
Fixes: 411de511c694 ("scsi: lpfc: Fix RQ empty firmware trap")
Cc: <stable@vger.kernel.org> # v4.17+
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 62e3a931db60daf94fdb3159d685a5bc6ad4d0cf upstream.
The following calltrace was seen:
BUG: sleeping function called from invalid context at mm/slab.h:494
...
Call Trace:
dump_stack+0x9a/0xf0
___might_sleep.cold.63+0x13d/0x178
slab_pre_alloc_hook+0x6a/0x90
kmem_cache_alloc_trace+0x3a/0x2d0
lpfc_sli4_nvmet_alloc+0x4c/0x280 [lpfc]
lpfc_post_rq_buffer+0x2e7/0xa60 [lpfc]
lpfc_sli4_hba_setup+0x6b4c/0xa4b0 [lpfc]
lpfc_pci_probe_one_s4.isra.15+0x14f8/0x2280 [lpfc]
lpfc_pci_probe_one+0x260/0x2880 [lpfc]
local_pci_probe+0xd4/0x180
work_for_cpu_fn+0x51/0xa0
process_one_work+0x8f0/0x17b0
worker_thread+0x536/0xb50
kthread+0x30c/0x3d0
ret_from_fork+0x3a/0x50
A prior patch introduced a spin_lock_irqsave(hbalock) in the
lpfc_post_rq_buffer() routine. Call trace is seen as the hbalock is held
with interrupts disabled during a GFP_KERNEL allocation in
lpfc_sli4_nvmet_alloc().
Fix by reordering locking so that hbalock not held when calling
sli4_nvmet_alloc() (aka rqb_buf_list()).
Link: https://lore.kernel.org/r/20201020202719.54726-2-james.smart@broadcom.com
Fixes: 411de511c694 ("scsi: lpfc: Fix RQ empty firmware trap")
Cc: <stable@vger.kernel.org> # v4.17+
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7f04839ec4483563f38062b4dd90253e45447198 upstream.
Initial FLOGIs are failing with the following message:
lpfc 0000:13:00.1: 1:(0):0820 FLOGI Failed (x300). BBCredit Not Supported
In a prior patch, the READ_SPARAM command was re-ordered to post after
CONFIG_LINK as the driver is expected to update the driver's copy of the
service parameters for the FLOGI payload. If the bb-credit recovery feature
is enabled, this is fine. But on adapters were bb-credit recovery isn't
enabled, it would cause the FLOGI to fail.
Fix by restoring the original command order (READ_SPARAM before
CONFIG_LINK), and after issuing CONFIG_LINK, detect bb-credit recovery
support and reissuing READ_SPARAM to obtain the updated service parameters
(effectively adding in the fix command order).
[mkp: corrected SHA]
Link: https://lore.kernel.org/r/20200911200147.110826-1-james.smart@broadcom.com
Fixes: 835214f5d5f5 ("scsi: lpfc: Fix broken Credit Recovery after driver load")
CC: <stable@vger.kernel.org> # v5.7+
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4cb9e1ddaa145be9ed67b6a7de98ca705a43f998 ]
Coverity reported a memory corruption error for the fdmi attributes
routines:
CID 15768 [Memory Corruption] Out-of-bounds access on FDMI
Sloppy coding of the fmdi structures. In both the lpfc_fdmi_attr_def and
lpfc_fdmi_reg_port_list structures, a field was placed at the start of
payload that may have variable content. The field was given an arbitrary
type (uint32_t). The code then uses the field name to derive an address,
which it used in things such as memset and memcpy. The memset sizes or
memcpy lengths were larger than the arbitrary type, thus coverity reported
an error.
Fix by replacing the arbitrary fields with the real field structures
describing the payload.
Link: https://lore.kernel.org/r/20200128002312.16346-8-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 821bc882accaaaf1bbecf5c0ecef659443e3e8cb ]
When performing reset testing, the eq's list for related hwqs was getting
corrupted. In cases where there is not a 1:1 eq to hwq, the eq is
shared. The eq maintains a list of hwqs utilizing it in case of cpu
offlining and polling. During the reset, the hwqs are being torn down so
they can be recreated. The recreation was getting confused by seeing a
non-null eq assignment on the eq and the eq list became corrupt.
Correct by clearing the hdwq eq assignment when the hwq is cleaned up.
Link: https://lore.kernel.org/r/20200128002312.16346-6-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 39c4f1a965a9244c3ba60695e8ff8da065ec6ac4 ]
The driver is occasionally seeing the following SLI Port error, requiring
reset and reinit:
Port Status Event: ... error 1=0x52004a01, error 2=0x218
The failure means an RQ timeout. That is, the adapter had received
asynchronous receive frames, ran out of buffer slots to place the frames,
and the driver did not replenish the buffer slots before a timeout
occurred. The driver should not be so slow in replenishing buffers that a
timeout can occur.
When the driver received all the frames of a sequence, it allocates an IOCB
to put the frames in. In a situation where there was no IOCB available for
the frame of a sequence, the RQ buffer corresponding to the first frame of
the sequence was not returned to the FW. Eventually, with enough traffic
encountering the situation, the timeout occurred.
Fix by releasing the buffer back to firmware whenever there is no IOCB for
the first frame.
[mkp: typo]
Link: https://lore.kernel.org/r/20200128002312.16346-2-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit be0709e449ac9d9753a5c17e5b770d6e5e930e4a ]
NVMe device re-discovery does not complete. Dev_loss_tmo messages seen on
initiator after recovery from a link disturbance.
The failing case is the following:
When the driver (as a NVME target) receives a PLOGI, the driver initiates
an "unreg rpi" mailbox command. While the mailbox command is in progress,
the driver requests that an ACC be sent to the initiator. The target's ACC
is received by the initiator and the initiator then transmits a PLOGI. The
driver receives the PLOGI prior to receiving the completion for the PLOGI
response WQE that sent the ACC. (Different delivery sources from the hw so
the race is very possible). Given the PLOGI is prior to the ACC completion
(signifying PLOGI exchange complete), the driver LS_RJT's the PRLI. The
"unreg rpi" mailbox then completes. Since PRLI has been received, the
driver transmits a PLOGI to restart discovery, which the initiator then
ACC's. If the driver processes the (re)PLOGI ACC prior to the completing
the handling for the earlier ACC it sent the intiators original PLOGI,
there is no state change for completion of the (re)PLOGI. The ndlp remains
in "PLOGI Sent" and the initiator continues sending PRLI's which are
rejected by the target until timeout or retry is reached.
Fix by: When in target mode, defer sending an ACC for the received PLOGI
until unreg RPI completes.
Link: https://lore.kernel.org/r/20191218235808.31922-2-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6c1e803eac846f886cd35131e6516fc51a8414b9 ]
When reading sysfs nvme_info file while a remote port leaves and comes
back, a NULL pointer is encountered. The issue is due to ndlp list
corruption as the the nvme_info_show does not use the same lock as the rest
of the code.
Correct by removing the rcu_xxx_lock calls and replace by the host_lock and
phba->hbaLock spinlocks that are used by the rest of the driver. Given
we're called from sysfs, we are safe to use _irq rather than _irqsave.
Link: https://lore.kernel.org/r/20191105005708.7399-4-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 359e10f087dbb7b9c9f3035a8cc4391af45bd651 ]
After exchanging PLOGI on an SLI-3 adapter, the PRLI exchange failed. Link
trace showed the port was assigned a non-zero n_port_id, but didn't use the
address on the PRLI. The assigned address is set on the port by the
CONFIG_LINK mailbox command. The driver responded to the PRLI before the
mailbox command completed. Thus the PRLI response used the old n_port_id.
Defer the PRLI response until CONFIG_LINK completes.
Link: https://lore.kernel.org/r/20190922035906.10977-2-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7b08e89f98cee9907895fabb64cf437bc505ce9a ]
The driver is unable to successfully login with remote device. During pt2pt
login, the driver completes its FLOGI request with the remote device having
WWN precedence. The remote device issues its own (delayed) FLOGI after
accepting the driver's and, upon transmitting the FLOGI, immediately
recognizes it has already processed the driver's FLOGI thus it transitions
to sending a PLOGI before waiting for an ACC to its FLOGI.
In the driver, the FLOGI is received and an ACC sent, followed by the PLOGI
being received and an ACC sent. The issue is that the PLOGI reception
occurs before the response from the adapter from the FLOGI ACC is
received. Processing of the PLOGI sets state flags to perform the REG_RPI
mailbox command and proceed with the rest of discovery on the port. The
same completion routine used by both FLOGI and PLOGI is generic in
nature. One of the things it does is clear flags, and those flags happen to
drive the rest of discovery. So what happened was the PLOGI processing set
the flags, the FLOGI ACC completion cleared them, thus when the PLOGI ACC
completes it doesn't see the flags and stops.
Fix by modifying the generic completion routine to not clear the rest of
discovery flag (NLP_ACC_REGLOGIN) unless the completion is also associated
with performing a mailbox command as part of its handling. For things such
as FLOGI ACC, there isn't a subsequent action to perform with the adapter,
thus there is no mailbox cmd ptr. PLOGI ACC though will perform REG_RPI
upon completion, thus there is a mailbox cmd ptr.
Link: https://lore.kernel.org/r/20200828175332.130300-3-james.smart@broadcom.com
Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 03dbfe0668e6692917ac278883e0586cd7f7d753 ]
When vports are deleted, it is observed that there is memory/kthread
leakage as the vport isn't fully being released.
There is a shost reference taken in scsi_add_host_dma that is not released
during scsi_remove_host. It was noticed that other drivers resolve this by
doing a scsi_host_put after calling scsi_remove_host.
The vport_delete routine is taking two references one that corresponds to
an access to the scsi_host in the vport_delete routine and another that is
released after the adapter mailbox command completes that destroys the VPI
that corresponds to the vport.
Remove one of the references taken such that the second reference that is
put will complete the missing scsi_add_host_dma reference and the shost
will be terminated.
Link: https://lore.kernel.org/r/20200630215001.70793-8-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit af6de8c60fe9433afa73cea6fcccdccd98ad3e5e ]
We cannot wait on a completion object in the lpfc_nvme_targetport structure
in the _destroy_targetport() code path because the NVMe/fc transport will
free that structure immediately after the .targetport_delete() callback.
This results in a use-after-free, and a crash if slub_debug=FZPU is
enabled.
An earlier fix put put the completion on the stack, but commit 2a0fb340fcc8
("scsi: lpfc: Correct localport timeout duration error") subsequently
changed the code to reference the completion through a pointer in the
object rather than the local stack variable. Fix this by using the stack
variable directly.
Link: https://lore.kernel.org/r/20200729231011.13240-1-emilne@redhat.com
Fixes: 2a0fb340fcc8 ("scsi: lpfc: Correct localport timeout duration error")
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 46da547e21d6cefceec3fb3dba5ebbca056627fc ]
Commit cdb42becdd40 ("scsi: lpfc: Replace io_channels for nvme and fcp with
general hdw_queues per cpu") has introduced static checker warnings for
potential null dereferences in 'lpfc_sli4_hba_unset()' and commit 1ffdd2c0440d
("scsi: lpfc: resolve static checker warning in lpfc_sli4_hba_unset") has
tried to fix it. However, yet another potential null dereference is
remaining. This commit fixes it.
This bug was discovered and resolved using Coverity Static Analysis
Security Testing (SAST) by Synopsys, Inc.
Link: https://lore.kernel.org/r/20200623084122.30633-1-sjpark@amazon.com
Fixes: 1ffdd2c0440d ("scsi: lpfc: resolve static checker warning inlpfc_sli4_hba_unset")
Fixes: cdb42becdd40 ("scsi: lpfc: Replace io_channels for nvme and fcp with general hdw_queues per cpu")
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7217e6e694da3aae6d17db8a7f7460c8d4817ebf ]
In order to create or activate a new node, lpfc_els_unsol_buffer() invokes
lpfc_nlp_init() or lpfc_enable_node() or lpfc_nlp_get(), all of them will
return a reference of the specified lpfc_nodelist object to "ndlp" with
increased refcnt.
When lpfc_els_unsol_buffer() returns, local variable "ndlp" becomes
invalid, so the refcount should be decreased to keep refcount balanced.
The reference counting issue happens in one exception handling path of
lpfc_els_unsol_buffer(). When "ndlp" in DEV_LOSS, the function forgets to
decrease the refcnt increased by lpfc_nlp_init() or lpfc_enable_node() or
lpfc_nlp_get(), causing a refcnt leak.
Fix this issue by calling lpfc_nlp_put() when "ndlp" in DEV_LOSS.
Link: https://lore.kernel.org/r/1590416184-52592-1-git-send-email-xiyuyang19@fudan.edu.cn
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit f809da6db68a8be49e317f0ccfbced1af9258839 upstream.
Implementation of a previous patch added a condition to an if check that
always end up with the if test being true. Execution of the else clause was
inadvertently negated. The additional condition check was incorrect and
unnecessary after the other modifications had been done in that patch.
Remove the check from the if series.
Link: https://lore.kernel.org/r/20200501214310.91713-5-jsmart2021@gmail.com
Fixes: b95b21193c85 ("scsi: lpfc: Fix loss of remote port after devloss due to lack of RPIs")
Cc: <stable@vger.kernel.org> # v5.4+
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 807e7353d8a7105ce884d22b0dbc034993c6679c ]
Kernel is crashing with the following stacktrace:
BUG: unable to handle kernel NULL pointer dereference at
00000000000005bc
IP: lpfc_nvme_register_port+0x1a8/0x3a0 [lpfc]
...
Call Trace:
lpfc_nlp_state_cleanup+0x2b2/0x500 [lpfc]
lpfc_nlp_set_state+0xd7/0x1a0 [lpfc]
lpfc_cmpl_prli_prli_issue+0x1f7/0x450 [lpfc]
lpfc_disc_state_machine+0x7a/0x1e0 [lpfc]
lpfc_cmpl_els_prli+0x16f/0x1e0 [lpfc]
lpfc_sli_sp_handle_rspiocb+0x5b2/0x690 [lpfc]
lpfc_sli_handle_slow_ring_event_s4+0x182/0x230 [lpfc]
lpfc_do_work+0x87f/0x1570 [lpfc]
kthread+0x10d/0x130
ret_from_fork+0x35/0x40
During target side fault injections, it is possible to hit the
NLP_WAIT_FOR_UNREG case in lpfc_nvme_remoteport_delete. A prior commit
fixed a rebind and delete race condition, but called lpfc_nlp_put
unconditionally. This triggered a deletion and the crash.
Fix by movng nlp_put to inside the NLP_WAIT_FOR_UNREG case, where the nlp
will be being unregistered/removed. Leave the reference if the flag isn't
set.
Link: https://lore.kernel.org/r/20200322181304.37655-8-jsmart2021@gmail.com
Fixes: b15bd3e6212e ("scsi: lpfc: Fix nvme remoteport registration race conditions")
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4cd70891308dfb875ef31060c4a4aa8872630a2e ]
Injecting EEH on a 32GB card is causing kernel oops
The pci error handler is doing an IO flush and the offline code is also
doing an IO flush. When the 1st flush is complete the hdwq is destroyed
(freed), yet the second flush accesses the hdwq and crashes.
Added a check in lpfc_sli4_fush_io_rings to check both the HBA_IOQ_FLUSH
flag and the hdwq pointer to see if it is already set and not already
freed.
Link: https://lore.kernel.org/r/20200322181304.37655-6-jsmart2021@gmail.com
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 38503943c89f0bafd9e3742f63f872301d44cbea ]
The following kasan bug was called out:
BUG: KASAN: slab-out-of-bounds in lpfc_unreg_login+0x7c/0xc0 [lpfc]
Read of size 2 at addr ffff889fc7c50a22 by task lpfc_worker_3/6676
...
Call Trace:
dump_stack+0x96/0xe0
? lpfc_unreg_login+0x7c/0xc0 [lpfc]
print_address_description.constprop.6+0x1b/0x220
? lpfc_unreg_login+0x7c/0xc0 [lpfc]
? lpfc_unreg_login+0x7c/0xc0 [lpfc]
__kasan_report.cold.9+0x37/0x7c
? lpfc_unreg_login+0x7c/0xc0 [lpfc]
kasan_report+0xe/0x20
lpfc_unreg_login+0x7c/0xc0 [lpfc]
lpfc_sli_def_mbox_cmpl+0x334/0x430 [lpfc]
...
When processing the completion of a "Reg Rpi" login mailbox command in
lpfc_sli_def_mbox_cmpl, a call may be made to lpfc_unreg_login. The vpi is
extracted from the completing mailbox context and passed as an input for
the next. However, the vpi stored in the mailbox command context is an
absolute vpi, which for SLI4 represents both base + offset. When used with
a non-zero base component, (function id > 0) this results in an
out-of-range access beyond the allocated phba->vpi_ids array.
Fix by subtracting the function's base value to get an accurate vpi number.
Link: https://lore.kernel.org/r/20200322181304.37655-2-jsmart2021@gmail.com
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d480e57809a043333a3b9e755c0bdd43e10a9f12 ]
Compilation can fail due to having an inline function reference where the
function body is not present.
Fix by removing the inline tag.
Fixes: 93a4d6f40198 ("scsi: lpfc: Add registration for CPU Offline/Online events")
Link: https://lore.kernel.org/r/20191111230401.12958-4-jsmart2021@gmail.com
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 835214f5d5f516a38069bc077c879c7da00d6108 ]
When driver is set to enable bb credit recovery, the switch displayed the
setting as inactive. If the link bounces, it switches to Active.
During link up processing, the driver currently does a MBX_READ_SPARAM
followed by a MBX_CONFIG_LINK. These mbox commands are queued to be
executed, one at a time and the completion is processed by the worker
thread. Since the MBX_READ_SPARAM is done BEFORE the MBX_CONFIG_LINK, the
BB_SC_N bit is never set the the returned values. BB Credit recovery status
only gets set after the driver requests the feature in CONFIG_LINK, which
is done after the link up. Thus the ordering of READ_SPARAM needs to follow
the CONFIG_LINK.
Fix by reordering so that READ_SPARAM is done after CONFIG_LINK. Added a
HBA_DEFER_FLOGI flag so that any FLOGI handling waits until after the
READ_SPARAM is done so that the proper BB credit value is set in the FLOGI
payload.
Fixes: 6bfb16208298 ("scsi: lpfc: Fix configuration of BB credit recovery in service parameters")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20200128002312.16346-4-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6bfb1620829825c01e1dcdd63b6a7700352babd9 ]
The driver today is reading service parameters from the firmware and then
overwriting the firmware-provided values with values of its own. There are
some switch features that require preliminary FLOGI's that are
switch-specific and done prior to the actual fabric FLOGI for traffic. The
fw will perform those FLOGIs and will revise the service parameters for the
features configured. As the driver later overwrites those values with its
own values, it misconfigures things like BBSCN use by doing so.
Correct by eliminating the driver-overwrite of firmware values. The driver
correctly re-reads the service parameters after each link up to obtain the
latest values from firmware.
Link: https://lore.kernel.org/r/20191105005708.7399-3-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e3ba04c9bad1d1c7f15df43da25e878045150777 ]
There are reports of multiple ports on the same system displaying different
hostnames in fabric FDMI displays.
Currently, the driver registers the hostname at initialization and obtains
the hostname via init_utsname()->nodename queried at the time the FC link
comes up. Unfortunately, if the machine hostname is updated after
initialization, such as via DHCP or admin command, the value registered
initially will be incorrect.
Fix by having the driver save the hostname that was registered with FDMI.
The driver then runs a heartbeat action that will check the hostname. If
the name changes, reregister the FMDI data.
The hostname is used in RSNN_NN, FDMI RPA and FDMI RHBA.
Link: https://lore.kernel.org/r/20191218235808.31922-5-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 93a4d6f40198dffcca35d9a928c409f9290f1fe0 ]
The recent affinitization didn't address cpu offlining/onlining. If an
interrupt vector is shared and the low order cpu owning the vector is
offlined, as interrupts are managed, the vector is taken offline. This
causes the other CPUs sharing the vector will hang as they can't get io
completions.
Correct by registering callbacks with the system for Offline/Online
events. When a cpu is taken offline, its eq, which is tied to an interrupt
vector is found. If the cpu is the "owner" of the vector and if the
eq/vector is shared by other CPUs, the eq is placed into a polled mode.
Additionally, code paths that perform io submission on the "sharing CPUs"
will check the eq state and poll for completion after submission of new io
to a wq that uses the eq.
Similarly, when a cpu comes back online and owns an offlined vector, the eq
is taken out of polled mode and rearmed to start driving interrupts for eq.
Link: https://lore.kernel.org/r/20191105005708.7399-9-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 0ab384a49c548baf132ccef249f78d9c6c506380 upstream.
If a call to lpfc_get_cmd_rsp_buf_per_hdwq returns NULL (memory allocation
failure), a previously allocated lpfc_io_buf resource is leaked.
Fix by releasing the lpfc_io_buf resource in the failure path.
Fixes: d79c9e9d4b3d ("scsi: lpfc: Support dynamic unbounded SGL lists on G7 hardware.")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20200128002312.16346-3-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8c5c660529209a0e324c1c1a35ce3f83d67a2aa5 upstream.
The original patch was to resolve the lldd being able to be unloaded
while being used to talk to the boot device of the system. However, the
end result of the original patch is that any driver unload while a nvme
controller is live via the lldd is now being prohibited. Given the module
reference, the module teardown routine can't be called, thus there's no
way, other than manual actions to terminate the controllers.
Fixes: 863fbae929c7 ("nvme_fc: add module to ops template to allow module references")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit df9166bfa7750bade5737ffc91fbd432e0354442 ]
This patch reworks the fdmi symbolic node name data for the following two
issues:
- Correcting extraneous periods following the DV and HN fdmi data fields.
- Avoiding buffer overflow issues when formatting the data.
The fix to the fist issue is to just remove the characters.
The fix to the second issue has all data being staged in temporary storage
before being moved to the real buffer.
Link: https://lore.kernel.org/r/20191218235808.31922-3-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4583a4f66b323c6e4d774be2649e83a4e7c7b78c ]
Looking at the recent conversion from smp_processor_id() to
raw_smp_processor_id(), realized that the allocation should be based on the
cpu the hdwq is bound to, not the executing cpu.
Revise to pull cpu number from the hdwq
Fixes: 765ab6cdac3b ("scsi: lpfc: Fix a kernel warning triggered by lpfc_get_sgl_per_hdwq()")
Link: https://lore.kernel.org/r/20191116003847.6141-1-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 765ab6cdac3b681952da0e22184bf6cf1ae41cf8 upstream.
Fix the following kernel bug report:
BUG: using smp_processor_id() in preemptible [00000000] code: systemd-udevd/954
Fixes: d79c9e9d4b3d ("scsi: lpfc: Support dynamic unbounded SGL lists on G7 hardware.")
Link: https://lore.kernel.org/r/20191107052158.25788-2-bvanassche@acm.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a4c21acca2be6729ecbe72eda9b08092725b0a77 upstream.
Many of the sgl-per-hdwq paths are locking with spin_lock_irq() and
spin_unlock_irq() and may unwittingly raising irq when it shouldn't. Hard
deadlocks were seen around lpfc_scsi_prep_cmnd().
Fix by converting the locks to irqsave/irqrestore.
Fixes: d79c9e9d4b3d ("scsi: lpfc: Support dynamic unbounded SGL lists on G7 hardware.")
Link: https://lore.kernel.org/r/20190922035906.10977-16-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 35a635af54ce79881eb35ba20b64dcb1e81b0389 upstream.
In lpfc_release_io_buf, an lpfc_io_buf is returned to the 'available' pool
before any associated sgl or cmd and rsp buffers are returned via their
respective 'put' routines. If xri rebalancing occurs and an lpfc_io_buf
structure is reused quickly, there may be a race condition between release
of old and association of new resources.
Re-ordered lpfc_release_io_buf to release sgl and cmd/rsp
buffer lists before releasing the lpfc_io_buf structure for re-use.
Fixes: d79c9e9d4b3d ("scsi: lpfc: Support dynamic unbounded SGL lists on G7 hardware.")
Link: https://lore.kernel.org/r/20190922035906.10977-17-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 97acd0019d5dadd9c0e111c2083c889bfe548f25 upstream.
A prior use-after-free mailbox fix solved it's problem by null'ing a ndlp
pointer. However, further testing has shown that this change causes a
later state change to occasionally be skipped, which results in a reference
count never being decremented thus the rpi is never released, which causes
a vport delete to never succeed.
Revise the fix in the prior patch to no longer null the ndlp. Instead the
RELEASE_RPI flag is set which will drive the release of the rpi.
Given the new code was added at a deep indentation level, refactor the code
block using a new routine that avoids the indentation issues.
Fixes: 9b1640686470 ("scsi: lpfc: Fix use-after-free mailbox cmd completion")
Link: https://lore.kernel.org/r/20190922035906.10977-6-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9a1b0b9a6dab452fb0e39fe96880c4faf3878369 ]
When phba->mbox_ext_buf_ctx.seqNum != phba->mbox_ext_buf_ctx.numBuf,
dd_data should be freed before return SLI_CONFIG_HANDLED.
When lpfc_sli_issue_mbox func return fails, pmboxq should be also freed in
job_error tag.
Link: https://lore.kernel.org/r/EDBAAA0BBBA2AC4E9C8B6B81DEEE1D6915E7A966@DGGEML525-MBS.china.huawei.com
Signed-off-by: Bo Wu <wubo40@huawei.com>
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 863fbae929c7a5b64e96b8a3ffb34a29eefb9f8f ]
In nvme-fc: it's possible to have connected active controllers
and as no references are taken on the LLDD, the LLDD can be
unloaded. The controller would enter a reconnect state and as
long as the LLDD resumed within the reconnect timeout, the
controller would resume. But if a namespace on the controller
is the root device, allowing the driver to unload can be problematic.
To reload the driver, it may require new io to the boot device,
and as it's no longer connected we get into a catch-22 that
eventually fails, and the system locks up.
Fix this issue by taking a module reference for every connected
controller (which is what the core layer did to the transport
module). Reference is cleared when the controller is removed.
Acked-by: Himanshu Madhani <hmadhani@marvell.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6c6d59e0fe5b86cf273d6d744a6a9768c4ecc756 ]
Coverity reported the following:
*** CID 101747: Null pointer dereferences (FORWARD_NULL)
/drivers/scsi/lpfc/lpfc_els.c: 4439 in lpfc_cmpl_els_rsp()
4433 kfree(mp);
4434 }
4435 mempool_free(mbox, phba->mbox_mem_pool);
4436 }
4437 out:
4438 if (ndlp && NLP_CHK_NODE_ACT(ndlp)) {
vvv CID 101747: Null pointer dereferences (FORWARD_NULL)
vvv Dereferencing null pointer "shost".
4439 spin_lock_irq(shost->host_lock);
4440 ndlp->nlp_flag &= ~(NLP_ACC_REGLOGIN | NLP_RM_DFLT_RPI);
4441 spin_unlock_irq(shost->host_lock);
4442
4443 /* If the node is not being used by another discovery thread,
4444 * and we are sending a reject, we are done with it.
Fix by adding a check for non-null shost in line 4438.
The scenario when shost is set to null is when ndlp is null.
As such, the ndlp check present was sufficient. But better safe
than sorry so add the shost check.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 101747 ("Null pointer dereferences")
Fixes: 2e0fef85e098 ("[SCSI] lpfc: NPIV: split ports")
CC: James Bottomley <James.Bottomley@SteelEye.com>
CC: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
CC: linux-next@vger.kernel.org
Link: https://lore.kernel.org/r/20191111230401.12958-3-jsmart2021@gmail.com
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7cfd5639d99bec0d27af089d0c8c114330e43a72 ]
If the driver receives a login that is later then LOGO'd by the remote port
(aka ndlp), the driver, upon the completion of the LOGO ACC transmission,
will logout the node and unregister the rpi that is being used for the
node. As part of the unreg, the node's rpi value is replaced by the
LPFC_RPI_ALLOC_ERROR value. If the port is subsequently offlined, the
offline walks the nodes and ensures they are logged out, which possibly
entails unreg'ing their rpi values. This path does not validate the node's
rpi value, thus doesn't detect that it has been unreg'd already. The
replaced rpi value is then used when accessing the rpi bitmask array which
tracks active rpi values. As the LPFC_RPI_ALLOC_ERROR value is not a valid
index for the bitmask, it may fault the system.
Revise the rpi release code to detect when the rpi value is the replaced
RPI_ALLOC_ERROR value and ignore further release steps.
Link: https://lore.kernel.org/r/20191105005708.7399-2-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2332e6e475b016e2026763f51333f84e2e6c57a3 ]
During heavy RCN activity and log_verbose = 0 we see these messages:
2754 PRLI failure DID:521245 Status:x9/xb2c00, data: x0
0231 RSCN timeout Data: x0 x3
0230 Unexpected timeout, hba link state x5
This is due to delayed RSCN activity.
Correct by avoiding the timeout thus the messages by restarting the
discovery timeout whenever an rscn is received.
Filter PRLI responses such that severity depends on whether expected for
the configuration or not. For example, PRLI errors on a fabric will be
informational (they are expected), but Point-to-Point errors are not
necessarily expected so they are raised to an error level.
Link: https://lore.kernel.org/r/20191105005708.7399-5-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit feff8b3d84d3d9570f893b4d83e5eab6693d6a52 ]
When operating in private loop mode, PLOGI exchanges are racing and the
driver tries to abort it's PLOGI. But the PLOGI abort ends up terminating
the login with the other end causing the other end to abort its PLOGI as
well. Discovery never fully completes.
Fix by disabling the PLOGI abort when private loop and letting the state
machine play out.
Link: https://lore.kernel.org/r/20191018211832.7917-5-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 91a52b617cdb8bf6d298892101c061d438b84a19 ]
In lpfc_abort_handler, the lock acquire order is hbalock (irqsave),
buf_lock (irq) and ring_lock (irq). The issue is that in two places the
locks are released out of order - the buf_lock and the hbalock - resulting
in the cpu preemption/lock flags getting restored out of order and
deadlocking the cpu.
Fix the unlock order by fully releasing the hbalocks as well.
CC: Zhangguanghui <zhang.guanghui@h3c.com>
Link: https://lore.kernel.org/r/20191018211832.7917-7-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 15498dc1a55b7aaea4b51ff03e3ff0f662e73f44 ]
After study, it was determined there was a double free of a CT iocb during
execution of lpfc_offline_prep and lpfc_offline. The prep routine issued
an abort for some CT iocbs, but the aborts did not complete fast enough for
a subsequent routine that waits for completion. Thus the driver proceeded
to lpfc_offline, which releases any pending iocbs. Unfortunately, the
completions for the aborts were then received which re-released the ct
iocbs.
Turns out the issue for why the aborts didn't complete fast enough was not
their time on the wire/in the adapter. It was the lpfc_work_done routine,
which requires the adapter state to be UP before it calls
lpfc_sli_handle_slow_ring_event() to process the completions. The issue is
the prep routine takes the link down as part of it's processing.
To fix, the following was performed:
- Prevent the offline routine from releasing iocbs that have had aborts
issued on them. Defer to the abort completions. Also means the driver
fully waits for the completions. Given this change, the recognition of
"driver-generated" status which then releases the iocb is no longer
valid. As such, the change made in the commit 296012285c90 is reverted.
As recognition of "driver-generated" status is no longer valid, this
patch reverts the changes made in
commit 296012285c90 ("scsi: lpfc: Fix leak of ELS completions on adapter reset")
- Modify lpfc_work_done to allow slow path completions so that the abort
completions aren't ignored.
- Updated the fdmi path to recognize a CT request that fails due to the
port being unusable. This stops FDMI retries. FDMI will be restarted on
next link up.
Link: https://lore.kernel.org/r/20190922035906.10977-14-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 07b8582430370097238b589f4e24da7613ca6dd3 ]
Symptoms were seen of the driver not having valid data for mailbox
commands. After debugging, the following sequence was found:
The driver maintains a port-wide pointer of the mailbox command that is
currently in execution. Once finished, the port-wide pointer is cleared
(done in lpfc_sli4_mq_release()). The next mailbox command issued will set
the next pointer and so on.
The mailbox response data is only copied if there is a valid port-wide
pointer.
In the failing case, it was seen that a new mailbox command was being
attempted in parallel with the completion. The parallel path was seeing
the mailbox no long in use (flag check under lock) and thus set the port
pointer. The completion path had cleared the active flag under lock, but
had not touched the port pointer. The port pointer is cleared after the
lock is released. In this case, the completion path cleared the just-set
value by the parallel path.
Fix by making the calls that clear mbox state/port pointer while under
lock. Also slightly cleaned up the error path.
Link: https://lore.kernel.org/r/20190922035906.10977-8-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3f97aed6117c7677eb16756c4ec8b86000fd5822 ]
An issue was seen discovering all SCSI Luns when a target device undergoes
link bounce.
The driver currently does not qualify the FC4 support on the target.
Therefore it will send a SCSI PRLI and an NVMe PRLI. The expectation is
that the target will reject the PRLI if it is not supported. If a PRLI
times out, the driver will retry. The driver will not proceed with the
device until both SCSI and NVMe PRLIs are resolved. In the failure case,
the device is FCP only and does not respond to the NVMe PRLI, thus
initiating the wait/retry loop in the driver. During that time, a RSCN is
received (device bounced) causing the driver to issue a GID_FT. The GID_FT
response comes back before the PRLI mess is resolved and it prematurely
cancels the PRLI retry logic and leaves the device in a STE_PRLI_ISSUE
state. Discovery with the target never completes or resets.
Fix by resetting the node state back to STE_NPR_NODE when GID_FT completes,
thereby restarting the discovery process for the node.
Link: https://lore.kernel.org/r/20190922035906.10977-10-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d38b4a527fe898f859f74a3a43d4308f48ac7855 ]
While reviewing the CT behavior, issues with spinlock_irq were seen. The
driver should be using spinlock_irqsave/irqrestore in the els flush
routine.
Changed to spinlock_irqsave/irqrestore.
Link: https://lore.kernel.org/r/20190922035906.10977-15-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 324e1c402069e8d277d2a2b18ce40bde1265b96a upstream.
In cases where I/O may be aborted, such as driver unload or link bounces,
the system will crash based on a bad ndlp pointer.
Example:
RIP: 0010:lpfc_sli4_abts_err_handler+0x15/0x140 [lpfc]
...
lpfc_sli4_io_xri_aborted+0x20d/0x270 [lpfc]
lpfc_sli4_sp_handle_abort_xri_wcqe.isra.54+0x84/0x170 [lpfc]
lpfc_sli4_fp_handle_cqe+0xc2/0x480 [lpfc]
__lpfc_sli4_process_cq+0xc6/0x230 [lpfc]
__lpfc_sli4_hba_process_cq+0x29/0xc0 [lpfc]
process_one_work+0x14c/0x390
Crash was caused by a bad ndlp address passed to I/O indicated by the XRI
aborted CQE. The address was not NULL so the routine deferenced the ndlp
ptr. The bad ndlp also caused the lpfc_sli4_io_xri_aborted to call an
erroneous io handler. Root cause for the bad ndlp was an lpfc_ncmd that
was aborted, put on the abort_io list, completed, taken off the abort_io
list, sent to lpfc_release_nvme_buf where it was put back on the abort_io
list because the lpfc_ncmd->flags setting LPFC_SBUF_XBUSY was not cleared
on the final completion.
Rework the exchange busy handling to ensure the flags are properly set for
both scsi and nvme.
Fixes: c490850a0947 ("scsi: lpfc: Adapt partitioned XRI lists to efficient sharing")
Cc: <stable@vger.kernel.org> # v5.1+
Link: https://lore.kernel.org/r/20191018211832.7917-6-jsmart2021@gmail.com
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Nine changes, eight in drivers [ufs, target, lpfc x 2, qla2xxx x 4]
and one core change in sd that fixes an I/O failure on DIF type 3
devices.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXbzO+iYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishYOpAP9/BCSY
2TAFlli2rVQe+ZNjhHcE4Gj92HNPO7ZgvDQvWgD9F184tjG+1pntYGFutoso7Ak6
QimtBw4AuYg9eDKJDKU=
=bQRX
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Nine changes, eight in drivers [ufs, target, lpfc x 2, qla2xxx x 4]
and one core change in sd that fixes an I/O failure on DIF type 3
devices"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla2xxx: stop timer in shutdown path
scsi: sd: define variable dif as unsigned int instead of bool
scsi: target: cxgbit: Fix cxgbit_fw4_ack()
scsi: qla2xxx: Fix partial flash write of MBI
scsi: qla2xxx: Initialized mailbox to prevent driver load failure
scsi: lpfc: Honor module parameter lpfc_use_adisc
scsi: ufs-bsg: Wake the device before sending raw upiu commands
scsi: lpfc: Check queue pointer before use
scsi: qla2xxx: fixup incorrect usage of host_byte
Nine changes, eight to drivers (qla2xxx, hpsa, lpfc, alua, ch,
53c710[x2], target) and one core change that tries to close a race
between sysfs delete and module removal.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXbN1gSYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishWUzAP4tB9Z+
X5zfnMLmeAtSCnVwIgFX3/GVSFfzEmi+3VxfBQEA3nfs5AAJCPsaTk9z+jLtAKPk
6uYoHwsyTHal19Ojt9g=
=IOPn
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Nine changes, eight to drivers (qla2xxx, hpsa, lpfc, alua, ch,
53c710[x2], target) and one core change that tries to close a race
between sysfs delete and module removal"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: lpfc: remove left-over BUILD_NVME defines
scsi: core: try to get module before removing device
scsi: hpsa: add missing hunks in reset-patch
scsi: target: core: Do not overwrite CDB byte 1
scsi: ch: Make it possible to open a ch device multiple times again
scsi: fix kconfig dependency warning related to 53C700_LE_ON_BE
scsi: sni_53c710: fix compilation error
scsi: scsi_dh_alua: handle RTPG sense code correctly during state transitions
scsi: qla2xxx: fix a potential NULL pointer dereference
The initial lpfc_desc_set_adisc implementation in commit
dea3101e0a5c ("lpfc: add Emulex FC driver version 8.0.28") enabled ADISC if
cfg_use_adisc && RSCN_MODE && FCP_2_DEVICE
In commit 92d7f7b0cde3 ("[SCSI] lpfc: NPIV: add NPIV support on top of
SLI-3") this changed to
(cfg_use_adisc && RSC_MODE) || FCP_2_DEVICE
and later in commit ffc954936b13 ("[SCSI] lpfc 8.3.13: FC Discovery Fixes
and enhancements.") to
(cfg_use_adisc && RSC_MODE) || (FCP_2_DEVICE && FCP_TARGET)
A customer reports that after a devloss, an ADISC failure is logged. It
turns out the ADISC flag is set even the user explicitly set lpfc_use_adisc
= 0.
[Sat Dec 22 22:55:58 2018] lpfc 0000:82:00.0: 2:(0):0203 Devloss timeout on WWPN 50:01:43:80:12:8e:40:20 NPort x05df00 Data: x82000000 x8 xa
[Sat Dec 22 23:08:20 2018] lpfc 0000:82:00.0: 2:(0):2755 ADISC failure DID:05DF00 Status:x9/x70000
[mkp: fixed Hannes' email]
Fixes: 92d7f7b0cde3 ("[SCSI] lpfc: NPIV: add NPIV support on top of SLI-3")
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: James Smart <james.smart@broadcom.com>
Link: https://lore.kernel.org/r/20191022072112.132268-1-dwagner@suse.de
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The queue pointer might not be valid. The rest of the code checks the
pointer before accessing it. lpfc_sli4_process_missed_mbox_completions is
the only place where the check is missing.
Fixes: 657add4e5e15 ("scsi: lpfc: Fix poor use of hardware queues if fewer irq vectors")
Cc: James Smart <jsmart2021@gmail.com>
Link: https://lore.kernel.org/r/20191018162111.8798-1-dwagner@suse.de
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>