When a LS_RJT is received with reason explanation authentication required,
current driver logic is to retry the PLOGI up to 48 times. In the worse
case scenario, 48 retries can take longer than dev_loss_tmo and if there is
an RSCN received indicating an authentication requirement change, the
driver may miss processing it. Fix by adding logic to specifically handle
reason explanation authentication required and set the max retry count to 8
times.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241212233309.71356-6-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Remove the NLP_TARGET_REMOVE flag as its usage is obsolete. The current
framework is to rely on the lpfc_dev_loss_tmo_callbk from upper layer to
notify final ndlp kref release. There's no need to specifically set
NLP_EVT_DEVICE_RM when a LOGO completes. The dev_loss_tmo_callbk is
responsible for the final kref put.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241212233309.71356-4-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
An RPI is tightly bound to an NDLP structure and is freed only upon
release of an NDLP object. As such, there should be no logic that frees
an RPI outside of the lpfc_nlp_release() routine. In order to reinforce
the original design usage of RPIs, remove the NLP_RELEASE_RPI flag and
related logic.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-9-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During firmware errata events, the lpfc_els_flush_cmd() routine is
responsible for the clean up of outstanding ELS and CT command
submissions. Thus, move the LPFC_SLI_ACTIVE flag check into the txcmplq
list walk and mark a piocb object for canceling if determined the HBA is
not active. Clean up should be regardless of application or driver
layer origin.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20241031223219.152342-5-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Revise certain log messages marked as KERN_ERR LOG_TRACE_EVENT to
KERN_WARNING and use the lpfc_vlog_msg() macro to still log the event. The
benefit is that events of interest are still logged and the entire trace
buffer is not dumped with extraneous logging information when using default
lpfc_log_verbose driver parameter settings.
Also, delete the keyword "fail" from such log messages as they aren't
really causes for concern. The log messages are more for warnings to a SAN
admin about SAN activity.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240912232447.45607-7-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During HBA stress testing, a spam of received PLOGIs exposes a resource
recovery bug causing leakage of lpfc_sqlq entries from the global
phba->sli4_hba.lpfc_els_sgl_list.
The issue is in lpfc_els_flush_cmd(), where the driver attempts to recover
outstanding ELS sgls when walking the txcmplq. Only CMD_ELS_REQUEST64_CRs
and CMD_GEN_REQUEST64_CRs are added to the abort and cancel lists. A check
for CMD_XMIT_ELS_RSP64_WQE is missing in order to recover LS_ACC usages of
the phba->sli4_hba.lpfc_els_sgl_list too.
Fix by adding CMD_XMIT_ELS_RSP64_WQE as part of the txcmplq walk when
adding WQEs to the abort and cancel list in lpfc_els_flush_cmd(). Also,
update naming convention from CRs to WQEs.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240912232447.45607-2-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A kref imbalance occurs when handling an unsolicited PRLO in direct
attached topology.
Rework PRLO rcv handling when in MAPPED state. Save the state that we were
handling a PRLO by setting nlp_last_elscmd to ELS_CMD_PRLO. Then in the
lpfc_cmpl_els_logo_acc() completion routine, manually restart discovery.
By issuing the PLOGI, which nlp_gets, before nlp_put at the end of the
lpfc_cmpl_els_logo_acc() routine, we are saving us from a final nlp_put.
And, we are still allowing the unreg_rpi to happen.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240726231512.92867-7-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In direct attached topology, certain target vendors that are quick to issue
FLOGI followed by a cable pull for more than dev_loss_tmo may result in a
kref imbalance for the remote port ndlp object.
Add an nlp_get when the defer_flogi_acc flag is set. This is expected to
balance the nlp_put in the defer_flogi_acc clause in the
lpfc_issue_els_flogi() routine. Because we need to retain the ndlp ptr,
reorganize all of the defer_flogi_acc information into one
lpfc_defer_flogi_acc struct.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240726231512.92867-6-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The vport->vmid_flag is unintentionally cleared twice after an issue_lip
via the lpfc_reinit_vmid routine().
The first call to lpfc_reinit_vmid() is in lpfc_cmpl_els_flogi(). Then
lpfc_cmpl_els_flogi_fabric() calls lpfc_register_new_vport(), which calls
lpfc_cmpl_reg_new_vport() when the mbox command completes and calls
lpfc_reinit_vmid() a second time.
Fix by moving the vmid_flag clear outside of the lpfc_reinit_vmid() routine
so that vmid_flag is only cleared once upon FLOGI completion.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240726231512.92867-5-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During diagnostics, it has been determined that the 0115 log message for
receipt of unknown ELS cmds does not benefit from trace buffer dumps. The
trace buffer dump floods the console with unnecessary information, and the
singular LOG_ELS flag has proven more beneficial in debugging efforts when
dealing with unknown ELS cmds.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240726231512.92867-2-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The MBX_TIMEOUT return code is not handled in lpfc_get_sfp_info and the
routine unconditionally frees submitted mailbox commands regardless of
return status. The issue is that for MBX_TIMEOUT cases, when firmware
returns SFP information at a later time, that same mailbox memory region
references previously freed memory in its cmpl routine.
Fix by adding checks for the MBX_TIMEOUT return code. During mailbox
resource cleanup, check the mbox flag to make sure that the wait did not
timeout. If the MBOX_WAKE flag is not set, then do not free the resources
because it will be freed when firmware completes the mailbox at a later
time in its cmpl routine.
Also, increase the timeout from 30 to 60 seconds to accommodate boot
scripts requiring longer timeouts.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240628172011.25921-6-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During SLI port errata events, there should be no expectation that
submitted outstanding WQEs will return back CQEs. In these situations, the
driver should not rely on receiving CQEs from the SLI port to signal WQE
resource clean up.
Put an sli_flag LPFC_SLI_ACTIVE check in lpfc_els_flush_cmd() when walking
the txcmplq. The sli_flag check helps determine whether to issue an abort
or driver based cancel on outstanding WQEs. If !LPFC_SLI_ACTIVE, then
there's no point to issue anything to the SLI port. Instead, let the
driver based cancel logic clean up the submitted WQE resources.
Also, enhance some abort log messages that help with future debugging.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240628172011.25921-2-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In LPFC_MBOXQ_t, the void *context3 ptr is used for various paths. It is
treated as a generic pointer, and is type casted during its usage.
The issue with this is that it can sometimes get confusing when reading
code as to what the context3 ptr is being used for and mistakenly be reused
in a different context.
Rename context3 to ctx_u, and declare it as a union of defined ptr types.
From now on, the ctx_u ptr may be used only if users define the use case
type.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240305200503.57317-11-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In LPFC_MBOXQ_t, the ctx_buf ptr shouldn't be defined as a generic void
*ptr. It is named ctx_buf and it should only be used as an lpfc_dmabuf
*ptr. Due to the void* declaration, there have been abuses of ctx_buf for
things not related to lpfc_dmabuf.
So, set the ptr type for *ctx_buf as lpfc_dmabuf. Remove all type casts on
ctx_buf because it is no longer a void *ptr. Convert the abuse of ctx_buf
for something not related to lpfc_dmabuf to use the void *context3 ptr.
A particular abuse of the ctx_buf warranted a new void *ext_buf ptr.
However, the usage of this new void *ext_buf is not generic. It is
intended to only hold virtual addresses for extended mailbox commands.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240305200503.57317-10-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In LPFC_MBOXQ_t data structure, the ctx_ndlp ptr shouldn't be defined as a
generic void *ptr. It is named ctx_ndlp and it should only be used as an
lpfc_nodelist *ptr. Due to the void* declaration, there have been abuses
of ctx_ndlp for things not related to ndlp.
So, set the ptr type for *ctx_ndlp as lpfc_nodelist. Remove all type casts
on ctx_ndlp because it is no longer a void *ptr. Convert the abuse of
ctx_ndlp for things not related to ndlps to use the void *context3 ptr.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240305200503.57317-9-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In attempt to reduce the amount of unnecessary shost_lock acquisitions in
the lpfc driver, change load_flag into an unsigned long bitmask and use
clear_bit/test_bit bitwise atomic APIs instead of reliance on shost_lock
for synchronization.
Also, correct the test for FC_UNLOADING in lpfc_ct_handle_mibreq, which
incorrectly tests vport->fc_flag rather than vport->load_flag.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240131185112.149731-16-justintee8345@gmail.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
If priority tagging is set in the service parameters of a FLOGI cmpl, then
we update the vmid_flag. In the current logic, if a follow up FLOGI cmpl
updates its service parameters such that priority tagging is no longer set,
then the vmid_flag ends up keeping stale data.
Fix by ensuring we clear the vmid_flag member during lpfc_reinit_vmid, and
check the priority tagging service parameter after reinitialization of the
vmid data structures.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20231207224039.35466-4-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A mailbox timeout error usually indicates something has gone wrong, and a
follow up reset of the HBA is a typical recovery mechanism. Introduce a
MBX_TMO_ERR flag to detect such cases and have lpfc_els_flush_cmd abort ELS
commands if the MBX_TMO_ERR flag condition was set. This ensures all of
the registered SGL resources meant for ELS traffic are not leaked after an
HBA reset.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230712180522.112722-9-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Previously, Establish Image Pair was set in all PRLI_ACC responses
regardless if the received PRLI was from an initiator or target function.
Specific target vendors that can operate in both initiator and target mode,
may view the PRLI_ACC with Establish Image Pair set as an invalid service
parameter when operating in initiator only mode. This causes discovery
issues later when the target switches on its target mode function.
Revise logic that determines an rport's role as an initiator or target and
set the Establish Image Pair service parameter bit only if the Target
Function bit is set.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230712180522.112722-7-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In lpfc_cmpl_els_flogi(), the return out: label decrements the ndlp kref
signaling that FLOGI processing on the ndlp is complete. In loop topology
path, there is an unnecessary ndlp put because it also branches to the out:
label. This also signals ndlp usage completion too soon. As such, remove
the extra lpfc_nlp_put() when in loop topology.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230712180522.112722-4-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When NPIV ports are zoned to devices that support both initiator and target
mode, a remote device's initiated PRLI results in unintended final kref
clean up of the device's ndlp structure. This disrupts NPIV ports'
discovery for target devices that support both initiator and target mode.
Modify the NPIV lpfc_drop_node clause such that we allow the ndlp to live
so long as it was in NLP_STE_PLOGI_ISSUE, NLP_STE_REG_LOGIN_ISSUE, or
NLP_STE_PRLI_ISSUE nlp_state. This allows lpfc's issued PRLI completion
routine to determine if the final kref clean up should execute rather than
a remote device's issued PRLI.
Fixes: db651ec225 ("scsi: lpfc: Correct used_rpi count when devloss tmo fires with no recovery")
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230523183206.7728-5-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Smatch detected a double free path because lpfc_nlp_not_used() releases an
ndlp object before reaching lpfc_nlp_put() at the end of
lpfc_cmpl_els_logo_acc().
Remove the outdated lpfc_nlp_not_used() routine. In
lpfc_mbx_cmpl_ns_reg_login(), replace the call with lpfc_nlp_put(). In
lpfc_cmpl_els_logo_acc(), replace the call with lpfc_unreg_rpi() and keep
the lpfc_nlp_put() at the end of the routine. If ndlp's rpi was
registered, then lpfc_unreg_rpi()'s completion routine performs the final
ndlp clean up after lpfc_nlp_put() is called from lpfc_cmpl_els_logo_acc().
Otherwise if ndlp has no rpi registered, the lpfc_nlp_put() at the end of
lpfc_cmpl_els_logo_acc() is the final ndlp clean up.
Fixes: 4430f7fd09 ("scsi: lpfc: Rework locations of ndlp reference taking")
Cc: <stable@vger.kernel.org> # v5.11+
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/all/Y3OefhyyJNKH%2Fiaf@kili/
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230417191558.83100-3-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A fabric controller can sometimes send an RDP request right before a link
down event. Because of this outstanding RDP request, the driver does not
remove the last reference count on its ndlp causing a potential leak of RPI
resources when devloss tmo fires.
In lpfc_cmpl_els_rsp(), modify the NPIV clause to always allow the
lpfc_drop_node() routine to execute when not registered with SCSI
transport.
This relaxes the contraint that an NPIV ndlp must be in a specific state in
order to call lpfc_drop node. Logic is revised such that the
lpfc_drop_node() routine is always called to ensure the last ndlp decrement
occurs.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230301231626.9621-7-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When mapped to a target with multiple virtual ports, a link bounce
sometimes results in unsuccessful rediscovery of all of the target's
virtual ports. This is because a succession of repeat RSCNs for the
virtual target ports leaves ndlps in the REG_LOGIN state with the
NLP_REG_LOGIN_SEND flag set. With NLP_REG_LOGIN_SEND set, during the next
PLOGI, the driver will UNREG_RPI. When UNREG_RPI is processed, the driver
can be in the middle of PRLI_ISSUE or MAPPED state resulting in an illegal
state transition by the discovery engine and stalling.
Fix by calling the discovery state machine with DEVICE_RECOVERY event
during RSCN processing. This will set the NLP_IGNR_REG_CMPL bit and
prevent the old REG_LOGIN state from advancing. Then for the new PLOGI
issue, add the check for the NLP_IGNR_REG_CMPL bit to delay issuing the new
PLOGI until the queued REG_LOGIN and UNREG_LOGIN have been processed.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230301231626.9621-6-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A target vendor array reboot in P2P topology can sometimes result in
unsuccessful rediscovery.
Rework the lpfc_cmpl_els_logo() routine such that when the LOGO completes
as a failure because of driver abort, the LOGO state is still recorded with
the discovery state machine.
This is a small rework to set LOGO completion without forcing a device
removal state change.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230301231626.9621-5-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The LLDD and the stack currently process FPINs received from the fabric,
but the stack is not aware of any action taken by the driver to alleviate
congestion. The current interface between the driver and the SCSI stack is
limited to passing the notification mainly for statistics and heuristics.
The reaction to an FPIN could be handled either by the driver or by the
stack (marginal path and failover). Amend the interface to indicate if
action on an FPIN has already been reacted to by the LLDDs or not. Add an
additional flag to fc_host_fpin_rcv() to indicate if the FPIN has been
acknowledged/reacted to by the driver.
Also added a new event code FCH_EVT_LINK_FPIN_ACK to notify to the user
that the event has been acknowledged/reacted by the LLDD driver
Link: https://lore.kernel.org/r/20230209034326.882514-1-muneendra.kumar@broadcom.com
Co-developed-by: Anil Gurumurthy <agurumurthy@marvell.com>
Signed-off-by: Anil Gurumurthy <agurumurthy@marvell.com>
Co-developed-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Muneendra <muneendra.kumar@broadcom.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After enabling VMID, an issue LIP test was erasing fabric switch VMID
information.
Introduce a lpfc_reinit_vmid() routine, which reinitializes all VMID data
structures upon FLOGI completion in fabric topology.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In a large SAN testing configuration, frequent target port toggle tests are
occasionally resulting in missing lun path rediscoveries. An outstanding
PRLI can be inflight when a target RSCN dissappearance occurs, causing the
driver to retry PRLIs using invalid rpi contexts.
Fix by verifying that an ndlp's state was not restarted from PRLI_ISSUE
due to an intermediate RSCN. If not in a valid state, early exit PRLI
completion handling.
The last follow up RSCN indicating target reappearance retriggers
PLOGI/PRLI with a valid rpi context and is expected to succeed in LUN path
rediscovery.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When a FLOGI completes with a sequence timeout error, a freed kref ptr
dereference crash can occur due to a timing race involving ndlp referencing
in lpfc_dev_loss_tmo_callbk.
Fix by ensuring the driver accounts for an outstanding FLOGI when dev_loss
is active. Also, don't remove the HBA_FLOGI_OUTSTANDING flag when the
FLOGI is retried to allow the driver to handle the reference counts
correctly in lpfc_dev_loss_tmo_handler.
Reported-by: Dietmar Hahn <dietmar.hahn@fujitsu.com>
Tested-by: Dietmar Hahn <dietmar.hahn@fujitsu.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20221116011921.105995-5-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>