In LPFC_MBOXQ_t, the ctx_buf ptr shouldn't be defined as a generic void
*ptr. It is named ctx_buf and it should only be used as an lpfc_dmabuf
*ptr. Due to the void* declaration, there have been abuses of ctx_buf for
things not related to lpfc_dmabuf.
So, set the ptr type for *ctx_buf as lpfc_dmabuf. Remove all type casts on
ctx_buf because it is no longer a void *ptr. Convert the abuse of ctx_buf
for something not related to lpfc_dmabuf to use the void *context3 ptr.
A particular abuse of the ctx_buf warranted a new void *ext_buf ptr.
However, the usage of this new void *ext_buf is not generic. It is
intended to only hold virtual addresses for extended mailbox commands.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240305200503.57317-10-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In LPFC_MBOXQ_t data structure, the ctx_ndlp ptr shouldn't be defined as a
generic void *ptr. It is named ctx_ndlp and it should only be used as an
lpfc_nodelist *ptr. Due to the void* declaration, there have been abuses
of ctx_ndlp for things not related to ndlp.
So, set the ptr type for *ctx_ndlp as lpfc_nodelist. Remove all type casts
on ctx_ndlp because it is no longer a void *ptr. Convert the abuse of
ctx_ndlp for things not related to ndlps to use the void *context3 ptr.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240305200503.57317-9-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In attempt to reduce the amount of unnecessary shost_lock acquisitions in
the lpfc driver, change load_flag into an unsigned long bitmask and use
clear_bit/test_bit bitwise atomic APIs instead of reliance on shost_lock
for synchronization.
Also, correct the test for FC_UNLOADING in lpfc_ct_handle_mibreq, which
incorrectly tests vport->fc_flag rather than vport->load_flag.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240131185112.149731-16-justintee8345@gmail.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In attempt to reduce the amount of unnecessary shost_lock acquisitions in
the lpfc driver, replace shost_lock with an explicit fc_nodes_list_lock
spinlock when accessing vport->fc_nodes lists. Although vport memory
region is owned by shost->hostdata, it is driver private memory and an
explicit fc_nodes list lock for fc_nodes list mutations is more appropriate
than locking the entire shost.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240131185112.149731-14-justintee8345@gmail.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Upon first RSCN receipt of a target server's remote port that is initially
acting as an initiator function, the driver marks the ndlp->nlp_type as an
initiator role. Then later, when processing an RSCN for a target function
role switch, that ndlp remote port is permanently stuck as an initiator
role and can never transition to be discovered as an updated target role
function.
Remove the NLP_RCV_PLOGI early return if statement clause so that the
NLP_NPR_2B_DISC flag gets set. This allows for role change detections.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20240131185112.149731-7-justintee8345@gmail.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The preexisting LOG_NODE message flag frequently spams a subset of the same
log messages during normal FC driver operations. When analyzing driver
logs, this sometimes leads to difficulty in troubleshooting.
Because LOG_IP log message flag is unused, convert it to a new
LOG_NODE_VERBOSE flag. The LOG_NODE_VERBOSE shall specifically be used for
diagnosing issues that require precise ndlp tracking detail.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20231009161812.97232-6-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During rmmod, when dev_loss_tmo callback is called, an ndlp kref count is
decremented twice. Once for SCSI transport registration and second to
remove the initial node allocation kref. If there is also an NVMe
transport registration, another reference count decrement is expected in
lpfc_nvme_unregister_port().
Race conditions between the NVMe transport remoteport_delete and
dev_loss_tmo callbacks sometimes results in premature ndlp object release
resulting in use-after-free issues.
Fix by not dropping the ndlp object in dev_loss_tmo callback with an
outstanding NVMe transport registration. Inversely, mark the final
NLP_DROPPED flag in lpfc_nvme_unregister_port when rmmod flag is set.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230908211923.37603-1-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When a dev_loss_tmo event occurs, an ndlp lock is taken before checking
nlp_flag for NLP_DROPPED. There is an attempt to restore the ndlp lock
when exiting the if statement, but the nlp_put kref could be the final
decrement causing a use-after-free memory access on a released ndlp object.
Instead of trying to reacquire the ndlp lock after checking nlp_flag, just
return after calling nlp_put.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230908211852.37576-1-justintee8345@gmail.com
Reviewed-by: "Ewan D. Milne" <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Only nodes whose state is at least past a PLOGI issue and strictly less
than a PRLI issue should be put into device recovery mode upon RSCN
receipt. Previously, the allowance of LOGO and PRLI completion states did
not make sense because those nodes should be allowed to flow through and
marked as NPort dissappeared as is normally done. A follow up RSCN GID_FT
would recover those nodes in such cases.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230804195546.157839-1-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The ndlp kref count implementation in lpfc_dev_loss_tmo_callbk() removes
the initial node reference when a vport is unloading. When lpfc_cleanup()
sends a DEVICE_RM event and is in NPR state, the driver calls
lpfc_drop_node(). Subsequently, lpfc_drop_node() also removes an ndlp kref
thinking it is the initial reference. This unintentionally introduces an
extra kref decrement on the ndlp object.
Fix by using the NLP_DROPPED node flag in lpfc_dev_loss_tmo_callbk() and
lpfc_drop_node() to coordinate the removal of the initial node reference.
In lpfc_dev_loss_tmo_callbk(), remove the SCSI transport reference provided
the node is registered in the dev_loss context because the driver cannot
call the SCSI transport in dev_loss context or afterwards. And, have
lpfc_drop_node() not remove a reference if another thread is acting or has
already acted on it.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230712180522.112722-6-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Conditionalize when to put an ndlp into recovery mode when processing
RSCNs. As long as an ndlp state is beyond a PLOGI issue and has been
mapped to a transport layer before, the ndlp qualifies to be put into
recovery mode. Otherwise, treat the ndlp rport normally through the
discovery engine.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230712180522.112722-5-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The variable phba->fcf.fcf_flag is often protected by the lock
phba->hbalock() when is accessed. Here is an example in
lpfc_unregister_fcf_rescan():
spin_lock_irq(&phba->hbalock);
phba->fcf.fcf_flag |= FCF_INIT_DISC;
spin_unlock_irq(&phba->hbalock);
However, in the same function, phba->fcf.fcf_flag is assigned with 0
without holding the lock, and thus can cause a data race:
phba->fcf.fcf_flag = 0;
To fix this possible data race, a lock and unlock pair is added when
accessing the variable phba->fcf.fcf_flag.
Reported-by: BassCheck <bass@buaa.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Link: https://lore.kernel.org/r/20230630024748.1035993-1-islituo@gmail.com
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pre-existing device loss recovery logic via the NLP_IN_RECOV_POST_DEV_LOSS
flag only handled Fabric Port Login, Fabric Controller, Management, and
Name Server addresses.
Fabric domain controllers fall under the same category for usage of the
NLP_IN_RECOV_POST_DEV_LOSS flag. Add a default case statement to mark an
ndlp for device loss recovery.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230523183206.7728-4-justintee8345@gmail.com
Acked-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In dev_loss_tmo callback routine, we early return if the ndlp is in a state
of rediscovery. This occurs when a target proactively PLOGIs or PRLIs
after an RSCN before the dev_loss_tmo callback routine is scheduled to run.
Move clear of the NLP_IN_DEV_LOSS flag before the ndlp state check in such
cases.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230523183206.7728-3-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Due to a target port D_ID swap, it is possible for the
lpfc_register_remote_port() routine to touch post mortem fc_rport memory
when trying to access fc_rport->dd_data.
The D_ID swap causes a simultaneous call to lpfc_unregister_remote_port(),
where fc_remote_port_delete() reclaims fc_rport memory.
Remove the fc_rport->dd_data->pnode NULL assignment because the following
line reassigns ndlp->rport with an fc_rport object from
fc_remote_port_add() anyways. The pnode nullification is superfluous.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230523183206.7728-2-justintee8345@gmail.com
Acked-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Smatch detected a double free path because lpfc_nlp_not_used() releases an
ndlp object before reaching lpfc_nlp_put() at the end of
lpfc_cmpl_els_logo_acc().
Remove the outdated lpfc_nlp_not_used() routine. In
lpfc_mbx_cmpl_ns_reg_login(), replace the call with lpfc_nlp_put(). In
lpfc_cmpl_els_logo_acc(), replace the call with lpfc_unreg_rpi() and keep
the lpfc_nlp_put() at the end of the routine. If ndlp's rpi was
registered, then lpfc_unreg_rpi()'s completion routine performs the final
ndlp clean up after lpfc_nlp_put() is called from lpfc_cmpl_els_logo_acc().
Otherwise if ndlp has no rpi registered, the lpfc_nlp_put() at the end of
lpfc_cmpl_els_logo_acc() is the final ndlp clean up.
Fixes: 4430f7fd09 ("scsi: lpfc: Rework locations of ndlp reference taking")
Cc: <stable@vger.kernel.org> # v5.11+
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/all/Y3OefhyyJNKH%2Fiaf@kili/
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230417191558.83100-3-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When mapped to a target with multiple virtual ports, a link bounce
sometimes results in unsuccessful rediscovery of all of the target's
virtual ports. This is because a succession of repeat RSCNs for the
virtual target ports leaves ndlps in the REG_LOGIN state with the
NLP_REG_LOGIN_SEND flag set. With NLP_REG_LOGIN_SEND set, during the next
PLOGI, the driver will UNREG_RPI. When UNREG_RPI is processed, the driver
can be in the middle of PRLI_ISSUE or MAPPED state resulting in an illegal
state transition by the discovery engine and stalling.
Fix by calling the discovery state machine with DEVICE_RECOVERY event
during RSCN processing. This will set the NLP_IGNR_REG_CMPL bit and
prevent the old REG_LOGIN state from advancing. Then for the new PLOGI
issue, add the check for the NLP_IGNR_REG_CMPL bit to delay issuing the new
PLOGI until the queued REG_LOGIN and UNREG_LOGIN have been processed.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20230301231626.9621-6-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
With faulty cables in PT2PT topology, an unintentional ndlp double kref
decrement can occur.
If a FLOGI request is outstanding before the link goes down, the missing
FLOGI_ACC causes an F_Port ndlp to remain in the UNUSED state. During link
down, lpfc_cleanup_rpis() is called and decrements an ndlp kref.
Additionally, when the driver later decides to abort the FLOGI, the FLOGI
completion handler decrements the ndlp kref a second time.
Remove duplicate clean up logic in lpfc_cleanup_rpis() because the updated
FLOGI completion handler already handles the ndlp kref decrement.
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When a FLOGI completes with a sequence timeout error, a freed kref ptr
dereference crash can occur due to a timing race involving ndlp referencing
in lpfc_dev_loss_tmo_callbk.
Fix by ensuring the driver accounts for an outstanding FLOGI when dev_loss
is active. Also, don't remove the HBA_FLOGI_OUTSTANDING flag when the
FLOGI is retried to allow the driver to handle the reference counts
correctly in lpfc_dev_loss_tmo_handler.
Reported-by: Dietmar Hahn <dietmar.hahn@fujitsu.com>
Tested-by: Dietmar Hahn <dietmar.hahn@fujitsu.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Link: https://lore.kernel.org/r/20221116011921.105995-5-justintee8345@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This patch fixes below Smatch reported issues:
1. lpfc_hbadisc.c:3020 lpfc_mbx_cmpl_fcf_rr_read_fcf_rec()
error: uninitialized symbol 'vlan_id'.
2. lpfc_hbadisc.c:3121 lpfc_mbx_cmpl_read_fcf_rec()
error: uninitialized symbol 'vlan_id'.
3. lpfc_init.c:335 lpfc_dump_wakeup_param_cmpl()
warn: always true condition '(prg->dist < 4) => (0-3 < 4)'
4. lpfc_init.c:2419 lpfc_parse_vpd()
warn: inconsistent indenting.
5. lpfc_init.c:13248 lpfc_sli4_enable_msi()
warn: 'phba->pcidev->irq' 2147483648 can't fit into 65535
'eqhdl->irq'
6. lpfc_debugfs.c:5300 lpfc_idiag_extacc_avail_get()
error: uninitialized symbol 'ext_cnt'
7. lpfc_debugfs.c:5300 lpfc_idiag_extacc_avail_get()
error: uninitialized symbol 'ext_size'
8. lpfc_vmid.c:248 lpfc_vmid_get_appid()
warn: sleeping in atomic context.
9. lpfc_init.c:8342 lpfc_sli4_driver_resource_setup()
warn: missing error code 'rc'.
10. lpfc_init.c:13573 lpfc_sli4_hba_unset()
warn: variable dereferenced before check 'phba->pport' (see
line 13546)
11. lpfc_auth.c:1923 lpfc_auth_handle_dhchap_reply()
error: double free of 'hash_value'
Fixes:
1. Initialize vlan_id to LPFC_FCOE_NULL_VID.
2. Initialize vlan_id to LPFC_FCOE_NULL_VID.
3. prg->dist is a 2 bit field. Its value can only be between 0-3.
Remove redundent check 'if (prg->dist < 4)'.
4. Fix inconsistent indenting. Moved logic into helper function
lpfc_fill_vpd().
5. Define 'eqhdl->irq' as int value as pci_irq_vector() returns int.
Also, check for return value of pci_irq_vector() and log message in
case of failure.
6. Initialize 'ext_cnt' to 0.
7. Initialize 'ext_size' to 0.
8. Use alloc_percpu_gfp() with GFP_ATOMIC flag.
9. 'rc' was not updated when dma_pool_create() fails. Update 'rc =
-ENOMEM' when dma_pool_create() fails before calling goto statement.
10. Add check for 'phba->pport' in lpfc_cpuhp_remove().
11. Initialize 'hash_value' to NULL, same like 'aug_chal' variable.
Link: https://lore.kernel.org/r/20220911221505.117655-13-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When a FLOGI is received before we have issued our FLOGI, the ACC response
to the received FLOGI is issued with SID 2 instead of the expected fabric
controller SID. Certain target vendors ignore the malformed ACC with SID 2
and wait for a properly filled ACC with a fabric controller SID.
The lpfc_sli_prep_wqe() routine depends on the FC_PT2PT flag to fill in the
fabric controller SID when in PT2PT mode, but due to a previous commit the
flag was getting cleared. Fix by adding a check for the defer_flogi_acc
flag to know whether or not to clear the FC_PT2PT flag on link up.
Link: https://lore.kernel.org/r/20220911221505.117655-3-jsmart2021@gmail.com
Fixes: 439b93293f ("scsi: lpfc: Fix unsolicited FLOGI receive handling during PT2PT discovery")
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During a stress offline/online test in PT2PT topology, target rediscovery
can fail with a specific target vendor array.
When the HBA transitions to online mode it is possible to receive an
unsolicited FLOGI before processing the Link Up event. The received FLOGI
will set the defer_flogi_acc_flag, which instructs the driver to wait until
it transmits its own FLOGI before ACKing the received FLOGI. In this
failure scenario, the link up processing clears the set
defer_flogi_acc_flag before we have sent out the FLOGI. As the target has
the higher WWPN and is responsible for sending the PLOGI, the target is
stuck waiting for its FLOGI_ACC that the driver will never send.
Remove the clear of defer_flogi_acc_flag from Link Up event processing. In
this stress test case, the defer_flogi_acc_flag is cleared during the Link
Down event processing anyways.
Link: https://lore.kernel.org/r/20220819011736.14141-2-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The RSCN_MEMENTO logic was to workaround a target that does not register
both FCP and NVMe FC4 types at the same time. This caused the
configuration to not produce a second RSCN for the NVMe FC4 type
registration in a timely manner. The intention of the RSCN_MEMENTO flag
was to always signal to try NVMe PRLI.
However, there are other FCP-only target arrays in correctly behaved
configurations that reject the NVMe PRLI followed by a LOGO leading to
never rediscovering the target after an issue_lip (as LOGO causes a repeat
of PLOGI/PRLIs).
Revert the RSCN_MEMENTO patch as it is causing correctly behaved configs to
fail while it exists only to succeed on a misbehaved config.
Link: https://lore.kernel.org/r/20220701211425.2708-9-jsmart2021@gmail.com
Fixes: 1045592fc9 ("scsi: lpfc: Introduce FC_RSCN_MEMENTO flag for tracking post RSCN completion")
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After a link up, it's possible for the switch to change FDMI support (e.g.
FDMI1 vs FDMI2 vs SmartSAN). If the switch reverts to FDMI1, then the
revert is currently not detected.
Additionally, when NPIV is configured, it's possible the physical port's
RHBA is unprocessed by the switch before reciept of an NPIV port issued
RPRT. This causes some switches vendors to reject the NPIV's RPRT.
Fix by reinitializing base FDMI mode on link up, and defer FDMI vport RPRT
submission until after confirming physical port's RHBA is completed.
Link: https://lore.kernel.org/r/20220506035519.50908-10-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During large NPIV port testing, it was sometimes seen that not all vports
would log back in to the target device.
There are instances when the fabric is slow to respond to a spam of GID_PT
requests and as a result the SLI PORT may abort the GID_PT request because
the fabric takes so long. lpfc_cmpl_ct_cmd_gid_pt() would enter the
lpfc_err_lost_link() logic and attempt to lpfc_els_flush_rscn(), which is
fine, but forgets to decrement the gidft_inp counter. This results in a
vport->gidft_inp never reaching 0 and never restarting discovery again.
Decrement vport->gidft_inp if lpfc_err_lost_link() is true for both
lpfc_cmpl_ct_cmd_gid_pt() and lpfc_cmpl_ct_cmd_gid_ft().
Increase logging info during RSCN timeout and lpfc_err_lost_link() events.
Link: https://lore.kernel.org/r/20220506035519.50908-8-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
After running a short external loopback test, when the external loopback is
removed and a normal cable inserted that is directly connected to a target
device, the system oops in the llpfc_set_rrq_active() routine.
When the loopback was inserted an FLOGI was transmit. As we're looped back,
we receive the FLOGI request. The FLOGI is ABTS'd as we recognize the same
wppn thus understand it's a loopback. However, as the ABTS sends address
information the port is not set to (fffffe), the ABTS is dropped on the
wire. A short 1 frame loopback test is run and completes before the ABTS
times out. The looback is unplugged and the new cable plugged in, and the
an FLOGI to the new device occurs and completes. Due to a mixup in ref
counting the completion of the new FLOGI releases the fabric ndlp. Then the
original ABTS completes and references the released ndlp generating the
oops.
Correct by no-op'ing the ABTS when in loopback mode (it will be dropped
anyway). Added a flag to track the mode to recognize when it should be
no-op'd.
Link: https://lore.kernel.org/r/20220506035519.50908-5-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During testing with repeated asynchronous resets of the target, an issue
was found when the driver issues a LOGO to disconnect its login and recover
all exchanges. The LOGO command takes a node reference but neglects to
remove it, keeping the node reference count artifically high.
Add a call to lpfc_nlp_put() to lpfc_nlp_logo_unreg() and move the mempool
free call to the routine exit along with the needed put. This is always
safe as this will not be the last reference removed as lpfc_unreg_rpi()
ensures there is an additional reference on the ndlp.
Link: https://lore.kernel.org/r/20220506035519.50908-4-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Do not rely on vendor version field of the CSPs to determine if we are in a
FA-PWWN environment. Instead, use the following procedure:
First, during HBA initialization, driver does a READ_CONFIG to determine if
FA-PWWN is configured on the HBA. A LPFC_FAWWPN_CONFIG hba_flag is set
accordingly.
Next, when the link comes up before the driver gets a link up event, the
firmware logs into the fabric with FA-PWWN. If the fabric port does not
support FA-PWWN, the driver will get a Misconfigured FA-WWN async event
before the link up. A LPFC_FAWWPN_FABRIC hba_flag will be set accordingly.
Finally, if the fabric supports FA-PWWN, the firmware will replace its CSPs
WWN with the Fabric Assigned ones. Then after link up, the driver will
retrieve the Fabric Assigned WWN when it does a READ_SPARAM mbox command.
Link: https://lore.kernel.org/r/20220412222008.126521-23-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The intention of this patch is to refactor mailbox memory allocation and
cleanup steps in one routine respectively to prevent memory leaks or memory
errors related to mailbox commands. There are trivial localized fixes as
well.
Provide lpfc_mbox_rsrc_prep() - this routine allocates the dmabuf and the
mbuf associated with it. It also catches allocation errors and returns
status.
Provide lpfc_mbox_rsrc_cleanup() - this routine verifies a dmabuf exists
and if so releases the associated mbuf and the dmabuf memory. It then sets
the ctx_buf to NULL and releases the mailbox memory to the mailbox pool.
Link: https://lore.kernel.org/r/20220412222008.126521-22-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The lpfc_iocbq data structure has void * pointers that are overloaded to be
as many as 8 different data types and the driver translates the void * by
casting. This patch removes the void * pointers by declaring the specific
types needed by the driver. It also expands the context_un to include more
seldom used pointer types to save structure bytes. It also groups the u8
types together to pack the 8 bytes needed. This work allows the lpfc_iocbq
data structure to be more strongly typed and keeps it from being allocated
from the 512 byte slab.
[mkp: rolled in zeroday fix]
Link: https://lore.kernel.org/r/20220412222008.126521-21-jsmart2021@gmail.com
Reported-by: kernel test robot <lkp@intel.com>
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
During an NVMe target reboot, the target may initialize itself as FCP only
during the first RSCN and shortly after trigger a second RSCN claiming NVMe
support. The timing of these RSCNs occur before FCP-PRLI for the first
RSCN completes leading discovery issues over NVMe.
Change RSCN and NVME-PRLI send logic based on a new FC_RSCN_MEMENTO flag
that signals when lpfc_end_rscn() is completed and serves as a memento that
discovery was started from RSCN.
Link: https://lore.kernel.org/r/20220412222008.126521-20-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
When injecting EEH errors the port is getting hung up waiting on the node
list to empty, message number 0233. The driver is stuck at this point and
also can't unload. The driver makes transport remoteport delete calls which
try to abort I/O's, but the EEH daemon has already called the driver to
detach and the detachment has set the global FC_UNLOADING flag. There are
several code paths that will avoid I/O cleanup if the FC_UNLOADING flag is
set, resulting in transports waiting for I/O while the driver is waiting on
transports to clean up.
Additionally, during study of the list, a locking issue was found in
lpfc_sli_abort_iocb_ring that could corrupt the list.
A special case was added to the lpfc_cleanup() routine to call
lpfc_sli_flush_rings() if the driver is FC_UNLOADING and if the pci-slot
is offline (e.g. EEH).
The SLI4 part of lpfc_sli_abort_iocb_ring() is changed to use the
ring_lock. Also added code to cancel the I/Os if the pci-slot is offline
and added checks and returns for the FC_UNLOADING and HBA_IOQ_FLUSH flags
to prevent trying to send an I/O that we cannot handle.
Link: https://lore.kernel.org/r/20220317032737.45308-3-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Following EEH errors, the driver can crash or hang when deleting the
localport or when attempting to unload.
The EEH handlers in the driver did not notify the NVMe-FC transport before
tearing the driver down. This was delayed until the resume steps. This
worked for SCSI because lpfc_block_scsi() would notify the
scsi_fc_transport that the target was not available but it would not clean
up all the references to the ndlp.
The SLI3 prep for dev reset handler did the lpfc_offline_prep() and
lpfc_offline() calls to get the port stopped before restarting. The SLI4
version of the prep for dev reset just destroyed the queues and did not
stop NVMe from continuing. Also because the port was not really stopped
the localport destroy would hang because the transport was still waiting
for I/O. Additionally, a devloss tmo can fire and post events to a stopped
worker thread creating another hang condition.
lpfc_sli4_prep_dev_for_reset() is modified to call lpfc_offline_prep() and
lpfc_offline() rather than just lpfc_scsi_dev_block() to ensure both SCSI
and NVMe transports are notified to block I/O to the driver.
Logic is added to devloss handler and worker thread to clean up ndlp
references and quiesce appropriately.
Link: https://lore.kernel.org/r/20220317032737.45308-2-jsmart2021@gmail.com
Co-developed-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: Justin Tee <justin.tee@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>