Currently, the erdma driver supports both the iWARP and RoCEv2 protocols.
The erdma driver reads the ERDMA_REGS_DEV_PROTO_REG register to identify
the protocol used by the erdma device. Since each protocol requires
different ib_device_ops, we introduce the erdma_device_ops_iwarp and
erdma_device_ops_rocev2 for iWARP and RoCEv2 protocols, respectively.
Signed-off-by: Boshi Yu <boshiyu@linux.alibaba.com>
Link: https://patch.msgid.link/20241211020930.68833-2-boshiyu@linux.alibaba.com
Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The "gl->tot_len" variable is controlled by the user. It comes from
process_responses(). On 32bit systems, the "gl->tot_len + sizeof(struct
cpl_pass_accept_req) + sizeof(struct rss_header)" addition could have an
integer wrapping bug. Use size_add() to prevent this.
Fixes: 1cab775c3e ("RDMA/cxgb4: Fix LE hash collision bug for passive open connection")
Link: https://patch.msgid.link/r/86b404e1-4a75-4a35-a34e-e3054fa554c7@stanley.mountain
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The current ODP counters represent the total number of pages
handled, but it is not enough to understand the effectiveness
of these operations.
Extend the ODP counters to include the number of times page fault
and invalidation events were handled.
Example for a single page fault handling 512 pages:
- page_fault: incremented by 512 (total pages)
- page_fault_handled: incremented by 1 (operation count)
The same example is applicable for page invalidation too.
Previous output:
$ rdma stat mr
dev rocep8s0f0 mrn 8 page_faults 27 page_invalidations 0 page_prefetch 29
New output:
$ rdma stat mr
dev rocep8s0f0 mrn 21 page_faults 512 page_faults_handled 1
page_invalidations 0 page_invalidations_handled 0 page_prefetch 51200
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/b18f29ed1392996ade66e9e6c45f018925253f6a.1733234165.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Smatch generates the following false error report:
drivers/infiniband/hw/mlx4/main.c:393 mlx4_ib_del_gid() error: uninitialized symbol 'gids'.
Traditionally, we are not changing kernel code and asking people to fix
the tools. However in this case, the fix can be done by simply rearranging
the code to be more clear.
Fixes: e26be1bfef ("IB/mlx4: Implement ib_device callbacks")
Link: https://patch.msgid.link/6a3a1577463da16962463fcf62883a87506e9b62.1733233426.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Gen P7 supports up to 13 SGEs for now. WQE software structure
can hold only 6 now. Since the max send sge is reported as
13, the stack can give requests up to 13 SGEs. This is causing
traffic failures and system crashes.
Use the define for max SGE supported for variable size. This
will work for both static and variable WQEs.
Fixes: 227f51743b ("RDMA/bnxt_re: Fix the max WQE size for static WQE support")
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/20241204075416.478431-2-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Pull driver core updates from Greg KH:
"Here is a small set of driver core changes for 6.13-rc1.
Nothing major for this merge cycle, except for the two simple merge
conflicts are here just to make life interesting.
Included in here are:
- sysfs core changes and preparations for more sysfs api cleanups
that can come through all driver trees after -rc1 is out
- fw_devlink fixes based on many reports and debugging sessions
- list_for_each_reverse() removal, no one was using it!
- last-minute seq_printf() format string bug found and fixed in many
drivers all at once.
- minor bugfixes and changes full details in the shortlog"
* tag 'driver-core-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (35 commits)
Fix a potential abuse of seq_printf() format string in drivers
cpu: Remove spurious NULL in attribute_group definition
s390/con3215: Remove spurious NULL in attribute_group definition
perf: arm-ni: Remove spurious NULL in attribute_group definition
driver core: Constify bin_attribute definitions
sysfs: attribute_group: allow registration of const bin_attribute
firmware_loader: Fix possible resource leak in fw_log_firmware_info()
drivers: core: fw_devlink: Fix excess parameter description in docstring
driver core: class: Correct WARN() message in APIs class_(for_each|find)_device()
cacheinfo: Use of_property_present() for non-boolean properties
cdx: Fix cdx_mmap_resource() after constifying attr in ->mmap()
drivers: core: fw_devlink: Make the error message a bit more useful
phy: tegra: xusb: Set fwnode for xusb port devices
drm: display: Set fwnode for aux bus devices
driver core: fw_devlink: Stop trying to optimize cycle detection logic
driver core: Constify attribute arguments of binary attributes
sysfs: bin_attribute: add const read/write callback variants
sysfs: implement all BIN_ATTR_* macros in terms of __BIN_ATTR()
sysfs: treewide: constify attribute callback of bin_attribute::llseek()
sysfs: treewide: constify attribute callback of bin_attribute::mmap()
...
Pull rdma updates from Jason Gunthorpe:
"Seveal fixes scattered across the drivers and a few new features:
- Minor updates and bug fixes to hfi1, efa, iopob, bnxt, hns
- Force disassociate the userspace FD when hns does an async reset
- bnxt new features for optimized modify QP to skip certain stayes,
CQ coalescing, better debug dumping
- mlx5 new data placement ordering feature
- Faster destruction of mlx5 devx HW objects
- Improvements to RDMA CM mad handling"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (51 commits)
RDMA/bnxt_re: Correct the sequence of device suspend
RDMA/bnxt_re: Use the default mode of congestion control
RDMA/bnxt_re: Support different traffic class
IB/cm: Rework sending DREQ when destroying a cm_id
IB/cm: Do not hold reference on cm_id unless needed
IB/cm: Explicitly mark if a response MAD is a retransmission
RDMA/mlx5: Move events notifier registration to be after device registration
RDMA/bnxt_re: Cache MSIx info to a local structure
RDMA/bnxt_re: Refurbish CQ to NQ hash calculation
RDMA/bnxt_re: Refactor NQ allocation
RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved
RDMA/hns: Fix different dgids mapping to the same dip_idx
RDMA/bnxt_re: Add set_func_resources support for P5/P7 adapters
RDMA/bnxt_re: Enhance RoCE SRIOV resource configuration design
bnxt_en: Add support for RoCE sriov configuration
RDMA/hns: Fix NULL pointer derefernce in hns_roce_map_mr_sg()
RDMA/hns: Fix out-of-order issue of requester when setting FENCE
RDMA/nldev: Add IB device and net device rename events
RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation
RDMA/core: Move ib_uverbs_file struct to uverbs_types.h
...
When in fatal error condition, mark device as detached first
and then complete all pending HWRM commands as firmware is not
going to process them and eventually time out. Move the device
to error only if suspend is called when device is in Fatal state.
Also, remove some outdated comments. Remove the stop_irq call
which is no longer required.
Fixes: cc5b9b48d4 ("RDMA/bnxt_re: Recover the device when FW error is detected")
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1731660464-27838-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
There are few use cases where CQ create and destroy
is seen before re-creating the CQ, this kind of use
case is disturbing the RR distribution and all the
active CQ getting mapped to only 2 NQ alternatively.
Fixing the CQ to NQ hash calculation by implementing
a quick load sorting mechanism under a mutex.
Using this, if the CQ was allocated and destroyed
before using it, the nq selecting algorithm still
obtains the least loaded CQ. Thus balancing the load
on NQs.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/1731577748-1804-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
DIP algorithm requires a one-to-one mapping between dgid and dip_idx.
Currently a queue 'spare_idx' is used to store QPN of QPs that use
DIP algorithm. For a new dgid, use a QPN from spare_idx as dip_idx.
This method lacks a mechanism for deduplicating QPN, which may result
in different dgids sharing the same dip_idx and break the one-to-one
mapping requirement.
This patch replaces spare_idx with xarray and introduces a refcnt of
a dip_idx to indicate the number of QPs that using this dip_idx.
The state machine for dip_idx management is implemented as:
* The entry at an index in xarray is empty -- This indicates that the
corresponding dip_idx hasn't been created.
* The entry at an index in xarray is not empty but with 0 refcnt --
This indicates that the corresponding dip_idx has been created but
not used as dip_idx yet.
* The entry at an index in xarray is not empty and with non-0 refcnt --
This indicates that the corresponding dip_idx is being used by refcnt
number of DIP QPs.
Fixes: eb653eda1e ("RDMA/hns: Bugfix for incorrect association between dip_idx and dgid")
Fixes: f91696f2f0 ("RDMA/hns: Support congestion control type selection according to the FW")
Signed-off-by: Feng Fang <fangfeng4@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20241112055553.3681129-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The FENCE indicator in hns WQE doesn't ensure that response data from
a previous Read/Atomic operation has been written to the requester's
memory before the subsequent Send/Write operation is processed. This
may result in the subsequent Send/Write operation accessing the original
data in memory instead of the expected response data.
Unlike FENCE, the SO (Strong Order) indicator blocks the subsequent
operation until the previous response data is written to memory and a
bresp is returned. Set the SO indicator instead of FENCE to maintain
strict order.
Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20241108075743.2652258-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Implement the device API for ufile_hw_cleanup operation, which
iterates over the ufile uobjects lists, and attempts to destroy
DevX QPs, by issuing up to 8 commands in parallel.
This function is responsible only for cleaning the FW resources of the
QP, and doesn't necessarily cleanup all of its resources.
Hence the normal serialized cleanup flow is still executed after it
in __uverbs_cleanup_ufile() to cleanup the remaining resources and
handle the cleanup of SW objects.
In order to avoid double cleanup for the FW resources, new DevX flag
was added DEVX_OBJ_FLAGS_HW_FREED, which marks the object's FW resources
as already freed.
Since QP destruction is the most time-consuming operation in FW,
parallelizing it reduces the cleanup time of applications that use
DevX QPs.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://patch.msgid.link/2f82675d0412542cba1c47a6b86f589521ae41e1.1730373303.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix a race condition when creating a lag bond in active backup
mode where after the bond creation the backup slave was
attached to the IB device, instead of the active slave.
This caused stale entries in the GID table, as the gid updating
mechanism relies on ib_device_get_netdev(), which would return
the backup slave.
Send an MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE
event when activating the lag, additionally to when modifying
the lag. This ensures that eventually the active netdevice is
stored in the bond IB device.
When handling this event remove the GIDs of the previously
attached netdevice in this port and rescan the GIDs of the
newly attached netdevice.
This ensures that eventually the active slave netdevice is
correctly stored in the IB device port. While there might be
a brief moment where the backup slave GIDs appear in the GID
table, it will eventually stabilize with the correct GIDs
(of the bond and the active slave).
Fixes: 8d159eb211 ("RDMA/mlx5: Use IB set_netdev and get_netdev functions")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/91fc2cb24f63add266a528c1c702668a80416d9f.1730381292.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Support QP with out-of-order (OOO) capabilities enabled.
This allows WRs on the receiver side of the QP to be consumed OOO,
permitting the sender side to transmit messages without guaranteeing
arrival order on the receiver side.
When enabled, the completion ordering of WRs remains in-order,
regardless of the Receive WRs consumption order.
RDMA Read and RDMA Atomic operations on the responder side continue to
be executed in-order, while the ordering of data placement for RDMA
Write and Send operations is not guaranteed.
Atomic operations larger than 8 bytes are currently not supported.
Therefore, when this feature is enabled, the created QP restricts its
atomic support to 8 bytes at most.
In addition, when querying the device, a new flag is returned in
response to indicate that the Kernel supports OOO QP.
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/06ac609a5f358c8fb0a090d22c61a2f9329d82e6.1725362773.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>