Add support for DMABUF MR registrations with Data-direct device.
Upon userspace calling to register a DMABUF MR with the data direct bit
set, the below algorithm will be followed.
1) Obtain a pinned DMABUF umem from the IB core using the user input
parameters (FD, offset, length) and the DMA PF device. The DMA PF
device is needed to allow the IOMMU to enable the DMA PF to access the
user buffer over PCI.
2) Create a KSM MKEY by setting its entries according to the user buffer
VA to IOVA mapping, with the MKEY being the data direct device-crossed
MKEY. This KSM MKEY is umrable and will be used as part of the MR cache.
The PD for creating it is the internal device 'data direct' kernel one.
3) Create a crossing MKEY that points to the KSM MKEY using the crossing
access mode.
4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs
managed on the mlx5_ib device.
5) Return the crossing MKEY to the user, created with its supplied PD.
Upon DMA PF unbind flow, the driver will revoke the KSM entries.
The final deregistration will occur under the hood once the application
deregisters its MKEY.
Notes:
- This version supports only the PINNED UMEM mode, so there is no
dependency on ODP.
- The IOVA supplied by the application must be system page aligned due to
HW translations of KSM.
- The crossing MKEY will not be umrable or part of the MR cache, as we
cannot change its crossed (i.e. KSM) MKEY over UMR.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
UMR QP is not used in some cases, so move QP and its CQ creations from
driver load flow to the time first reg_mr occurs, that is when MR
interfaces are first called.
The initialization of dev->umrc.pd and dev->umrc.lock is still done in
driver load because pd is needed for mlx5_mkey_cache_init and the lock
is reused to protect against the concurrent creation.
When testing 4G bytes memory registration latency with rtool [1] and 8
threads in parallel, there is minor performance degradation (<5% for
the max latency) is seen for the first reg_mr with this change.
Link: https://github.com/paravmellanox/rtool [1]
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Link: https://lore.kernel.org/r/55d3c4f8a542fd974d8a4c5816eccfb318a59b38.1717409369.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently, mkeys are managed via xarray. This implementation leads to
a degradation in cases many MRs are unregistered in parallel, due to xarray
internal implementation, for example: deregistration 1M MRs via 64 threads
is taking ~15% more time[1].
Hence, implement mkeys management via LIFO queue, which solved the
degradation.
[1]
2.8us in kernel v5.19 compare to 3.2us in kernel v6.4
Signed-off-by: Shay Drory <shayd@nvidia.com>
Link: https://lore.kernel.org/r/fde3d4cfab0f32f0ccb231cd113298256e1502c5.1695283384.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
According to PCIe spec, Enable Relaxed Ordering value in the VF's PCI
config space is wired to 0 and PF relaxed ordering (RO) setting should
be applied to the VF. In QEMU (and maybe others), when assigning VFs,
the RO bit in PCI config space is not emulated properly and is always
set to 0.
Therefore, pcie_relaxed_ordering_enabled() always returns 0 for VFs and
VMs and thus MKeys can't be created with RO read even if the PF supports
it.
pcie_relaxed_ordering_enabled() check was added to avoid a syndrome when
creating a MKey with relaxed ordering (RO) enabled when the driver's
relaxed_ordering_read_pci_enabled HCA capability is out of sync with FW.
With the new relaxed_ordering_read capability this can't happen, as it's
set regardless of RO value in PCI config space and thus can't change
during runtime.
Hence, to allow RO read in VFs and VMs, use the new HCA capability
relaxed_ordering_read without checking pcie_relaxed_ordering_enabled().
The old capability checks are kept for backward compatibility with older
FWs.
Allowing RO in VFs and VMs is valuable since it can greatly improve
performance on some setups. For example, testing throughput of a VF on
an AMD EPYC 7763 and ConnectX-6 Dx setup showed roughly 60% performance
improvement.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Link: https://lore.kernel.org/r/e7048640d66c341a8fa0465e099926e7989184bc.1681131553.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
relaxed_ordering_read HCA capability is set if both the device supports
relaxed ordering (RO) read and RO is set in PCI config space.
RO in PCI config space can change during runtime. This will change the
value of relaxed_ordering_read HCA capability in FW, but the driver will
not see it since it queries the capabilities only once.
This can lead to the following scenario:
1. RO in PCI config space is enabled.
2. User creates MKey without RO.
3. RO in PCI config space is disabled.
As a result, relaxed_ordering_read HCA capability is turned off in FW
but remains on in driver copy of the capabilities.
4. User requests to reconfig the MKey with RO via UMR.
5. Driver will try to reconfig the MKey with RO read although it
shouldn't (as relaxed_ordering_read HCA capability is really off).
To fix this, check pcie_relaxed_ordering_enabled() before setting RO
read in UMR.
Fixes: 896ec97353 ("RDMA/mlx5: Set mkey relaxed ordering by UMR with ConnectX-7")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Link: https://lore.kernel.org/r/8d39eb8317e7bed1a354311a20ae707788fd94ed.1681131553.git.leon@kernel.org
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Per the device spec, MLX5_UMR_MTT_ALIGNMENT is good not only for UMR MTT
entries, but for all other entries as well, like KLMs and KSMs.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The cited commit removed from the cleanup flow of umr the checks
if the resources were created. This could lead to null-ptr-deref
in case that we had failure in mlx5_ib_stage_ib_reg_init stage.
Fix it by adding new state to the umr that can say if the resources
were created or not and check it in the umr cleanup flow before
destroying the resources.
Fixes: 04876c12c1 ("RDMA/mlx5: Move init and cleanup of UMR to umr.c")
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/4cfa61386cf202e9ce330e8d228ce3b25a36326e.1661763459.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Introduce mlx5_umr_post_send_wait() that uses a UMR adjusted flow for
posting WQEs. The next patches will gradually move UMR operations to use
this flow. Once done, will get rid of mlx5_ib_post_send_wait().
mlx5_umr_post_send_wait gets already written WQE segments and will only
memcpy it to the SQ. This way, we avoid packing all the data in a WR just
to unpack it into the WQE.
Link: https://lore.kernel.org/r/f027dd592fde62402b2d49efded8d1d22229d22b.1649747695.git.leonro@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>