-Wflex-array-member-not-at-end is coming in GCC-14, and we are getting
ready to enable it globally.
There are currently a couple of objects (`alloc_head` and `bundle`) in
`struct bundle_priv` that contain a couple of flexible structures:
struct bundle_priv {
/* Must be first */
struct bundle_alloc_head alloc_head;
...
/*
* Must be last. bundle ends in a flex array which overlaps
* internal_buffer.
*/
struct uverbs_attr_bundle bundle;
u64 internal_buffer[32];
};
So, in order to avoid ending up with a couple of flexible-array members
in the middle of a struct, we use the `struct_group_tagged()` helper to
separate the flexible array from the rest of the members in the flexible
structures:
struct uverbs_attr_bundle {
struct_group_tagged(uverbs_attr_bundle_hdr, hdr,
... the rest of the members
);
struct uverbs_attr attrs[];
};
With the change described above, we now declare objects of the type of
the tagged struct without embedding flexible arrays in the middle of
another struct:
struct bundle_priv {
/* Must be first */
struct bundle_alloc_head_hdr alloc_head;
...
struct uverbs_attr_bundle_hdr bundle;
u64 internal_buffer[32];
};
We also use `container_of()` whenever we need to retrieve a pointer
to the flexible structures.
Notice that the `bundle_size` computed in `uapi_compute_bundle_size()`
remains the same.
So, with these changes, fix the following warnings:
drivers/infiniband/core/uverbs_ioctl.c:45:34: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
45 | struct bundle_alloc_head alloc_head;
| ^~~~~~~~~~
drivers/infiniband/core/uverbs_ioctl.c:67:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
67 | struct uverbs_attr_bundle bundle;
| ^~~~~~
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/ZeIgeZ5Sb0IZTOyt@neat
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When a struct containing a flexible array is included in another struct,
and there is a member after the struct-with-flex-array, there is a
possibility of memory overlap. These cases must be audited [1]. See:
struct inner {
...
int flex[];
};
struct outer {
...
struct inner header;
int overlap;
...
};
This is the scenario for all the "struct *_filter" structures that are
included in the following "struct ib_flow_spec_*" structures:
struct ib_flow_spec_eth
struct ib_flow_spec_ib
struct ib_flow_spec_ipv4
struct ib_flow_spec_ipv6
struct ib_flow_spec_tcp_udp
struct ib_flow_spec_tunnel
struct ib_flow_spec_esp
struct ib_flow_spec_gre
struct ib_flow_spec_mpls
The pattern is like the one shown below:
struct *_filter {
...
u8 real_sz[];
};
struct ib_flow_spec_* {
...
struct *_filter val;
struct *_filter mask;
};
In this case, the trailing flexible array "real_sz" is never allocated
and is only used to calculate the size of the structures. Here the use
of the "offsetof" helper can be changed by the "sizeof" operator because
the goal is to get the size of these structures. Therefore, the trailing
flexible arrays can also be removed.
However, due to the trailing padding that can be induced in structs it
is possible that the:
offsetof(struct *_filter, real_sz) != sizeof(struct *_filter)
This situation happens with the "struct ib_flow_ipv6_filter" and to
avoid it the "__packed" macro is used in this structure. But now, the
"sizeof(struct ib_flow_ipv6_filter)" has changed. This is not a problem
since this size is not used in the code.
The situation now is that "sizeof(struct ib_flow_spec_ipv6)" has also
changed (this struct contains the struct ib_flow_ipv6_filter). This is
also not a problem since it is only used to set the size of the "union
ib_flow_spec", which can store all the "ib_flow_spec_*" structures.
Link: https://lore.kernel.org/r/20240217142913.4285-1-erick.archer@gmx.com
Signed-off-by: Erick Archer <erick.archer@gmx.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
64k pages introduce the situation in this diagram when the HCA 4k page
size is being used:
+-------------------------------------------+ <--- 64k aligned VA
| |
| HCA 4k page |
| |
+-------------------------------------------+
| o |
| |
| o |
| |
| o |
+-------------------------------------------+
| |
| HCA 4k page |
| |
+-------------------------------------------+ <--- Live HCA page
|OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO| <--- offset
| | <--- VA
| MR data |
+-------------------------------------------+
| |
| HCA 4k page |
| |
+-------------------------------------------+
| o |
| |
| o |
| |
| o |
+-------------------------------------------+
| |
| HCA 4k page |
| |
+-------------------------------------------+
The VA addresses are coming from rdma-core in this diagram can be
arbitrary, but for 64k pages, the VA may be offset by some number of HCA
4k pages and followed by some number of HCA 4k pages.
The current iterator doesn't account for either the preceding 4k pages or
the following 4k pages.
Fix the issue by extending the ib_block_iter to contain the number of DMA
pages like comment [1] says and by using __sg_advance to start the
iterator at the first live HCA page.
The changes are contained in a parallel set of iterator start and next
functions that are umem aware and specific to umem since there is one user
of the rdma_for_each_block() without umem.
These two fixes prevents the extra pages before and after the user MR
data.
Fix the preceding pages by using the __sq_advance field to start at the
first 4k page containing MR data.
Fix the following pages by saving the number of pgsz blocks in the
iterator state and downcounting on each next.
This fix allows for the elimination of the small page crutch noted in the
Fixes.
Fixes: 10c75ccb54 ("RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()")
Link: https://lore.kernel.org/r/20231129202143.1434-2-shiraz.saleem@intel.com
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add support to dump SRQ resource in raw format. It enable drivers to
return the entire device specific SRQ context without setting each
field separately.
Example:
$ rdma res show srq -r
dev hns3 149000...
$ rdma res show srq -j -r
[{"ifindex":0,"ifname":"hns3","data":[149,0,0,...]}]
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Link: https://lore.kernel.org/r/20230918131110.3987498-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
After a change to the bnxt_re driver, it fails to link when
CONFIG_INFINIBAND_USER_ACCESS is disabled:
aarch64-linux-ld: drivers/infiniband/hw/bnxt_re/ib_verbs.o: in function `bnxt_re_handler_BNXT_RE_METHOD_ALLOC_PAGE':
ib_verbs.c:(.text+0xd64): undefined reference to `ib_uverbs_get_ucontext_file'
aarch64-linux-ld: drivers/infiniband/hw/bnxt_re/ib_verbs.o:(.rodata+0x168): undefined reference to `uverbs_idr_class'
aarch64-linux-ld: drivers/infiniband/hw/bnxt_re/ib_verbs.o:(.rodata+0x1a8): undefined reference to `uverbs_destroy_def_handler'
The problem is that the 'bnxt_re_uapi_defs' structure is built
unconditionally and references a couple of functions that are never
really called in this configuration but instead require other functions
that are left out.
Adding an #ifdef around the new code, or a Kconfig dependency would
address this problem, but adding the compile-time check inside of the
UAPI_DEF_CHAIN_OBJ_TREE_NAMED() macro seems best because that also
addresses the problem in other drivers that may run into the same
dependency.
Fixes: 360da60d6c ("RDMA/bnxt_re: Enable low latency push")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make it clearer what is going on by adding a function to go back from the
"virtual" dma_addr to a kva and another to a struct page. This is used in the
ib_uses_virt_dma() style drivers (siw, rxe, hfi, qib).
Call them instead of a naked casting and virt_to_page() when working with dma_addr
values encoded by the various ib_map functions.
This also fixes the virt_to_page() casting problem Linus Walleij has been
chasing.
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v2-05ea785520ed+10-ib_virt_page_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This commit extends the RDMA kernel verbs ABI to support the flush
operation defined in IBA A19.4.1. These changes are
backward compatible with the existing RDMA kernel verbs ABI.
It makes device/HCA support new FLUSH attributes/capabilities, and it
also makes memory region support new FLUSH access flags.
Users can use ibv_reg_mr(3) to register flush access flags. Only the
access flags also supported by device's capabilities can be registered
successfully.
Once registered successfully, it means the MR is flushable. Similarly,
A flushable MR should also have one or both of GLOBAL_VISIBILITY and
PERSISTENT attributes/capabilities like device/HCA.
Link: https://lore.kernel.org/r/20221206130201.30986-3-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This will cause an informative backtrace to print if the user of
ib_device_set_netdev() isn't careful about tearing down the ibdevice
before its the netdevice parent is destroyed. Such as like this:
unregister_netdevice: waiting for vlan0 to become free. Usage count = 2
leaked reference.
ib_device_set_netdev+0x266/0x730
siw_newlink+0x4e0/0xfd0
nldev_newlink+0x35c/0x5c0
rdma_nl_rcv_msg+0x36d/0x690
rdma_nl_rcv+0x2ee/0x430
netlink_unicast+0x543/0x7f0
netlink_sendmsg+0x918/0xe20
sock_sendmsg+0xcf/0x120
____sys_sendmsg+0x70d/0x8b0
___sys_sendmsg+0x11d/0x1b0
__sys_sendmsg+0xfa/0x1d0
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
This will help debug the issues syzkaller is seeing.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-a7c81b3842ce+e5-netdev_tracker_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This uses the same passing protocol as UVERBS_ATTR_FD (eg len = 0 data_s64
= fd), except that the FD is not required to be a uverbs object and the
core code does not covert the FD to an object handle automatically.
Access to the int fd is provided by uverbs_get_raw_fd().
Link: https://lore.kernel.org/r/2-v1-bd147097458e+ede-umem_dmabuf_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In inter-subnet cases, when inbound/outbound PRs are available,
outbound_PR.dlid is used as the requestor's datapath DLID and
inbound_PR.dlid is used as the responder's DLID. The inbound_PR.dlid
is passed to responder side with the "ConnectReq.Primary_Local_Port_LID"
field. With this solution the PERMISSIVE_LID is no longer used in
Primary Local LID field.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/b3f6cac685bce9dde37c610be82e2c19d9e51d9e.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Support receiving inbound and outbound IB path records (along with GMP
PathRecord) from user-space service through the RDMA netlink channel.
The LIDs in these 3 PRs can be used in this way:
1. GMP PR: used as the standard local/remote LIDs;
2. DLID of outbound PR: Used as the "dlid" field for outbound traffic;
3. DLID of inbound PR: Used as the "dlid" field for outbound traffic in
responder side.
This is aimed to support adaptive routing. With current IB routing
solution when a packet goes out it's assigned with a fixed DLID per
target, meaning a fixed router will be used.
The LIDs in inbound/outbound path records can be used to identify group
of routers that allow communication with another subnet's entity. With
them packets from an inter-subnet connection may travel through any
router in the set to reach the target.
As confirmed with Jason, when sending a netlink request, kernel uses
LS_RESOLVE_PATH_USE_ALL so that the service knows kernel supports
multiple PRs.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/2fa2b6c93c4c16c8915bac3cfc4f27be1d60519d.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Pull dma-mapping updates from Christoph Hellwig:
- convert arm32 to the common dma-direct code (Arnd Bergmann, Robin
Murphy, Christoph Hellwig)
- restructure the PCIe peer to peer mapping support (Logan Gunthorpe)
- allow the IOMMU code to communicate an optional DMA mapping length
and use that in scsi and libata (John Garry)
- split the global swiotlb lock (Tianyu Lan)
- various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
Lukas Bulwahn, Robin Murphy)
* tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping: (45 commits)
swiotlb: fix passing local variable to debugfs_create_ulong()
dma-mapping: reformat comment to suppress htmldoc warning
PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
RDMA/rw: drop pci_p2pdma_[un]map_sg()
RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
nvme-pci: convert to using dma_map_sgtable()
nvme-pci: check DMA ops when indicating support for PCI P2PDMA
iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
iommu: Explicitly skip bus address marked segments in __iommu_map_sg()
dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
dma-direct: support PCI P2PDMA pages in dma-direct map_sg
dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
PCI/P2PDMA: Attempt to set map_type if it has not been set
lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
swiotlb: clean up some coding style and minor issues
dma-mapping: update comment after dmabounce removal
scsi: sd: Add a comment about limiting max_sectors to shost optimal limit
ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors
scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit
...
Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.
Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Add a netevent callback for cma, mainly to catch NETEVENT_NEIGH_UPDATE.
Previously, when a system with failover MAC mechanism change its MAC address
during a CM connection attempt, the RDMA-CM would take a lot of time till
it disconnects and timesout due to the incorrect MAC address.
Now when we get a NETEVENT_NEIGH_UPDATE we check if it is due to a failover
MAC change and if so, we instantly destroy the CM and notify the user in order
to spare the unnecessary waiting for the timeout.
Link: https://lore.kernel.org/r/bb255c9e301cd50b905663b8e73f7f5133d0e4c5.1654601342.git.leonro@nvidia.com
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Leon Romanovsky says:
====================
Mellanox shared branch that includes:
* Removal of FPGA TLS code https://lore.kernel.org/all/cover.1649073691.git.leonro@nvidia.com
Mellanox INNOVA TLS cards are EOL in May, 2018 [1]. As such, the code
is unmaintained, untested and not in-use by any upstream/distro oriented
customers. In order to reduce code complexity, drop the kernel code,
clean build config options and delete useless kTLS vs. TLS separation.
[1] https://network.nvidia.com/related-docs/eol/LCR-000286.pdf
* Removal of FPGA IPsec code https://lore.kernel.org/all/cover.1649232994.git.leonro@nvidia.com
Together with FPGA TLS, the IPsec went to EOL state in the November of
2019 [1]. Exactly like FPGA TLS, no active customers exist for this
upstream code and all the complexity around that area can be deleted.
[2] https://network.nvidia.com/related-docs/eol/LCR-000535.pdf
* Fix to undefined behavior from Borislav https://lore.kernel.org/all/20220405151517.29753-11-bp@alien8.de
====================
* 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Remove not-implemented IPsec capabilities
net/mlx5: Remove ipsec_ops function table
net/mlx5: Reduce kconfig complexity while building crypto support
net/mlx5: Move IPsec file to relevant directory
net/mlx5: Remove not-needed IPsec config
net/mlx5: Align flow steering allocation namespace to common style
net/mlx5: Unify device IPsec capabilities check
net/mlx5: Remove useless IPsec device checks
net/mlx5: Remove ipsec vs. ipsec offload file separation
RDMA/core: Delete IPsec flow action logic from the core
RDMA/mlx5: Drop crypto flow steering API
RDMA/mlx5: Delete never supported IPsec flow action
net/mlx5: Remove FPGA ipsec specific statistics
net/mlx5: Remove XFRM no_trailer flag
net/mlx5: Remove not-used IDA field from IPsec struct
net/mlx5: Delete metadata handling logic
net/mlx5_fpga: Drop INNOVA IPsec support
IB/mlx5: Fix undefined behavior due to shift overflowing the constant
net/mlx5: Cleanup kTLS function names and their exposure
net/mlx5: Remove tls vs. ktls separation as it is the same
net/mlx5: Remove indirection in TLS build
net/mlx5: Reliably return TLS device capabilities
net/mlx5_fpga: Drop INNOVA TLS support
Link: https://lore.kernel.org/r/20220409055303.1223644-1-leon@kernel.org
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Split out flags from ib_device::device_cap_flags that are only used
internally to the kernel into kernel_cap_flags that is not part of the
uapi. This limits the device_cap_flags to being the same bitmap that will
be copied to userspace.
This cleanly splits out the uverbs flags from the kernel flags to avoid
confusion in the flags bitmap.
Add some short comments describing which each of the kernel flags is
connected to. Remove unused kernel flags.
Link: https://lore.kernel.org/r/0-v2-22c19e565eef+139a-kern_caps_jgg@nvidia.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are no remaining callers of set_fs(), so CONFIG_SET_FS
can be removed globally, along with the thread_info field and
any references to it.
This turns access_ok() into a cheaper check against TASK_SIZE_MAX.
As CONFIG_SET_FS is now gone, drop all remaining references to
set_fs()/get_fs(), mm_segment_t, user_addr_max() and uaccess_kernel().
Acked-by: Sam Ravnborg <sam@ravnborg.org> # for sparc32 changes
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Sergey Matyukevich <sergey.matyukevich@synopsys.com> # for arc changes
Acked-by: Stafford Horne <shorne@gmail.com> # [openrisc, asm-generic]
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Introduce ib_umem_dmabuf_get_pinned() which allows the driver to get a
dmabuf umem which is pinned and does not require move_notify callback
implementation.
The returned umem is pinned and DMA mapped like standard cpu umems, and is
released through ib_umem_release() (incl. unpinning and unmapping).
Link: https://lore.kernel.org/r/20211012120903.96933-3-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ib_dma_map_sgtable_attrs() should be mapping the sgls and setting nents
but the ib_uses_virt_dma() path falls back to ib_dma_virt_map_sg() which
will not set the nents in the sgtable.
Check the return value (per the map_sg calling convention) and set
sgt->nents appropriately on success.
Fixes: 79fbd3e124 ("RDMA: Use the sg_table directly and remove the opencoded version from umem")
Link: https://lore.kernel.org/r/20211013165942.89806-1-logang@deltatee.com
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
An optional counter is a driver-specific counter that may be dynamically
enabled/disabled. This enhancement allows drivers to expose counters
which are, for example, mutually exclusive and cannot be enabled at the
same time, counters that might degrades performance, optional debug
counters, etc.
Optional counters are marked with IB_STAT_FLAG_OPTIONAL flag. They are not
exported in sysfs, and must be at the end of all stats, otherwise the
attr->show() in sysfs would get wrong indexes for hwcounters that are
behind optional counters.
Link: https://lore.kernel.org/r/20211008122439.166063-7-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>