Commit Graph

14239 Commits

Author SHA1 Message Date
Bob Pearson
cd8aaddf0d RDMA/rxe: Remove save/rollback_state in rxe_requester
Now that req.task and comp.task are merged it is no longer necessary to
call save_state() before calling rxe_xmit_pkt() and rollback_state() if
rxe_xmit_pkt() fails. This was done originally to prevent races between
rxe_completer() and rxe_requester() which now cannot happen.

Link: https://lore.kernel.org/r/20240329145513.35381-8-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
67f57892f9 RDMA/rxe: Merge request and complete tasks
Currently the rxe driver has three work queue tasks per qp.  These are the
req.task, comp.task and resp.task which call rxe_requester(),
rxe_completer() and rxe_responder() respectively directly or on work
queues. Each of these subroutines checks to see if there is work to be
performed on the send queue or on the response packet queue or the request
packet queue and will run until there is no work remaining or yield the
cpu and reschedule itself until there is no work remaining.

This commit combines the req.task and comp.task into a single send.task
and renames the resp.task to the recv.task. The combined send.task calls
rxe_requester() and rxe_completer() serially and continues until all work
on both the send queue and the response packet queue are done.

In various benchmarks the performance is either improved or left the
same. At high scale there is a significant reduction in the load on the
cpu.

This is the first step in combining these two tasks. Once they are
serialized cross rescheduling of req.task and comp.task can be more
efficiently handled by just letting the send.task continue to run. This
will be done in the next several patches.

Link: https://lore.kernel.org/r/20240329145513.35381-7-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
ff30e45376 RDMA/rxe: Remove redundant scheduling of rxe_completer
In rxe_post_send_kernel() if the qp is in the error state after posting
the work requests the rxe_completer() task is scheduled.

But, the only way to move the qp into the error state is to call
rxe_qp_error() which also schedules the rxe_completer() task to drain the
queues. Calling it a second time has no effect. This commit removes the
redundant call.

Link: https://lore.kernel.org/r/20240329145513.35381-6-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:33 -03:00
Bob Pearson
b703374837 RDMA/rxe: Allow good work requests to be executed
A previous commit incorrectly added an 'if(!err)' before scheduling the
requester task in rxe_post_send_kernel(). But if there were send wrs
successfully added to the send queue before a bad wr they might never get
executed.

This commit fixes this by scheduling the requester task if any wqes were
successfully posted in rxe_post_send_kernel() in rxe_verbs.c.

Link: https://lore.kernel.org/r/20240329145513.35381-5-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 5bf944f241 ("RDMA/rxe: Add error messages")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:32 -03:00
Bob Pearson
2b23b60973 RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt
In rxe_comp_queue_pkt() an incoming response packet skb is enqueued to the
resp_pkts queue and then a decision is made whether to run the completer
task inline or schedule it. Finally the skb is dereferenced to bump a 'hw'
performance counter. This is wrong because if the completer task is
already running in a separate thread it may have already processed the skb
and freed it which can cause a seg fault.  This has been observed
infrequently in testing at high scale.

This patch fixes this by changing the order of enqueuing the packet until
after the counter is accessed.

Link: https://lore.kernel.org/r/20240329145513.35381-4-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 0b1e5b99a4 ("IB/rxe: Add port protocol stats")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-04-22 16:54:32 -03:00
Michael Guralnik
ca0b44e20a IB/core: Implement a limit on UMAD receive List
The existing behavior of ib_umad, which maintains received MAD
packets in an unbounded list, poses a risk of uncontrolled growth.
As user-space applications extract packets from this list, the rate
of extraction may not match the rate of incoming packets, leading
to potential list overflow.

To address this, we introduce a limit to the size of the list. After
considering typical scenarios, such as OpenSM processing, which can
handle approximately 100k packets per second, and the 1-second retry
timeout for most packets, we set the list size limit to 200k. Packets
received beyond this limit are dropped, assuming they are likely timed
out by the time they are handled by user-space.

Notably, packets queued on the receive list due to reasons like
timed-out sends are preserved even when the list is full.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/7197cb58a7d9e78399008f25036205ceab07fbd5.1713268818.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-21 14:00:35 +03:00
Chengchang Tang
349e859952 RDMA/hns: Modify the print level of CQE error
Too much print may lead to a panic in kernel. Change ibdev_err() to
ibdev_err_ratelimited(), and change the printing level of cqe dump
to debug level.

Fixes: 7c044adca2 ("RDMA/hns: Simplify the cqe code of poll cq")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-11-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
4125269bb9 RDMA/hns: Use complete parentheses in macros
Use complete parentheses to ensure that macro expansion does
not produce unexpected results.

Fixes: a25d13cbe8 ("RDMA/hns: Add the interfaces to support multi hop addressing for the contexts in hip08")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-10-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
wenglianfa
9a84848dce RDMA/hns: Add mutex_destroy()
Add mutex_destroy().

Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-9-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
ee04549328 RDMA/hns: Fix GMV table pagesize
GMV's BA table only supports 4K pages. Currently, PAGESIZE is used to
calculate gmv_bt_num, which will cause an abnormal number of gmv_bt_num
in a 64K OS.

Fixes: d6d91e4621 ("RDMA/hns: Add support for configuring GMV table")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-8-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
wenglianfa
dc3bda6e56 RDMA/hns: Fix mismatch exception rollback
When dma_alloc_coherent() fails in hns_roce_alloc_hem(), just call
kfree() to release hem instead of hns_roce_free_hem().

Fixes: c00743cbf2 ("RDMA/hns: Simplify 'struct hns_roce_hem' allocation")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
a942ec2745 RDMA/hns: Fix UAF for cq async event
The refcount of CQ is not protected by locks. When CQ asynchronous
events and CQ destruction are concurrent, CQ may have been released,
which will cause UAF.

Use the xa_lock() to protect the CQ refcount.

Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
b46494b6f9 RDMA/hns: Fix deadlock on SRQ async events.
xa_lock for SRQ table may be required in AEQ. Use xa_store_irq()/
xa_erase_irq() to avoid deadlock.

Fixes: 81fce6291d ("RDMA/hns: Add SRQ asynchronous event support")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
2ce384307f RDMA/hns: Add max_ah and cq moderation capacities in query_device()
Add max_ah and cq moderation capacities to hns_roce_query_device().

Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Chengchang Tang
f4caa864af RDMA/hns: Remove unused parameters and variables
Remove unused parameters and variables.

Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Yangyang Li
bfb6be4014 RDMA/hns: Use macro instead of magic number
Use macro instead of magic number.

Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240412091616.370789-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 15:06:47 +03:00
Zhengchao Shao
203b70fda6 RDMA/hns: Fix return value in hns_roce_map_mr_sg
As described in the ib_map_mr_sg function comment, it returns the number
of sg elements that were mapped to the memory region. However,
hns_roce_map_mr_sg returns the number of pages required for mapping the
DMA area. Fix it.

Fixes: 9b2cf76c9f ("RDMA/hns: Optimize PBL buffer allocation process")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20240411033851.2884771-1-shaozhengchao@huawei.com
Reviewed-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:59:08 +03:00
Konstantin Taranov
8859f009ac RDMA/mana_ib: Configure mac address in RNIC
Set local mac address in RNIC, which is required by the HW.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-7-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
faafb8b126 RDMA/mana_ib: Adding and deleting GIDs
Implement add_gid and del_gid for RNIC.
IPv4 and IPv6 addresses are supported.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-6-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
8b184e4f1c RDMA/mana_ib: Enable RoCE on port 1
Set netdev and RoCEv2 flag to enable GID population on port 1.
Use GIDs of the master netdev. As mc->ports[] stores slave devices,
use a helper to get the master netdev.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-5-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
4bda1d5332 RDMA/mana_ib: Implement port parameters
Implement port parameters for RNIC:
1) extend query_port() method
2) implement get_link_layer()
3) implement query_pkey()

Only port 1 can store GIDs.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-4-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
1a79c2b9d4 RDMA/mana_ib: Create and destroy rnic adapter
Add functions for RNIC creation and destruction.
If creation fails, the ib_probe fails as well.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-3-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
98b889c439 RDMA/mana_ib: Add EQ creation for rnic adapter
Create an error EQ for the RNIC adapter.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712738551-22075-2-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 14:28:26 +03:00
Konstantin Taranov
23f59f4e83 RDMA/mana_ib: Use num_comp_vectors of ib_device
Use num_comp_vectors of struct ib_device instead of max_num_queues
from gdma_context.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712911656-17352-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-16 13:29:47 +03:00
Jakub Kicinski
0e36c21d76 Merge branch mana-ib-flex of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
Erick Archer says:

====================
mana: Add flex array to struct mana_cfg_rx_steer_req_v2 (part)

The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
trailing elements. Specifically, it uses a "mana_handle_t" array. So,
use the preferred way in the kernel declaring a flexible array [1].

At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions).

Also, avoid the open-coded arithmetic in the memory allocator functions
[2] using the "struct_size" macro.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

Now, it is also possible to use the "flex_array_size" helper to compute
the size of these trailing elements in the "memcpy" function.

Specifically, the first commit adds the flex member and the patches 2 and
3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2".

This code was detected with the help of Coccinelle, and audited and
modified manually. The Coccinelle script used to detect this code pattern
is the following:

virtual report

@rule1@
type t1;
type t2;
identifier i0;
identifier i1;
identifier i2;
identifier ALLOC =~ "kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc";
position p1;
@@

i0 = sizeof(t1) + sizeof(t2) * i1;
...
i2 = ALLOC@p1(..., i0, ...);

@script:python depends on report@
p1 << rule1.p1;
@@

msg = "WARNING: verify allocation on line %s" % (p1[0].line)
coccilib.report.print_report(p1[0],msg)

Link: https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays [1]
Link: https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [2]

v1: https://lore.kernel.org/linux-hardening/AS8PR02MB7237974EF1B9BAFA618166C38B382@AS8PR02MB7237.eurprd02.prod.outlook.com/
v2: https://lore.kernel.org/linux-hardening/AS8PR02MB723729C5A63F24C312FC9CD18B3F2@AS8PR02MB7237.eurprd02.prod.outlook.com/
====================

Link: https://lore.kernel.org/r/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11 07:38:28 -07:00
Zhu Yanjun
dfcdb38b21 RDMA/rxe: Return the correct errno
In the function __rxe_add_to_pool, the function xa_alloc_cyclic is
called. The return value of the function xa_alloc_cyclic is as below:
"
 Return: 0 if the allocation succeeded without wrapping.  1 if the
 allocation succeeded after wrapping, -ENOMEM if memory could not be
 allocated or -EBUSY if there are no free entries in @limit.
"
But now the function __rxe_add_to_pool only returns -EINVAL. All the
returned error value should be returned to the caller.

Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240408142142.792413-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-11 14:47:26 +03:00
Konstantin Taranov
c8fc935f4b RDMA/mana_ib: remove useless return values from dbg prints
Remove printing ret value on success as it was always 0.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1712672465-29960-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-11 14:30:24 +03:00
Leon Romanovsky
e537deecda RDMA/mana_ib: Add flex array to struct mana_cfg_rx_steer_req_v2
The "struct mana_cfg_rx_steer_req_v2" uses a dynamically sized set of
trailing elements. Specifically, it uses a "mana_handle_t" array. So,
use the preferred way in the kernel declaring a flexible array [1].

At the same time, prepare for the coming implementation by GCC and Clang
of the __counted_by attribute. Flexible array members annotated with
__counted_by can have their accesses bounds-checked at run-time via
CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for
strcpy/memcpy-family functions).

Also, avoid the open-coded arithmetic in the memory allocator functions
[2] using the "struct_size" macro.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

Now, it is also possible to use the "flex_array_size" helper to compute
the size of these trailing elements in the "memcpy" function.

Specifically, the first commit adds the flex member and the patches 2
and 3 refactor the consumers of the "struct mana_cfg_rx_steer_req_v2".

This code was detected with the help of Coccinelle, and audited and
modified manually. The Coccinelle script used to detect this code
pattern is the following:

virtual report

@rule1@
type t1;
type t2;
identifier i0;
identifier i1;
identifier i2;
identifier ALLOC =~
"kmalloc|kzalloc|kmalloc_node|kzalloc_node|vmalloc|vzalloc|kvmalloc|kvzalloc";
position p1;
@@

i0 = sizeof(t1) + sizeof(t2) * i1;
...
i2 = ALLOC@p1(..., i0, ...);

@script:python depends on report@
p1 << rule1.p1;
@@

msg = "WARNING: verify allocation on line %s" % (p1[0].line)
coccilib.report.print_report(p1[0],msg)

[1] https://www.kernel.org/doc/html/next/process/deprecated.html#zero-length-and-one-element-arrays
[2] https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://lore.kernel.org/all/AS8PR02MB72374BD1B23728F2E3C3B1A18B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

* mana-ib-flex:
  net: mana: Avoid open coded arithmetic
  RDMA/mana_ib: Prefer struct_size over open coded arithmetic
  net: mana: Add flex array to struct mana_cfg_rx_steer_req_v2
2024-04-11 13:48:53 +03:00
Erick Archer
29b8e13a8b RDMA/mana_ib: Prefer struct_size over open coded arithmetic
This is an effort to get rid of all multiplications from allocation
functions in order to prevent integer overflows [1][2].

As the "req" variable is a pointer to "struct mana_cfg_rx_steer_req_v2"
and this structure ends in a flexible array:

struct mana_cfg_rx_steer_req_v2 {
	[...]
        mana_handle_t indir_tab[] __counted_by(num_indir_entries);
};

the preferred way in the kernel is to use the struct_size() helper to
do the arithmetic instead of the calculation "size + size * count" in
the kzalloc() function.

Moreover, use the "offsetof" helper to get the indirect table offset
instead of the "sizeof" operator and avoid the open-coded arithmetic in
pointers using the new flex member. This new structure member also allow
us to remove the "req_indir_tab" variable since it is no longer needed.

This way, the code is more readable and safer.

This code was detected with the help of Coccinelle, and audited and
modified manually.

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1]
Link: https://github.com/KSPP/linux/issues/160 [2]
Signed-off-by: Erick Archer <erick.archer@outlook.com>
Link: https://lore.kernel.org/r/AS8PR02MB72375EB06EE1A84A67BE722E8B022@AS8PR02MB7237.eurprd02.prod.outlook.com
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-11 13:46:47 +03:00
Junxian Huang
ee20cc17e9 RDMA/hns: Support DSCP
Add support for DSCP configuration. For DSCP, get dscp-prio mapping
via hns3 nic driver api .get_dscp_prio() and fill the SL (in WQE for
UD or in QPC for RC) with the priority value. The prio-tc mapping is
configured to HW by hns3 nic driver. HW will select a corresponding
TC according to SL and the prio-tc mapping.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240315093551.1650088-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-09 10:27:41 +03:00
Guillaume Nault
ec20b28300 ipv4: Set scope explicitly in ip_route_output().
Add a "scope" parameter to ip_route_output() so that callers don't have
to override the tos parameter with the RTO_ONLINK flag if they want a
local scope.

This will allow converting flowi4_tos to dscp_t in the future, thus
allowing static analysers to flag invalid interactions between
"tos" (the DSCP bits) and ECN.

Only three users ask for local scope (bonding, arp and atm). The others
continue to use RT_SCOPE_UNIVERSE. While there, add a comment to warn
users about the limitations of ip_route_output().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com> # infiniband
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-08 13:20:51 +01:00
Or Har-Toov
2ca7e93bc9 RDMA/mlx5: Adding remote atomic access flag to updatable flags
Currently IB_ACCESS_REMOTE_ATOMIC is blocked from being updated via UMR
although in some cases it should be possible. These cases are checked in
mlx5r_umr_can_reconfig function.

Fixes: ef3642c4f5 ("RDMA/mlx5: Fix error unwinds for rereg_mr")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://lore.kernel.org/r/24dac73e2fa48cb806f33a932d97f3e402a5ea2c.1712140377.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:34:54 +03:00
Or Har-Toov
8c1185fef6 RDMA/mlx5: Change check for cacheable mkeys
umem can be NULL for user application mkeys in some cases. Therefore
umem can't be used for checking if the mkey is cacheable and it is
changed for checking a flag that indicates it. Also make sure that
all mkeys which are not returned to the cache will be destroyed.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://lore.kernel.org/r/2690bc5c6896bcb937f89af16a1ff0343a7ab3d0.1712140377.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:34:51 +03:00
Or Har-Toov
0611a8e8b4 RDMA/mlx5: Uncacheable mkey has neither rb_key or cache_ent
As some mkeys can't be modified with UMR due to some UMR limitations,
like the size of translation that can be updated, not all user mkeys can
be cached.

Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Link: https://lore.kernel.org/r/f2742dd934ed73b2d32c66afb8e91b823063880c.1712140377.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:34:47 +03:00
Michael Guralnik
be121ffb38 RDMA/mlx5: Fix port number for counter query in multi-port configuration
Set the correct port when querying PPCNT in multi-port configuration.
Distinguish between cases where switchdev mode was enabled to multi-port
configuration and don't overwrite the queried port to 1 in multi-port
case.

Fixes: 74b30b3ad5 ("RDMA/mlx5: Set local port to one when accessing counters")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/9bfcc8ade958b760a51408c3ad654a01b11f7d76.1712134988.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-08 13:33:10 +03:00
Konstantin Taranov
f10242b3da RDMA/mana_ib: Use struct mana_ib_queue for RAW QPs
Use struct mana_ib_queue and its helpers for RAW QPs

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1711483688-24358-5-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-02 11:30:23 +03:00
Konstantin Taranov
688bac28e3 RDMA/mana_ib: Use struct mana_ib_queue for WQs
Use struct mana_ib_queue and its helpers for WQs

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1711483688-24358-4-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-02 11:30:23 +03:00
Konstantin Taranov
60a7ac0b8b RDMA/mana_ib: Use struct mana_ib_queue for CQs
Use struct mana_ib_queue and its helpers for CQs

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1711483688-24358-3-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-02 11:30:23 +03:00
Konstantin Taranov
46f5be7cd4 RDMA/mana_ib: Introduce helpers to create and destroy mana queues
Intoduce helpers to work with mana ib queues (struct mana_ib_queue).
A queue always consists of umem, gdma_region, and id.
A queue can become a WQ or a CQ.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1711483688-24358-2-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-02 11:30:22 +03:00
Mark Zhang
b68e1acb58 RDMA/cm: Print the old state when cm_destroy_id gets timeout
The old state is helpful for debugging, as the current state is always
IB_CM_IDLE when timeout happens.

Fixes: 96d9cbe2f2 ("RDMA/cm: add timeout to cm_destroy_id wait")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/20240322112049.2022994-1-markzhang@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 15:16:36 +03:00
Wenchao Hao
ca537a3477 RDMA/restrack: Fix potential invalid address access
struct rdma_restrack_entry's kern_name was set to KBUILD_MODNAME
in ib_create_cq(), while if the module exited but forgot del this
rdma_restrack_entry, it would cause a invalid address access in
rdma_restrack_clean() when print the owner of this rdma_restrack_entry.

These code is used to help find one forgotten PD release in one of the
ULPs. But it is not needed anymore, so delete them.

Signed-off-by: Wenchao Hao <haowenchao2@huawei.com>
Link: https://lore.kernel.org/r/20240318092320.1215235-1-haowenchao2@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:49:27 +03:00
Boshi Yu
df0e16bab5 RDMA/erdma: Remove unnecessary __GFP_ZERO flag
The dma_alloc_coherent() interface automatically zero the memory returned.
Thus, we do not need to specify the __GFP_ZERO flag explicitly when we call
dma_alloc_coherent().

Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Boshi Yu <boshiyu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240311113821.22482-4-boshiyu@alibaba-inc.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:46:01 +03:00
Boshi Yu
fdb09ed15f RDMA/erdma: Unify the names related to doorbell records
There exist two different names for the doorbell records: db_info and
db_record. We use dbrec for cpu address of the doorbell record and
dbrec_dma for dma address of the doorbell recordi uniformly.

Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Boshi Yu <boshiyu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240311113821.22482-3-boshiyu@alibaba-inc.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:46:01 +03:00
Boshi Yu
f0697bf078 RDMA/erdma: Allocate doorbell records from dma pool
Currently, the 8 byte doorbell record is allocated along with the queue
buffer, which may result in waste of dma space when the queue buffer is
page aligned. To address this issue, we introduce a dma pool named
db_pool and allocate doorbell record from it.

Reviewed-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Boshi Yu <boshiyu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240311113821.22482-2-boshiyu@alibaba-inc.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:46:01 +03:00
Yanjun.Zhu
481047d7e8 RDMA/rxe: Fix the problem "mutex_destroy missing"
When a mutex lock is not used any more, the function mutex_destroy
should be called to mark the mutex lock uninitialized.

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Signed-off-by: Yanjun.Zhu <yanjun.zhu@linux.dev>
Link: https://lore.kernel.org/r/20240314065140.27468-1-yanjun.zhu@linux.dev
Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-04-01 14:36:41 +03:00
Linus Torvalds
6207b37eb5 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
 "Very small update this cycle:

   - Minor code improvements in fi, rxe, ipoib, mana, cxgb4, mlx5,
     irdma, rxe, rtrs, mana

   - Simplify the hns hem mechanism

   - Fix EFA's MSI-X allocation in resource constrained configurations

   - Fix a KASN splat in srpt

   - Narrow hns's congestion control selection to QPs granularity and
     allow userspace to select it

   - Solve a parallel module loading race between the CM module and a
     driver module

   - Flexible array cleanup

   - Dump hns's SCC Conext to 'rdma res' for debugging

   - Make mana build page lists for HW objects that require a 0 offset
     correctly

   - Stuck CM ID debugging"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (29 commits)
  RDMA/cm: add timeout to cm_destroy_id wait
  RDMA/mana_ib: Use virtual address in dma regions for MRs
  RDMA/mana_ib: Fix bug in creation of dma regions
  RDMA/hns: Append SCC context to the raw dump of QPC
  RDMA/uverbs: Avoid -Wflex-array-member-not-at-end warnings
  RDMA/hns: Support userspace configuring congestion control algorithm with QP granularity
  RDMA/rtrs-clt: Check strnlen return len in sysfs mpath_policy_store()
  RDMA/uverbs: Remove flexible arrays from struct *_filter
  RDMA/device: Fix a race between mad_client and cm_client init
  RDMA/hns: Fix mis-modifying default congestion control algorithm
  RDMA/rxe: Remove unused 'iova' parameter from rxe_mr_init_user
  RDMA/srpt: Do not register event handler until srpt device is fully setup
  RDMA/irdma: Remove duplicate assignment
  RDMA/efa: Limit EQs to available MSI-X vectors
  RDMA/mlx5: Delete unused mlx5_ib_copy_pas prototype
  RDMA/cxgb4: Delete unused c4iw_ep_redirect prototype
  RDMA/mana_ib: Introduce mana_ib_install_cq_cb helper function
  RDMA/mana_ib: Introduce mana_ib_get_netdev helper function
  RDMA/mana_ib: Introduce mdev_to_gc helper function
  RDMA/hns: Simplify 'struct hns_roce_hem' allocation
  ...
2024-03-18 15:34:03 -07:00
Manjunath Patil
96d9cbe2f2 RDMA/cm: add timeout to cm_destroy_id wait
Add timeout to cm_destroy_id, so that userspace can trigger any data
collection that would help in analyzing the cause of delay in destroying
the cm_id.

New noinline function helps dtrace/ebpf programs to hook on to it.
Existing functionality isn't changed except triggering a probe-able new
function at every timeout interval.

We have seen cases where CM messages stuck with MAD layer (either due to
software bug or faulty HCA), leading to cm_id getting stuck in the
following call stack. This patch helps in resolving such issues faster.

kernel: ... INFO: task XXXX:56778 blocked for more than 120 seconds.
...
	Call Trace:
	__schedule+0x2bc/0x895
	schedule+0x36/0x7c
	schedule_timeout+0x1f6/0x31f
 	? __slab_free+0x19c/0x2ba
	wait_for_completion+0x12b/0x18a
	? wake_up_q+0x80/0x73
	cm_destroy_id+0x345/0x610 [ib_cm]
	ib_destroy_cm_id+0x10/0x20 [ib_cm]
	rdma_destroy_id+0xa8/0x300 [rdma_cm]
	ucma_destroy_id+0x13e/0x190 [rdma_ucm]
	ucma_write+0xe0/0x160 [rdma_ucm]
	__vfs_write+0x3a/0x16d
	vfs_write+0xb2/0x1a1
	? syscall_trace_enter+0x1ce/0x2b8
	SyS_write+0x5c/0xd3
	do_syscall_64+0x79/0x1b9
	entry_SYSCALL_64_after_hwframe+0x16d/0x0

Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Link: https://lore.kernel.org/r/20240309063323.458102-1-manjunath.b.patil@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-10 13:17:54 +02:00
Konstantin Taranov
2d5c008157 RDMA/mana_ib: Use virtual address in dma regions for MRs
Introduce mana_ib_create_dma_region() to create dma regions with iova
for MRs. It allows creating MRs with any page offset. Previously,
only page-aligned addresses worked.

For dma regions that must have a zero dma offset (e.g., for queues),
mana_ib_create_zero_offset_dma_region() is added.
To get the zero offset, ib_umem_find_best_pgoff() is used with zero
pgoff_bitmask.

Fixes: 0266a17763 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1709560361-26393-3-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-07 11:32:33 +02:00
Konstantin Taranov
e02497fb65 RDMA/mana_ib: Fix bug in creation of dma regions
Use ib_umem_dma_offset() helper to calculate correct dma offset.

Fixes: 0266a17763 ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1709560361-26393-2-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-07 11:32:33 +02:00
wenglianfa
124a9fbe43 RDMA/hns: Append SCC context to the raw dump of QPC
SCCC (SCC Context) is a context with QP granularity that contains
information about congestion control. Dump SCCC and QPC together
to improve troubleshooting.

When dumping raw QPC with rdmatool, there will be a total of 576 bytes
data output, where the first 512 bytes is QPC and the last 64 bytes is
SCCC. When congestion control is disabled, the 64 byte SCCC will be all 0.

Example:
$rdma res show qp -jpr
[ {
        "ifindex": 0,
        "ifname": "hns_0",
	"data": [ 67,0,0,0... 512bytes
		  4,0,2... 64bytes]
  },...
} ]

Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240305055257.823513-1-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-03-07 11:26:10 +02:00