commit 5fc01fb846 upstream.
If cma_acquire_dev_by_src_ip() returns error in addr_handler(), the
device state changes back to RDMA_CM_ADDR_BOUND but the resolved source
IP address is still left. After that, if rdma_destroy_id() is called
after rdma_listen(), the device is freed without removed from
listen_any_list in cma_cancel_operation(). Revert to the previous IP
address if acquiring device fails.
Reported-by: syzbot+f3ce716af730c8f96637@syzkaller.appspotmail.com
Signed-off-by: Myungho Jung <mhjungk@gmail.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bc5add0976 upstream.
When disabling and removing a receive context, it is possible for an
asynchronous event (i.e IRQ) to occur. Because of this, there is a race
between cleaning up the context, and the context being used by the
asynchronous event.
cpu 0 (context cleanup)
rc->ref_count-- (ref_count == 0)
hfi1_rcd_free()
cpu 1 (IRQ (with rcd index))
rcd_get_by_index()
lock
ref_count+++ <-- reference count race (WARNING)
return rcd
unlock
cpu 0
hfi1_free_ctxtdata() <-- incorrect free location
lock
remove rcd from array
unlock
free rcd
This race will cause the following WARNING trace:
WARNING: CPU: 0 PID: 175027 at include/linux/kref.h:52 hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
CPU: 0 PID: 175027 Comm: IMB-MPI1 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
Call Trace:
dump_stack+0x19/0x1b
__warn+0xd8/0x100
warn_slowpath_null+0x1d/0x20
hfi1_rcd_get_by_index+0x84/0xa0 [hfi1]
is_rcv_urgent_int+0x24/0x90 [hfi1]
general_interrupt+0x1b6/0x210 [hfi1]
__handle_irq_event_percpu+0x44/0x1c0
handle_irq_event_percpu+0x32/0x80
handle_irq_event+0x3c/0x60
handle_edge_irq+0x7f/0x150
handle_irq+0xe4/0x1a0
do_IRQ+0x4d/0xf0
common_interrupt+0x162/0x162
The race can also lead to a use after free which could be similar to:
general protection fault: 0000 1 SMP
CPU: 71 PID: 177147 Comm: IMB-MPI1 Kdump: loaded Tainted: G W OE ------------ 3.10.0-957.el7.x86_64 #1
Hardware name: Intel Corporation S2600KP/S2600KP, BIOS SE5C610.86B.11.01.0076.C4.111920150602 11/19/2015
task: ffff9962a8098000 ti: ffff99717a508000 task.ti: ffff99717a508000 __kmalloc+0x94/0x230
Call Trace:
? hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
hfi1_user_sdma_process_request+0x9c8/0x1250 [hfi1]
hfi1_aio_write+0xba/0x110 [hfi1]
do_sync_readv_writev+0x7b/0xd0
do_readv_writev+0xce/0x260
? handle_mm_fault+0x39d/0x9b0
? pick_next_task_fair+0x5f/0x1b0
? sched_clock_cpu+0x85/0xc0
? __schedule+0x13a/0x890
vfs_writev+0x35/0x60
SyS_writev+0x7f/0x110
system_call_fastpath+0x22/0x27
Use the appropriate kref API to verify access.
Reorder context cleanup to ensure context removal before cleanup occurs
correctly.
Cc: stable@vger.kernel.org # v4.14.0+
Fixes: f683c80ca6 ("IB/hfi1: Resolve kernel panics by reference counting receive contexts")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6ab4aba00f ]
The following BUG was reported by kasan:
BUG: KASAN: use-after-free in ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
Read of size 80 at addr ffff88034c30bcd0 by task kworker/u16:1/24020
Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib]
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0xe3/0x2e0
kasan_report+0x18a/0x2e0
? ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
memcpy+0x1f/0x50
ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib]
? kvm_clock_read+0x1f/0x30
? ipoib_cm_skb_reap+0x610/0x610 [ib_ipoib]
? __lock_is_held+0xc2/0x170
? process_one_work+0x880/0x1960
? process_one_work+0x912/0x1960
process_one_work+0x912/0x1960
? wq_pool_ids_show+0x310/0x310
? lock_acquire+0x145/0x440
worker_thread+0x87/0xbb0
? process_one_work+0x1960/0x1960
kthread+0x314/0x3d0
? kthread_create_worker_on_cpu+0xc0/0xc0
ret_from_fork+0x3a/0x50
Allocated by task 0:
kasan_kmalloc+0xa0/0xd0
kmem_cache_alloc_trace+0x168/0x3e0
path_rec_create+0xa2/0x1f0 [ib_ipoib]
ipoib_start_xmit+0xa98/0x19e0 [ib_ipoib]
dev_hard_start_xmit+0x159/0x8d0
sch_direct_xmit+0x226/0xb40
__dev_queue_xmit+0x1d63/0x2950
neigh_update+0x889/0x1770
arp_process+0xc47/0x21f0
arp_rcv+0x462/0x760
__netif_receive_skb_core+0x1546/0x2da0
netif_receive_skb_internal+0xf2/0x590
napi_gro_receive+0x28e/0x390
ipoib_ib_handle_rx_wc_rss+0x873/0x1b60 [ib_ipoib]
ipoib_rx_poll_rss+0x17d/0x320 [ib_ipoib]
net_rx_action+0x427/0xe30
__do_softirq+0x28e/0xc42
Freed by task 26680:
__kasan_slab_free+0x11d/0x160
kfree+0xf5/0x360
ipoib_flush_paths+0x532/0x9d0 [ib_ipoib]
ipoib_set_mode_rss+0x1ad/0x560 [ib_ipoib]
set_mode+0xc8/0x150 [ib_ipoib]
kernfs_fop_write+0x279/0x440
__vfs_write+0xd8/0x5c0
vfs_write+0x15e/0x470
ksys_write+0xb8/0x180
do_syscall_64+0x9b/0x420
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff88034c30bcc8
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 8 bytes inside of
512-byte region [ffff88034c30bcc8, ffff88034c30bec8)
The buggy address belongs to the page:
The following race between change mode and xmit flow is the reason for
this use-after-free:
Change mode Send packet 1 to GID XX Send packet 2 to GID XX
| | |
start | |
| | |
| | |
| Create new path for GID XX |
| and update neigh path |
| | |
| | |
| | |
flush_paths | |
| |
queue_work(cm.start_task) |
| Path for GID XX not found
| create new path
|
|
start_task runs with old
released path
There is no locking to protect the lifetime of the path through the
ipoib_cm_tx struct, so delete it entirely and always use the newly looked
up path under the priv->lock.
Fixes: 546481c281 ("IB/ipoib: Fix memory corruption in ipoib cm mode connect flow")
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 904bba211a ]
The work completion length for a receiving a UD send with immediate is
short by 4 bytes causing application using this opcode to fail.
The UD receive logic incorrectly subtracts 4 bytes for immediate
value. These bytes are already included in header length and are used to
calculate header/payload split, so the result is these 4 bytes are
subtracted twice, once when the header length subtracted from the overall
length and once again in the UD opcode specific path.
Remove the extra subtraction when handling the opcode.
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Brian Welty <brian.welty@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 48396e80fb upstream.
Since .scsi_done() must only be called after scsi_queue_rq() has
finished, make sure that the SRP initiator driver does not call
.scsi_done() while scsi_queue_rq() is in progress. Although
invoking sg_reset -d while I/O is in progress works fine with kernel
v4.20 and before, that is not the case with kernel v5.0-rc1. This
patch avoids that the following crash is triggered with kernel
v5.0-rc1:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000138
CPU: 0 PID: 360 Comm: kworker/0:1H Tainted: G B 5.0.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:blk_mq_dispatch_rq_list+0x116/0xb10
Call Trace:
blk_mq_sched_dispatch_requests+0x2f7/0x300
__blk_mq_run_hw_queue+0xd6/0x180
blk_mq_run_work_fn+0x27/0x30
process_one_work+0x4f1/0xa20
worker_thread+0x67/0x5b0
kthread+0x1cf/0x1f0
ret_from_fork+0x24/0x30
Cc: <stable@vger.kernel.org>
Fixes: 94a9174c63 ("IB/srp: reduce lock coverage of command completion")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 09ce351dff upstream.
Fix potential memory corruption and panic in loopback for IB_WR_SEND
variants.
The code blindly assumes the posted length will fit in the fetched rwqe,
which is not a valid assumption.
Fix by adding a limit test, and triggering the appropriate send completion
and putting the QP in an error state. This mimics the handling for
non-loopback QPs.
Fixes: 1570346153 ("IB/{hfi1, qib, rdmavt}: Move ruc_loopback to rdmavt")
Cc: <stable@vger.kernel.org> #v4.20+
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
[ Upstream commit ca95f802ef ]
Currently, When a reserved operation is completed, its entry in the send
queue will not be unreserved, which leads to the miscalculation of
qp->s_avail and thus the triggering of a WARN_ON call trace. This patch
fixes the problem by unreserving the reserved operation when it is
completed.
Fixes: 856cc4c237 ("IB/hfi1: Add the capability for reserved operations")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7709b0dc26 upstream.
Applications that use the stack for execution purposes cause userspace PSM
jobs to fail during mmap().
Both Fortran (non-standard format parsing) and C (callback functions
located in the stack) applications can be written such that stack
execution is required. The linker notes this via the gnu_stack ELF flag.
This causes READ_IMPLIES_EXEC to be set which forces all PROT_READ mmaps
to have PROT_EXEC for the process.
Checking for VM_EXEC bit and failing the request with EPERM is overly
conservative and will break any PSM application using executable stacks.
Cc: <stable@vger.kernel.org> #v4.14+
Fixes: 1222026764 ("IB/hfi: Protect against writable mmap")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8036e90f92 ]
Acquiring the rtnl lock while holding usdev_lock could result in a
deadlock.
For example:
usnic_ib_query_port()
| mutex_lock(&us_ibdev->usdev_lock)
| ib_get_eth_speed()
| rtnl_lock()
rtnl_lock()
| usnic_ib_netdevice_event()
| mutex_lock(&us_ibdev->usdev_lock)
This commit moves the usdev_lock acquisition after the rtnl lock has been
released.
This is safe to do because usdev_lock is not protecting anything being
accessed in ib_get_eth_speed(). Hence, the correct order of holding locks
(rtnl -> usdev_lock) is not violated.
Signed-off-by: Parvi Kaustubhi <pkaustub@cisco.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b024dd0eba ]
FRWR memory registration is done with a series of calls and WRs.
1. ULP invokes ib_dma_map_sg()
2. ULP invokes ib_map_mr_sg()
3. ULP posts an IB_WR_REG_MR on the Send queue
Step 2 generates an iova. It is permissible for ULPs to change this
iova (with certain restrictions) between steps 2 and 3.
rxe_map_mr_sg captures the MR's iova but later when rxe processes the
REG_MR WR, it ignores the MR's iova field. If a ULP alters the MR's iova
after step 2 but before step 3, rxe never captures that change.
When the remote sends an RDMA Read targeting that MR, rxe looks up the
R_key, but the altered iova does not match the iova stored in the MR,
causing the RDMA Read request to fail.
Reported-by: Anna Schumaker <schumaker.anna@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit a9666c1cae upstream.
Unsafe global rkey is considered dangerous because it exposes memory
registered for all memory in the system. Only users with a QP on the same
PD can use the rkey, and generally those QPs will already know the
value. However, out of caution, do not expose the value to unprivleged
users on the local system. Require CAP_NET_ADMIN instead.
Cc: <stable@vger.kernel.org> # 4.16
Fixes: 29cf1351d4 ("RDMA/nldev: provide detailed PD information")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 37fbd834b4 ]
When support for bonding of RoCE devices was added, there was
necessarily a link between the RoCE device and the paired netdevice that
was part of the bond. If you remove the mlx4_en module, that paired
association is broken (the RoCE device is still present but the paired
netdevice has been released). We need to account for this in
is_upper_ndev_bond_master_filter() and filter out those links with a
broken pairing or else we later oops in netdev_next_upper_dev_rcu().
Fixes: 408f1242d9 ("IB/core: Delete lower netdevice default GID entries in bonding scenario")
Signed-off-by: Mark Zhang <markz@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 28a9a9e83c upstream
Packet queue state is over used to determine SDMA descriptor
availablitity and packet queue request state.
cpu 0 ret = user_sdma_send_pkts(req, pcount);
cpu 0 if (atomic_read(&pq->n_reqs))
cpu 1 IRQ user_sdma_txreq_cb calls pq_update() (state to _INACTIVE)
cpu 0 xchg(&pq->state, SDMA_PKT_Q_ACTIVE);
At this point pq->n_reqs == 0 and pq->state is incorrectly
SDMA_PKT_Q_ACTIVE. The close path will hang waiting for the state
to return to _INACTIVE.
This can also change the state from _DEFERRED to _ACTIVE. However,
this is a mostly benign race.
Remove the racy code path.
Use n_reqs to determine if a packet queue is active or not.
Cc: <stable@vger.kernel.org> # 4.19.x
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 75b7b86bdb ]
Memory windows are implemented with an indirect MKey, when a page fault
event comes for a MW Mkey we need to find the MR at the end of the list of
the indirect MKeys by iterating on all items from the first to the last.
The offset calculated during this process has to be zeroed after the first
iteration or the next iteration will start from a wrong address, resulting
incorrect ODP faulting behavior.
Fixes: db570d7dea ("IB/mlx5: Add ODP support to MW")
Signed-off-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ca088320a0 ]
Current hns driver assigned the first two PBL page addresses from previous
registered MR to the hardware when reregister MR changing the memory
locations occurred. This will lead to PBL addressing error as the PBL has
already been released. This patch fixes this wrong assignment by using the
page address from new allocated PBL.
Fixes: a2c80b7b41 ("RDMA/hns: Add rereg mr support for hip08")
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4f32fb921b ]
rdmavt uses a crazy system that looses the type checking when assinging
functions to struct ib_device function pointers. Because of this the
signature to this function was not changed when the below commit revised
things.
Fix the signature so we are not calling a function pointer with a
mismatched signature.
Fixes: 477864c8fc ("IB/core: Let create_ah return extended response to user")
Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d52ef88a9f ]
Currently when MAC address is changed, regardless of the netdev reg_state,
GID entries are removed and added to reflect the new MAC address and new
default GID entries.
When a bonding device is used and the underlying PCI device is removed
several netdevice events are generated. Two events of the interest are
CHANGEADDR and UNREGISTER event on lower(slave) netdevice of the bond
netdevice.
Sometimes CHANGEADDR event is generated when netdev state is
UNREGISTERING (after UNREGISTER event is generated). In this scenario, GID
entries for default GIDs are added and never deleted because GID entries
are deleted only when netdev state is < UNREGISTERED.
This leads to non zero reference count on the netdevice. Due to this, PCI
device unbind operation is getting stuck.
To avoid it, when changing mac address, add GID entries only if netdev is
in REGISTERED state.
Fixes: 03db3a2d81 ("IB/core: Add RoCE GID table management")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 074fca3a18 ]
Currently, for IB_WR_LOCAL_INV WR, when the next fence is None, the
current fence will be SMALL instead of Normal Fence.
Without this patch krping doesn't work on CX-5 devices and throws
following error:
The error messages are from CX5 driver are: (from server side)
[ 710.434014] mlx5_0:dump_cqe:278:(pid 2712): dump error cqe
[ 710.434016] 00000000 00000000 00000000 00000000
[ 710.434016] 00000000 00000000 00000000 00000000
[ 710.434017] 00000000 00000000 00000000 00000000
[ 710.434018] 00000000 93003204 100000b8 000524d2
[ 710.434019] krping: cq completion failed with wr_id 0 status 4 opcode 128 vender_err 32
Fixed the logic to set the correct fence type.
Fixes: 6e8484c5cf ("RDMA/mlx5: set UMR wqe fence according to HCA cap")
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit db7a691a15 upstream.
If the firmware reports a connection width that is not 1x, 4x, 8x or 12x
it causes the driver to fail during initialization.
To prevent this failure every time a new width is introduced to the RDMA
stack, we will set a default 4x width for these widths which ar unknown to
the driver.
This is needed to allow to run old kernels with new firmware.
Cc: <stable@vger.kernel.org> # 4.1
Fixes: 1b5daf11b0 ("IB/mlx5: Avoid using the MAD_IFC command under ISSI > 0 mode")
Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Reviewed-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a0e0cb8280 upstream.
pq_update() can only be called in two places: from the completion
function when the complete (npkts) sequence of packets has been
submitted and processed, or from setup function if a subset of the
packets were submitted (i.e. the error path).
Currently both paths can call pq_update() if an error occurrs. This
race will cause the n_req value to go negative, hanging file_close(),
or cause a crash by freeing the txlist more than once.
Several variables are used to determine SDMA send state. Most of
these are unnecessary, and have code inspectible races between the
setup function and the completion function, in both the send path and
the error path.
The request 'status' value can be set by the setup or by the
completion function. This is code inspectibly racy. Since the status
is not needed in the completion code or by the caller it has been
removed.
The request 'done' value races between usage by the setup and the
completion function. The completion function does not need this.
When the number of processed packets matches npkts, it is done.
The 'has_error' value races between usage of the setup and the
completion function. This can cause incorrect error handling and leave
the n_req in an incorrect value (i.e. negative).
Simplify the code by removing all of the unneeded state checks and
variables.
Clean up iovs node when it is freed.
Eliminate race conditions in the error path:
If all packets are submitted, the completion handler will set the
completion status correctly (ok or aborted).
If all packets are not submitted, the caller must wait until the
submitted packets have completed, and then set the completion status.
These two change eliminate the race condition in the error path.
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b97db58557 ]
Don't reset the resp opcode for a replayed read response.
The resp opcode could be in the middle of a write or send
sequence, when the duplicate read request was received.
An example sequence is as follows:
- Receive read request for 12KB PSN 20. Transmit read response
first, middle and last with PSNs 20,21,22.
- Receive write first PSN 23.
At this point the resp psn is 24 and resp opcode is write first.
- The sender notices that PSN 20 is dropped and retransmits.
Receive read request for 12KB PSN 20. Transmit read response
first, middle and last with PSNs 20,21,22. The resp opcode is
set to -1, the resp psn remains 24.
- Receive write first PSN 23. This is processed by duplicate_request().
The resp opcode remains -1 and resp psn remains 24.
- Receive write middle PSN 24. check_op_seq() reports a missing
first error since the resp opcode is -1.
When sending an ack for a duplicate send or write request,
use the psn of the previous ack sent. Do not use the psn
of a read response for the ack.
An example sequence is as follows:
- Receive write PSN 30. Transmit ACK for PSN 30.
- Receive read request 4KB PSN 31. Transmit read response with
PSN 31. The resp psn is now 32.
- The sender notices that PSN 30 is dropped and retransmits.
Receive write PSN 30. duplicate_request() sends an ACK with
PSN 31. That is incorrect since PSN 31 was a read request.
Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9b8b2a3230 ]
Some InfiniBand network devices have multiple ports on the same PCI
function. This initializes the `dev_port' sysfs field of those
network interfaces with their port number.
Prior to this the kernel erroneously used the `dev_id' sysfs
field of those network interfaces to convey the port number to userspace.
The use of `dev_id' was considered correct until Linux 3.15,
when another field, `dev_port', was defined for this particular
purpose and `dev_id' was reserved for distinguishing stacked ifaces
(e.g: VLANs) with the same hardware address as their parent device.
Similar fixes to net/mlx4_en and many other drivers, which started
exporting this information through `dev_id' before 3.15, were accepted
into the kernel 4 years ago.
See 76a066f2a2 (`net/mlx4_en: Expose port number through sysfs').
Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ed51efd2ce ]
In the failure path, nq->bar_reg_iomem gets accessed without
initializing. Avoid this by calling the bnxt_qplib_nq_stop_irq only if the
initialization is complete.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 1ac5a40479 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Fixes: 6e04b10356 ("RDMA/bnxt_re: Fix broken RoCE driver due to recent L2 driver changes")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0f6ef65d1c ]
If the provider driver (such as rdma_rxe) doesn't support pma counters,
avoid exposing its directory similar to optional hw_counters directory.
If core fails to read the PMA counter, return an error so that user can
retry later if needed.
Fixes: 35c4cbb178 ("IB/core: Create get_perf_mad function in sysfs.c")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Doug writes:
"Really final for-rc pull request for 4.19
Ok, so last week I thought we had sent our final pull request for
4.19. Well, wouldn't ya know someone went and found a couple Spectre
v1 fixes were needed :-/. So, a couple *very* small specter patches
for this (hopefully) final -rc week."
* tag 'for-gkh' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/ucma: Fix Spectre v1 vulnerability
IB/ucm: Fix Spectre v1 vulnerability
hdr.cmd can be indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
drivers/infiniband/core/ucma.c:1686 ucma_write() warn: potential
spectre issue 'ucma_cmd_table' [r] (local cap)
Fix this by sanitizing hdr.cmd before using it to index
ucm_cmd_table.
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
hdr.cmd can be indirectly controlled by user-space, hence leading to
a potential exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
drivers/infiniband/core/ucm.c:1127 ib_ucm_write() warn: potential
spectre issue 'ucm_cmd_table' [r] (local cap)
Fix this by sanitizing hdr.cmd before using it to index
ucm_cmd_table.
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug writes:
"RDMA fixes:
Final for-rc pull request for 4.19
We only have one bug to submit this time around. It fixes a DMA
unmap issue where we unmapped the DMA address from the IOMMU before
we did from the card, resulting in a DMAR error with IOMMU enabled,
or possible crash without."
* tag 'for-gkh' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/mlx5: Unmap DMA addr from HCA before IOMMU
Jason writes:
"Second RDMA rc pull request
- Fix a long standing race bug when destroying comp_event file descriptors
- srp, hfi1, bnxt_re: Various driver crashes from missing validation
and other cases
- Fixes for regressions in patches merged this window in the gid
cache, devx, ucma and uapi."
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/core: Set right entry state before releasing reference
IB/mlx5: Destroy the DEVX object upon error flow
IB/uverbs: Free uapi on destroy
RDMA/bnxt_re: Fix system crash during RDMA resource initialization
IB/hfi1: Fix destroy_qp hang after a link down
IB/hfi1: Fix context recovery when PBC has an UnsupportedVL
IB/hfi1: Invalid user input can result in crash
IB/hfi1: Fix SL array bounds check
RDMA/uverbs: Fix validity check for modify QP
IB/srp: Avoid that sg_reset -d ${srp_device} triggers an infinite loop
ucma: fix a use-after-free in ucma_resolve_ip()
RDMA/uverbs: Atomically flush and mark closed the comp event queue
cxgb4: fix abort_req_rss6 struct
Currently add_modify_gid() for IB link layer has followong issue
in cache update path.
When GID update event occurs, core releases reference to the GID
table without updating its state and/or entry pointer.
CPU-0 CPU-1
------ -----
ib_cache_update() IPoIB ULP
add_modify_gid() [..]
put_gid_entry()
refcnt = 0, but
state = valid,
entry is valid.
(work item is not yet executed).
ipoib_create_ah()
rdma_create_ah()
rdma_get_gid_attr() <--
Tries to acquire gid_attr
which has refcnt = 0.
This is incorrect.
GID entry state and entry pointer is provides the accurate GID enty
state. Such fields must be updated with rwlock to protect against
readers and, such fields must be in sane state before refcount can drop
to zero. Otherwise above race condition can happen leading to
use-after-free situation.
Following backtrace has been observed when cache update for an IB port
is triggered while IPoIB ULP is creating an AH.
Therefore, when updating GID entry, first mark a valid entry as invalid
through state and set the barrier so that no callers can acquired
the GID entry, followed by release reference to it.
refcount_t: increment on 0; use-after-free.
WARNING: CPU: 4 PID: 29106 at lib/refcount.c:153 refcount_inc_checked+0x30/0x50
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
RIP: 0010:refcount_inc_checked+0x30/0x50
RSP: 0018:ffff8802ad36f600 EFLAGS: 00010082
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000008 RDI: ffffffff86710100
RBP: ffff8802d6e60a30 R08: ffffed005d67bf8b R09: ffffed005d67bf8b
R10: 0000000000000001 R11: ffffed005d67bf8a R12: ffff88027620cee8
R13: ffff8802d6e60988 R14: ffff8802d6e60a78 R15: 0000000000000202
FS: 0000000000000000(0000) GS:ffff8802eb200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3ab35e5c88 CR3: 00000002ce84a000 CR4: 00000000000006e0
IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready
Call Trace:
rdma_get_gid_attr+0x220/0x310 [ib_core]
? lock_acquire+0x145/0x3a0
rdma_fill_sgid_attr+0x32c/0x470 [ib_core]
rdma_create_ah+0x89/0x160 [ib_core]
? rdma_fill_sgid_attr+0x470/0x470 [ib_core]
? ipoib_create_ah+0x52/0x260 [ib_ipoib]
ipoib_create_ah+0xf5/0x260 [ib_ipoib]
ipoib_mcast_join_complete+0xbbe/0x2540 [ib_ipoib]
Fixes: b150c3862d ("IB/core: Introduce GID entry reference counts")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Upon DEVX object creation the object must be destroyed upon a follows
error flow.
Fixes: 7efce3691d ("IB/mlx5: Add obj create and destroy functionality")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make sure we free struct uverbs_api once we clean the radix tree. It was
allocated by uverbs_alloc_api().
Fixes: 9ed3e5f447 ("IB/uverbs: Build the specs into a radix tree at runtime")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rvt_destroy_qp() cannot complete until all in process packets have
been released from the underlying hardware. If a link down event
occurs, an application can hang with a kernel stack similar to:
cat /proc/<app PID>/stack
quiesce_qp+0x178/0x250 [hfi1]
rvt_reset_qp+0x23d/0x400 [rdmavt]
rvt_destroy_qp+0x69/0x210 [rdmavt]
ib_destroy_qp+0xba/0x1c0 [ib_core]
nvme_rdma_destroy_queue_ib+0x46/0x80 [nvme_rdma]
nvme_rdma_free_queue+0x3c/0xd0 [nvme_rdma]
nvme_rdma_destroy_io_queues+0x88/0xd0 [nvme_rdma]
nvme_rdma_error_recovery_work+0x52/0xf0 [nvme_rdma]
process_one_work+0x17a/0x440
worker_thread+0x126/0x3c0
kthread+0xcf/0xe0
ret_from_fork+0x58/0x90
0xffffffffffffffff
quiesce_qp() waits until all outstanding packets have been freed.
This wait should be momentary. During a link down event, the cleanup
handling does not ensure that all packets caught by the link down are
flushed properly.
This is caused by the fact that the freeze path and the link down
event is handled the same. This is not correct. The freeze path
waits until the HFI is unfrozen and then restarts PIO. A link down
is not a freeze event. The link down path cannot restart the PIO
until link is restored. If the PIO path is restarted before the link
comes up, the application (QP) using the PIO path will hang (until
link is restored).
Fix by separating the linkdown path from the freeze path and use the
link down path for link down events.
Close a race condition sc_disable() by acquiring both the progress
and release locks.
Close a race condition in sc_stop() by moving the setting of the flag
bits under the alloc lock.
Cc: <stable@vger.kernel.org> # 4.9.x+
Fixes: 7724105686 ("IB/hfi1: add driver files")
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>