Instead of (struct rt6_info *)dst casts, we can use :
#define dst_rt6_info(_ptr) \
container_of_const(_ptr, struct rt6_info, dst)
Some places needed missing const qualifiers :
ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()
v2: added missing parts (David Ahern)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum ASICs only support a single interrupt, that means that all the
events are handled by one IRQ (interrupt request) handler. Once an
interrupt is received, we schedule tasklet to handle events from EQ and
then schedule tasklets to handle completions from CQs. Tasklet runs in
softIRQ (software IRQ) context, and will be run on the same CPU which
scheduled it. That means that today we use only one CPU to handle all the
packets (both network packets and EMADs) from hardware.
This can be improved using NAPI. The idea is to use NAPI instance per
CQ, which is mapped 1:1 to DQ (RDQ or SDQ). NAPI poll method can be run
in kernel thread, so then the driver will be able to handle WQEs in several
CPUs. Convert the existing code to use NAPI APIs.
Add NAPI instance as part of 'struct mlxsw_pci_queue' and initialize it
as part of CQs initialization. Set the appropriate poll method and dummy
net device, according to queue number, similar to tasklet setup. For CQs
which are used for completions of RDQ, use Rx poll method and
'napi_dev_rx', which is set as 'threaded'. It means that Rx poll method
will run in kernel context, so several RDQs will be handled in parallel.
For CQs which are used for completions of SDQ, use Tx poll method and
'napi_dev_tx', this method will run in softIRQ context, as it is
recommended in NAPI documentation, as Tx packets' processing is short task.
Convert mlxsw_pci_cq_{rx,tx}_tasklet() to poll methods. Handle 'budget'
argument - ignore it in Tx poll method, as it is recommended to not limit
Tx processing. For Rx processing, handle up to 'budget' completions.
Return 'work_done' which is the amount of completions that were handled.
Handle the following cases:
1. After processing 'budget' completions, the driver still has work to do:
Return work-done = budget. In that case, the NAPI instance will be
polled again (without the need to be rescheduled). Do not re-arm the
queue, as NAPI will handle the reschedule, so we do not have to involve
hardware to send an additional interrupt for the completions that should
be processed.
2. Event processing has been completed:
Call napi_complete_done() to mark NAPI processing as completed, which
means that the poll method will not be rescheduled. Re-arm the queue,
as all completions were handled.
In case that poll method handled exactly 'budget' completions, return
work-done = budget -1, to distinguish from the case that driver still
has completions to handle. Otherwise, return the amount of completions
that were handled.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The next patch will set the driver to use NAPI for event processing. Then
tasklet mechanism will be used only for EQ. Reorganize 'mlxsw_pci_queue'
to hold EQ and CQ attributes in a union. For now, add tasklet for both EQ
and CQ. This will be changed in the next patch, as 'tasklet_struct' will be
replaced with NAPI instance.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw will use NAPI for event processing in a next patch. As preparation,
add two dummy net devices and initialize them.
NAPI instance should be attached to net device. Usually each queue is used
by a single net device in network drivers, so the mapping between net
device to NAPI instance is intuitive. In our case, Rx queues are not per
port, they are per trap-group. Tx queues are mapped to net devices, but we
do not have a separate queue for each local port, several ports share the
same queue.
Use init_dummy_netdev() to initialize dummy net devices for NAPI.
To run NAPI poll method in a kernel thread, the net device which NAPI
instance is attached to should be marked as 'threaded'. It is
recommended to handle Tx packets in softIRQ context, as usually this is
a short task - just free the Tx packet which has been transmitted.
Rx packets handling is more complicated task, so drivers can use a
dedicated kernel thread to process them. It allows processing packets from
different Rx queues in parallel. We would like to handle only Rx packets in
kernel threads, which means that we will use two dummy net devices
(one for Rx and one for Tx). Set only one of them with 'threaded' as it
will be used for Rx processing. Do not fail in case that setting 'threaded'
fails, as it is better to use regular softIRQ NAPI rather than preventing
the driver from loading.
Note that the net devices are initialized with init_dummy_netdev(), so
they are not registered, which means that they will not be visible to user.
It will not be possible to change 'threaded' configuration from user
space, but it is reasonable in our case, as there is no another
configuration which makes sense, considering that user has no influence
on the usage of each queue.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, for each CQE in CQ, we ring CQ doorbell, then handle RDQ and
ring RDQ doorbell. Finally we ring CQ arm doorbell - once per CQ tasklet.
The idea of ringing CQ doorbell before RDQ doorbell, is to be sure that
when we post new WQE (after RDQ is handled), there is an available CQE.
This was done because of a hardware bug as part of
commit c9ebea04cb ("mlxsw: pci: Ring CQ's doorbell before RDQ's").
There is no real reason to ring RDQ and CQ doorbells for each completion,
it is better to handle several completions and reduce number of ringings,
as access to hardware is expensive (time wise) and might take time because
of memory barriers.
A previous patch changed CQ tasklet to handle up to 64 Rx packets. With
this limitation, we can ring CQ and RDQ doorbells once per CQ tasklet.
The counters of the doorbells are increased by the amount of packets
that we handled, then the device will know for which completion to send
an additional event.
To avoid reordering CQ and RDQ doorbells' ring, let the tasklet to ring
also RDQ doorbell, mlxsw_pci_cqe_rdq_handle() handles the counter but
does not ring the doorbell.
Note that with this change there is no need to copy the CQE, as we ring CQ
doorbell only after Rx packet processing (which uses the CQE) is done.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can get many completions in one interrupt. Currently, the CQ tasklet
handles up to half queue size completions, and then arms the hardware to
generate additional events, which means that in case that there were
additional completions that we did not handle, we will get immediately an
additional interrupt to handle the rest.
The decision to handle up to half of the queue size is arbitrary and was
determined in 2015, when mlxsw driver was added to the kernel. One
additional fact that should be taken into account is that while WQEs
from RDQ are handled, the CPU that handles the tasklet is dedicated for
this task, which means that we might hold the CPU for a long time.
Handle WQEs in smaller chucks, then arm CQ doorbell to notify the hardware
to send additional notifications. Set the chunk size to 64 as this number
is recommended using NAPI and the driver will use NAPI in a next patch.
Note that for now we use ARM doorbell to retrigger CQ tasklet, but with
NAPI it will be more efficient as software will reschedule the poll
method and we will not involve hardware for that.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When calling qede_parse_actions() then the
return code was only used for a non-zero check,
and then -EINVAL was returned.
qede_parse_actions() can currently fail with:
* -EINVAL
* -EOPNOTSUPP
This patch changes the code to use the actual
return code, not just return -EINVAL.
The blaimed commit broke the implicit assumption
that only -EINVAL would ever be returned.
Only compile tested.
Fixes: 319a1d1947 ("flow_offload: check for basic action hw stats type")
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In qede_flow_spec_to_rule(), when calling
qede_parse_flow_attr() then the return code
was only used for a non-zero check, and then
-EINVAL was returned.
qede_parse_flow_attr() can currently fail with:
* -EINVAL
* -EOPNOTSUPP
* -EPROTONOSUPPORT
This patch changes the code to use the actual
return code, not just return -EINVAL.
The blaimed commit introduced qede_flow_spec_to_rule(),
and this call to qede_parse_flow_attr(), it looks
like it just duplicated how it was already used.
Only compile tested.
Fixes: 37c5d3efd7 ("qede: use ethtool_rx_flow_rule() to remove duplicated parser code")
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In qede_add_tc_flower_fltr(), when calling
qede_parse_flow_attr() then the return code
was only used for a non-zero check, and then
-EINVAL was returned.
qede_parse_flow_attr() can currently fail with:
* -EINVAL
* -EOPNOTSUPP
* -EPROTONOSUPPORT
This patch changes the code to use the actual
return code, not just return -EINVAL.
The blaimed commit introduced these functions.
Only compile tested.
Fixes: 2ce9c93eac ("qede: Ingress tc flower offload (drop action) support.")
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Explicitly set 'rc' (return code), before jumping to the
unlock and return path.
By not having any code depend on that 'rc' remains at
it's initial value of -EINVAL, then we can re-use 'rc' for
the return code of function calls in subsequent patches.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UMAC_CMD register is written from different execution
contexts and has insufficient synchronization protections to
prevent possible corruption. Of particular concern are the
acceses from the phy_device delayed work context used by the
adjust_link call and the BH context that may be used by the
ndo_set_rx_mode call.
A spinlock is added to the driver to protect contended register
accesses (i.e. reg_lock) and it is used to synchronize accesses
to UMAC_CMD.
Fixes: 1c1008c793 ("net: bcmgenet: add main driver file")
Cc: stable@vger.kernel.org
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ndo_set_rx_mode function is synchronized with the
netif_addr_lock spinlock and BHs disabled. Since this
function is also invoked directly from the driver the
same synchronization should be applied.
Fixes: 72f9634762 ("net: bcmgenet: set Rx mode before starting netif")
Cc: stable@vger.kernel.org
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The EXT_RGMII_OOB_CTRL register can be written from different
contexts. It is predominantly written from the adjust_link
handler which is synchronized by the phydev->lock, but can
also be written from a different context when configuring the
mii in bcmgenet_mii_config().
The chances of contention are quite low, but it is conceivable
that adjust_link could occur during resume when WoL is enabled
so use the phydev->lock synchronizer in bcmgenet_mii_config()
to be sure.
Fixes: afe3f907d2 ("net: bcmgenet: power on MII block for all MII modes")
Cc: stable@vger.kernel.org
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the base-time for taprio is in the past, start the schedule at the time
of the form "base_time + N*cycle_time" where N is the smallest possible
integer such that the above time is in the future.
Signed-off-by: Tanmay Patil <t-patil@ti.com>
Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support to per-packet Tx hardware timestamp request to
AF_XDP zero-copy packet via XDP Tx metadata framework. Please note that
user needs to enable Tx HW timestamp capability via igc_ioctl() with
SIOCSHWTSTAMP cmd before sending xsk Tx hardware timestamp request.
Same as implementation in RX timestamp XDP hints kfunc metadata, Timer 0
(adjustable clock) is used in xsk Tx hardware timestamp. i225/i226 have
four sets of timestamping registers. *skb and *xsk_tx_buffer pointers
are used to indicate whether the timestamping register is already occupied.
Furthermore, a boolean variable named xsk_pending_ts is used to hold the
transmit completion until the tx hardware timestamp is ready. This is
because, for i225/i226, the timestamp notification event comes some time
after the transmit completion event. The driver will retrigger hardware irq
to clean the packet after retrieve the tx hardware timestamp.
Besides, xsk_meta is added into struct igc_tx_timestamp_request as a hook
to the metadata location of the transmit packet. When the Tx timestamp
interrupt is fired, the interrupt handler will copy the value of Tx hwts
into metadata location via xsk_tx_metadata_complete().
This patch is tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel ADL-S platform. Below are the test steps and results.
Test Step 1: Run xdp_hw_metadata app
./xdp_hw_metadata <iface> > /dev/shm/result.log
Test Step 2: Enable Tx hardware timestamp
hwstamp_ctl -i <iface> -t 1 -r 1
Test Step 3: Run ptp4l and phc2sys for time synchronization
Test Step 4: Generate UDP packets with 1ms interval for 10s
trafgen --dev <iface> '{eth(da=<addr>), udp(dp=9091)}' -t 1ms -n 10000
Test Step 5: Rerun Step 1-3 with 10s iperf3 as background traffic
Test Step 6: Rerun Step 1-4 with 10s iperf3 as background traffic
Based on iperf3 results below, the impact of holding tx completion to
throughput is not observable.
Result of last UDP packet (no. 10000) in Step 4:
poll: 1 (0) skip=99 fail=0 redir=10000
xsk_ring_cons__peek: 1
0x5640a37972d0: rx_desc[9999]->addr=f2110 addr=f2110 comp_addr=f2110 EoP
rx_hash: 0x2049BE1D with RSS type:0x1
HW RX-time: 1679819246792971268 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (14.990 usec)
XDP RX-time: 1679819246792981987 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (4.271 usec)
No rx_vlan_tci or rx_vlan_proto, err=-95
0x5640a37972d0: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6
0x5640a37972d0: complete tx idx=9999 addr=f010
HW TX-complete-time: 1679819246793036971 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (77.656 usec)
XDP RX-time: 1679819246792981987 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (132.640 usec)
HW RX-time: 1679819246792971268 (sec:1679819246.7930) delta to HW TX-complete-time sec:0.0001 (65.703 usec)
0x5640a37972d0: complete rx idx=10127 addr=f2110
Result of iperf3 without tx hwts request in step 5:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.74 GBytes 2.36 Gbits/sec 0 sender
[ 5] 0.00-10.05 sec 2.74 GBytes 2.34 Gbits/sec receiver
Result of iperf3 running parallel with trafgen command in step 6:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.74 GBytes 2.36 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 2.74 GBytes 2.34 Gbits/sec receiver
Co-developed-by: Lai Peter Jun Ann <jun.ann.lai@intel.com>
Signed-off-by: Lai Peter Jun Ann <jun.ann.lai@intel.com>
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240424210256.3440903-1-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tony Nguyen says:
====================
net: intel: start The Great Code Dedup + Page Pool for iavf
Alexander Lobakin says:
Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and
libie) and switch iavf to Page Pool. Details are in the commit messages;
here's a summary:
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers. The first name that came to my mind was
"libie" -- "Intel Ethernet common library". Also this sounds like
"lovelie" (-> one word, no "lib I E" pls) and can be expanded as
"lib Internet Explorer" :P
The "generic", pure-software part is placed separately, so that it can be
easily reused in any driver by any vendor without linking to the Intel
pre-200G guts. In a few words, it's something any modern driver does the
same way, but nobody moved it level up (yet).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks:
"can I share this?". The final destination is very ambitious: have only
one unified driver for at least i40e, ice, iavf, and idpf with a struct
ops for each generation. That's never gonna happen, right? But you still
can at least try.
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so that a driver can't
use much of the lib until it's converted. iavf is only the example, the
rest will eventually be converted soon on a per-driver basis. That is
when it gets really interesting. Stay tech.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
MAINTAINERS: add entry for libeth and libie
iavf: switch to Page Pool
iavf: pack iavf_ring more efficiently
libeth: add Rx buffer management
page_pool: add DMA-sync-for-CPU inline helper
page_pool: constify some read-only function arguments
slab: introduce kvmalloc_array_node() and kvcalloc_node()
iavf: drop page splitting and recycling
iavf: kill "legacy-rx" for good
net: intel: introduce {, Intel} Ethernet common library
====================
Link: https://lore.kernel.org/r/20240424203559.3420468-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The fragment lookup should only be performed, when
at least one of the fragment flags are set.
This change was deliberately not included in commit
68aba00483 ("net: sparx5: flower: fix fragment flags handling")
as it's only needed for future proffing the code, since
"mask" is currently only set in conjunction with the
fragment flags.
(The 3rd flag FLOW_DIS_ENCAPSULATION is only used with "key")
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-2-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We try to access count + 1 byte from userspace with memdup_user(buffer,
count + 1). However, the userspace only provides buffer of count bytes and
only these count bytes are verified to be okay to access. To ensure the
copied buffer is NUL terminated, we use memdup_user_nul instead.
Fixes: 3a2eb515d1 ("octeontx2-af: Fix an off by one in rvu_dbg_qsize_write()")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://lore.kernel.org/r/20240424-fix-oob-read-v2-6-f1f1b53a10f4@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, we allocate a nbytes-sized kernel buffer and copy nbytes from
userspace to that buffer. Later, we use sscanf on this buffer but we don't
ensure that the string is terminated inside the buffer, this can lead to
OOB read when using sscanf. Fix this issue by using memdup_user_nul
instead of memdup_user.
Fixes: 7afc5dbde0 ("bna: Add debugfs interface.")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://lore.kernel.org/r/20240424-fix-oob-read-v2-2-f1f1b53a10f4@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, we allocate a count-sized kernel buffer and copy count bytes
from userspace to that buffer. Later, we use sscanf on this buffer but we
don't ensure that the string is terminated inside the buffer, this can lead
to OOB read when using sscanf. Fix this issue by using memdup_user_nul
instead of memdup_user.
Fixes: 96a9a9341c ("ice: configure FW logging")
Fixes: 73671c3162 ("ice: enable FW logging")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://lore.kernel.org/r/20240424-fix-oob-read-v2-1-f1f1b53a10f4@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The CPTS, by design, captures the messageType (Sync, Delay_Req, etc.)
field from the second nibble of the PTP header which is defined in the
PTPv2 (1588-2008) specification. In the PTPv1 (1588-2002) specification
the first two bytes of the PTP header are defined as the versionType
which is always 0x0001. This means that any PTPv1 packets that are
tagged for TX timestamping by the CPTS will have their messageType set
to 0x0 which corresponds to a Sync message type. This causes issues
when a PTPv1 stack is expecting a Delay_Req (messageType: 0x1)
timestamp that never appears.
Fix this by checking if the ptp_class of the timestamped TX packet is
PTP_CLASS_V1 and then matching the PTP sequence ID to the stored
sequence ID in the skb->cb data structure. If the sequence IDs match
and the packet is of type PTPv1 then there is a chance that the
messageType has been incorrectly stored by the CPTS so overwrite the
messageType stored by the CPTS with the messageType from the skb->cb
data structure. This allows the PTPv1 stack to receive TX timestamps
for Delay_Req packets which are necessary to lock onto a PTP Leader.
Signed-off-by: Jason Reeder <jreeder@ti.com>
Signed-off-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
Tested-by: Ed Trexel <ed.trexel@hp.com>
Fixes: f6bd59526c ("net: ethernet: ti: introduce am654 common platform time sync driver")
Link: https://lore.kernel.org/r/20240424071626.32558-1-r-gunasekaran@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
9f74a3dfcf ("ice: Fix VF Reset paths when interface in a failed over
aggregate"), the ice driver has acquired the LAG mutex in ice_reset_vf().
The commit placed this lock acquisition just prior to the acquisition of
the VF configuration lock.
If ice_reset_vf() acquires the configuration lock via the ICE_VF_RESET_LOCK
flag, this could deadlock with ice_vc_cfg_qs_msg() because it always
acquires the locks in the order of the VF configuration lock and then the
LAG mutex.
Lockdep reports this violation almost immediately on creating and then
removing 2 VF:
======================================================
WARNING: possible circular locking dependency detected
6.8.0-rc6 #54 Tainted: G W O
------------------------------------------------------
kworker/60:3/6771 is trying to acquire lock:
ff40d43e099380a0 (&vf->cfg_lock){+.+.}-{3:3}, at: ice_reset_vf+0x22f/0x4d0 [ice]
but task is already holding lock:
ff40d43ea1961210 (&pf->lag_mutex){+.+.}-{3:3}, at: ice_reset_vf+0xb7/0x4d0 [ice]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&pf->lag_mutex){+.+.}-{3:3}:
__lock_acquire+0x4f8/0xb40
lock_acquire+0xd4/0x2d0
__mutex_lock+0x9b/0xbf0
ice_vc_cfg_qs_msg+0x45/0x690 [ice]
ice_vc_process_vf_msg+0x4f5/0x870 [ice]
__ice_clean_ctrlq+0x2b5/0x600 [ice]
ice_service_task+0x2c9/0x480 [ice]
process_one_work+0x1e9/0x4d0
worker_thread+0x1e1/0x3d0
kthread+0x104/0x140
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1b/0x30
-> #0 (&vf->cfg_lock){+.+.}-{3:3}:
check_prev_add+0xe2/0xc50
validate_chain+0x558/0x800
__lock_acquire+0x4f8/0xb40
lock_acquire+0xd4/0x2d0
__mutex_lock+0x9b/0xbf0
ice_reset_vf+0x22f/0x4d0 [ice]
ice_process_vflr_event+0x98/0xd0 [ice]
ice_service_task+0x1cc/0x480 [ice]
process_one_work+0x1e9/0x4d0
worker_thread+0x1e1/0x3d0
kthread+0x104/0x140
ret_from_fork+0x31/0x50
ret_from_fork_asm+0x1b/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&pf->lag_mutex);
lock(&vf->cfg_lock);
lock(&pf->lag_mutex);
lock(&vf->cfg_lock);
*** DEADLOCK ***
4 locks held by kworker/60:3/6771:
#0: ff40d43e05428b38 ((wq_completion)ice){+.+.}-{0:0}, at: process_one_work+0x176/0x4d0
#1: ff50d06e05197e58 ((work_completion)(&pf->serv_task)){+.+.}-{0:0}, at: process_one_work+0x176/0x4d0
#2: ff40d43ea1960e50 (&pf->vfs.table_lock){+.+.}-{3:3}, at: ice_process_vflr_event+0x48/0xd0 [ice]
#3: ff40d43ea1961210 (&pf->lag_mutex){+.+.}-{3:3}, at: ice_reset_vf+0xb7/0x4d0 [ice]
stack backtrace:
CPU: 60 PID: 6771 Comm: kworker/60:3 Tainted: G W O 6.8.0-rc6 #54
Hardware name:
Workqueue: ice ice_service_task [ice]
Call Trace:
<TASK>
dump_stack_lvl+0x4a/0x80
check_noncircular+0x12d/0x150
check_prev_add+0xe2/0xc50
? save_trace+0x59/0x230
? add_chain_cache+0x109/0x450
validate_chain+0x558/0x800
__lock_acquire+0x4f8/0xb40
? lockdep_hardirqs_on+0x7d/0x100
lock_acquire+0xd4/0x2d0
? ice_reset_vf+0x22f/0x4d0 [ice]
? lock_is_held_type+0xc7/0x120
__mutex_lock+0x9b/0xbf0
? ice_reset_vf+0x22f/0x4d0 [ice]
? ice_reset_vf+0x22f/0x4d0 [ice]
? rcu_is_watching+0x11/0x50
? ice_reset_vf+0x22f/0x4d0 [ice]
ice_reset_vf+0x22f/0x4d0 [ice]
? process_one_work+0x176/0x4d0
ice_process_vflr_event+0x98/0xd0 [ice]
ice_service_task+0x1cc/0x480 [ice]
process_one_work+0x1e9/0x4d0
worker_thread+0x1e1/0x3d0
? __pfx_worker_thread+0x10/0x10
kthread+0x104/0x140
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
To avoid deadlock, we must acquire the LAG mutex only after acquiring the
VF configuration lock. Fix the ice_reset_vf() to acquire the LAG mutex only
after we either acquire or check that the VF configuration lock is held.
Fixes: 9f74a3dfcf ("ice: Fix VF Reset paths when interface in a failed over aggregate")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Dave Ertman <david.m.ertman@intel.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240423182723.740401-5-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If the MFS is set below the default (0x2600), a warning message is
reported like the following :
MFS for port 1 has been set below the default: 600
This message is a bit confusing as the number shown here (600) is in
fact an hexa number: 0x600 = 1536
Without any explicit "0x" prefix, this message is read like the MFS is
set to 600 bytes.
MFS, as per MTUs, are usually expressed in decimal base.
This commit reports both current and default MFS values in decimal
so it's less confusing for end-users.
A typical warning message looks like the following :
MFS for port 1 (1536) has been set below the default (9728)
Signed-off-by: Erwan Velu <e.velu@criteo.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Tony Brelinski <tony.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Fixes: 3a2c6ced90 ("i40e: Add a check to see if MFS is set")
Link: https://lore.kernel.org/r/20240423182723.740401-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Replace the hardcoded values used in the calculations for
vnic descriptors and rings with defines. Minor code cleanup.
Signed-off-by: Satish Kharat <satishkh@cisco.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>