Fix ETH56G FC-FEC incorrect Rx offset value by changing it from -255.96
to -469.26 ns.
Those values are derived from HW spec and reflect internal delays.
Hex value is a fixed point representation in Q23.9 format.
Fixes: 7cab44f1c3 ("ice: Introduce ETH56G PHY model for E825C products")
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Quad registers are read/written incorrectly. E825 devices always use
quad 0 address and differentiate between the PHYs by changing SBQ
destination device (phy_0 or phy_0_peer).
Add helpers for reading/writing PTP registers shared per quad and use
correct quad address and SBQ destination device based on port.
Fixes: 7cab44f1c3 ("ice: Introduce ETH56G PHY model for E825C products")
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.
On the other hand, kthread_create_worker() creates a kthread worker and
runs it.
This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.
Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
ptp4l application reports too high offset when ran on E823 device
with a 100GB/s link. Those values cannot go under 100ns, like in a
working case when using 100 GB/s cable.
This is due to incorrect frequency settings on the PHY clocks for
100 GB/s speed. Changes are introduced to align with the internal
hardware documentation, and correctly initialize frequency in PHY
clocks with the frequency values that are in our HW spec.
To reproduce the issue run ptp4l as a Time Receiver on E823 device,
and observe the offset, which will never approach values seen
in the PTP working case.
Reproduction output:
ptp4l -i enp137s0f3 -m -2 -s -f /etc/ptp4l_8275.conf
ptp4l[5278.775]: master offset 12470 s2 freq +41288 path delay -3002
ptp4l[5278.837]: master offset 10525 s2 freq +39202 path delay -3002
ptp4l[5278.900]: master offset -24840 s2 freq -20130 path delay -3002
ptp4l[5278.963]: master offset 10597 s2 freq +37908 path delay -3002
ptp4l[5279.025]: master offset 8883 s2 freq +36031 path delay -3002
ptp4l[5279.088]: master offset 7267 s2 freq +34151 path delay -3002
ptp4l[5279.150]: master offset 5771 s2 freq +32316 path delay -3002
ptp4l[5279.213]: master offset 4388 s2 freq +30526 path delay -3002
ptp4l[5279.275]: master offset -30434 s2 freq -28485 path delay -3002
ptp4l[5279.338]: master offset -28041 s2 freq -27412 path delay -3002
ptp4l[5279.400]: master offset 7870 s2 freq +31118 path delay -3002
Fixes: 3a7496234d ("ice: implement basic E822 PTP support")
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Mask admin command returned max phase adjust value for both input and
output pins. Only 31 bits are relevant, last released data sheet wrongly
points that 32 bits are valid - see [1] 3.2.6.4.1 Get CCU Capabilities
Command for reference. Fix of the datasheet itself is in progress.
Fix the min/max assignment logic, previously the value was wrongly
considered as negative value due to most significant bit being set.
Example of previous broken behavior:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/dpll.yaml \
--do pin-get --json '{"id":1}'| grep phase-adjust
'phase-adjust': 0,
'phase-adjust-max': 16723,
'phase-adjust-min': -16723,
Correct behavior with the fix:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/dpll.yaml \
--do pin-get --json '{"id":1}'| grep phase-adjust
'phase-adjust': 0,
'phase-adjust-max': 2147466925,
'phase-adjust-min': -2147466925,
[1] https://cdrdv2.intel.com/v1/dl/getContent/613875?explicitVersion=true
Fixes: 90e1c90750 ("ice: dpll: implement phase related callbacks")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Currently, xsk_buff_add_frag() only adds the frag to pool's linked list,
not doing anything with the &xdp_buff. The drivers do that manually and
the logic is the same.
Make it really add an skb frag, just like xdp_buff_add_frag() does that,
and freeing frags on error if needed. This allows to remove repeating
code from i40e and ice and not add the same code again and again.
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Drop "devlink_" prefix from files that sit in devlink/.
I'm going to add more files there, and repeating "devlink" does not feel
good. This is also the scheme used in most other places, most notably the
devlink core files are named like that.
devlink.[ch] stays as is.
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The ice_copy_rxq_ctx_to_hw() and ice_write_rxq_ctx() functions perform some
defensive checks which are typically frowned upon by kernel style
guidelines.
In particular, NULL checks on buffers which point to the stack are
discouraged, especially when the functions are static and only called once.
Checks of this sort only serve to hide potential programming error, as we
will not produce the normal crash dump on a NULL access.
In addition, ice_copy_rxq_ctx_to_hw() cannot fail in another way, so could
be made void.
Future support for VF Live Migration will need to introduce an inverse
function for reading Rx queue context from HW registers to unpack it, as
well as functions to pack and unpack Tx queue context from HW.
Rather than copying these style issues into the new functions, lets first
cleanup the existing code.
For the ice_copy_rxq_ctx_to_hw() function:
* Move the Rx queue index check out of this function.
* Convert the function to a void return.
* Use a simple int variable instead of a u8 for the for loop index, and
initialize it inside the for loop.
* Update the function description to better align with kernel doc style.
For the ice_write_rxq_ctx() function:
* Move the Rx queue index check into this function.
* Update the function description with a Returns: to align with kernel doc
style.
These changes align the existing write functions to current kernel
style, and will align with the style of the new functions added when we
implement live migration in a future series.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-10-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ice_write_rxq_ctx() function is responsible for programming the Rx
Queue context into hardware. It receives the configuration in unpacked form
via the ice_rlan_ctx structure.
This function unconditionally modifies the context to set the prefetch
enable bit. This was done by commit c31a5c25bb ("ice: Always set prefena
when configuring an Rx queue"). Setting this bit makes sense, since
prefetching descriptors is almost always the preferred behavior.
However, the ice_write_rxq_ctx() function is not the place that actually
defines the queue context. We initialize the Rx Queue context in
ice_setup_rx_ctx(). It is surprising to have the Rx queue context changed
by a function who's responsibility is to program the given context to
hardware.
Following the principle of least surprise, move the setting of the prefetch
enable bit out of ice_write_rxq_ctx() and into the ice_setup_rx_ctx().
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-9-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ice_rlan_ctx and ice_tlan_ctx structures have some fields which are
intentionally sized larger than necessary relative to the packed sizes the
data must fit into. This was done because the original ice_set_ctx()
function and its helpers did not correctly handle packing when the packed
bits straddled a byte. This is no longer the case with the use of the
<linux/packing.h> implementation.
Save some bytes in these structures by sizing the variables to the number
of bytes the actual bitpacked fields fit into.
There are a couple of gaps left in the structure, which is a result of the
fields being in the order they appear in the packed bit layout, but where
alignment forces some extra gaps. We could fix this, saving ~8 bytes from
each structure. However, these structures are not used heavily, and the
resulting savings is minimal:
$ bloat-o-meter ice-before-reorder.ko ice-after-reorder.ko
add/remove: 0/0 grow/shrink: 1/1 up/down: 26/-70 (-44)
Function old new delta
ice_vsi_cfg_txq 1873 1899 +26
ice_setup_rx_ctx.constprop 1529 1459 -70
Total: Before=1459555, After=1459511, chg -0.00%
Thus, the fields are left in the same order as the packed bit layout,
despite the gaps this causes.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-8-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ice driver needs to write the Tx and Rx queue context when programming
Tx and Rx queues. This is currently done using some bespoke custom logic
via the ice_set_ctx() and its helper functions, along with bit position
definitions in the ice_tlan_ctx_info and ice_rlan_ctx_info structures.
This logic does work, but is problematic for several reasons:
1) ice_set_ctx requires a helper function for each byte size being packed,
as it uses a separate function to pack u8, u16, u32, and u64 fields.
This requires 4 functions which contain near-duplicate logic with the
types changed out.
2) The logic in the ice_pack_ctx_word, ice_pack_ctx_dword, and
ice_pack_ctx_qword does not handle values which straddle alignment
boundaries very well. This requires that several fields in the
ice_tlan_ctx_info and ice_rlan_ctx_info be a size larger than their bit
size should require.
3) Future support for live migration will require adding unpacking
functions to take the packed hardware context and unpack it into the
ice_rlan_ctx and ice_tlan_ctx structures. Implementing this would
require implementing ice_get_ctx, and its associated helper functions,
which essentially doubles the amount of code required.
The Linux kernel has had a packing library that can handle this logic since
commit 554aae3500 ("lib: Add support for generic packing operations").
The library was recently extended with support for packing or unpacking an
array of fields, with a similar structure as the ice_ctx_ele structure.
Replace the ice-specific ice_set_ctx() logic with the recently added
pack_fields and packed_field_s infrastructure from <linux/packing.h>
For API simplicity, the Tx and Rx queue context are programmed using
separate ice_pack_txq_ctx() and ice_pack_rxq_ctx(). This avoids needing to
export the packed_field_s arrays. The functions can pointers to the
appropriate ice_txq_ctx_buf_t and ice_rxq_ctx_buf_t types, ensuring that
only buffers of the appropriate size are passed.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-7-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ice Tx and Rx queue context are currently stored as arrays of bytes
with defined size (ICE_RXQ_CTX_SZ and ICE_TXQ_CTX_SZ). The packed queue
context is often passed to other functions as a simple u8 * pointer, which
does not allow tracking the size. This makes the queue context API easy to
misuse, as you can pass an arbitrary u8 array or pointer.
Introduce wrapper typedefs which use a __packed structure that has the
proper fixed size for the Tx and Rx context buffers. This enables the
compiler to track the size of the value and ensures that passing the wrong
buffer size will be detected by the compiler.
The existing APIs do not benefit much from this change, however the
wrapping structures will be used to simplify the arguments of new packing
functions based on the recently introduced pack_fields API.
Co-developed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-6-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Paolo Abeni:
"Including fixes from can and netfilter.
Current release - regressions:
- rtnetlink: fix double call of rtnl_link_get_net_ifla()
- tcp: populate XPS related fields of timewait sockets
- ethtool: fix access to uninitialized fields in set RXNFC command
- selinux: use sk_to_full_sk() in selinux_ip_output()
Current release - new code bugs:
- net: make napi_hash_lock irq safe
- eth:
- bnxt_en: support header page pool in queue API
- ice: fix NULL pointer dereference in switchdev
Previous releases - regressions:
- core: fix icmp host relookup triggering ip_rt_bug
- ipv6:
- avoid possible NULL deref in modify_prefix_route()
- release expired exception dst cached in socket
- smc: fix LGR and link use-after-free issue
- hsr: avoid potential out-of-bound access in fill_frame_info()
- can: hi311x: fix potential use-after-free
- eth: ice: fix VLAN pruning in switchdev mode
Previous releases - always broken:
- netfilter:
- ipset: hold module reference while requesting a module
- nft_inner: incorrect percpu area handling under softirq
- can: j1939: fix skb reference counting
- eth:
- mlxsw: use correct key block on Spectrum-4
- mlx5: fix memory leak in mlx5hws_definer_calc_layout"
* tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
net :mana :Request a V2 response version for MANA_QUERY_GF_STAT
net: avoid potential UAF in default_operstate()
vsock/test: verify socket options after setting them
vsock/test: fix parameter types in SO_VM_SOCKETS_* calls
vsock/test: fix failures due to wrong SO_RCVLOWAT parameter
net/mlx5e: Remove workaround to avoid syndrome for internal port
net/mlx5e: SD, Use correct mdev to build channel param
net/mlx5: E-Switch, Fix switching to switchdev mode in MPV
net/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabled
net/mlx5: HWS: Properly set bwc queue locks lock classes
net/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layout
bnxt_en: handle tpa_info in queue API implementation
bnxt_en: refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap()
bnxt_en: refactor tpa_info alloc/free into helpers
geneve: do not assume mac header is set in geneve_xmit_skb()
mlxsw: spectrum_acl_flex_keys: Use correct key block on Spectrum-4
ethtool: Fix wrong mod state in case of verbose and no_mask bitset
ipmr: tune the ipmr_can_free_table() checks.
netfilter: nft_set_hash: skip duplicated elements pending gc run
netfilter: ipset: Hold module reference while requesting a module
...
In switchdev mode the uplink VSI should receive all unmatched packets,
including VLANs. Therefore, VLAN pruning should be disabled if uplink is
in switchdev mode. It is already being done in ice_eswitch_setup_env(),
however the addition of ice_up() in commit 44ba608db5 ("ice: do
switchdev slow-path Rx using PF VSI") caused VLAN pruning to be
re-enabled after disabling it.
Add a check to ice_set_vlan_filtering_features() to ensure VLAN
filtering will not be enabled if uplink is in switchdev mode. Note that
ice_is_eswitch_mode_switchdev() is being used instead of
ice_is_switchdev_running(), as the latter would only return true after
the whole switchdev setup completes.
Fixes: 44ba608db5 ("ice: do switchdev slow-path Rx using PF VSI")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Tested-by: Priya Singh <priyax.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
To check if PHY Clock Recovery mechanic is available for a device, there
is a need to verify if given PHY is available within the netlist, but the
netlist node type used for the search is wrong, also the search context
shall be specified.
Modify the search function to allow specifying the context in the
search.
Use the PHY node type instead of CLOCK CONTROLLER type, also use proper
search context which for PHY search is PORT, as defined in E810
Datasheet [1] ('3.3.8.2.4 Node Part Number and Node Options (0x0003)' and
'Table 3-105. Program Topology Device NVM Admin Command').
[1] https://cdrdv2.intel.com/v1/dl/getContent/613875?explicitVersion=true
Fixes: 91e43ca009 ("ice: fix linking when CONFIG_PTP_1588_CLOCK=n")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2024-11-05 (ice, ixgbe, igc. igb, igbvf, e1000)
For ice:
Mateusz refactors and adds additional SerDes configuration values to be
output.
Przemek refactors processing of DDP and adds support for a flag field in
the DDP's signature segment header.
Joe Damato adds support for persistent NAPI config.
Brett adjusts setting of Tx promiscuous based on unicast/multicast
setting.
Jake moves setting of pf->supported_rxdids to occur directly after DDP
load and changes a small struct to use stack memory.
Frederic Weisbecker adds WQ_UNBOUND flag to the workqueue.
For ixgbe:
Diomidis Spinellis removes a circular dependency.
For igc:
Vitaly removes an unneeded autoneg parameter.
For igb:
Johnny Park fixes a couple of typos.
For igbvf:
Wander Lairson Costa removes an unused spinlock.
For e1000:
Joe Damato adds RTNL lock to some calls where it is expected to be held.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
e1000: Hold RTNL when e1000_down can be called
igbvf: remove unused spinlock
igb: Fix 2 typos in comments in igb_main.c
igc: remove autoneg parameter from igc_mac_info
ixgbe: Break include dependency cycle
ice: Unbind the workqueue
ice: use stack variable for virtchnl_supported_rxdids
ice: initialize pf->supported_rxdids immediately after loading DDP
ice: only allow Tx promiscuous for multicast
ice: Add support for persistent NAPI config
ice: support optional flags in signature segment header
ice: refactor "last" segment of DDP pkg
ice: extend dump serdes equalizer values feature
ice: rework of dump serdes equalizer values feature
====================
Link: https://patch.msgid.link/20241113185431.1289708-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently when FDB entries are added to or deleted from a VXLAN netdevice,
the VXLAN driver emits one notification, including the VXLAN-specific
attributes. The core however always sends a notification as well, a generic
one. Thus two notifications are unnecessarily sent for these operations. A
similar situation comes up with bridge driver, which also emits
notifications on its own:
# ip link add name vx type vxlan id 1000 dstport 4789
# bridge monitor fdb &
[1] 1981693
# bridge fdb add de:ad:be:ef:13:37 dev vx self dst 192.0.2.1
de:ad:be:ef:13:37 dev vx dst 192.0.2.1 self permanent
de:ad:be:ef:13:37 dev vx self permanent
In order to prevent this duplicity, add a paremeter to ndo_fdb_add,
bool *notified. The flag is primed to false, and if the callee sends a
notification on its own, it sets it to true, thus informing the core that
it should not generate another notification.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/cbf6ae8195e85cbf922f8058ce4eba770f3b71ed.1731589511.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ice workqueue doesn't seem to rely on any CPU locality and should
therefore be able to run on any CPU. In practice this is already
happening through the unbound ice_service_timer that may fire anywhere
and queue the workqueue accordingly to any CPU.
Make this official so that the ice workqueue is only ever queued to
housekeeping CPUs on nohz_full.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The ice_vc_query_rxdid() function allocates memory to store the
virtchnl_supported_rxdids structure used to communicate the bitmap of
supported RXDIDs to a VF.
This structure is only 8 bytes in size. The function must hold the
allocated length on the stack as well as the pointer to the structure which
itself is 8 bytes. Allocating this storage on the heap adds unnecessary
overhead including a potential error path that must be handled in case
kzalloc fails. Because this structure is so small, we're not saving stack
space. Additionally, because we must ensure that we free the allocated
memory, the return value from ice_vc_send_msg_to_vf() must also be saved in
the stack ret variable. Depending on compiler optimization, this means
allocating the 8-byte structure is requiring up to 16-bytes of stack
memory!
Simplify this function to keep the rxdid variable on the stack, saving
memory and removing a potential failure exit path from this function.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The pf->supported_rxdids field is used to populate the list of valid RXDIDs
that a VF may use when negotiating VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC.
The set of supported RXDIDs is dependent on the DDP, and can be read from
the GLXFLXP_RXDID_FLAGS register. The PF needs to send this list to the
VF upon receiving the VIRTCHNL_OP_GET_SUPPORTED_RXDIDs. It also needs to
use this list to validate the requested descriptor ID from the VF when
programming the Rx queues.
A future update to support VF live migration will also want to validate
that the target VF can support the same descriptor ID when migrating.
Currently, pf->supported_rxdids is initialized inside the
ice_vc_query_rxdid() function. This means that it is only ever initialized
if at least one VF actually tries to negotiate
VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC. It is also unnecessarily re-initialized
every time the VF loads and requests the descriptor list. This worked
before because the PF only checks pf->suppported_rxdids when programming
the Rx queue if the VF actually negotiates the
VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC feature.
This will be problematic for VF live migration. We need the list of
supported Rx descriptor IDs when migrating. It is possible that no VF on
the target PF has ever actually issued a VIRTCHNL_OP_GET_SUPPORTED_RXDIDs.
Refactor the driver to initialize pf->supported_rxdids during driver
initialization after the DDP is loaded. This is simpler, avoids unnecessary
duplicate work, and avoids issues with the live migration process.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Currently when any VF is trusted and true promiscuous mode is enabled on
the PF, the VF will receive all unicast traffic directed to the device's
internal switch. This includes traffic external to the NIC and also from
other VSI (i.e. VFs). This does not match the expected behavior as
unicast traffic should only be visible from external sources in this
case. Disable the Tx promiscuous mode bits for unicast promiscuous mode.
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
An optional flag field has been added to the signature segment header.
The field contains two flags, a "valid" bit, and a "last segment" bit
that indicates whether the segment is the last segment that will be
sent to firmware.
If the flag field's valid bit is NOT set, then as was done before,
assume that this is the last segment being downloaded.
However, if the flag field's valid bit IS set, then use the last segment
flag to determine if this segment is the last segment to download.
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Co-developed-by: Dan Nowlin <dan.nowlin@intel.com>
Signed-off-by: Dan Nowlin <dan.nowlin@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add ice_ddp_send_hunk() that buffers "sent FW hunk" calls to AQ in order
to mark the "last" one in more elegant way. Next commit will add even
more complicated "sent FW" flow, so it's better to untangle a bit before.
Note that metadata buffers were not skipped for NOT-@indicate_last
segments, this is fixed now.
Minor:
+ use ice_is_buffer_metadata() instead of open coding it in
ice_dwnld_cfg_bufs();
+ ice_dwnld_cfg_bufs_no_lock() + dependencies were moved up a bit to have
better git-diff, as this function was rewritten (in terms of git-blame)
CC: Paul Greenwalt <paul.greenwalt@intel.com>
CC: Dan Nowlin <dan.nowlin@intel.com>
CC: Ahmed Zaki <ahmed.zaki@intel.com>
CC: Simon Horman <horms@kernel.org>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Extend the work done in commit 70838938e8 ("ice: Implement driver
functionality to dump serdes equalizer values") by adding the new set of
Rx registers that can be read using command:
$ ethtool -d interface_name
Rx equalization parameters are E810 PHY registers used by end user to
gather information about configuration and status to debug link and
connection issues in the field.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Refactor function ice_get_tx_rx_equa() to iterate over new table of
params instead of multiple calls to ice_aq_get_phy_equalization().
Subsequent commit will extend that function by add more serdes equalizer
values to dump.
Shorten the fields of struct ice_serdes_equalization_to_ethtool for
readability purposes.
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Fix Flow Director not allowing to re-map traffic to 0th queue when action
is configured to drop (and vice versa).
The current implementation of ethtool callback in the ice driver forbids
change Flow Director action from 0 to -1 and from -1 to 0 with an error,
e.g:
# ethtool -U eth2 flow-type tcp4 src-ip 1.1.1.1 loc 1 action 0
# ethtool -U eth2 flow-type tcp4 src-ip 1.1.1.1 loc 1 action -1
rmgr: Cannot insert RX class rule: Invalid argument
We set the value of `u16 q_index = 0` at the beginning of the function
ice_set_fdir_input_set(). In case of "drop traffic" action (which is
equal to -1 in ethtool) we store the 0 value. Later, when want to change
traffic rule to redirect to queue with index 0 it returns an error
caused by duplicate found.
Fix this behaviour by change of the type of field `q_index` from u16 to s16
in `struct ice_fdir_fltr`. This allows to store -1 in the field in case
of "drop traffic" action. What is more, change the variable type in the
function ice_set_fdir_input_set() and assign at the beginning the new
`#define ICE_FDIR_NO_QUEUE_IDX` which is -1. Later, if the action is set
to another value (point specific queue index) the variable value is
overwritten in the function.
Fixes: cac2a27cd9 ("ice: Support IPv4 Flow Director filters")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Unloading the ice driver while switchdev port representors are added to
a bridge can lead to kernel panic. Reproducer:
modprobe ice
devlink dev eswitch set $PF1_PCI mode switchdev
ip link add $BR type bridge
ip link set $BR up
echo 2 > /sys/class/net/$PF1/device/sriov_numvfs
sleep 2
ip link set $PF1 master $BR
ip link set $VF1_PR master $BR
ip link set $VF2_PR master $BR
ip link set $PF1 up
ip link set $VF1_PR up
ip link set $VF2_PR up
ip link set $VF1 up
rmmod irdma ice
When unloading the driver, ice_eswitch_detach() is eventually called as
part of VF freeing. First, it removes a port representor from xarray,
then unregister_netdev() is called (via repr->ops.rem()), finally
representor is deallocated. The problem comes from the bridge doing its
own deinit at the same time. unregister_netdev() triggers a notifier
chain, resulting in ice_eswitch_br_port_deinit() being called. It should
set repr->br_port = NULL, but this does not happen since repr has
already been removed from xarray and is not found. Regardless, it
finishes up deallocating br_port. At this point, repr is still not freed
and an fdb event can happen, in which ice_eswitch_br_fdb_event_work()
takes repr->br_port and tries to use it, which causes a panic (use after
free).
Note that this only happens with 2 or more port representors added to
the bridge, since with only one representor port, the bridge deinit is
slightly different (ice_eswitch_br_port_deinit() is called via
ice_eswitch_br_ports_flush(), not ice_eswitch_br_port_unlink()).
Trace:
Oops: general protection fault, probably for non-canonical address 0xf129010fd1a93284: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: maybe wild-memory-access in range [0x8948287e8d499420-0x8948287e8d499427]
(...)
Workqueue: ice_bridge_wq ice_eswitch_br_fdb_event_work [ice]
RIP: 0010:__rht_bucket_nested+0xb4/0x180
(...)
Call Trace:
(...)
ice_eswitch_br_fdb_find+0x3fa/0x550 [ice]
? __pfx_ice_eswitch_br_fdb_find+0x10/0x10 [ice]
ice_eswitch_br_fdb_event_work+0x2de/0x1e60 [ice]
? __schedule+0xf60/0x5210
? mutex_lock+0x91/0xe0
? __pfx_ice_eswitch_br_fdb_event_work+0x10/0x10 [ice]
? ice_eswitch_br_update_work+0x1f4/0x310 [ice]
(...)
A workaround is available: brctl setageing $BR 0, which stops the bridge
from adding fdb entries altogether.
Change the order of operations in ice_eswitch_detach(): move the call to
unregister_netdev() before removing repr from xarray. This way
repr->br_port will be correctly set to NULL in
ice_eswitch_br_port_deinit(), preventing a panic.
Fixes: fff292b47a ("ice: add VF representors one by one")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
net_dim() is currently passed a struct dim_sample argument by value.
struct dim_sample is 24 bytes. Since this is greater 16 bytes, x86-64
passes it on the stack. All callers have already initialized dim_sample
on the stack, so passing it by value requires pushing a duplicated copy
to the stack. Either witing to the stack and immediately reading it, or
perhaps dereferencing addresses relative to the stack pointer in a chain
of push instructions, seems to perform quite poorly.
In a heavy TCP workload, mlx5e_handle_rx_dim() consumes 3% of CPU time,
94% of which is attributed to the first push instruction to copy
dim_sample on the stack for the call to net_dim():
// Call ktime_get()
0.26 |4ead2: call 4ead7 <mlx5e_handle_rx_dim+0x47>
// Pass the address of struct dim in %rdi
|4ead7: lea 0x3d0(%rbx),%rdi
// Set dim_sample.pkt_ctr
|4eade: mov %r13d,0x8(%rsp)
// Set dim_sample.byte_ctr
|4eae3: mov %r12d,0xc(%rsp)
// Set dim_sample.event_ctr
0.15 |4eae8: mov %bp,0x10(%rsp)
// Duplicate dim_sample on the stack
94.16 |4eaed: push 0x10(%rsp)
2.79 |4eaf1: push 0x10(%rsp)
0.07 |4eaf5: push %rax
// Call net_dim()
0.21 |4eaf6: call 4eafb <mlx5e_handle_rx_dim+0x6b>
To allow the caller to reuse the struct dim_sample already on the stack,
pass the struct dim_sample by reference to net_dim().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Arthur Kiyanovski <akiyano@amazon.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Link: https://patch.msgid.link/20241031002326.3426181-2-csander@purestorage.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The E810 Lan On Motherboard (LOM) design is vendor specific. Intel
provides the reference design, but it is up to vendor on the final
product design. For some cases, like Linux DPLL support, the static
values defined in the driver does not reflect the actual LOM design.
Current implementation of dpll pins is causing the crash on probe
of the ice driver for such DPLL enabled E810 LOM designs:
WARNING: (...) at drivers/dpll/dpll_core.c:495 dpll_pin_get+0x2c4/0x330
...
Call Trace:
<TASK>
? __warn+0x83/0x130
? dpll_pin_get+0x2c4/0x330
? report_bug+0x1b7/0x1d0
? handle_bug+0x42/0x70
? exc_invalid_op+0x18/0x70
? asm_exc_invalid_op+0x1a/0x20
? dpll_pin_get+0x117/0x330
? dpll_pin_get+0x2c4/0x330
? dpll_pin_get+0x117/0x330
ice_dpll_get_pins.isra.0+0x52/0xe0 [ice]
...
The number of dpll pins enabled by LOM vendor is greater than expected
and defined in the driver for Intel designed NICs, which causes the crash.
Prevent the crash and allow generic pin initialization within Linux DPLL
subsystem for DPLL enabled E810 LOM designs.
Newly designed solution for described issue will be based on "per HW
design" pin initialization. It requires pin information dynamically
acquired from the firmware and is already in progress, planned for
next-tree only.
Fixes: d7999f5ea6 ("ice: implement dpll interface to control cgu")
Reviewed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Since commit fff292b47a ("ice: add VF representors one by one")
ice_eswitch_configure() is not used anymore.
Commit 1b8f15b64a ("ice: refactor filter functions") removed
ice_vsi_cfg_mac_fltr() but leave declaration.
Commit a24b4c6e9a ("ice: xsk: Do not convert to buff to frame for
XDP_TX") leave ice_xmit_xdp_buff() declaration.
Commit 7cab44f1c3 ("ice: Introduce ETH56G PHY model for E825C
products") declared ice_phy_cfg_{rx,tx}_offset_eth56g(),
commit a1ffafb0b4 ("ice: Support configuring the device to Double
VLAN Mode") declared ice_pkg_buf_get_free_space(), and
commit 8a3a565ff2 ("ice: add admin commands to access cgu
configuration") declared ice_is_pca9575_present(), but all these never
be implemented.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The max_frame and rx_buf_len fields of the VSI set the maximum frame size
for packets on the wire, and configure the size of the Rx buffer. In the
hardware, these are per-queue configuration. Most VSI types use a simple
method to determine the size of the buffers for all queues.
However, VFs may potentially configure different values for each queue.
While the Linux iAVF driver does not do this, it is allowed by the virtchnl
interface.
The current virtchnl code simply sets the per-VSI fields inbetween calls to
ice_vsi_cfg_single_rxq(). This technically works, as these fields are only
ever used when programming the Rx ring, and otherwise not checked again.
However, it is confusing to maintain.
The Rx ring also already has an rx_buf_len field in order to access the
buffer length in the hotpath. It also has extra unused bytes in the ring
structure which we can make use of to store the maximum frame size.
Drop the VSI max_frame and rx_buf_len fields. Add max_frame to the Rx ring,
and slightly re-order rx_buf_len to better fit into the gaps in the
structure layout.
Change the ice_vsi_cfg_frame_size function so that it writes to the ring
fields. Call this function once per ring in ice_vsi_cfg_rxqs(). This is
done over calling it inside the ice_vsi_cfg_rxq(), because
ice_vsi_cfg_rxq() is called in the virtchnl flow where the max_frame and
rx_buf_len have already been configured.
Change the accesses for rx_buf_len and max_frame to all point to the ring
structure. This has the added benefit that ice_vsi_cfg_rxq() no longer has
the surprise side effect of updating ring->rx_buf_len based on the VSI
field.
Update the virtchnl ice_vc_cfg_qs_msg() function to set the ring values
directly, and drop references to the removed VSI fields.
This now makes the VF logic clear, as the ring fields are obviously
per-queue. This reduces the required cognitive load when reasoning about
this logic.
Note that removing the VSI fields does leave a 4 byte gap, but the ice_vsi
structure has many gaps, and its layout is not as critical in the hot path.
The structure may benefit from a more thorough repacking, but no attempt
was made in this change.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The ice_vc_cfg_qs_msg() function is used to configure VF queues in response
to a VIRTCHNL_OP_CONFIG_VSI_QUEUES command.
The virtchnl command contains an array of queue pair data for configuring
Tx and Rx queues. This data includes a queue ID. When configuring the
queues, the driver generally uses this queue ID to determine which Tx and
Rx ring to program. However, a handful of places use the index into the
queue pair data from the VF. While most VF implementations appear to send
this data in order, it is not mandated by the virtchnl and it is not
verified that the queue pair data comes in order.
Fix the driver to consistently use the q_idx field instead of the 'i'
iterator value when accessing the rings. For the Rx case, introduce a local
ring variable to keep lines short.
Fixes: 7ad15440ac ("ice: Refactor VIRTCHNL_OP_CONFIG_VSI_QUEUES handling")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>