A mixed mode decoder is programmed with device physical addresses
that span both ram and pmem partitions of a memdev.
Linux does not support mixed mode decoders. The driver rejects
sysfs writes that try to set decoder mode to mixed, and if a
resource bieng allocated is not wholly contained in either the
pmem or ram partition of a memdev, it is also rejected. Basically,
the CXL region driver is not going to create regions with mixed
mode decoders, but the BIOS could.
If the kernel driver sees the mixed mode decoder, it will fail to
enable the region, and emit a dev_dbg() message.
A dev_dbg() is not noisy enough in this case. Change the message
to be a dev_warn() that explicitly says mixed mode is not supported.
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/20230218013834.31237-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
The helper, cxl_dpa_resource_start(), snapshots the dpa-address of an
endpoint-decoder after acquiring the cxl_dpa_rwsem. However, it is
sufficient to assert that cxl_dpa_rwsem is held rather than acquire it
in the helper. Otherwise, it triggers multiple lockdep reports:
1/ Tracing callbacks are in an atomic context that can not acquire sleeping
locks:
BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1525
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1288, name: bash
preempt_count: 2, expected: 0
RCU nest depth: 0, expected: 0
[..]
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0x71/0x90
__might_resched+0x1b2/0x2c0
down_read+0x1a/0x190
cxl_dpa_resource_start+0x15/0x50 [cxl_core]
cxl_trace_hpa+0x122/0x300 [cxl_core]
trace_event_raw_event_cxl_poison+0x1c9/0x2d0 [cxl_core]
2/ The rwsem is already held in the inject poison path:
WARNING: possible recursive locking detected
6.7.0-rc2+ #12 Tainted: G W OE N
--------------------------------------------
bash/1288 is trying to acquire lock:
ffffffffc05f73d0 (cxl_dpa_rwsem){++++}-{3:3}, at: cxl_dpa_resource_start+0x15/0x50 [cxl_core]
but task is already holding lock:
ffffffffc05f73d0 (cxl_dpa_rwsem){++++}-{3:3}, at: cxl_inject_poison+0x7d/0x1e0 [cxl_core]
[..]
Call Trace:
<TASK>
dump_stack_lvl+0x71/0x90
__might_resched+0x1b2/0x2c0
down_read+0x1a/0x190
cxl_dpa_resource_start+0x15/0x50 [cxl_core]
cxl_trace_hpa+0x122/0x300 [cxl_core]
trace_event_raw_event_cxl_poison+0x1c9/0x2d0 [cxl_core]
__traceiter_cxl_poison+0x5c/0x80 [cxl_core]
cxl_inject_poison+0x1bc/0x1e0 [cxl_core]
This appears to have been an issue since the initial implementation and
uncovered by the new cxl-poison.sh test [1]. That test is now passing with
these changes.
Fixes: 28a3ae4ff6 ("cxl/trace: Add an HPA to cxl_poison trace events")
Link: http://lore.kernel.org/r/e4f2716646918135ddbadf4146e92abb659de734.1700615159.git.alison.schofield@intel.com [1]
Cc: <stable@vger.kernel.org>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The new helper "cxl_num_decoders_committed()" added a lockdep assertion
to validate that port->commit_end is protected against modification.
That assertion fires in init_hdm_decoder() where it is initializing
port->commit_end. Given that it is both accessing and writing that
property it obstensibly needs the lock.
In practice, CXL decoder commit rules (must commit in order) and the
in-order discovery of device decoders makes the manipulation of
->commit_end in init_hdm_decoder() safe. However, rather than rely on
the subtle rules of CXL hardware, just make the implementation obviously
correct from a software perspective.
The Fixes: tag is only for cleaning up a lockdep splat, there is no
functional issue addressed by this fix.
Fixes: 458ba8189c ("cxl: Add cxl_decoders_committed() helper")
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/170025232811.2147250.16376901801315194121.stgit@djiang5-mobl3
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Restricted CXL Host (RCH) Error Handling undoes the topology munging of
CXL 1.1 to enabled some AER recovery, and lands some base infrastructure
for handling Root-Complex-Event-Collectors (RCECs) with CXL. Include
this long running series finally for v6.7.
Now, that the Component Register mappings are stored, use them to
enable and map the HDM decoder capabilities. The Component Registers
do not need to be probed again for this, remove probing code.
The HDM capability applies to Endpoints, USPs and VH Host Bridges. The
Endpoint's component register mappings are located in the cxlds and
else in the port's structure. Duplicate the cxlds->reg_map in
port->reg_map for endpoint ports.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
[rework to drop cxl_port_get_comp_map()]
Link: https://lore.kernel.org/r/20231018171713.1883517-8-rrichter@amd.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The primary role of @dev is to host the mappings for devm operations.
@dev is too ambiguous as a name. I.e. when does @dev refer to the
'struct device *' instance that the registers belong, and when does
@dev refer to the 'struct device *' instance hosting the mapping for
devm operations?
Clarify the role of @dev in cxl_register_map by renaming it to @host.
Also, rename local variables to 'host' where map->host is used.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20231018171713.1883517-3-rrichter@amd.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The sanitize operation is destructive and the expectation is that the
device is unmapped while in progress. The current implementation does a
lockless check for decoders being active, but then does nothing to
prevent decoders from racing to be committed. Introduce state tracking
to resolve this race.
This incidentally cleans up unpriveleged userspace from triggering mmio
read cycles by spinning on reading the 'security/state' attribute. Which
at a minimum is a waste since the kernel state machine can cache the
completion result.
Lastly cxl_mem_sanitize() was mistakenly marked EXPORT_SYMBOL() in the
original implementation, but an export was never required.
Fixes: 0c36b6ad43 ("cxl/mbox: Add sanitization handling machinery")
Cc: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pick up the first half of the RCH error handling series. The back half
needs some fixups for test regressions. Small conflicts with the PMU
work around register enumeration and setup helpers.
The corresponding device of a register mapping is used for devm
operations and logging. For operations with struct cxl_register_map
the device needs to be kept track separately. To simpify the involved
function interfaces, add @dev to cxl_register_map.
While at it also reorder function arguments of cxl_map_device_regs()
and cxl_map_component_regs() to have the object @cxl_register_map
first.
As a result a bunch of functions are available to be used with a
@cxl_register_map object.
This patch is in preparation of reworking the component register setup
code.
Signed-off-by: Robert Richter <rrichter@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20230622205523.85375-7-terry.bowman@amd.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The CXL specification mandates that 4-byte registers must be accessed
with 4-byte access cycles. CXL 3.0 8.2.3 "Component Register Layout and
Definition" states that the behavior is undefined if (2) 32-bit
registers are accessed as an 8-byte quantity. It turns out that at least
one hardware implementation is sensitive to this in practice. The @size
variable results in zero with:
size = readq(hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(which));
...and the correct size with:
lo = readl(hdm + CXL_HDM_DECODER0_SIZE_LOW_OFFSET(which));
hi = readl(hdm + CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(which));
size = (hi << 32) + lo;
Fixes: d17d0540a0 ("cxl/core/hdm: Add CXL standard decoder enumeration to the core")
Cc: <stable@vger.kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://lore.kernel.org/r/168149844056.792294.8224490474529733736.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
One motivation for mapping range registers to decoder objects is
to use those settings for region autodiscovery.
The need to map a region for devices programmed to use range registers
is especially urgent now that the kernel no longer routes "Soft
Reserved" ranges in the memory map to device-dax by default. The CXL
memory range loses all access mechanisms.
Complete the implementation by marking the DPA reservation and setting
the endpoint-decoder state to signal autodiscovery. Note that the
default settings of ways=1 and granularity=4096 set in cxl_decode_init()
do not need to be updated.
Fixes: 09d09e04d2 ("cxl/dax: Create dax devices for CXL RAM regions")
Tested-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Gregory Price <gregory.price@memverge.com>
Link: https://lore.kernel.org/r/168012575521.221280.14177293493678527326.stgit@dwillia2-xfh.jf.intel.com
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Recall that range register emulation seeks to treat the 2 potential
range registers as Linux CXL "decoder" objects. The number of range
registers can be 1 or 2, while HDM decoder ranges can include more than
2.
Be careful not to confuse DVSEC range count with HDM capability decoder
count. Commit to range register earlier in devm_cxl_setup_hdm().
Otherwise, a device with more HDM decoders than range registers can set
@cxlhdm->decoder_count to an invalid value.
Avoid introducing a forward declaration by just moving the definition of
should_emulate_decoders() earlier in the file. should_emulate_decoders()
is unchanged.
Tested-by: Dave Jiang <dave.jiang@intel.com>
Fixes: d7a2153762 ("cxl/hdm: Add emulation when HDM decoders are not committed")
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/168012574932.221280.15944705098679646436.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pick up the CXL DVSEC range register emulation for v6.3, and resolve
conflicts with the cxl_port_probe() split (from for-6.3/cxl-ram-region)
and event handling (from for-6.3/cxl-events).
Region autodiscovery is an asynchronous state machine advanced by
cxl_port_probe(). After the decoders on an endpoint port are enumerated
they are scanned for actively enabled instances. Each active decoder is
flagged for auto-assembly CXL_DECODER_F_AUTO and attached to a region.
If a region does not already exist for the address range setting of the
decoder one is created. That creation process may race with other
decoders of the same region being discovered since cxl_port_probe() is
asynchronous. A new 'struct cxl_root_decoder' lock, @range_lock, is
introduced to mitigate that race.
Once all decoders have arrived, "p->nr_targets == p->interleave_ways",
they are sorted by their relative decode position. The sort algorithm
involves finding the point in the cxl_port topology where one leg of the
decode leads to deviceA and the other deviceB. At that point in the
topology the target order in the 'struct cxl_switch_decoder' indicates
the relative position of those endpoint decoders in the region.
>From that point the region goes through the same setup and validation
steps as user-created regions, but instead of programming the decoders
it validates that driver would have written the same values to the
decoders as were already present.
Tested-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/167601999958.1924368.9366954455835735048.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The RAS Capabilitiy Structure is a CXL Component register capability
block. Unlike the HDM Decoder Capability, it will be referenced by the
cxl_pci driver in response to PCIe AER events. Due to this it is no
longer the case that cxl_map_component_regs() can assume that it should
map all component registers. Plumb a bitmask of capability ids to map
through cxl_map_component_regs().
For symmetry cxl_probe_device_regs() is updated to populate @id in
'struct cxl_reg_map' even though cxl_map_device_regs() does not have a
need to map a subset of the device registers per caller.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/166974412214.1608150.11487843455070795378.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Vishal notes that when attempting to define a second pmem region on a
device the DPA allocation fails with a message of the form:
decoder11.1: failed to reserve skipped space
Recall that the skip setting is used when there is a pmem allocation in
the presence of free ram DPA space. The first pmem allocation skips over
the free ram and subsequent pmem allocations do not require a skip. The
bug is that a skip is still attempted and the DPA reservation code
flags the double skip allocation conflict.
Fixes: cf880423b6 ("cxl/hdm: Add support for allocating DPA to an endpoint decoder")
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/165973754730.1558392.15466392461645857658.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
After adding support for emulating platform firmware established DPA
reservations, the cxl-topology.sh [1] unit test started crashing with
the following signature:
general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP
[..]
RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core]
[..]
Call Trace:
<TASK>
__cxl_dpa_release+0x1b/0xd0 [cxl_core]
cxl_dpa_release+0x1d/0x30 [cxl_core]
release_nodes+0x63/0x90
devres_release_all+0x88/0xc0
...i.e. a use after free of a 'struct cxl_endpoint_decoder' object. This
results from the ordering of init_hdm_decoder() before add_hdm_decoder()
where, at release time, the decoder is unregistered and released before
the DPA reservation.
Fix this by extending the life of the object until all DPA reservations
have been released which also preserves platform decoder settings being
settled by the time the decoder is published in sysfs (KOBJ_ADD time).
Note that the @len == 0 case in __cxl_dpa_reserve() is avoided in
practice as this function is only called for committed decoders and new
non-zero DPA allocations.
Link: https://github.com/pmem/ndctl/blob/pending/test/cxl-topology.sh [1]
Fixes: 9c57cde0dc ("cxl/hdm: Enumerate allocated DPA")
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/165896020625.3546860.12390103413706292760.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
After all the soft validation of the region has completed, convey the
region configuration to hardware while being careful to commit decoders
in specification mandated order. In addition to programming the endpoint
decoder base-address, interleave ways and granularity, the switch
decoder target lists are also established.
While the kernel can enforce spec-mandated commit order, it can not
enforce spec-mandated reset order. For example, the kernel can't stop
someone from removing an endpoint device that is occupying decoderN in a
switch decoder where decoderN+1 is also committed. To reset decoderN,
decoderN+1 must be torn down first. That "tear down the world"
implementation is saved for a follow-on patch.
Callback operations are provided for the 'commit' and 'reset'
operations. While those callbacks may prove useful for CXL accelerators
(Type-2 devices with memory) the primary motivation is to enable a
simple way for cxl_test to intercept those operations.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/165784338418.1758207.14659830845389904356.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reduce the complexity and the overhead of walking the topology to
determine endpoint connectivity to root decoder interleave
configurations.
Note that cxl_detach_ep(), after it determines that the last @ep has
departed and decides to delete the port, now needs to walk the dport
array with the device_lock() held to remove entries. Previously
list_splice_init() could be used atomically delete all dport entries at
once and then perform entry tear down outside the lock. There is no
list_splice_init() equivalent for the xarray.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/165784331647.1758207.6345820282285119339.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The region provisioning flow will roughly follow a sequence of:
1/ Allocate DPA to a set of decoders
2/ Allocate HPA to a region
3/ Associate decoders with a region and validate that the DPA allocations
and topologies match the parameters of the region.
For now, this change (step 1) arranges for DPA capacity to be allocated
and deleted from non-committed decoders based on the decoder's mode /
partition selection. Capacity is allocated from the lowest DPA in the
partition and any 'pmem' allocation blocks out all remaining ram
capacity in its 'skip' setting. DPA allocations are enforced in decoder
instance order. I.e. decoder N + 1 always starts at a higher DPA than
instance N, and deleting allocations must proceed from the
highest-instance allocated decoder to the lowest.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/165784329399.1758207.16732038126938632700.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Recall that the Device Physical Address (DPA) space of a CXL Memory
Expander is potentially partitioned into a volatile and persistent
portion. A decoder maps a Host Physical Address (HPA) range to a DPA
range and that translation depends on the value of all previous (lower
instance number) decoders before the current one.
In preparation for allowing dynamic provisioning of regions, decoders
need an ABI to indicate which DPA partition a decoder targets. This ABI
needs to be prepared for the possibility that some other agent committed
and locked a decoder that spans the partition boundary.
Add 'decoderX.Y/mode' to endpoint decoders that indicates which
partition 'ram' / 'pmem' the decoder targets, or 'mixed' if the decoder
currently spans the partition boundary.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/165603881967.551046.6007594190951596439.stgit@dwillia2-xfh
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for provisioning CXL regions, add accounting for the DPA
space consumed by existing regions / decoders. Recall, a CXL region is a
memory range comprised from one or more endpoint devices contributing a
mapping of their DPA into HPA space through a decoder.
Record the DPA ranges covered by committed decoders at initial probe of
endpoint ports relative to a per-device resource tree of the DPA type
(pmem or volatile-ram).
The cxl_dpa_rwsem semaphore is introduced to globally synchronize DPA
state across all endpoints and their decoders at once. The vast majority
of DPA operations are reads as region creation is expected to be as rare
as disk partitioning and volume creation. The device_lock() for this
synchronization is specifically avoided for concern of entangling with
sysfs attribute removal.
Co-developed-by: Ben Widawsky <bwidawsk@kernel.org>
Signed-off-by: Ben Widawsky <bwidawsk@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/165784327682.1758207.7914919426043855876.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Previously the target routing specifics of switch decoders and platform
CXL window resource tracking of root decoders were factored out of
'struct cxl_decoder'. While switch decoders translate from SPA to
downstream ports, endpoint decoders translate from SPA to DPA.
This patch, 3 of 3, adds a 'struct cxl_endpoint_decoder' that tracks an
endpoint-specific Device Physical Address (DPA) resource. For now this
just defines ->dpa_res, a follow-on patch will handle requesting DPA
resource ranges from a device-DPA resource tree.
Co-developed-by: Ben Widawsky <bwidawsk@kernel.org>
Signed-off-by: Ben Widawsky <bwidawsk@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/165784327088.1758207.15502834501671201192.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Unless and until accelerator (type-2) drivers start registering for
CXL.mem mapping services from the CXL subsystem core, initialize idle
HDM decoders to the "expander" type. I.e. the only CXL devices using the
CXL core presently are those implementing the CXL 2.0 Type-3 memory
expander device class code that the cxl_pci driver claims.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20220624041950.559155-6-dan.j.williams@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>