Commit Graph

15974 Commits

Author SHA1 Message Date
Christian König
7ccde2e6c0 drm/amdgpu: revert "use CPU for page table update if SDMA is unavailable"
That is clearly not something we should do upstream. The SDMA is
mandatory for the driver to work correctly.

We could do this for emulation and bringup, but in those cases the
engineer should probably enabled CPU based updates manually.

This reverts commit 62eefd10ac.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:55:06 -04:00
Dan Carpenter
27f9dcb9cc drm/amdgpu/mes11: Indent an if statment
Indent the "break" statement one more tab.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:55:05 -04:00
Ramesh Errabolu
01be2b62c0 drm/amdgpu: Surface svm_default_granularity, a RW module parameter
Enables users to update SVM's default granularity, used in
buffer migration and handling of recoverable page faults.
Param value is set in terms of log(numPages(buffer)),
e.g. 9 for a 2 MIB buffer

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:55:05 -04:00
Jesse Zhang
e8397d327e drm/amdgpu: fix queue reset issue by mmio
Initialize the queue type before resetting the queue using mmio.

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:54:54 -04:00
Srinivasan Shanmugam
559a285816 drm/amdgpu: Replace 'amdgpu_job_submit_direct' with 'drm_sched_entity' in cleaner shader
This commit replaces the use of amdgpu_job_submit_direct which submits
the job to the ring directly, with drm_sched_entity in the cleaner
shader job submission process. The change allows the GPU scheduler to
manage the cleaner shader job.

- The job is then submitted to the GPU using the
  drm_sched_entity_push_job function, which allows the GPU scheduler to
  manage the job.

This change improves the reliability of the cleaner shader job
submission process by leveraging the capabilities of the GPU scheduler.

Fixes: d361ad5d2f ("drm/amdgpu: Add sysfs interface for running cleaner shader")
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:42:33 -04:00
Srinivasan Shanmugam
2578487ebe drm/amdgpu/: Add missing kdoc entry in amdgpu_vm_handle_fault function
This commit adds a description for the 'ts' parameter in the
amdgpu_vm_handle_fault function's comment block.

Fixes the below with gcc W=1:
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2781: warning: Function parameter or struct member 'ts' not described in 'amdgpu_vm_handle_fault'

Cc: Xiaogang.Chen <Xiaogang.Chen@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202408251419.vgZHg3GV-lkp@intel.com/
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Xiaogang Chen <Xiaogang.Chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:42:05 -04:00
Lijo Lazar
4481df364d drm/amdgpu: Normalize reg offsets on JPEG v4.0.3
On VFs and SOCs with GC 9.4.4, VCN RRMT is disabled.
Only local register offsets should be used on JPEG v4.0.3 as they cannot
handle remote access to other AIDs. Since only local offsets are used,
the special write to MCM_ADDR register is no longer needed.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:40:47 -04:00
Li Zetao
760e3c8b32 drm/amdgpu: use clamp() in amdgpu_vm_adjust_size()
When it needs to get a value within a certain interval, using clamp()
makes the code easier to understand than min(max()).

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:38:57 -04:00
Li Zetao
6fbbb660b1 drm/amd: use clamp() in amdgpu_pll_get_fb_ref_div()
When it needs to get a value within a certain interval, using clamp()
makes the code easier to understand than min(max()).

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:38:53 -04:00
Peng Liu
2c7795e245 drm/amdgpu: enable gfxoff quirk on HP 705G4
Enabling gfxoff quirk results in perfectly usable
graphical user interface on HP 705G4 DM with R5 2400G.

Without the quirk, X server is completely unusable as
every few seconds there is gpu reset due to ring gfx timeout.

Signed-off-by: Peng Liu <liupeng01@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:38:45 -04:00
Peng Liu
0126c0ae11 drm/amdgpu: add raven1 gfxoff quirk
Fix screen corruption with openkylin.

Link: https://bbs.openkylin.top/t/topic/171497
Signed-off-by: Peng Liu <liupeng01@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:38:33 -04:00
Lang Yu
4453808d9e drm/amdgpu: fix invalid fence handling in amdgpu_vm_tlb_flush
CPU based update doesn't produce a fence, handle such cases properly.

Fixes: d8a3f0a034 ("drm/amdgpu: implement TLB flush fence")
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:36:50 -04:00
Christian König
4da5a95bf1 drm/amdgpu: re-work VM syncing
Rework how VM operations synchronize to submissions. Provide an
amdgpu_sync container to the backends instead of an reservation
object and fill in the amdgpu_sync object in the higher layers
of the code.

No intended functional change, just prepares for upcomming changes.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Friedrich Vock <friedrich.vock@gmx.de>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06 17:36:24 -04:00
Christian König
b2ef808786 drm/sched: add optional errno to drm_sched_start()
The current implementation of drm_sched_start uses a hardcoded
-ECANCELED to dispose of a job when the parent/hw fence is NULL.
This results in drm_sched_job_done being called with -ECANCELED for
each job with a NULL parent in the pending list, making it difficult
to distinguish between recovery methods, whether a queue reset or a
full GPU reset was used.

To improve this, we first try a soft recovery for timeout jobs and
use the error code -ENODATA. If soft recovery fails, we proceed with
a queue reset, where the error code remains -ENODATA for the job.
Finally, for a full GPU reset, we use error codes -ECANCELED or
-ETIME. This patch adds an error code parameter to drm_sched_start,
allowing us to differentiate between queue reset and GPU reset
failures. This enables user mode and test applications to validate
the expected correctness of the requested operation. After a
successful queue reset, the only way to continue normal operation is
to call drm_sched_job_done with the specific error code -ENODATA.

v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
    and amdgpu_device_unlock_reset_domain to allow user mode to track
    the queue reset status and distinguish between queue reset and
    GPU reset.
v2: Christian suggested using the error codes -ENODATA for queue reset
    and -ECANCELED or -ETIME for GPU reset, returned to
    amdgpu_cs_wait_ioctl.
v3: To meet the requirements, we introduce a new function
    drm_sched_start_ex with an additional parameter to set
    dma_fence_set_error, allowing us to handle the specific error
    codes appropriately and dispose of bad jobs with the selected
    error code depending on whether it was a queue reset or GPU reset.
v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
    which more accurately describes the function's purpose.
    Additionally, it was recommended to add documentation details
    about the new method.
v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
v6 (chk): rebase on upstream changes, cleanup the commit message,
          drop the new function again and update all callers,
          apply the errno also to scheduler fences with hw fences
v7 (chk): rebased

Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826122541.85663-1-christian.koenig@amd.com
2024-09-06 18:05:52 +02:00
Dmitry Baryshkov
ca097d4d94 drm/display: split DSC helpers from DP helpers
Currently the DRM DSC functions are selected by the
DRM_DISPLAY_DP_HELPER Kconfig symbol. This is not optimal, since the DSI
code (both panel and host drivers) end up selecting the seemingly
irrelevant DP helpers. Split the DSC code to be guarded by the separate
DRM_DISPLAY_DSC_HELPER Kconfig symbol.

Reviewed-by: Jessica Zhang <quic_jesszhan@quicinc.com>
Reviewed-by: Marijn Suijten <marijn.suijten@somainline.org>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> #i915
Acked-by: Maxime Ripard <mripard@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20240704-panel-sw43408-fix-v6-1-3ea1c94bbb9b@linaro.org
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2024-09-03 00:13:44 +03:00
Alex Deucher
4de34b0478 drm/amdgpu: always allocate cleared VRAM for GEM allocations
This adds allocation latency, but aligns better with user
expectations.  The latency should improve with the drm buddy
clearing patches that Arun has been working on.

In addition this fixes the high CPU spikes seen when doing
wipe on release.

v2: always set AMDGPU_GEM_CREATE_VRAM_CLEARED (Christian)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3528
Fixes: a68c7eaa7a ("drm/amdgpu: Enable clear page functionality")
Acked-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian König <christian.koenig@amd.com>
(cherry picked from commit 6c0a7c3c69)
Cc: stable@vger.kernel.org # 6.10.x
2024-09-02 13:08:51 -04:00
Jack Xiao
34c36a77f4 drm/amdgpu/mes: add mes mapping legacy queue switch
For mes11 old firmware has issue to map legacy queue,
add a flag to switch mes to map legacy queue.

Fixes: f9d8c5c785 ("drm/amdgpu/gfx: enable mes to map legacy queue support")
Reported-by: Andrew Worsley <amworsley@gmail.com>
Link: https://lists.freedesktop.org/archives/amd-gfx/2024-August/112773.html
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 52491d97aa)
2024-09-02 13:05:39 -04:00
Alex Deucher
ead60e9c4e drm/amdgpu/gfx10: use rlc safe mode for soft recovery
Protect the MMIO access with safe mode.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:40 -04:00
Alex Deucher
3f2d35c325 drm/amdgpu/gfx11: use rlc safe mode for soft recovery
Protect the MMIO access with safe mode.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:38 -04:00
Alex Deucher
21818f39be drm/amdgpu/gfx12: use rlc safe mode for soft recovery
Protect the MMIO access with safe mode.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:34 -04:00
Alex Deucher
f8eee864ba drm/amdgpu/gfx12: use proper rlc safe mode helpers
Rather than open coding it for the queue reset.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:31 -04:00
Alex Deucher
01d05521f7 drm/amdgpu/gfx11: use proper rlc safe mode helpers
Rather than open coding it for the queue reset.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:28 -04:00
Alex Deucher
bcee4c3f89 drm/amdgpu/gfx10: use proper rlc safe mode helpers
Rather than open coding it for the queue reset.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:24 -04:00
Alex Deucher
1a1995b1dc drm/amdgpu/gfx12: per queue reset only on bare metal
It's not supported under SR-IOV at the moment.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:22 -04:00
Alex Deucher
01163079e1 drm/amdgpu/gfx11: per queue reset only on bare metal
It's not supported under SR-IOV at the moment.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:20 -04:00
Alex Deucher
4d5ddfa4b1 drm/amdgpu/gfx10: per queue reset only on bare metal
It's not supported under SR-IOV at the moment.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:16 -04:00
Jiadong Zhu
178ad0e280 drm/amdgpu/mes11: implement mmio queue reset for gfx11
Implement queue reset for graphic and compute queue.

v2: use amdgpu_gfx_rlc funcs to enter/exit safe mode.
v3: use gfx_v11_0_request_gfx_index_mutex()
v4: fix mutex handling

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:13 -04:00
Jiadong Zhu
01b4ae38e5 drm/amdgpu/mes: implement amdgpu_mes_reset_hw_queue_mmio
The reset_queue api could be used from kfd or kgd.

v2: add use_mmio parameter for mes_reset_legacy_queue.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:07 -04:00
Jiadong Zhu
8b2429a13f drm/amdgpu/mes: modify mes api for mmio queue reset
Add me/pipe/queue parameters for queue reset input.

v2: fix build (Alex)

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:03 -04:00
Alex Deucher
8fe4fde381 drm/amdgpu/gfx12: fallback to driver reset compute queue directly
Since the MES FW resets kernel compute queue always failed, this
may caused by the KIQ failed to process unmap KCQ. So, before MES
FW work properly that will fallback to driver executes dequeue and
resets SPI directly. Besides, rework the ring reset function and make
the busy ring type reset in each function respectively.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:41:01 -04:00
Alex Deucher
2480599890 drm/amdgpu/gfx12: add ring reset callbacks
Add ring reset callbacks for gfx and compute.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:58 -04:00
Alex Deucher
d1f2144321 drm/amdgpu/gfx10: rework reset sequence
To match other GFX IPs.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:55 -04:00
Jiadong Zhu
097af47d3c drm/amdgpu/gfx10: wait for reset done before remap
There is a racing condition that cp firmware modifies
MQD in reset sequence after driver updates it for
remapping. We have to wait till CP_HQD_ACTIVE becoming
false then remap the queue.

v2: fix KIQ locking (Alex)
v3: fix KIQ locking harder (Jessie)

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:47 -04:00
Jiadong Zhu
2f3806f781 drm/amdgpu/gfx10: remap queue after reset successfully
Kiq command unmap_queues only does the dequeueing action.
We have to map the queue back with clean mqd.

v2: fix up error handling (Alex)

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:44 -04:00
Alex Deucher
1741281a15 drm/amdgpu/gfx10: add ring reset callbacks
Add ring reset callbacks for gfx and compute.

v2: fix gfx handling
v3: wait for KIQ to complete

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:38 -04:00
Jiadong Zhu
a10c93931b drm/amdgpu/gfx11: wait for reset done before remap
There is a racing condition that cp firmware modifies
MQD in reset sequence after driver updates it for
remapping. We have to wait till CP_HQD_ACTIVE becoming
false then remap the queue.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:34 -04:00
Alex Deucher
7d8e9e65f2 drm/amdgpu/gfx11: rename gfx_v11_0_gfx_init_queue()
Rename to gfx_v11_0_kgq_init_queue() to better align with
the other naming in the file.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:31 -04:00
Prike Liang
072b441478 drm/amdgpu/gfx11: fallback to driver reset compute queue directly (v2)
Since the MES FW resets kernel compute queue always failed, this
may caused by the KIQ failed to process unmap KCQ. So, before MES
FW work properly that will fallback to driver executes dequeue and
resets SPI directly. Besides, rework the ring reset function and make
the busy ring type reset in each function respectively.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:40:21 -04:00
Alex Deucher
b3e9bfd866 drm/amdgpu/gfx11: add ring reset callbacks
Add ring reset callbacks for gfx and compute.

Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:35:12 -04:00
Prike Liang
ad17b124c3 drm/amdgpu/gfx9.4.3: Implement compute pipe reset
Implement the compute pipe reset, and the driver will
fallback to pipe reset when queue reset fails.
The pipe reset only deactivates the queue which is
scheduled in the pipe, and meanwhile the MEC pipe
will be reset to the firmware _start pointer. So,
it seems pipe reset will cost more cycles than the
queue reset; therefore, the driver tries to recover
by doing queue reset first.

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-02 11:32:32 -04:00
Christian Brauner
641bb4394f fs: move FMODE_UNSIGNED_OFFSET to fop_flags
This is another flag that is statically set and doesn't need to use up
an FMODE_* bit. Move it to ->fop_flags and free up another FMODE_* bit.

(1) mem_open() used from proc_mem_operations
(2) adi_open() used from adi_fops
(3) drm_open_helper():
    (3.1) accel_open() used from DRM_ACCEL_FOPS
    (3.2) drm_open() used from
    (3.2.1) amdgpu_driver_kms_fops
    (3.2.2) psb_gem_fops
    (3.2.3) i915_driver_fops
    (3.2.4) nouveau_driver_fops
    (3.2.5) panthor_drm_driver_fops
    (3.2.6) radeon_driver_kms_fops
    (3.2.7) tegra_drm_fops
    (3.2.8) vmwgfx_driver_fops
    (3.2.9) xe_driver_fops
    (3.2.10) DRM_GEM_FOPS
    (3.2.11) DEFINE_DRM_GEM_DMA_FOPS
(4) struct memdev sets fmode flags based on type of device opened. For
    devices using struct mem_fops unsigned offset is used.

Mark all these file operations as FOP_UNSIGNED_OFFSET and add asserts
into the open helper to ensure that the flag is always set.

Link: https://lore.kernel.org/r/20240809-work-fop_unsigned-v1-1-658e054d893e@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30 08:22:36 +02:00
Alex Deucher
6c0a7c3c69 drm/amdgpu: always allocate cleared VRAM for GEM allocations
This adds allocation latency, but aligns better with user
expectations.  The latency should improve with the drm buddy
clearing patches that Arun has been working on.

In addition this fixes the high CPU spikes seen when doing
wipe on release.

v2: always set AMDGPU_GEM_CREATE_VRAM_CLEARED (Christian)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3528
Fixes: a68c7eaa7a ("drm/amdgpu: Enable clear page functionality")
Acked-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian König <christian.koenig@amd.com>
2024-08-29 14:56:37 -04:00
Jack Xiao
52491d97aa drm/amdgpu/mes: add mes mapping legacy queue switch
For mes11 old firmware has issue to map legacy queue,
add a flag to switch mes to map legacy queue.

Fixes: f9d8c5c785 ("drm/amdgpu/gfx: enable mes to map legacy queue support")
Reported-by: Andrew Worsley <amworsley@gmail.com>
Link: https://lists.freedesktop.org/archives/amd-gfx/2024-August/112773.html
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:41:02 -04:00
Alex Deucher
1125f95cd2 drm/amdgpu/gfx12: return early in preempt_ib()
When MES is enabled KIQ is not available.  Return an error
when someone uses the debugfs preempt test interface in
that case.

Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:57 -04:00
Alex Deucher
1e487c9173 drm/amdgpu/gfx11: return early in preempt_ib()
When MES is enabled KIQ is not available.  Return an error
when someone uses the debugfs preempt test interface in
that case.

Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:48 -04:00
Sunil Khatri
30e8f4c2bd drm/amdgpu: Move the dumping log out of for loop
log message "Dumping IP State Completed" needs to
be logged only once when state dumping is complete.

Hence moving it out of the for loop.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Acked-by: Trigger Huang <Trigger.Huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:19 -04:00
Victor Zhao
af76ca8e18 drm/amd/amdgpu: move drain_workqueue before shutdown is set
[background] when unloading amdgpu driver right after running a
workload, drain_workqueue is causing "Fence fallback timer
expired on ring sdma0.0". Under sriov, this issue will cause sriov
full access timeout and a reset happening.

move drain_workqueue before shutdown is set to allow ih process and
before enter full access under sriov to avoid full access time cost.

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:07 -04:00
Trigger Huang
c67db6a6a6 drm/amdgpu: Do core dump immediately when job tmo
Do the coredump immediately after a job timeout to get a closer
representation of GPU's error status.

V2: This will skip printing vram_lost as the GPU reset is not
happened yet (Alex)

V3: Unconditionally call the core dump as we care about all the reset
functions(soft-recovery and queue reset and full adapter reset, Alex)

V4: Do the dump after adev->job_hang = true (Sunil)

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Acked-by:  Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:39:00 -04:00
Trigger Huang
6122f5c72e drm/amdgpu: skip printing vram_lost if needed
The vm lost status can only be obtained after a GPU reset occurs, but
sometimes a dev core dump can be happened before GPU reset. So a new
argument is added to tell the dev core dump implementation whether to
skip printing the vram_lost status in the dump.
And this patch is also trying to decouple the core dump function from
the GPU reset function, by replacing the argument amdgpu_reset_context
with amdgpu_job to specify the context for core dump.

V2: Inform user if VRAM lost check is skipped so users don't assume
VRAM wasn't lost (Alex)

Signed-off-by: Trigger Huang <Trigger.Huang@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:38:53 -04:00
Alex Deucher
7c1a2d8aba drm/amdgpu/gfx9: put queue resets behind a debug option
Pending extended validation.

Reviewed-and-tested-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29 13:38:48 -04:00