Commit Graph

15974 Commits

Author SHA1 Message Date
Srinivasan Shanmugam
19b7f7c721 drm/amdgpu/gfx12: Add Cleaner Shader Support for GFX12.0 GPUs
This commit enables the cleaner shader feature for GFX12.0 and GFX12.0.1
GPUs. The cleaner shader is important for clearing GPU resources such as
Local Data Share (LDS), Vector General Purpose Registers (VGPRs), and
Scalar General Purpose Registers (SGPRs) between workloads.

- This feature ensures that GPU resources are reset between workloads,
  preventing data leaks and ensuring accurate computation.

By enabling the cleaner shader, this update enhances the security and
reliability of GPU operations on GFX12.0 hardware.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14 11:06:50 -05:00
Kent Russell
f2935a3019 drm/amdgpu: Mark debug KFD module params as unsafe
Mark options only meant to be used for debugging as unsafe so that the
kernel is tainted when they are used.

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14 11:06:50 -05:00
Gui Chengming
62952a38d9 drm/amdgpu: fix fw attestation for MP0_14_0_{2/3}
FW attestation was disabled on MP0_14_0_{2/3}.

V2:
Move check into is_fw_attestation_support func. (Frank)
Remove DRM_WARN log info. (Alex)
Fix format. (Christian)

Signed-off-by: Gui Chengming <Jack.Gui@amd.com>
Reviewed-by: Frank.Min <Frank.Min@amd.com>
Reviewed-by: Christian König <Christian.Koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14 11:06:50 -05:00
Christian König
def59436fb drm/amdgpu: always sync the GFX pipe on ctx switch
That is needed to enforce isolation between contexts.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14 11:06:50 -05:00
Christian König
177b76a8d8 drm/amdgpu: mark a bunch of module parameters unsafe
We sometimes have people trying to use debugging options in production
environments.

Mark options only meant to be used for debugging as unsafe so that the
kernel is tainted when they are used.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Simona Vetter <simona.vetter@ffwll.ch>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14 11:06:50 -05:00
Tvrtko Ursulin
e996127ec1 drm/amdgpu: Use DRM scheduler API in amdgpu_xcp_release_sched
Lets use the existing helper instead of peeking into the structure
directly.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Danilo Krummrich <dakr@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Philipp Stanner <pstanner@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14 11:06:50 -05:00
Kenneth Feng
2affe2bbc9 drm/amdgpu: disable gfxoff with the compute workload on gfx12
Disable gfxoff with the compute workload on gfx12. This is a
workaround for the opencl test failure.

Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14 11:06:50 -05:00
Srinivasan Shanmugam
0b6b2dd383 drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX Isolation
This commit addresses a circular locking dependency issue within the GFX
isolation mechanism. The problem was identified by a warning indicating
a potential deadlock due to inconsistent lock acquisition order.

- The `amdgpu_gfx_enforce_isolation_ring_begin_use` and
  `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously
  acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`,
  leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is
  called while `enforce_isolation_mutex` is held, and
  `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is
  held, it can create a circular dependency.

By ensuring consistent lock usage, this fix resolves the issue:

[  606.297333] ======================================================
[  606.297343] WARNING: possible circular locking dependency detected
[  606.297353] 6.10.0-amd-mlkd-610-311224-lof #19 Tainted: G           OE
[  606.297365] ------------------------------------------------------
[  606.297375] kworker/u96:3/3825 is trying to acquire lock:
[  606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610
[  606.297413]
               but task is already holding lock:
[  606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.297725]
               which lock already depends on the new lock.

[  606.297738]
               the existing dependency chain (in reverse order) is:
[  606.297749]
               -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}:
[  606.297765]        __mutex_lock+0x85/0x930
[  606.297776]        mutex_lock_nested+0x1b/0x30
[  606.297786]        amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.298007]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.298225]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.298412]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.298603]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.298866]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.298880]        process_one_work+0x21e/0x680
[  606.298890]        worker_thread+0x190/0x350
[  606.298899]        kthread+0xe7/0x120
[  606.298908]        ret_from_fork+0x3c/0x60
[  606.298919]        ret_from_fork_asm+0x1a/0x30
[  606.298929]
               -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}:
[  606.298947]        __mutex_lock+0x85/0x930
[  606.298956]        mutex_lock_nested+0x1b/0x30
[  606.298966]        amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu]
[  606.299190]        process_one_work+0x21e/0x680
[  606.299199]        worker_thread+0x190/0x350
[  606.299208]        kthread+0xe7/0x120
[  606.299217]        ret_from_fork+0x3c/0x60
[  606.299227]        ret_from_fork_asm+0x1a/0x30
[  606.299236]
               -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}:
[  606.299257]        __lock_acquire+0x16f9/0x2810
[  606.299267]        lock_acquire+0xd1/0x300
[  606.299276]        __flush_work+0x250/0x610
[  606.299286]        cancel_delayed_work_sync+0x71/0x80
[  606.299296]        amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.299509]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.299723]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.299909]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.300101]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.300355]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.300369]        process_one_work+0x21e/0x680
[  606.300378]        worker_thread+0x190/0x350
[  606.300387]        kthread+0xe7/0x120
[  606.300396]        ret_from_fork+0x3c/0x60
[  606.300406]        ret_from_fork_asm+0x1a/0x30
[  606.300416]
               other info that might help us debug this:

[  606.300428] Chain exists of:
                 (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex

[  606.300458]  Possible unsafe locking scenario:

[  606.300468]        CPU0                    CPU1
[  606.300476]        ----                    ----
[  606.300484]   lock(&adev->gfx.kfd_sch_mutex);
[  606.300494]                                lock(&adev->enforce_isolation_mutex);
[  606.300508]                                lock(&adev->gfx.kfd_sch_mutex);
[  606.300521]   lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work));
[  606.300536]
                *** DEADLOCK ***

[  606.300546] 5 locks held by kworker/u96:3/3825:
[  606.300555]  #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  606.300577]  #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  606.300600]  #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu]
[  606.300837]  #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.301062]  #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610
[  606.301083]
               stack backtrace:
[  606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G           OE      6.10.0-amd-mlkd-610-311224-lof #19
[  606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024
[  606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
[  606.301140] Call Trace:
[  606.301146]  <TASK>
[  606.301154]  dump_stack_lvl+0x9b/0xf0
[  606.301166]  dump_stack+0x10/0x20
[  606.301175]  print_circular_bug+0x26c/0x340
[  606.301187]  check_noncircular+0x157/0x170
[  606.301197]  ? register_lock_class+0x48/0x490
[  606.301213]  __lock_acquire+0x16f9/0x2810
[  606.301230]  lock_acquire+0xd1/0x300
[  606.301239]  ? __flush_work+0x232/0x610
[  606.301250]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301261]  ? mark_held_locks+0x54/0x90
[  606.301274]  ? __flush_work+0x232/0x610
[  606.301284]  __flush_work+0x250/0x610
[  606.301293]  ? __flush_work+0x232/0x610
[  606.301305]  ? __pfx_wq_barrier_func+0x10/0x10
[  606.301318]  ? mark_held_locks+0x54/0x90
[  606.301331]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301345]  cancel_delayed_work_sync+0x71/0x80
[  606.301356]  amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.301661]  amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.302050]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.302069]  amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.302452]  amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.302862]  ? drm_sched_entity_error+0x82/0x190 [gpu_sched]
[  606.302890]  amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.303366]  drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.303388]  process_one_work+0x21e/0x680
[  606.303409]  worker_thread+0x190/0x350
[  606.303424]  ? __pfx_worker_thread+0x10/0x10
[  606.303437]  kthread+0xe7/0x120
[  606.303449]  ? __pfx_kthread+0x10/0x10
[  606.303463]  ret_from_fork+0x3c/0x60
[  606.303476]  ? __pfx_kthread+0x10/0x10
[  606.303489]  ret_from_fork_asm+0x1a/0x30
[  606.303512]  </TASK>

v2: Refactor lock handling to resolve circular dependency (Alex)

- Introduced a `sched_work` flag to defer the call to
  `amdgpu_gfx_kfd_sch_ctrl` until after releasing
  `enforce_isolation_mutex`.
- This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside
  the critical section, preventing the circular dependency and deadlock.
- The `sched_work` flag is set within the mutex-protected section if
  conditions are met, and the actual function call is made afterward.
- This approach ensures consistent lock acquisition order.

Fixes: afefd6f245 ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization")
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14 10:38:33 -05:00
Dave Airlie
c3d590f8ba Merge tag 'amd-drm-next-6.14-2025-01-10' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.14-2025-01-10:

amdgpu:
- Fix max surface handling in DC
- clang fixes
- DCN 3.5 fixes
- DCN 4.0.1 fixes
- DC CRC fixes
- DML updates
- DSC fixes
- PSR fixes
- DC add some divide by 0 checks
- SMU13 updates
- SR-IOV fixes
- RAS fixes
- Cleaner shader support for gfx10.3 dGPUs
- fix drm buddy trim handling
- SDMA engine reset updates
_ Fix RB bitmap setup
- Fix doorbell ttm cleanup
- Add CEC notifier support
- DPIA updates
- MST fixes

amdkfd:
- Shader debugger fixes
- Trap handler cleanup
- Cleanup includes
- Eviction fence wq fix

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250110172731.2960668-1-alexander.deucher@amd.com
2025-01-13 11:13:13 +10:00
Victor Zhao
85b73415fd drm/amdgpu: fill the ucode bo during psp resume for SRIOV
refill the ucode bo during psp resume for SRIOV, otherwise ucode load
will fail after VM hibernation and fb clean.

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:02:57 -05:00
Srinivasan Shanmugam
9814626751 drm/amdgpu/gfx10: Enable cleaner shader for GFX10.3.2/10.3.4/10.3.5 GPUs
Enable the cleaner shader for GFX10.3.2/10.3.4/10.3.5 GPUs to provide
data isolation between GPU workloads. The cleaner shader is responsible
for clearing the Local Data Store (LDS), Vector General Purpose
Registers (VGPRs), and Scalar General Purpose Registers (SGPRs), which
helps prevent data leakage and ensures accurate computation results.

This update extends cleaner shader support to GFX10.3.2/10.3.4/10.3.5
GPUs, previously available for GFX10.3.0. It enhances security by
clearing GPU memory between processes and maintains a consistent GPU
state across KGD and KFD workloads.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:02:57 -05:00
Jonathan Kim
86bde64cb7 drm/amdgpu: fix gpu recovery disable with per queue reset
Per queue reset should be bypassed when gpu recovery is disabled
with module parameter.

Fixes: ee0a469cf9 ("drm/amdkfd: support per-queue reset on gfx9")
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:02:57 -05:00
Jiang Liu
edec9b0690 drm/amdgpu: wrong array index to get ip block for PSP
The adev->ip_blocks array is not indexed by AMD_IP_BLOCK_TYPE_xxx,
instead we should call amdgpu_device_ip_get_ip_block() to get the
corresponding IP block oject.

Fix some checkpatch issues (Alex)

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:02:56 -05:00
Jiang Liu
60a2c0c12b drm/amdgpu: tear down ttm range manager for doorbell in amdgpu_ttm_fini()
Tear down ttm range manager for doorbell in function amdgpu_ttm_fini(),
to avoid memory leakage.

Fixes: 792b84fb90 ("drm/amdgpu: initialize ttm for doorbells")
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:02:56 -05:00
Tim Huang
6b34d0328b drm/amdgpu: fix incorrect number of active RBs for gfx12
The RB bitmap should be global active RB bitmap &
active RB bitmap based on active SA.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:02:56 -05:00
Tim Huang
4a60c55b3b drm/amdgpu: fix incorrect active RB bitmap in setup RBs
The RB bitmap width per SA may be 0x1 for some ASICs.
Use the actual bitmap of SA instead of 0x3 to determine
the active RB bitmap.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:02:56 -05:00
Dan Carpenter
6ec6cd9acb drm/amdgpu: Fix shift type in amdgpu_debugfs_sdma_sched_mask_set()
The "mask" and "val" variables are type u64.  The problem is that the
BIT() macros are type unsigned long which is just 32 bits on 32bit
systems.

It's unlikely that people will be using this driver on 32bit kernels
and even if they did we only use the lower AMDGPU_MAX_SDMA_INSTANCES (16)
bits.  So this bug does not affect anything in real life.

Still, for correctness sake, u64 bit masks should use BIT_ULL().

Fixes: d2e3961ae3 ("drm/amdgpu: add amdgpu_sdma_sched_mask debugfs")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/d39a9325-87a4-4543-b6ec-1c61fca3a6fc@stanley.mountain
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:02:56 -05:00
Jesse Zhang
f7e672e6f8 drm/amdgpu: enable gfx12 queue reset flag
Enable the kgq and kcq queue reset flag

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Tim Huang <tim.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:02:47 -05:00
Jesse Zhang
39b0fa29f6 drm/amdgpu/sdma4.4.2: add apu support in sdma queue reset
Remove apu check in sdma queue reset.

Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09 16:01:29 -05:00
Dave Airlie
0739b8ba82 Merge tag 'drm-misc-next-2025-01-06' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next for 6.14:

UAPI Changes:
- Clarify drm memory stats documentation

Cross-subsystem Changes:

Core Changes:
 - sched: Documentation fixes,

Driver Changes:
 - amdgpu: Track BO memory stats at runtime
 - amdxdna: Various fixes
 - hisilicon: New HIBMC driver
 - bridges:
   - Provide default implementation of atomic_check for HDMI bridges
   - it605: HDCP improvements, MCCS Support

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250106-augmented-kakapo-of-action-0cf000@houat
2025-01-09 15:48:50 +10:00
Dmitry Baryshkov
26d6fd8191 drm/connector: make mode_valid take a const struct drm_display_mode
The mode_valid() callbacks of drm_encoder, drm_crtc and drm_bridge
take a const struct drm_display_mode argument. Change the mode_valid
callback of drm_connector to also take a const argument.

Acked-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Reviewed-by: Raphael Gallais-Pou <rgallaispou@gmail.com>
Reviewed-by: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Reviewed-by: Lyude Paul <lyude@redhat.com>
Reviewed-by: Maxime Ripard <mripard@kernel.org>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241214-drm-connector-mode-valid-const-v2-5-4f9498a4c822@linaro.org
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2025-01-07 12:45:19 +02:00
Kent Russell
6c9c97387b drm/amdgpu: Remove unnecessary NULL check
container_of cannot return NULL, so it is unnecessary to check for
NULL after gem_to_amdgpu_bo, which is just a container_of call

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-06 14:44:29 -05:00
Arunpravin Paneer Selvam
3318ba94e5 drm/amdgpu: Add a lock when accessing the buddy trim function
When running YouTube videos and Steam games simultaneously,
the tester found a system hang / race condition issue with
the multi-display configuration setting. Adding a lock to
the buddy allocator's trim function would be the solution.

<log snip>
[ 7197.250436] general protection fault, probably for non-canonical address 0xdead000000000108
[ 7197.250447] RIP: 0010:__alloc_range+0x8b/0x340 [amddrm_buddy]
[ 7197.250470] Call Trace:
[ 7197.250472]  <TASK>
[ 7197.250475]  ? show_regs+0x6d/0x80
[ 7197.250481]  ? die_addr+0x37/0xa0
[ 7197.250483]  ? exc_general_protection+0x1db/0x480
[ 7197.250488]  ? drm_suballoc_new+0x13c/0x93d [drm_suballoc_helper]
[ 7197.250493]  ? asm_exc_general_protection+0x27/0x30
[ 7197.250498]  ? __alloc_range+0x8b/0x340 [amddrm_buddy]
[ 7197.250501]  ? __alloc_range+0x109/0x340 [amddrm_buddy]
[ 7197.250506]  amddrm_buddy_block_trim+0x1b5/0x260 [amddrm_buddy]
[ 7197.250511]  amdgpu_vram_mgr_new+0x4f5/0x590 [amdgpu]
[ 7197.250682]  amdttm_resource_alloc+0x46/0xb0 [amdttm]
[ 7197.250689]  ttm_bo_alloc_resource+0xe4/0x370 [amdttm]
[ 7197.250696]  amdttm_bo_validate+0x9d/0x180 [amdttm]
[ 7197.250701]  amdgpu_bo_pin+0x15a/0x2f0 [amdgpu]
[ 7197.250831]  amdgpu_dm_plane_helper_prepare_fb+0xb2/0x360 [amdgpu]
[ 7197.251025]  ? try_wait_for_completion+0x59/0x70
[ 7197.251030]  drm_atomic_helper_prepare_planes.part.0+0x2f/0x1e0
[ 7197.251035]  drm_atomic_helper_prepare_planes+0x5d/0x70
[ 7197.251037]  drm_atomic_helper_commit+0x84/0x160
[ 7197.251040]  drm_atomic_nonblocking_commit+0x59/0x70
[ 7197.251043]  drm_mode_atomic_ioctl+0x720/0x850
[ 7197.251047]  ? __pfx_drm_mode_atomic_ioctl+0x10/0x10
[ 7197.251049]  drm_ioctl_kernel+0xb9/0x120
[ 7197.251053]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 7197.251056]  drm_ioctl+0x2d4/0x550
[ 7197.251058]  ? __pfx_drm_mode_atomic_ioctl+0x10/0x10
[ 7197.251063]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
[ 7197.251186]  __x64_sys_ioctl+0xa0/0xf0
[ 7197.251190]  x64_sys_call+0x143b/0x25c0
[ 7197.251193]  do_syscall_64+0x7f/0x180
[ 7197.251197]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 7197.251199]  ? amdgpu_display_user_framebuffer_create+0x215/0x320 [amdgpu]
[ 7197.251329]  ? drm_internal_framebuffer_create+0xb7/0x1a0
[ 7197.251332]  ? srso_alias_return_thunk+0x5/0xfbef5

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Fixes: 4a5ad08f53 ("drm/amdgpu: Add address alignment support to DCC buffers")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-06 14:44:29 -05:00
Prike Liang
2b11179e18 drm/amdgpu: reduce RLC safe mode request for gfx clock gating
The driver can only request one time for the power safe mode instead of
polling and disabling the power feature each time prior to program the
GFX clock gating control registers. This update will reduce the latency
on the GFX clock gating entry.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-06 14:44:29 -05:00
Srinivasan Shanmugam
8b248b9045 drm/amdgpu/gfx10: Add cleaner shader for GFX10.3.0
This commit adds the cleaner shader microcode for GFX10.3.0 GPUs. The
cleaner shader is a piece of GPU code that is used to clear or
initialize certain GPU resources, such as Local Data Share (LDS), Vector
General Purpose Registers (VGPRs), and Scalar General Purpose Registers
(SGPRs).

Clearing these resources is important for ensuring data isolation
between different workloads running on the GPU. Without the cleaner
shader, residual data from a previous workload could potentially be
accessed by a subsequent workload, leading to data leaks and incorrect
computation results.

The cleaner shader microcode is represented as an array of 32-bit words
(`gfx_10_3_0_cleaner_shader_hex`). This array is the binary
representation of the cleaner shader code, which is written in a
low-level GPU instruction set.

When the cleaner shader feature is enabled, the AMDGPU driver loads this
array into a specific location in the GPU memory. The GPU then reads
this memory location to fetch and execute the cleaner shader
instructions.

The cleaner shader is executed automatically by the GPU at the end of
each workload, before the next workload starts. This ensures that all
GPU resources are in a clean state before the start of each workload.

This addition is part of the cleaner shader feature implementation. The
cleaner shader feature helps resource utilization by cleaning up GPU
resources after they are used. It also enhances security and reliability
by preventing data leaks between workloads.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-06 14:44:28 -05:00
Srinivasan Shanmugam
9095567bc3 drm/amdgpu: Fix error handling in amdgpu_ras_add_bad_pages
It ensures that appropriate error codes are returned when an error
condition is detected

Fixes the below;
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2849 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_umc_pages_in_a_row()' failed.
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2884 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_ras_mca2pa()' failed.

v2: s/-EIO/-EINVAL, retained the use of -EINVAL from
    amdgpu_umc_pages_in_a_row & and amdgpu_ras_mca2pa_by_idx, when the
    RAS context is not initialized or the convert_ras_err_addr function is
    unavailable. (Thomas)

V3: Returning 0 as the absence of eh_data is acceptable. (Tao)

Fixes: a8d133e625 ("drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: YiPeng Chai <yipeng.chai@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-06 14:44:28 -05:00
yfeng1
62bf9fe6fa drm/amdgpu: Fix for MEC SJT FW Load Fail on VF
Users might switch to ROCM build does not include MEC SJT FW and driver
needs to consider this case.w

Signed-off-by: yfeng1 <yfeng1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-06 14:44:28 -05:00
Dave Airlie
8368e9719d Merge tag 'amd-drm-next-6.14-2024-12-18' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.14-2024-12-18:

amdgpu:
- RAS updates
- ISP updates
- SDMA queue reset support
- Rework DPM powergating interfaces
- Documentation updates and cleanups
- Panel replay fixes
- DCN 3.5 updates
- DP tunneling fixes
- Use a pm notifier to more gracefully handle VRAM eviction on suspend or hibernate
- Add debugfs interfaces for forcing scheduling to specific engine instances
- GG 9.5 updates
- IH 4.4 updates
- Make missing optional firmware less noisy
- PSP 13.x updates
- SMU 13.x updates
- VCN 5.x updates
- JPEG 5.x updates
- Misc cleanups
- GC 12.x updates
- DRM panic support
- DC FAMS updates
- DSC fixes
- job handling fixes

amdkfd:
- GG 9.5 updates
- Logging improvements
- Misc cleanups
- Various Optimizations

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241218201758.2580723-1-alexander.deucher@amd.com
2024-12-20 07:57:01 +10:00
Yunxiang Li
74ef9527bd drm/amdgpu: track bo memory stats at runtime
Before, every time fdinfo is queried we try to lock all the BOs in the
VM and calculate memory usage from scratch. This works okay if the
fdinfo is rarely read and the VMs don't have a ton of BOs. If either of
these conditions is not true, we get a massive performance hit.

In this new revision, we track the BOs as they change states. This way
when the fdinfo is queried we only need to take the status lock and copy
out the usage stats with minimal impact to the runtime performance. With
this new approach however, we would no longer be able to track active
buffers.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241219151411.1150-6-Yunxiang.Li@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
2024-12-19 16:56:28 +01:00
Yunxiang Li
a541a6e865 drm/amdgpu: remove unused function parameter
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all
callers have a reference to adev handy, so remove it for cleanliness.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241219151411.1150-5-Yunxiang.Li@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
2024-12-19 16:56:25 +01:00
Yunxiang Li
bebf2ebd70 drm: make drm-active- stats optional
When memory stats is generated fresh everytime by going though all the
BOs, their active information is quite easy to get. But if the stats are
tracked with BO's state this becomes harder since the job scheduling
part doesn't really deal with individual buffers.

Make drm-active- optional to enable amdgpu to switch to the second
method.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241219151411.1150-3-Yunxiang.Li@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
2024-12-19 16:56:17 +01:00
Dave Airlie
38e961097e Merge tag 'v6.13-rc3' into drm-next
Backmerge linux 6.13-rc3 as amd next has some dependencies on fixes in it.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-12-19 12:00:02 +10:00
Michel Dänzer
695c2c745e drm/amdgpu: Handle NULL bo->tbo.resource (again) in amdgpu_vm_bo_update
Third time's the charm, I hope?

Fixes: d3116756a7 ("drm/ttm: rename bo->mem and make it a pointer")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3837
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:08 -05:00
Mirsad Todorovac
a21ab06b8c drm/admgpu: replace kmalloc() and memcpy() with kmemdup()
The static analyser tool gave the following advice:

./drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c:1266:7-14: WARNING opportunity for kmemdup

 → 1266         tmp = kmalloc(used_size, GFP_KERNEL);
   1267         if (!tmp)
   1268                 return -ENOMEM;
   1269
 → 1270         memcpy(tmp, &host_telemetry->body.error_count, used_size);

Replacing kmalloc() + memcpy() with kmemdump() doesn't change semantics.
Original code works without fault, so this is not a bug fix but proposed improvement.

Link: https://lwn.net/Articles/198928/
Fixes: 84a2947ecc ("drm/amdgpu: Implement virt req_ras_err_count")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Xinhui Pan <Xinhui.Pan@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Zhigang Luo <Zhigang.Luo@amd.com>
Cc: Victor Skvortsov <victor.skvortsov@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Yunxiang Li <Yunxiang.Li@amd.com>
Cc: Jack Xiao <Jack.Xiao@amd.com>
Cc: Vignesh Chander <Vignesh.Chander@amd.com>
Cc: Danijel Slivka <danijel.slivka@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:08 -05:00
Philip Yang
e37ccf44ac drm/amdgpu: Show warning message if IH ring overflow
If IH primary ring and KFD ih fifo overflows, we may miss CP, SDMA
interrupts and cause application soft hang. Show warning message with
ring name if overflow happens.

Add function to get ih ring name to avoid duplicating it. To keep
warning message consistent between GPU generations, change all
*_ih.c except ASICs older than Vega which has only one ih ring.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Philip Yang
1b00143231 drm/amdgpu: Optimize gfx v9 GPU page fault handling
After GPU page fault, there are lots of page fault interrupts generated
at short period even with CAM filter enabled because the fault address
is different. Each page fault copy to KFD ih fifo to send event to user
space by KFD interrupt worker, this could cause KFD ih fifo overflow
while other processes generate events at same time.

KFD process is aborted after GPU page fault, we only need one GPU page
fault interrupt sent to KFD ih fifo to send memory exception event to
user space.

Incease KFD ih fifo size to 2 times of IH primary ring size, to handle
the burst events case.

This patch handle the gfx v9 path, cover retry on/off and CAM filter
on/off cases.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Christian König
11815bb0e3 drm/amdgpu: partially revert "reduce reset time"
This partially reverts commit 194eb174cb.

This commit introduced a new state variable into adev without even
remotely worrying about CPU barriers.

Since we already have the amdgpu_in_reset() function exactly for this
use case partially revert that.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Christian König
26c95e838e drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare
As soon as the prepare phase is completed the VM might be released,
better set it to NULL.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Christian König
57f812d171 drm/amdgpu: fix amdgpu_coredump
The VM pointer might already be outdated when that function is called.
Use the PASID instead to gather the information instead.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Candice Li
d1ebe307b4 drm/amdgpu: Enable psp v14_0_3 RAS support for non-SRIOV configurations.
Enable psp v14_0_3 RAS support for non-SRIOV configurations.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Philip Yang
b4b7271e5c drm/amdgpu: Don't enable sdma 4.4.5 CTXEMPTY interrupt
The sdma context empty interrupt is dropped in amdgpu_irq_dispatch
as unregistered interrupt src_id 243, this interrupt accounts to 1/3 of
total interrupts and causes IH primary ring overflow when running
stressful benchmark application. Disable this interrupt has no side
effect found.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:39:07 -05:00
Karol Przybylski
34c4eb7d4e drm/amdgpu: Fix potential integer overflow in scheduler mask calculations
The use of 1 << i in scheduler mask calculations can result in an
unintentional integer overflow due to the expression being
evaluated as a 32-bit signed integer.

This patch replaces 1 << i with 1ULL << i to ensure the operation
is performed as a 64-bit unsigned integer, preventing overflow

Discovered in coverity scan, CID 1636393, 1636175, 1636007, 1635853

Fixes: c5c63d9cb5 ("drm/amdgpu: add amdgpu_gfx_sched_mask and amdgpu_compute_sched_mask debugfs")
Signed-off-by: Karol Przybylski <karprzy7@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:38:42 -05:00
Alex Deucher
f1fd1d0f40 drm/amdgpu/gfx12: fix IP version check
Use the helper function rather than reading it directly.

Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:22:24 -05:00
Alex Deucher
63bfd24088 drm/amdgpu/mmhub4.1: fix IP version check
Use the helper function rather than reading it directly.

Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:22:22 -05:00
Alex Deucher
2c8eeaaa0f drm/amdgpu/nbio7.11: fix IP version check
Use the helper function rather than reading it directly.

Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:22:19 -05:00
Alex Deucher
0ec43fbece drm/amdgpu/nbio7.0: fix IP version check
Use the helper function rather than reading it directly.

Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:22:16 -05:00
Alex Deucher
22b9555bc9 drm/amdgpu/nbio7.7: fix IP version check
Use the helper function rather than reading it directly.

Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:22:03 -05:00
Andrew Martin
357ef5b3b7 drm/amdgpu: Failed to check various return code
Clean up code to quiet the compiler on us failing to check the return
code.

Signed-off-by: Andrew Martin <Andrew.Martin@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:18:20 -05:00
Lijo Lazar
635c659fce drm/amdgpu: Use dbg level for VBIOS check messages
Driver has different ways to fetch VBIOS. If one of the methods doesn't
find an authentic one, it will show misleading info messages eventhough
a subsequent method finds a valid VBIOS. Keep the message level at debug
and add device context.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:18:04 -05:00
Pierre-Eric Pelloux-Prayer
54a1b36d4b drm/amdgpu: remove useless init from amdgpu_job_alloc
This init is useless because base.sched will be cleared to 0 in drm_sched_job_init
because of commit 2320c9e6a7 ("drm/sched: memset() 'job' in drm_sched_job_init()").

Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18 12:17:46 -05:00