YiPeng Chai
3fdcd0a31d
drm/amdgpu: Prepare for asynchronous processing of umc page retirement
...
Preparing for asynchronous processing of umc page retirement.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
Stanley.Yang
2c7a1560e8
drm/amdgpu: Show deferred error count for UMC
...
Show deferred error count for UMC syfs node
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-18 15:47:07 -05:00
Yang Wang
7ed97155b2
drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[]
...
fix array index out of bounds issue for ras_block_string[] array.
Fixes: 30df05fb74 ("drm/amdgpu: Align ras block enum with firmware")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-18 15:46:07 -05:00
Hawking Zhang
4e2965bd3b
drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supported
...
Move ras capablity check to amdgpu_ras_check_supported.
Driver will query ras capablity through psp interace, or
vbios interface, or specific ip callbacks.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:39 -05:00
Candice Li
46e2231ce0
drm/amdgpu: Log deferred error separately
...
Separate deferred error from UE and CE and log it
individually.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:37 -05:00
Yang Wang
37973b69ea
drm/amdgpu: add aca sysfs support
...
add aca sysfs node support
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:36 -05:00
Yang Wang
04c4fcd263
drm/amdgpu: add amdgpu ras aca query interface
...
v1:
add ACA error query interface
v2:
Add a new helper function to determine whether to use ACA or MCA.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:36 -05:00
Yang Wang
33dcda51e9
drm/amdgpu: add ACA bank dump debugfs support
...
add ACA bank dump debugfs support
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:35 -05:00
Hawking Zhang
cce4febb27
drm/amdgpu: Add ras helper to query boot errors v2
...
Add ras helper function to query boot time gpu
errors.
v2: use aqua_vanjaram smn addressing pattern
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Le Ma <le.ma@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:35 -05:00
Hawking Zhang
73cb81dc54
drm/amdgpu: Packed socket_id to ras feature mask
...
Initialize RAS feature mask bit[31:29] with socket_id.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-09 15:44:13 -05:00
Candice Li
fb1e917199
drm/amdgpu: Support poison error injection via ras_ctrl debugfs
...
Support poison error injection.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-09 15:44:13 -05:00
Candice Li
90bd01471d
drm/amdgpu: Drop unnecessary sentences about CE and deferred error.
...
Remove "no user action is needed" for correctable and deferred error
to avoid confusion.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-09 15:43:54 -05:00
Hawking Zhang
6697dbf0af
Revert "drm/amdgpu: enable mca debug mode on APU by default"
...
Not needed any more with firmware fixes
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-05 16:04:36 -05:00
Srinivasan Shanmugam
b8d55a90fd
drm/amdgpu: Fix possible NULL dereference in amdgpu_ras_query_error_status_helper()
...
Return invalid error code -EINVAL for invalid block id.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1183 amdgpu_ras_query_error_status_helper() error: we previously assumed 'info' could be null (see line 1176)
Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com >
Cc: Tao Zhou <tao.zhou1@amd.com >
Cc: Hawking Zhang <Hawking.Zhang@amd.com >
Cc: Christian König <christian.koenig@amd.com >
Cc: Alex Deucher <alexander.deucher@amd.com >
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-03 11:16:06 -05:00
Srinivasan Shanmugam
091411be7a
drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in amdgpu_ras.c
...
Fixes the below smatch warnings:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2543 amdgpu_ras_recovery_init() warn: Please consider using kzalloc instead of kmalloc
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2830 amdgpu_ras_init() warn: Please consider using kzalloc instead of kmalloc
Cc: Christian König <christian.koenig@amd.com >
Cc: Alex Deucher <alexander.deucher@amd.com >
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-03 11:16:06 -05:00
YiPeng Chai
9f91e983ee
drm/amdgpu: MCA supports recording umc address information
...
MCA supports recording umc address information.
V2:
Move err_addr variable from struct ras_err_node to
struct ras_err_info.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-12-19 14:59:03 -05:00
Dave Airlie
c1ee197d64
Backmerge tag 'v6.7-rc5' into drm-next
...
Linux 6.7-rc5
Alex requested this for some amdkfd work relying on the symbols exports.
Signed-off-by: Dave Airlie <airlied@redhat.com >
2023-12-12 11:32:33 +10:00
Yang Wang
dbf3850d12
drm/amdgpu: optimize the printing order of error data
...
sort error data list to optimize the printing order.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-12-06 16:05:32 -05:00
Yang Wang
f875f61b1f
drm/amdgpu: enable mca debug mode on APU by default
...
enable MCA debug mode on APU device by default.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-30 18:26:31 -05:00
Lijo Lazar
201761b5eb
drm/amdgpu: Move mca debug mode decision to ras
...
Refactor code such that ras block decides the default mca debug mode,
and not swsmu block.
By default mca debug mode is set to false.
v2: squash in uninitialized value fix (Alex)
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-29 16:49:01 -05:00
Yang Wang
07ee43faeb
drm/amdgpu: fix ras err_data null pointer issue in amdgpu_ras.c
...
fix ras err_data null pointer issue in amdgpu_ras.c
Fixes: 8cc0f5669e ("drm/amdgpu: Support multiple error query modes")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-17 00:51:58 -05:00
Vitaly Prosyak
4638e0c29a
drm/amdgpu: fix software pci_unplug on some chips
...
When software 'pci unplug' using IGT is executed we got a sysfs directory
entry is NULL for differant ras blocks like hdp, umc, etc.
Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group'
check that 'sd' is not NULL.
[ +0.000001] RIP: 0010:sysfs_remove_group+0x83/0x90
[ +0.000002] Code: 31 c0 31 d2 31 f6 31 ff e9 9a a8 b4 00 4c 89 e7 e8 f2 a2 ff ff eb c2 49 8b 55 00 48 8b 33 48 c7 c7 80 65 94 82 e8 cd 82 bb ff <0f> 0b eb cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
[ +0.000001] RSP: 0018:ffffc90002067c90 EFLAGS: 00010246
[ +0.000002] RAX: 0000000000000000 RBX: ffffffff824ea180 RCX: 0000000000000000
[ +0.000001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ +0.000001] RBP: ffffc90002067ca8 R08: 0000000000000000 R09: 0000000000000000
[ +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ +0.000001] R13: ffff88810a395f48 R14: ffff888101aab0d0 R15: 0000000000000000
[ +0.000001] FS: 00007f5ddaa43a00(0000) GS:ffff88841e800000(0000) knlGS:0000000000000000
[ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000001] CR2: 00007f8ffa61ba50 CR3: 0000000106432000 CR4: 0000000000350ef0
[ +0.000001] Call Trace:
[ +0.000001] <TASK>
[ +0.000001] ? show_regs+0x72/0x90
[ +0.000002] ? sysfs_remove_group+0x83/0x90
[ +0.000002] ? __warn+0x8d/0x160
[ +0.000001] ? sysfs_remove_group+0x83/0x90
[ +0.000001] ? report_bug+0x1bb/0x1d0
[ +0.000003] ? handle_bug+0x46/0x90
[ +0.000001] ? exc_invalid_op+0x19/0x80
[ +0.000002] ? asm_exc_invalid_op+0x1b/0x20
[ +0.000003] ? sysfs_remove_group+0x83/0x90
[ +0.000001] dpm_sysfs_remove+0x61/0x70
[ +0.000002] device_del+0xa3/0x3d0
[ +0.000002] ? ktime_get_mono_fast_ns+0x46/0xb0
[ +0.000002] device_unregister+0x18/0x70
[ +0.000001] i2c_del_adapter+0x26d/0x330
[ +0.000002] arcturus_i2c_control_fini+0x25/0x50 [amdgpu]
[ +0.000236] smu_sw_fini+0x38/0x260 [amdgpu]
[ +0.000241] amdgpu_device_fini_sw+0x116/0x670 [amdgpu]
[ +0.000186] ? mutex_lock+0x13/0x50
[ +0.000003] amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
[ +0.000192] drm_minor_release+0x4f/0x80 [drm]
[ +0.000025] drm_release+0xfe/0x150 [drm]
[ +0.000027] __fput+0x9f/0x290
[ +0.000002] ____fput+0xe/0x20
[ +0.000002] task_work_run+0x61/0xa0
[ +0.000002] exit_to_user_mode_prepare+0x150/0x170
[ +0.000002] syscall_exit_to_user_mode+0x2a/0x50
Cc: Hawking Zhang <hawking.zhang@amd.com >
Cc: Luben Tuikov <luben.tuikov@amd.com >
Cc: Alex Deucher <alexander.deucher@amd.com >
Cc: Christian Koenig <christian.koenig@amd.com >
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com >
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-09 17:02:49 -05:00
Hawking Zhang
8cc0f5669e
drm/amdgpu: Support multiple error query modes
...
Direct error query mode and firmware error query mode
are supported for now.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-09 17:01:58 -05:00
Tao Zhou
18eae367cb
drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_count
...
Handle xgmi hive case.
Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-03 12:18:32 -04:00
Tao Zhou
d1d4c0b7b6
drm/amdgpu: check RAS supported first in ras_reset_error_count
...
Not all platforms support RAS.
Fixes: 73582be11a ("drm/amdgpu: bypass RAS error reset in some conditions")
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-31 16:40:15 -04:00
Tao Zhou
73582be11a
drm/amdgpu: bypass RAS error reset in some conditions
...
PMFW is responsible for RAS error reset in some conditions, driver can
skip the operation.
v2: add check for ras->in_recovery, it's set earlier than
amdgpu_in_reset.
v3: fix error in gpu reset check.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-26 18:41:22 -04:00
Tao Zhou
f3a3bbf156
drm/amdgpu: enable RAS poison mode for APU
...
Enable it by default on APU platform.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-26 18:41:22 -04:00
Yang Wang
ec3e0a9167
drm/amdgpu: refine ras error kernel log print
...
refine ras error kernel log to avoid user-ridden ambiguity.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-26 18:41:21 -04:00
Yang Wang
53d4d77927
drm/amdgpu: fix find ras error node error
...
the origin function might return the wrong node.
Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-26 18:41:21 -04:00
Tao Zhou
8096df7664
drm/amdgpu: add set/get mca debug mode operations
...
Record the debug mode status in RAS.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:28 -04:00
Stanley.Yang
66d64e4e03
drm/amdgpu: Enable RAS feature by default for APU
...
Enable RAS feature by default for aqua vanjaram on apu
platform.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:28 -04:00
Yang Wang
49c260bef3
drm/amdgpu: fix typo for amdgpu ras error data print
...
typo fix.
Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Candice Li <candice.li@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:28 -04:00
Stanley.Yang
8a65661114
drm/amdgpu: Fix delete nodes that have been relesed
...
Fix delete nodes that it has been freed.
Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:27 -04:00
Hawking Zhang
9248462d7e
drm/amdgpu: Enable software RAS in vcn v4_0_3
...
Set VCN/JPEG RAS masks to enable software RAS for
VCN and JPEG.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:26 -04:00
Tao Zhou
472c5fb297
drm/amdgpu: define ras_reset_error_count function
...
Make the code architecture more simple.
v2: reuse ras_reset_error_count in ras_reset_error_status.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:26 -04:00
Asad Kamal
53dd920c1f
drm/amdgpu : Add hive ras recovery check
...
If one of the devices in the hive detects a
fatal error, need to send ras recovery reset
message to PMFW of all devices in the hive.
For that add a flag in hive to indicate that
it's undergoing ras recovery
Signed-off-by: Asad Kamal <asad.kamal@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-19 18:26:51 -04:00
Yang Wang
5b1270beb3
drm/amdgpu: add ras_err_info to identify RAS error source
...
introduced "ras_err_info" to better identify a RAS ERROR source.
NOTE:
For legacy chips, keep the original RAS error print format.
v1:
RAS errors may come from different dies during a RAS error query,
therefore, need a new data structure to identify the source of RAS ERROR.
v2:
- use new data structure 'amdgpu_smuio_mcm_config_info' instead of
ras_err_id (in v1 patch)
- refine ras error dump function name
- refine ras error dump log format
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-13 11:35:35 -04:00
Asad Kamal
625e5f3851
drm/amdgpu: Expose ras version & schema info
...
Expose ras table version & schema info to sysfs
v2: Updated schema to get poison support info
from ras context, removed asic specific checks
Signed-off-by: Asad Kamal <asad.kamal@amd.com >
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-13 11:02:54 -04:00
Stanley.Yang
a83f2bf1f4
drm/amdgpu: Fix false positive error log
...
It should first check block ras obj whether be set, it should
return 0 directly if block ras obj or hw_ops is not set.
If block doesn't support RAS just return 0 is fine.
Changed from V1:
return 0 directly if block ras obj or hw ops is not set
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-20 16:24:09 -04:00
Cong Liu
5838f74c29
drm/amdgpu: fix a memory leak in amdgpu_ras_feature_enable
...
This patch fixes a memory leak in the amdgpu_ras_feature_enable() function.
The leak occurs when the function sends a command to the firmware to enable
or disable a RAS feature for a GFX block. If the command fails, the kfree()
function is not called to free the info memory.
Fixes: 9f051d6ff1 ("drm/amdgpu: Free ras cmd input buffer properly")
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Cong Liu <liucong2@kylinos.cn >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-20 16:24:07 -04:00
Yang Wang
4051844c66
drm/amdgpu: add amdgpu mca debug sysfs support
...
add amdgpu mca debug sysfs support.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-20 12:25:19 -04:00
Lijo Lazar
4e8303cf2c
drm/amdgpu: Use function for IP version check
...
Use an inline function for version check. Gives more flexibility to
handle any format changes.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com >
Reviewed-by: Alex Deucher <alexander.deucher@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-20 12:23:28 -04:00
Hawking Zhang
9f9d4651f7
drm/amdgpu: fallback to old RAS error message for aqua_vanjaram
...
So driver doesn't generate incorrect message until
the new format is settled down for aqua_vanjaram
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-11 17:15:02 -04:00
Hawking Zhang
bf7aa8bea9
drm/amdgpu: Free ras cmd input buffer properly
...
Do not access the pointer for ras input cmd buffer
if it is even not allocated.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-08-30 15:51:16 -04:00
Hawking Zhang
ec70578c83
drm/amdgpu: Allow issue disable gfx ras cmd to firmware
...
Disable gfx ras command is needed in some use cases
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-08-30 15:25:54 -04:00
YiPeng Chai
80578f1641
drm/amdgpu: Enable ras for mp0 v13_0_6 sriov
...
Enable ras for mp0 v13_0_6 sriov
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-08-30 14:58:16 -04:00
YiPeng Chai
1b98a5f8e0
drm/amdgpu: mode1 reset needs to recover mp1 for mp0 v13_0_10
...
Mode1 reset needs to recover mp1 in fatal error case
for mp0 v13_0_10.
v2:
Define a macro to wrap psp function calls.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-08-15 18:07:41 -04:00
Hawking Zhang
8b3a7a707c
drm/amdgpu: Remove unnecessary ras cap check
...
RAS global isr will only be invoked by hardware
interrupt. Don't need to query ras capability in isr
In addition, amdgpu_ras_interrupt_fatal_error_handler
ensures the isr won't be called from guest linux
side by accident. The RAS cap check in isr that
introduced to fix sriov crash is not needed any more
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-08-15 18:07:41 -04:00
Candice Li
bc0f80802d
drm/amdgpu: Extend poison mode check to SDMA/VCN/JPEG
...
Treat SDMA/VCN/JPEG as RAS capable IP blocks in poison mode.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-08-09 09:46:05 -04:00
Tao Zhou
7692e1ee24
drm/amdgpu: add RAS fatal error handler for NBIO v7.9
...
Register RAS fatal error interrupt and add handler.
v2: only register NBIO RAS for dGPU platform.
change nbio_v7_9_set_ras_controller_irq_state and nbio_v7_9_set_ras_err_event_athub_irq_state
to dummy functions.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-08-09 09:46:04 -04:00