Stanley.Yang
7ec11c2f65
drm/amdgpu: Fix ineffective ras_mask settings
...
Check amdgpu_ras_mask to fix ineffective ras_mask setting
due to special asic without sram ecc enable but with poison
supported.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-02-26 11:14:37 -05:00
Lijo Lazar
1b6ef74b2b
drm/amdgpu: Add fatal error detected flag
...
For a RAS error that needs a full reset to recover, set the fatal error
status. Clear the status once the device is reset.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com >
Reviewed-by: Asad Kamal <asad.kamal@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-02-26 11:14:24 -05:00
Tao Zhou
edfdde9013
drm/amdgpu: disable RAS feature when fini
...
Send RAS disable feature command in fini.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-31 14:05:18 -05:00
Hawking Zhang
1731ba9b64
drm/amdgpu: Update boot time errors polling sequence
...
Update boot time errors polling sequence to align with
the latest firmware change.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Frank Min <Frank.Min@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-31 14:04:55 -05:00
YiPeng Chai
ed1e1e42fd
drm/amdgpu: Support passing poison consumption ras block to SRIOV
...
Support passing poison consumption ras blocks
to SRIOV.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-25 14:58:03 -05:00
Yang Wang
c0c48f0d61
drm/amdgpu: adjust aca init/fini sequence to match gpu reset
...
- move aca init/fini function into ras init/fini to adapt gpu reset
sequence.
- add new function amdgpu_aca_reset()
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-25 14:58:02 -05:00
Mukul Joshi
c84a7e21db
drm/amdgpu: Fix module unload hang with RAS enabled
...
The driver unload hangs because the page retirement
kthread cannot be stopped as it is sleeping and waiting
on page retirement event to occur. Add kthread_should_stop()
to the event condition to wake up the kthread when kthread
stop is called during driver unload.
Fixes: 3fdcd0a31d ("drm/amdgpu: Prepare for asynchronous processing of umc page retirement")
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-25 14:57:52 -05:00
Yang Wang
2866a4549c
drm/amdgpu: skip call ras_late_init if ras block is not supported
...
skip call ras_late_init callback if ras block is not supported.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:28 -05:00
YiPeng Chai
0795b5d234
drm/amdgpu:Support retiring multiple MCA error address pages
...
Support retiring multiple MCA error address pages in
one in-band query for umc v12_0.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
YiPeng Chai
6c23f3d12a
drm/amdgpu: Use asynchronous polling to handle umc_v12_0 poisoning
...
Use asynchronous polling to handle umc_v12_0 poisoning.
v2:
1. Change function name.
2. Change the debugging information content.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
Stanley.Yang
ee9c3031d0
drm/amdgpu: Fix ras features value calltrace
...
The high three bits of ras features mask indicate socket
id, it should skip to check high three bits of ras features
mask before disable all ras features.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
YiPeng Chai
3fdcd0a31d
drm/amdgpu: Prepare for asynchronous processing of umc page retirement
...
Preparing for asynchronous processing of umc page retirement.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
Stanley.Yang
2c7a1560e8
drm/amdgpu: Show deferred error count for UMC
...
Show deferred error count for UMC syfs node
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-18 15:47:07 -05:00
Yang Wang
7ed97155b2
drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[]
...
fix array index out of bounds issue for ras_block_string[] array.
Fixes: 30df05fb74 ("drm/amdgpu: Align ras block enum with firmware")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-18 15:46:07 -05:00
Hawking Zhang
4e2965bd3b
drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supported
...
Move ras capablity check to amdgpu_ras_check_supported.
Driver will query ras capablity through psp interace, or
vbios interface, or specific ip callbacks.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:39 -05:00
Candice Li
46e2231ce0
drm/amdgpu: Log deferred error separately
...
Separate deferred error from UE and CE and log it
individually.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:37 -05:00
Yang Wang
37973b69ea
drm/amdgpu: add aca sysfs support
...
add aca sysfs node support
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:36 -05:00
Yang Wang
04c4fcd263
drm/amdgpu: add amdgpu ras aca query interface
...
v1:
add ACA error query interface
v2:
Add a new helper function to determine whether to use ACA or MCA.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:36 -05:00
Yang Wang
33dcda51e9
drm/amdgpu: add ACA bank dump debugfs support
...
add ACA bank dump debugfs support
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:35 -05:00
Hawking Zhang
cce4febb27
drm/amdgpu: Add ras helper to query boot errors v2
...
Add ras helper function to query boot time gpu
errors.
v2: use aqua_vanjaram smn addressing pattern
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Le Ma <le.ma@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:35 -05:00
Hawking Zhang
73cb81dc54
drm/amdgpu: Packed socket_id to ras feature mask
...
Initialize RAS feature mask bit[31:29] with socket_id.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-09 15:44:13 -05:00
Candice Li
fb1e917199
drm/amdgpu: Support poison error injection via ras_ctrl debugfs
...
Support poison error injection.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-09 15:44:13 -05:00
Candice Li
90bd01471d
drm/amdgpu: Drop unnecessary sentences about CE and deferred error.
...
Remove "no user action is needed" for correctable and deferred error
to avoid confusion.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-09 15:43:54 -05:00
Hawking Zhang
6697dbf0af
Revert "drm/amdgpu: enable mca debug mode on APU by default"
...
Not needed any more with firmware fixes
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-05 16:04:36 -05:00
Srinivasan Shanmugam
b8d55a90fd
drm/amdgpu: Fix possible NULL dereference in amdgpu_ras_query_error_status_helper()
...
Return invalid error code -EINVAL for invalid block id.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1183 amdgpu_ras_query_error_status_helper() error: we previously assumed 'info' could be null (see line 1176)
Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com >
Cc: Tao Zhou <tao.zhou1@amd.com >
Cc: Hawking Zhang <Hawking.Zhang@amd.com >
Cc: Christian König <christian.koenig@amd.com >
Cc: Alex Deucher <alexander.deucher@amd.com >
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-03 11:16:06 -05:00
Srinivasan Shanmugam
091411be7a
drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in amdgpu_ras.c
...
Fixes the below smatch warnings:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2543 amdgpu_ras_recovery_init() warn: Please consider using kzalloc instead of kmalloc
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2830 amdgpu_ras_init() warn: Please consider using kzalloc instead of kmalloc
Cc: Christian König <christian.koenig@amd.com >
Cc: Alex Deucher <alexander.deucher@amd.com >
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-03 11:16:06 -05:00
YiPeng Chai
9f91e983ee
drm/amdgpu: MCA supports recording umc address information
...
MCA supports recording umc address information.
V2:
Move err_addr variable from struct ras_err_node to
struct ras_err_info.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-12-19 14:59:03 -05:00
Dave Airlie
c1ee197d64
Backmerge tag 'v6.7-rc5' into drm-next
...
Linux 6.7-rc5
Alex requested this for some amdkfd work relying on the symbols exports.
Signed-off-by: Dave Airlie <airlied@redhat.com >
2023-12-12 11:32:33 +10:00
Yang Wang
dbf3850d12
drm/amdgpu: optimize the printing order of error data
...
sort error data list to optimize the printing order.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-12-06 16:05:32 -05:00
Yang Wang
f875f61b1f
drm/amdgpu: enable mca debug mode on APU by default
...
enable MCA debug mode on APU device by default.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-30 18:26:31 -05:00
Lijo Lazar
201761b5eb
drm/amdgpu: Move mca debug mode decision to ras
...
Refactor code such that ras block decides the default mca debug mode,
and not swsmu block.
By default mca debug mode is set to false.
v2: squash in uninitialized value fix (Alex)
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-29 16:49:01 -05:00
Yang Wang
07ee43faeb
drm/amdgpu: fix ras err_data null pointer issue in amdgpu_ras.c
...
fix ras err_data null pointer issue in amdgpu_ras.c
Fixes: 8cc0f5669e ("drm/amdgpu: Support multiple error query modes")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-17 00:51:58 -05:00
Vitaly Prosyak
4638e0c29a
drm/amdgpu: fix software pci_unplug on some chips
...
When software 'pci unplug' using IGT is executed we got a sysfs directory
entry is NULL for differant ras blocks like hdp, umc, etc.
Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group'
check that 'sd' is not NULL.
[ +0.000001] RIP: 0010:sysfs_remove_group+0x83/0x90
[ +0.000002] Code: 31 c0 31 d2 31 f6 31 ff e9 9a a8 b4 00 4c 89 e7 e8 f2 a2 ff ff eb c2 49 8b 55 00 48 8b 33 48 c7 c7 80 65 94 82 e8 cd 82 bb ff <0f> 0b eb cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
[ +0.000001] RSP: 0018:ffffc90002067c90 EFLAGS: 00010246
[ +0.000002] RAX: 0000000000000000 RBX: ffffffff824ea180 RCX: 0000000000000000
[ +0.000001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ +0.000001] RBP: ffffc90002067ca8 R08: 0000000000000000 R09: 0000000000000000
[ +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ +0.000001] R13: ffff88810a395f48 R14: ffff888101aab0d0 R15: 0000000000000000
[ +0.000001] FS: 00007f5ddaa43a00(0000) GS:ffff88841e800000(0000) knlGS:0000000000000000
[ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000001] CR2: 00007f8ffa61ba50 CR3: 0000000106432000 CR4: 0000000000350ef0
[ +0.000001] Call Trace:
[ +0.000001] <TASK>
[ +0.000001] ? show_regs+0x72/0x90
[ +0.000002] ? sysfs_remove_group+0x83/0x90
[ +0.000002] ? __warn+0x8d/0x160
[ +0.000001] ? sysfs_remove_group+0x83/0x90
[ +0.000001] ? report_bug+0x1bb/0x1d0
[ +0.000003] ? handle_bug+0x46/0x90
[ +0.000001] ? exc_invalid_op+0x19/0x80
[ +0.000002] ? asm_exc_invalid_op+0x1b/0x20
[ +0.000003] ? sysfs_remove_group+0x83/0x90
[ +0.000001] dpm_sysfs_remove+0x61/0x70
[ +0.000002] device_del+0xa3/0x3d0
[ +0.000002] ? ktime_get_mono_fast_ns+0x46/0xb0
[ +0.000002] device_unregister+0x18/0x70
[ +0.000001] i2c_del_adapter+0x26d/0x330
[ +0.000002] arcturus_i2c_control_fini+0x25/0x50 [amdgpu]
[ +0.000236] smu_sw_fini+0x38/0x260 [amdgpu]
[ +0.000241] amdgpu_device_fini_sw+0x116/0x670 [amdgpu]
[ +0.000186] ? mutex_lock+0x13/0x50
[ +0.000003] amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
[ +0.000192] drm_minor_release+0x4f/0x80 [drm]
[ +0.000025] drm_release+0xfe/0x150 [drm]
[ +0.000027] __fput+0x9f/0x290
[ +0.000002] ____fput+0xe/0x20
[ +0.000002] task_work_run+0x61/0xa0
[ +0.000002] exit_to_user_mode_prepare+0x150/0x170
[ +0.000002] syscall_exit_to_user_mode+0x2a/0x50
Cc: Hawking Zhang <hawking.zhang@amd.com >
Cc: Luben Tuikov <luben.tuikov@amd.com >
Cc: Alex Deucher <alexander.deucher@amd.com >
Cc: Christian Koenig <christian.koenig@amd.com >
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com >
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-09 17:02:49 -05:00
Hawking Zhang
8cc0f5669e
drm/amdgpu: Support multiple error query modes
...
Direct error query mode and firmware error query mode
are supported for now.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-09 17:01:58 -05:00
Tao Zhou
18eae367cb
drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_count
...
Handle xgmi hive case.
Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-03 12:18:32 -04:00
Tao Zhou
d1d4c0b7b6
drm/amdgpu: check RAS supported first in ras_reset_error_count
...
Not all platforms support RAS.
Fixes: 73582be11a ("drm/amdgpu: bypass RAS error reset in some conditions")
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-31 16:40:15 -04:00
Tao Zhou
73582be11a
drm/amdgpu: bypass RAS error reset in some conditions
...
PMFW is responsible for RAS error reset in some conditions, driver can
skip the operation.
v2: add check for ras->in_recovery, it's set earlier than
amdgpu_in_reset.
v3: fix error in gpu reset check.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-26 18:41:22 -04:00
Tao Zhou
f3a3bbf156
drm/amdgpu: enable RAS poison mode for APU
...
Enable it by default on APU platform.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-26 18:41:22 -04:00
Yang Wang
ec3e0a9167
drm/amdgpu: refine ras error kernel log print
...
refine ras error kernel log to avoid user-ridden ambiguity.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-26 18:41:21 -04:00
Yang Wang
53d4d77927
drm/amdgpu: fix find ras error node error
...
the origin function might return the wrong node.
Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-26 18:41:21 -04:00
Tao Zhou
8096df7664
drm/amdgpu: add set/get mca debug mode operations
...
Record the debug mode status in RAS.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:28 -04:00
Stanley.Yang
66d64e4e03
drm/amdgpu: Enable RAS feature by default for APU
...
Enable RAS feature by default for aqua vanjaram on apu
platform.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:28 -04:00
Yang Wang
49c260bef3
drm/amdgpu: fix typo for amdgpu ras error data print
...
typo fix.
Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Candice Li <candice.li@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:28 -04:00
Stanley.Yang
8a65661114
drm/amdgpu: Fix delete nodes that have been relesed
...
Fix delete nodes that it has been freed.
Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:27 -04:00
Hawking Zhang
9248462d7e
drm/amdgpu: Enable software RAS in vcn v4_0_3
...
Set VCN/JPEG RAS masks to enable software RAS for
VCN and JPEG.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:26 -04:00
Tao Zhou
472c5fb297
drm/amdgpu: define ras_reset_error_count function
...
Make the code architecture more simple.
v2: reuse ras_reset_error_count in ras_reset_error_status.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:26 -04:00
Asad Kamal
53dd920c1f
drm/amdgpu : Add hive ras recovery check
...
If one of the devices in the hive detects a
fatal error, need to send ras recovery reset
message to PMFW of all devices in the hive.
For that add a flag in hive to indicate that
it's undergoing ras recovery
Signed-off-by: Asad Kamal <asad.kamal@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-19 18:26:51 -04:00
Yang Wang
5b1270beb3
drm/amdgpu: add ras_err_info to identify RAS error source
...
introduced "ras_err_info" to better identify a RAS ERROR source.
NOTE:
For legacy chips, keep the original RAS error print format.
v1:
RAS errors may come from different dies during a RAS error query,
therefore, need a new data structure to identify the source of RAS ERROR.
v2:
- use new data structure 'amdgpu_smuio_mcm_config_info' instead of
ras_err_id (in v1 patch)
- refine ras error dump function name
- refine ras error dump log format
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-13 11:35:35 -04:00
Asad Kamal
625e5f3851
drm/amdgpu: Expose ras version & schema info
...
Expose ras table version & schema info to sysfs
v2: Updated schema to get poison support info
from ras context, removed asic specific checks
Signed-off-by: Asad Kamal <asad.kamal@amd.com >
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-13 11:02:54 -04:00
Stanley.Yang
a83f2bf1f4
drm/amdgpu: Fix false positive error log
...
It should first check block ras obj whether be set, it should
return 0 directly if block ras obj or hw_ops is not set.
If block doesn't support RAS just return 0 is fine.
Changed from V1:
return 0 directly if block ras obj or hw ops is not set
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-20 16:24:09 -04:00