Yang Wang
9262f411dc
drm/amdgpu: skip to create ras xxx_err_count node when ACA is enabled
...
skip to create 'xxx_err_count' node when ACA is enabled.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-05-23 15:08:25 -04:00
Yang Wang
062a7ce676
drm/amdgpu: fix ACA no query result after gpu reset
...
fix ACA no query result after gpu reset.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-05-17 17:40:39 -04:00
Yang Wang
b712d7c201
drm/amdgpu: fix compiler 'side-effect' check issue for RAS_EVENT_LOG()
...
create a new helper function to avoid compiler 'side-effect'
check about RAS_EVENT_LOG() macro.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-05-17 17:40:37 -04:00
Ma Jun
4c11d30c95
drm/amdgpu: Fix the null pointer dereference to ras_manager
...
Check ras_manager before using it
Signed-off-by: Ma Jun <Jun.Ma2@amd.com >
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-05-17 17:40:36 -04:00
Ma Jun
01b3297336
drm/amdgpu: Remove dead code in amdgpu_ras_add_mca_err_addr
...
Remove dead code in amdgpu_ras_add_mca_err_addr
Signed-off-by: Ma Jun <Jun.Ma2@amd.com >
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-05-17 17:09:55 -04:00
YiPeng Chai
2b3b9d2150
drm/amdgpu: change log level
...
Change log level.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-05-08 15:17:05 -04:00
Yang Wang
329cec8f18
drm/amdgpu: fix RAS unload driver issue in SRIOV
...
Fix null pointer issue when unload driver in SRIOV mode.
Adjust the function position to ensure that the amdgpu_mca/aca_xxx_init()
related functions can be initialized properly.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-05-08 15:17:05 -04:00
Hawking Zhang
1dbd59f3f4
drm/amdgpu: Add psp v13_0_14 ip block
...
Add psp v13_0_14 ip block support.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Le Ma <le.ma@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-05-02 15:49:05 -04:00
YiPeng Chai
3ca73073f4
drm/amdgpu: Remove redundant function call
...
Remove redundant function call.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-30 09:59:14 -04:00
Yang Wang
76ad30f51a
drm/amdgpu: add MCA smu cache support
...
v1:
because SMU CE valid mca bank will be cleared after reading,
this patch adds mca cache at the driver level to ensure that the mca bank is not lost.
v2:
refine amdgpu_mca_init/fini/reset() function name.
v3:
add mca_cache.lock support
only add CE bank to mca bank cache.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-30 09:58:41 -04:00
YiPeng Chai
6f3b69139c
drm/amdgpu: Fix ras mode2 reset failure in ras aca mode
...
Fix ras mode2 reset failure in ras aca mode.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:43 -04:00
YiPeng Chai
48fa90718b
drm/amdgpu: Use new interface to reserve bad page
...
Use new interface to reserve bad page.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:43 -04:00
YiPeng Chai
bcc0934885
drm/amdgpu: Fix address translation defect
...
retired_page is page frame and should be expanded
to the full address when querying status.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:42 -04:00
YiPeng Chai
370fbff4cc
drm/amdgpu: add poison consumption handler
...
Add poison consumption handler.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:42 -04:00
YiPeng Chai
2cf8e50ec3
drm/amdgpu: Add delay work to retire bad pages
...
Add delay work to retire bad pages.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:41 -04:00
YiPeng Chai
95b4063de4
drm/amdgpu: add interface to update umc v12_0 ecc status
...
Add interface to update umc v12_0 ecc status.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:41 -04:00
YiPeng Chai
a734adfbcd
drm/amdgpu: add poison creation handler
...
Add poison creation handler.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:41 -04:00
YiPeng Chai
f493dd64ee
drm/amdgpu: prepare for logging ecc errors
...
Prepare for logging ecc errors.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:41 -04:00
YiPeng Chai
98b5bc878d
drm/amdgpu: add message fifo to handle RAS poison events
...
Add message fifo to handle RAS poison events.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:41 -04:00
YiPeng Chai
af730e0820
drm/amdgpu: Add interface to reserve bad page
...
Add interface to reserve bad page.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Christian König <christian.koenig@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:39 -04:00
Lijo Lazar
b41f742d6f
drm/amdgpu: Set fatal errror detected flag earlier
...
In case of fatal errors, set FED status when interrupt is received. Set
the flag on other devices in the hive before RAS recovery work.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Asad Kamal <asad.kamal@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-09 22:13:36 -04:00
Yang Wang
31fd330b97
drm/amdgpu: add ras event id support for ACA
...
add ras event id support for ACA.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-22 15:48:18 -04:00
Yang Wang
865d339763
drm/amdgpu: add aca deferred error type support
...
add aca deferred error type support
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-20 13:38:15 -04:00
Tao Zhou
2fc46e0b2f
drm/amdgpu: make reset method configurable for RAS poison
...
Each RAS block has different requirement for gpu reset in poison
consumption handling.
Add support for mmhub RAS poison consumption handling.
v2: remove the mmhub poison support for kfd int v10.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-20 13:38:15 -04:00
Yang Wang
9dc57c2adf
drm/amdgpu: add ras event id support
...
add amdgpu ras event id support to better distinguish different
error information sources in dmesg logs.
the following log will be identify by event id:
{event_id} interrupt to inform RAS event
{event_id} ACA logs
{event_id} errors statistic since from current injection/error query
{event_id} errors statistic since from gpu load
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-20 13:38:13 -04:00
Stanley.Yang
7ec11c2f65
drm/amdgpu: Fix ineffective ras_mask settings
...
Check amdgpu_ras_mask to fix ineffective ras_mask setting
due to special asic without sram ecc enable but with poison
supported.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-02-26 11:14:37 -05:00
Lijo Lazar
1b6ef74b2b
drm/amdgpu: Add fatal error detected flag
...
For a RAS error that needs a full reset to recover, set the fatal error
status. Clear the status once the device is reset.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com >
Reviewed-by: Asad Kamal <asad.kamal@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-02-26 11:14:24 -05:00
Tao Zhou
edfdde9013
drm/amdgpu: disable RAS feature when fini
...
Send RAS disable feature command in fini.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-31 14:05:18 -05:00
Hawking Zhang
1731ba9b64
drm/amdgpu: Update boot time errors polling sequence
...
Update boot time errors polling sequence to align with
the latest firmware change.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Frank Min <Frank.Min@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-31 14:04:55 -05:00
YiPeng Chai
ed1e1e42fd
drm/amdgpu: Support passing poison consumption ras block to SRIOV
...
Support passing poison consumption ras blocks
to SRIOV.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-25 14:58:03 -05:00
Yang Wang
c0c48f0d61
drm/amdgpu: adjust aca init/fini sequence to match gpu reset
...
- move aca init/fini function into ras init/fini to adapt gpu reset
sequence.
- add new function amdgpu_aca_reset()
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-25 14:58:02 -05:00
Mukul Joshi
c84a7e21db
drm/amdgpu: Fix module unload hang with RAS enabled
...
The driver unload hangs because the page retirement
kthread cannot be stopped as it is sleeping and waiting
on page retirement event to occur. Add kthread_should_stop()
to the event condition to wake up the kthread when kthread
stop is called during driver unload.
Fixes: 3fdcd0a31d ("drm/amdgpu: Prepare for asynchronous processing of umc page retirement")
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-25 14:57:52 -05:00
Yang Wang
2866a4549c
drm/amdgpu: skip call ras_late_init if ras block is not supported
...
skip call ras_late_init callback if ras block is not supported.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:28 -05:00
YiPeng Chai
0795b5d234
drm/amdgpu:Support retiring multiple MCA error address pages
...
Support retiring multiple MCA error address pages in
one in-band query for umc v12_0.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
YiPeng Chai
6c23f3d12a
drm/amdgpu: Use asynchronous polling to handle umc_v12_0 poisoning
...
Use asynchronous polling to handle umc_v12_0 poisoning.
v2:
1. Change function name.
2. Change the debugging information content.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
Stanley.Yang
ee9c3031d0
drm/amdgpu: Fix ras features value calltrace
...
The high three bits of ras features mask indicate socket
id, it should skip to check high three bits of ras features
mask before disable all ras features.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
YiPeng Chai
3fdcd0a31d
drm/amdgpu: Prepare for asynchronous processing of umc page retirement
...
Preparing for asynchronous processing of umc page retirement.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
Stanley.Yang
2c7a1560e8
drm/amdgpu: Show deferred error count for UMC
...
Show deferred error count for UMC syfs node
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-18 15:47:07 -05:00
Yang Wang
7ed97155b2
drm/amdgpu: fix UBSAN array-index-out-of-bounds for ras_block_string[]
...
fix array index out of bounds issue for ras_block_string[] array.
Fixes: 30df05fb74 ("drm/amdgpu: Align ras block enum with firmware")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-18 15:46:07 -05:00
Hawking Zhang
4e2965bd3b
drm/amdgpu: Centralize ras cap query to amdgpu_ras_check_supported
...
Move ras capablity check to amdgpu_ras_check_supported.
Driver will query ras capablity through psp interace, or
vbios interface, or specific ip callbacks.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:39 -05:00
Candice Li
46e2231ce0
drm/amdgpu: Log deferred error separately
...
Separate deferred error from UE and CE and log it
individually.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:37 -05:00
Yang Wang
37973b69ea
drm/amdgpu: add aca sysfs support
...
add aca sysfs node support
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:36 -05:00
Yang Wang
04c4fcd263
drm/amdgpu: add amdgpu ras aca query interface
...
v1:
add ACA error query interface
v2:
Add a new helper function to determine whether to use ACA or MCA.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:36 -05:00
Yang Wang
33dcda51e9
drm/amdgpu: add ACA bank dump debugfs support
...
add ACA bank dump debugfs support
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:35 -05:00
Hawking Zhang
cce4febb27
drm/amdgpu: Add ras helper to query boot errors v2
...
Add ras helper function to query boot time gpu
errors.
v2: use aqua_vanjaram smn addressing pattern
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Le Ma <le.ma@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:35 -05:00
Hawking Zhang
73cb81dc54
drm/amdgpu: Packed socket_id to ras feature mask
...
Initialize RAS feature mask bit[31:29] with socket_id.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-09 15:44:13 -05:00
Candice Li
fb1e917199
drm/amdgpu: Support poison error injection via ras_ctrl debugfs
...
Support poison error injection.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-09 15:44:13 -05:00
Candice Li
90bd01471d
drm/amdgpu: Drop unnecessary sentences about CE and deferred error.
...
Remove "no user action is needed" for correctable and deferred error
to avoid confusion.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-09 15:43:54 -05:00
Hawking Zhang
6697dbf0af
Revert "drm/amdgpu: enable mca debug mode on APU by default"
...
Not needed any more with firmware fixes
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-05 16:04:36 -05:00
Srinivasan Shanmugam
b8d55a90fd
drm/amdgpu: Fix possible NULL dereference in amdgpu_ras_query_error_status_helper()
...
Return invalid error code -EINVAL for invalid block id.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1183 amdgpu_ras_query_error_status_helper() error: we previously assumed 'info' could be null (see line 1176)
Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com >
Cc: Tao Zhou <tao.zhou1@amd.com >
Cc: Hawking Zhang <Hawking.Zhang@amd.com >
Cc: Christian König <christian.koenig@amd.com >
Cc: Alex Deucher <alexander.deucher@amd.com >
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-03 11:16:06 -05:00