Yang Wang
671af06690
drm/amdgpu: remove RAS unused paramter 'err_addr'
...
- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()
The parameter 'err_addr' is no longer used since following patch.
Fixes: a7e8467fbe ("drm/amdgpu: Remove unused code")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-08-06 11:11:01 -04:00
YiPeng Chai
56631dee29
drm/amdgpu: optimize logging deferred error info
...
1. Use pa_pfn as the radix-tree key index to log
deferred error info.
2. Use local array to store a row of bad pages.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-07-23 17:32:14 -04:00
YiPeng Chai
27cdf8c3ca
drm/amdgpu: optimize umc v12 address conversion function
...
Split into 3 parts:
1. Convert soc physical address via ras ta.
2. Expand bad pages from soc physical address.
3. Dump bad address info.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-07-23 17:31:59 -04:00
YiPeng Chai
e23300dfff
drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed
...
The problem case is as follows:
1. GPU A triggers a gpu ras reset, and GPU A drives
GPU B to also perform a gpu ras reset.
2. After gpu B ras reset started, gpu B queried a DE
data. Since the DE data was queried in the ras reset
thread instead of the page retirement thread, bad
page retirement work would not be triggered. Then
even if all gpu resets are completed, the bad pages
will be cached in RAM until GPU B's bad page retirement
work is triggered again and then saved to eeprom.
This patch can save the bad pages to eeprom in time after gpu
ras reset is completed.
v2:
1. Add the above description to code comments.
2. Reuse existing function.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-07-10 10:13:41 -04:00
YiPeng Chai
78146c1dcd
drm/amdgpu: add variable to record the deferred error number read by driver
...
Add variable to record the deferred error
number read by driver.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-06-27 17:31:20 -04:00
YiPeng Chai
2b3b9d2150
drm/amdgpu: change log level
...
Change log level.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-05-08 15:17:05 -04:00
YiPeng Chai
2c0410fbee
rm/amdgpu: Remove unused code
...
Remove unused code.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-30 09:59:08 -04:00
YiPeng Chai
e023874081
drm/amdgpu: support ACA logging ecc errors
...
support ACA logging ecc errors.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:42 -04:00
YiPeng Chai
314c38cde6
drm/amdgpu: retire bad pages for umc v12_0
...
Retire bad pages for umc v12_0.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:42 -04:00
YiPeng Chai
f27defca68
drm/amdgpu: umc v12_0 logs ecc errors
...
1. umc v12_0 logs ecc errors.
2. Reserve newly detected ecc error pages.
3. Add tag for bad pages, so that they can
be retired later.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:41 -04:00
YiPeng Chai
b2aa6b108d
drm/amdgpu: umc v12_0 converts error address
...
Umc v12_0 converts error address.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:41 -04:00
YiPeng Chai
95b4063de4
drm/amdgpu: add interface to update umc v12_0 ecc status
...
Add interface to update umc v12_0 ecc status.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-26 17:22:41 -04:00
Tao Zhou
4b0cb230bd
drm/amdgpu: retire UMC v12 mca_addr_to_pa
...
RAS TA will handle it, the function is useless.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-04-09 22:09:15 -04:00
Tao Zhou
8e4617c25d
drm/amdgpu: simplify convert_error_address interface for UMC v12
...
Replace separate parameters with struct ta_ras_query_address_input.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-22 15:56:18 -04:00
Tao Zhou
8b3495eafb
drm/amdgpu: add socket id parameter for psp query address cmd
...
And set the socket id.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-22 15:54:54 -04:00
Yang Wang
f7bcfb7a56
drm/amdgpu: retrieve umc odecc error count for aca umc v12.0
...
retrieve umc odecc error count for aca umc v12.0
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-22 15:48:03 -04:00
Yang Wang
b93d759f54
drm/amdgpu: add umc v12.0.0 deferred error support
...
add umc v12.0.0 deferred error support.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-20 13:38:15 -04:00
Yang Wang
e3d4de8d8b
drm/amdgpu: retire unused aca_bank_report data structure
...
retire unused aca_bank_report data structure.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-20 13:38:15 -04:00
Yang Wang
69bf42fbb2
drm/amdgpu: refine aca error cache for umc v12.0
...
refine aca error cache for umc v12.0
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-20 13:38:15 -04:00
Yang Wang
abc3b5d21d
drm/amdgpu: add new aca_smu_type support
...
Add new types to distinguish between ACA error type and smu mca type.
e.g.:
the ACA_ERROR_TYPE_DEFERRED is not matched any smu mca valid bank
channel, so add new type 'aca_smu_type' to distinguish aca error type
and smu mca type.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-20 13:38:14 -04:00
Yang Wang
9dc57c2adf
drm/amdgpu: add ras event id support
...
add amdgpu ras event id support to better distinguish different
error information sources in dmesg logs.
the following log will be identify by event id:
{event_id} interrupt to inform RAS event
{event_id} ACA logs
{event_id} errors statistic since from current injection/error query
{event_id} errors statistic since from gpu load
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-03-20 13:38:13 -04:00
Tao Zhou
2c684b9342
drm/amdgpu: add deferred error check for UMC v12 address query
...
Both RAS UE and deferred errors need page retirement.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-02-29 20:35:14 -05:00
Tao Zhou
01087a1974
drm/amdgpu: use PSP address query command
...
Get UMC physical address from PSP in RAS error address coversion.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-31 14:05:19 -05:00
YiPeng Chai
0795b5d234
drm/amdgpu:Support retiring multiple MCA error address pages
...
Support retiring multiple MCA error address pages in
one in-band query for umc v12_0.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
YiPeng Chai
afb617f38f
drm/amdgpu: add interface to check mca umc status
...
Add interface to check mca umc status.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
YiPeng Chai
22f6e3e112
drm/amdgpu: Add log info for umc_v12_0
...
Add log info for umc_v12_0.
v2:
Delete redundant logs.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-22 17:13:25 -05:00
Tao Zhou
a9e4f61df1
drm/amdgpu: update error condition check for umc_v12_0_query_error_address
...
Deferred error is also taken into account.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-18 15:47:24 -05:00
Candice Li
46e2231ce0
drm/amdgpu: Log deferred error separately
...
Separate deferred error from UE and CE and log it
individually.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:37 -05:00
Yang Wang
f38765de83
drm/amdgpu: add umc v12.0 ACA support
...
add umc v12.0 ACA driver support
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2024-01-15 18:35:36 -05:00
YiPeng Chai
99cab331a4
drm/amdgpu: Add umc page retirement for umc v12_0
...
Add umc page retirement for umc v12_0.
V2:
1. Changed umc page retirement check condition
to call umc_v12_0_is_uncorrectable_error.
2. Use memset to clear the contents of the umc
error address structure.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-12-19 14:59:03 -05:00
YiPeng Chai
a8c77a121c
drm/amdgpu: Add poison mode check error condition for umc v12_0
...
Add poison mode check error condition for umc v12_0.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-12-19 14:59:03 -05:00
YiPeng Chai
9f91e983ee
drm/amdgpu: MCA supports recording umc address information
...
MCA supports recording umc address information.
V2:
Move err_addr variable from struct ras_err_node to
struct ras_err_info.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-12-19 14:59:03 -05:00
Yang Wang
bf13da6ae1
drm/amdgpu: correct smu v13.0.6 umc ras error check
...
correct smu v13.0.0 umc ras error check
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-11-09 17:01:20 -05:00
Candice Li
e020d01575
drm/amdgpu: Drop deferred error in uncorrectable error check
...
Drop checking deferred error which can be handled by poison
consumption.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-31 16:40:15 -04:00
Candice Li
d59fcfb084
drm/amdgpu: Identify data parity error corrected in replay mode
...
Use ErrorCodeExt field to identify data parity error in replay mode.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Yang Wang <kevinyang.wang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-27 14:15:03 -04:00
Candice Li
afcf949cf3
drm/amdgpu: Log UE corrected by replay as correctable error
...
Support replay mode where UE could be converted to CE.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-20 15:11:26 -04:00
Yang Wang
3bba4bc6a0
drm/amdgpu: add RAS error info support for umc_v12_0
...
add RAS error info support for umc_v12_0.
Signed-off-by: Yang Wang <kevinyang.wang@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-10-13 11:36:11 -04:00
Tao Zhou
f8754f58d6
drm/amdgpu: print channel index for UMC bad page
...
Print channel index for UMC v12.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-20 16:25:17 -04:00
Tao Zhou
ced575203a
drm/amdgpu: print more address info of UMC bad page
...
Print out row, column and bank value of UMC error address for UMC v12.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-11 17:15:15 -04:00
Tao Zhou
3cb9ebc9d6
drm/amdgpu: add channel index table for UMC v12
...
Get UMC phyical channel index according to node id, umc instance and
channel instance.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-11 17:10:58 -04:00
Tao Zhou
40a08fe890
drm/amdgpu: add address conversion for UMC v12
...
Convert MCA error address to physical address and find out all pages in
one physical row.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-11 17:10:35 -04:00
Candice Li
7e6ec09974
drm/amdgpu: Add umc v12_0 ras functions
...
Add umc v12_0 ras error querying.
Signed-off-by: Candice Li <candice.li@amd.com >
Reviewed-by: Tao Zhou <tao.zhou1@amd.com >
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com >
Signed-off-by: Alex Deucher <alexander.deucher@amd.com >
2023-09-06 14:38:00 -04:00