[ Upstream commit 70982eef4d ]
The ret value might be -EBUSY, caller will think lru lock is still
locked but actually NOT. So return -ENOSPC instead. Otherwise we hit
list corruption.
ttm_bo_cleanup_refs might fail too if BO is not idle. If we return 0,
caller(ttm_tt_populate -> ttm_global_swapout ->ttm_device_swapout) will
be stuck as we actually did not free any BO memory. This usually happens
when the fence is not signaled for a long time.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Fixes: ebd59851c7 ("drm/ttm: move swapout logic around v3")
Link: https://patchwork.freedesktop.org/patch/msgid/20210907040832.1107747-1-xinhui.pan@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
My local syzbot instance hit GPF in ttm_bo_release().
Unfortunately, syzbot didn't produce a reproducer for this, but I
found out possible scenario:
drm_gem_vram_create() <-- drm_gem_vram_object kzalloced
(bo embedded in this object)
ttm_bo_init()
ttm_bo_init_reserved()
ttm_resource_alloc()
man->func->alloc() <-- allocation failure
ttm_bo_put()
ttm_bo_release()
ttm_mem_io_free() <-- bo->resource == NULL passed
as second argument
*GPF*
Added NULL check inside ttm_mem_io_free() to prevent reported GPF and
make this function NULL save in future.
Same problem was in ttm_bo_move_to_lru_tail() as Christian reported.
ttm_bo_move_to_lru_tail() is called in ttm_bo_release() and mem pointer
can be NULL as well as in ttm_mem_io_free().
Fail log:
KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
...
RIP: 0010:ttm_mem_io_free+0x28/0x170 drivers/gpu/drm/ttm/ttm_bo_util.c:66
..
Call Trace:
ttm_bo_release+0xd94/0x10a0 drivers/gpu/drm/ttm/ttm_bo.c:422
kref_put include/linux/kref.h:65 [inline]
ttm_bo_put drivers/gpu/drm/ttm/ttm_bo.c:470 [inline]
ttm_bo_init_reserved+0x7cb/0x960 drivers/gpu/drm/ttm/ttm_bo.c:1050
ttm_bo_init+0x105/0x270 drivers/gpu/drm/ttm/ttm_bo.c:1074
drm_gem_vram_create+0x332/0x4c0 drivers/gpu/drm/drm_gem_vram_helper.c:228
Fixes: d3116756a7 ("drm/ttm: rename bo->mem and make it a pointer")
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210708112518.17271-1-paskripkin@gmail.com
TTM implements a rather extensive accounting of allocated memory.
There are two reasons for this:
1. It tries to block userspace allocating a huge number of very small
BOs without accounting for the kmalloced memory.
2. Make sure we don't over allocate and run into an OOM situation
during swapout while trying to handle the memory shortage.
This is only partially a good idea. First of all it is perfectly
valid for an application to use all of system memory, limiting it to
50% is not really acceptable.
What we need to take care of is that the application is held
accountable for the memory it allocated. This is what control
mechanisms like memcg and the normal Linux page accounting already do.
Making sure that we don't run into an OOM situation while trying to
cope with a memory shortage is still a good idea, but this is also
not very well implemented since it means another opportunity of
recursion from the driver back into TTM.
So start to rework all of this by implementing a shrinker callback which
allows for TT object to be swapped out if necessary.
v2: Switch from limit to shrinker callback.
v3: fix gfp mask handling, use atomic for swapable_pages, add debugfs
v4: drop the extra gfp_mask checks
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210208133226.36955-1-christian.koenig@amd.com
As far as I can tell the buffer_count was never used by an
userspace application.
The number of BOs in the system is far better suited in
debugfs than sysfs and we now should be able to add other
information here as well.
v2: add that additionally to sysfs
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/414954/
Recently a regression was introduced which caused TTM's buffer eviction to
attempt to evict already-pinned BOs, causing issues with buffer eviction
under memory pressure along with suspend/resume:
nouveau 0000:1f:00.0: DRM: evicting buffers...
nouveau 0000:1f:00.0: DRM: Moving pinned object 00000000c428c3ff!
nouveau 0000:1f:00.0: fifo: fault 00 [READ] at 0000000000200000 engine 04
[BAR1] client 07 [HUB/HOST_CPU] reason 02 [PTE] on channel -1 [00ffeaa000
unknown]
nouveau 0000:1f:00.0: fifo: DROPPED_MMU_FAULT 00001000
nouveau 0000:1f:00.0: fifo: fault 01 [WRITE] at 0000000000020000 engine
0c [HOST6] client 07 [HUB/HOST_CPU] reason 02 [PTE] on channel 1
[00ffb28000 DRM]
nouveau 0000:1f:00.0: fifo: channel 1: killed
nouveau 0000:1f:00.0: fifo: runlist 0: scheduled for recovery
[TTM] Buffer eviction failed
nouveau 0000:1f:00.0: DRM: waiting for kernel channels to go idle...
nouveau 0000:1f:00.0: DRM: failed to idle channel 1 [DRM]
nouveau 0000:1f:00.0: DRM: resuming display...
After some bisection and investigation, it appears this resulted from the
recent changes to ttm_bo_move_to_lru_tail(). Previously when a buffer was
pinned, the buffer would be removed from the LRU once ttm_bo_unreserve
to maintain the LRU list when pinning or unpinning BOs. However, since:
commit 3d1a88e105 ("drm/ttm: cleanup LRU handling further")
We've been exiting from ttm_bo_move_to_lru_tail() at the very beginning of
the function if the bo we're looking at is pinned, resulting in the pinned
BO never getting removed from the lru and as a result - causing issues when
it eventually becomes time for eviction.
So, let's fix this by calling ttm_bo_del_from_lru() from
ttm_bo_move_to_lru_tail() in the event that we're dealing with a pinned
buffer.
v2 (chk): reduce to only the fixing one liner since we always want to
call the callback whenever we would move on the LRU.
Fixes: 3d1a88e105 ("drm/ttm: cleanup LRU handling further")
Cc: Dave Airlie <airlied@redhat.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210105114505.38210-1-christian.koenig@amd.com
Based on an idea from Dave, but cleaned up a bit.
We had multiple fields for essentially the same thing.
Now bo->base.size is the original size of the BO in
arbitrary units, usually bytes.
bo->mem.num_pages is the size in number of pages in the
resource domain of bo->mem.mem_type.
v2: use the GEM object size instead of the BO size
v3: fix printks in some places
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com> (v1)
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/406831/
Fixes the following W=1 kernel build warning(s):
drivers/gpu/drm/ttm/ttm_bo.c:51: warning: Function parameter or member 'ttm_global_mutex' not described in 'DEFINE_MUTEX'
drivers/gpu/drm/ttm/ttm_bo.c:286: warning: Function parameter or member 'bo' not described in 'ttm_bo_cleanup_memtype_use'
drivers/gpu/drm/ttm/ttm_bo.c:359: warning: Function parameter or member 'bo' not described in 'ttm_bo_cleanup_refs'
drivers/gpu/drm/ttm/ttm_bo.c:359: warning: Function parameter or member 'interruptible' not described in 'ttm_bo_cleanup_refs'
drivers/gpu/drm/ttm/ttm_bo.c:359: warning: Function parameter or member 'no_wait_gpu' not described in 'ttm_bo_cleanup_refs'
drivers/gpu/drm/ttm/ttm_bo.c:359: warning: Function parameter or member 'unlock_resv' not described in 'ttm_bo_cleanup_refs'
drivers/gpu/drm/ttm/ttm_bo.c:424: warning: Function parameter or member 'bdev' not described in 'ttm_bo_delayed_delete'
drivers/gpu/drm/ttm/ttm_bo.c:424: warning: Function parameter or member 'remove_all' not described in 'ttm_bo_delayed_delete'
drivers/gpu/drm/ttm/ttm_bo.c:635: warning: Function parameter or member 'bo' not described in 'ttm_bo_evict_swapout_allowable'
drivers/gpu/drm/ttm/ttm_bo.c:635: warning: Function parameter or member 'ctx' not described in 'ttm_bo_evict_swapout_allowable'
drivers/gpu/drm/ttm/ttm_bo.c:635: warning: Function parameter or member 'locked' not described in 'ttm_bo_evict_swapout_allowable'
drivers/gpu/drm/ttm/ttm_bo.c:635: warning: Function parameter or member 'busy' not described in 'ttm_bo_evict_swapout_allowable'
drivers/gpu/drm/ttm/ttm_bo.c:769: warning: Function parameter or member 'bo' not described in 'ttm_bo_add_move_fence'
drivers/gpu/drm/ttm/ttm_bo.c:769: warning: Function parameter or member 'man' not described in 'ttm_bo_add_move_fence'
drivers/gpu/drm/ttm/ttm_bo.c:769: warning: Function parameter or member 'mem' not described in 'ttm_bo_add_move_fence'
drivers/gpu/drm/ttm/ttm_bo.c:769: warning: Function parameter or member 'no_wait_gpu' not described in 'ttm_bo_add_move_fence'
drivers/gpu/drm/ttm/ttm_bo.c:806: warning: Function parameter or member 'bo' not described in 'ttm_bo_mem_force_space'
drivers/gpu/drm/ttm/ttm_bo.c:806: warning: Function parameter or member 'place' not described in 'ttm_bo_mem_force_space'
drivers/gpu/drm/ttm/ttm_bo.c:806: warning: Function parameter or member 'mem' not described in 'ttm_bo_mem_force_space'
drivers/gpu/drm/ttm/ttm_bo.c:806: warning: Function parameter or member 'ctx' not described in 'ttm_bo_mem_force_space'
drivers/gpu/drm/ttm/ttm_bo.c:872: warning: Function parameter or member 'bo' not described in 'ttm_bo_mem_space'
drivers/gpu/drm/ttm/ttm_bo.c:872: warning: Function parameter or member 'placement' not described in 'ttm_bo_mem_space'
drivers/gpu/drm/ttm/ttm_bo.c:872: warning: Function parameter or member 'mem' not described in 'ttm_bo_mem_space'
drivers/gpu/drm/ttm/ttm_bo.c:872: warning: Function parameter or member 'ctx' not described in 'ttm_bo_mem_space'
drivers/gpu/drm/ttm/ttm_bo.c:1387: warning: Function parameter or member 'ctx' not described in 'ttm_bo_swapout'
Reviewed-by: Christian König <christian.koenig@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: dri-devel@lists.freedesktop.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20201116174112.1833368-32-lee.jones@linaro.org
Currently drivers get called to move a buffer, but if they have to
move it temporarily through another space (SYSTEM->VRAM via TT)
then they can end up with a lot of ttm->driver->ttm call stacks,
if the temprorary space moves requires eviction.
Instead of letting the driver do all the placement/space for the
temporary, allow it to report back (-EMULTIHOP) and a placement (hop)
to the move code, which will then do the temporary move, and the
correct placement move afterwards.
This removes a lot of code from drivers, at the expense of
adding some midlayering. I've some further ideas on how to turn
it inside out, but I think this is a good solution to the call
stack problems.
v2: separate out the driver patches, add WARN for getting
MULTHOP in paths we shouldn't (Daniel)
v3: use memset (Christian)
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: hristian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20201109005432.861936-2-airlied@gmail.com