Only call synchronize_rcu_expedited after unlocking struct_mutex to
avoid deadlock because the workqueues depend on struct_mutex.
>From original patch by Andrea:
synchronize_rcu/synchronize_sched/synchronize_rcu_expedited() will
hang until its own workqueues are run. The i915 gem workqueues will
wait on the struct_mutex to be released. So we cannot wait for a
quiescent state using those rcu primitives while holding the
struct_mutex or it creates a circular lock dependency resulting in
kernel hangs (which is reproducible but goes undetected by lockdep).
kswapd0 D 0 700 2 0x00000000
Call Trace:
? __schedule+0x1a5/0x660
? schedule+0x36/0x80
? _synchronize_rcu_expedited.constprop.65+0x2ef/0x300
? wake_up_bit+0x20/0x20
? rcu_stall_kick_kthreads.part.54+0xc0/0xc0
? rcu_exp_wait_wake+0x530/0x530
? i915_gem_shrink+0x34b/0x4b0
? i915_gem_shrinker_scan+0x7c/0x90
? i915_gem_shrinker_scan+0x7c/0x90
? shrink_slab.part.61.constprop.72+0x1c1/0x3a0
? shrink_zone+0x154/0x160
? kswapd+0x40a/0x720
? kthread+0xf4/0x130
? try_to_free_pages+0x450/0x450
? kthread_create_on_node+0x40/0x40
? ret_from_fork+0x23/0x30
plasmashell D 0 4657 4614 0x00000000
Call Trace:
? __schedule+0x1a5/0x660
? schedule+0x36/0x80
? schedule_preempt_disabled+0xe/0x10
? __mutex_lock.isra.4+0x1c9/0x790
? i915_gem_close_object+0x26/0xc0
? i915_gem_close_object+0x26/0xc0
? drm_gem_object_release_handle+0x48/0x90
? drm_gem_handle_delete+0x50/0x80
? drm_ioctl+0x1fa/0x420
? drm_gem_handle_create+0x40/0x40
? pipe_write+0x391/0x410
? __vfs_write+0xc6/0x120
? do_vfs_ioctl+0x8b/0x5d0
? SyS_ioctl+0x3b/0x70
? entry_SYSCALL_64_fastpath+0x13/0x94
kworker/0:0 D 0 29186 2 0x00000000
Workqueue: events __i915_gem_free_work
Call Trace:
? __schedule+0x1a5/0x660
? schedule+0x36/0x80
? schedule_preempt_disabled+0xe/0x10
? __mutex_lock.isra.4+0x1c9/0x790
? del_timer_sync+0x44/0x50
? update_curr+0x57/0x110
? __i915_gem_free_objects+0x31/0x300
? __i915_gem_free_objects+0x31/0x300
? __i915_gem_free_work+0x2d/0x40
? process_one_work+0x13a/0x3b0
? worker_thread+0x4a/0x460
? kthread+0xf4/0x130
? process_one_work+0x3b0/0x3b0
? kthread_create_on_node+0x40/0x40
? ret_from_fork+0x23/0x30
Fixes: 3d3d18f086 ("drm/i915: Avoid rcu_barrier() from reclaim paths (shrinker)")
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
(cherry picked from commit 8f612d0551)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Pull drm updates from Dave Airlie:
"This is the main pull request for drm for 4.10 kernel.
New drivers:
- ZTE VOU display driver (zxdrm)
- Amlogic Meson Graphic Controller GXBB/GXL/GXM SoCs (meson)
- MXSFB support (mxsfb)
Core:
- Format handling has been reworked
- Better atomic state debugging
- drm_mm leak debugging
- Atomic explicit fencing support
- fbdev helper ops
- Documentation updates
- MST fbcon fixes
Bridge:
- Silicon Image SiI8620 driver
Panel:
- Add support for new simple panels
i915:
- GVT Device model
- Better HDMI2.0 support on skylake
- More watermark fixes
- GPU idling rework for suspend/resume
- DP Audio workarounds
- Scheduler prep-work
- Opregion CADL handling
- GPU scheduler and priority boosting
amdgfx/radeon:
- Support for virtual devices
- New VM manager for non-contig VRAM buffers
- UVD powergating
- SI register header cleanup
- Cursor fixes
- Powermanagement fixes
nouveau:
- Powermangement reworks for better voltage/clock changes
- Atomic modesetting support
- Displayport Multistream (MST) support.
- GP102/104 hang and cursor fixes
- GP106 support
hisilicon:
- hibmc support (BMC chip for aarch64 servers)
armada:
- add tracing support for overlay change
- refactor plane support
- de-midlayer the driver
omapdrm:
- Timing code cleanups
rcar-du:
- R8A7792/R8A7796 support
- Misc fixes.
sunxi:
- A31 SoC display engine support
imx-drm:
- YUV format support
- Cleanup plane atomic update
mali-dp:
- Misc fixes
dw-hdmi:
- Add support for HDMI i2c master controller
tegra:
- IOMMU support fixes
- Error handling fixes
tda998x:
- Fix connector registration
- Improved robustness
- Fix infoframe/audio compliance
virtio:
- fix busid issues
- allocate more vbufs
qxl:
- misc fixes and cleanups.
vc4:
- Fragment shader threading
- ETC1 support
- VEC (tv-out) support
msm:
- A5XX GPU support
- Lots of atomic changes
tilcdc:
- Misc fixes and cleanups.
etnaviv:
- Fix dma-buf export path
- DRAW_INSTANCED support
- fix driver on i.MX6SX
exynos:
- HDMI refactoring
fsl-dcu:
- fbdev changes"
* tag 'drm-for-v4.10' of git://people.freedesktop.org/~airlied/linux: (1343 commits)
drm/nouveau/kms/nv50: fix atomic regression on original G80
drm/nouveau/bl: Do not register interface if Apple GMUX detected
drm/nouveau/bl: Assign different names to interfaces
drm/nouveau/bios/dp: fix handling of LevelEntryTableIndex on DP table 4.2
drm/nouveau/ltc: protect clearing of comptags with mutex
drm/nouveau/gr/gf100-: handle GPC/TPC/MPC trap
drm/nouveau/core: recognise GP106 chipset
drm/nouveau/ttm: wait for bo fence to signal before unmapping vmas
drm/nouveau/gr/gf100-: FECS intr handling is not relevant on proprietary ucode
drm/nouveau/gr/gf100-: properly ack all FECS error interrupts
drm/nouveau/fifo/gf100-: recover from host mmu faults
drm: Add fake controlD* symlinks for backwards compat
drm/vc4: Don't use drm_put_dev
drm/vc4: Document VEC DT binding
drm/vc4: Add support for the VEC (Video Encoder) IP
drm: Add TV connector states to drm_connector_state
drm: Turn DRM_MODE_SUBCONNECTOR_xx definitions into an enum
drm/vc4: Fix ->clock_select setting for the VEC encoder
drm/amdgpu/dce6: Set MASTER_UPDATE_MODE to 0 in resume_mc_access as well
drm/amdgpu: use pin rather than pin_restricted in a few cases
...
The shrinker may appear to recurse into obj->mm.lock as the shrinker may
be called from a direct reclaim path whilst handling get_pages. We
filter out recursing on the same obj->mm.lock by inspecting
obj->mm.pages, but we do want to take the lock on a second object in
order to reap their pages. lockdep spots the recursion on the same
lockclass and needs annotation to avoid a false positive. To keep the
two paths distinct, create an enum to indicate which subclass of
obj->mm.lock we are using. This removes the false positive and avoids
masking real bugs.
Suggested-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161101121134.27504-1-chris@chris-wilson.co.uk
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
During shrinking, we walk over the list of objects searching for
victims. Any that are not removed are put back into the global list.
Currently, they are put back in order (at the front) which means they
will be first to be scanned again. If we instead move them to the rear
of the list, we will scan new potential victims on the next pass and
waste less time rescanning unshrinkable objects. Normally the lists are
kept in rough order to shrinking (with object least frequently used at
the start), by moving just scanned objects to the rear we are
acknowledging that they are still in use.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161101084843.3961-3-chris@chris-wilson.co.uk
We want to hide the latency of releasing objects and their backing
storage from the submission, so we move the actual free to a worker.
This allows us to switch to struct_mutex freeing of the object in the
next patch.
Furthermore, if we know that the object we are dereferencing remains valid
for the duration of our access, we can forgo the usual synchronisation
barriers and atomic reference counting. To ensure this we defer freeing
an object til after an RCU grace period, such that any lookup of the
object within an RCU read critical section will remain valid until
after we exit that critical section. We also employ this delay for
rate-limiting the serialisation on reallocation - we have to slow down
object creation in order to prevent resource starvation (in particular,
files).
v2: Return early in i915_gem_tiling() ioctl to skip over superfluous
work on error.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-19-chris@chris-wilson.co.uk
Break the allocation of the backing storage away from struct_mutex into
a per-object lock. This allows parallel page allocation, provided we can
do so outside of struct_mutex (i.e. set-domain-ioctl, pwrite, GTT
fault), i.e. before execbuf! The increased cost of the atomic counters
are hidden behind i915_vma_pin() for the typical case of execbuf, i.e.
as the object is typically bound between execbufs, the page_pin_count is
static. The cost will be felt around set-domain and pwrite, but offset
by the improvement from reduced struct_mutex contention.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20161028125858.23563-14-chris@chris-wilson.co.uk
In the next patch we want to handle reset directly by a locked waiter in
order to avoid issues with returning before the reset is handled. To
handle the reset, we must first know whether we hold the struct_mutex.
If we do not hold the struct_mtuex we can not perform the reset, but we do
not block the reset worker either (and so we can just continue to wait for
request completion) - otherwise we must relinquish the mutex.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20160909131201.16673-10-chris@chris-wilson.co.uk
We can now wait for the GPU (all engines) to become idle without
requiring the struct_mutex. Inside the shrinker, we need to currently
take the struct_mutex in order to purge objects and to purge the objects
we need the GPU to be idle - causing a stall whilst we hold the
struct_mutex. We can hide most of that stall by performing the wait
before taking the struct_mutex and only doing essential waits for
new rendering on objects to be freed.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1470388464-28458-8-git-send-email-chris@chris-wilson.co.uk
The principal motivation for this was to try and eliminate the
struct_mutex from i915_gem_suspend - but we still need to hold the mutex
current for the i915_gem_context_lost(). (The issue there is that there
may be an indirect lockdep cycle between cpu_hotplug (i.e. suspend) and
struct_mutex via the stop_machine().) For the moment, enabling last
request tracking for the engine, allows us to do busyness checking and
waiting without requiring the struct_mutex - which is useful in its own
right.
As a side-effect of having a robust means for tracking engine busyness,
we can replace our other busyness heuristic, that of comparing against
the last submitted seqno. For paranoid reasons, we have a semi-ordered
check of that seqno inside the hangchecker, which we can now improve to
an ordered check of the engine's busyness (removing a locked xchg in the
process).
v2: Pass along "bool interruptible" as being unlocked we cannot rely on
i915->mm.interruptible being stable or even under our control.
v3: Replace check Ironlake i915_gpu_busy() with the common precalculated value
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1470388464-28458-6-git-send-email-chris@chris-wilson.co.uk
If we enable RCU for the requests (providing a grace period where we can
inspect a "dead" request before it is freed), we can allow callers to
carefully perform lockless lookup of an active request.
However, by enabling deferred freeing of requests, we can potentially
hog a lot of memory when dealing with tens of thousands of requests per
second - with a quick insertion of a synchronize_rcu() inside our
shrinker callback, that issue disappears.
v2: Currently, it is our responsibility to handle reclaim i.e. to avoid
hogging memory with the delayed slab frees. At the moment, we wait for a
grace period in the shrinker, and block for all RCU callbacks on oom.
Suggested alternatives focus on flushing our RCU callback when we have a
certain number of outstanding request frees, and blocking on that flush
after a second high watermark. (So rather than wait for the system to
run out of memory, we stop issuing requests - both are nondeterministic.)
Paul E. McKenney wrote:
Another approach is synchronize_rcu() after some largish number of
requests. The advantage of this approach is that it throttles the
production of callbacks at the source. The corresponding disadvantage
is that it slows things up.
Another approach is to use call_rcu(), but if the previous call_rcu()
is still in flight, block waiting for it. Yet another approach is
the get_state_synchronize_rcu() / cond_synchronize_rcu() pair. The
idea is to do something like this:
cond_synchronize_rcu(cookie);
cookie = get_state_synchronize_rcu();
You would of course do an initial get_state_synchronize_rcu() to
get things going. This would not block unless there was less than
one grace period's worth of time between invocations. But this
assumes a busy system, where there is almost always a grace period
in flight. But you can make that happen as follows:
cond_synchronize_rcu(cookie);
cookie = get_state_synchronize_rcu();
call_rcu(&my_rcu_head, noop_function);
Note that you need additional code to make sure that the old callback
has completed before doing a new one. Setting and clearing a flag
with appropriate memory ordering control suffices (e.g,. smp_load_acquire()
and smp_store_release()).
v3: More comments on compiler and processor order of operations within
the RCU lookup and discover we can use rcu_access_pointer() here instead.
v4: Wrap i915_gem_active_get_rcu() to take the rcu_read_lock itself.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: "Goel, Akash" <akash.goel@intel.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/1470324762-2545-25-git-send-email-chris@chris-wilson.co.uk
We are motivated to avoid using a bitfield for obj->active for a couple
of reasons. Firstly, we wish to document our lockless read of obj->active
using READ_ONCE inside i915_gem_busy_ioctl() and that requires an
integral type (i.e. not a bitfield). Secondly, gcc produces abysmal code
when presented with a bitfield and that shows up high on the profiles of
request tracking (mainly due to excess memory traffic as it converts
the bitfield to a register and back and generates frequent AGI in the
process).
v2: BIT, break up a long line in compute the other engines, new paint
for i915_gem_object_is_active (now i915_gem_object_get_active).
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1470324762-2545-23-git-send-email-chris@chris-wilson.co.uk
When we call i915_vma_unbind(), we will wait upon outstanding rendering.
This will also trigger a retirement phase, which may update the object
lists. If, we extend request tracking to the VMA itself (rather than
keep it at the encompassing object), then there is a potential that the
obj->vma_list be modified for other elements upon i915_vma_unbind(). As
a result, if we walk over the object list and call i915_vma_unbind(), we
need to be prepared for that list to change.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1470293567-10811-8-git-send-email-chris@chris-wilson.co.uk
Since drm_i915_private is now a subclass of drm_device we do not need to
chase the drm_i915_private->dev backpointer and can instead simply
access drm_i915_private->drm directly.
text data bss dec hex filename
1068757 4565 416 1073738 10624a drivers/gpu/drm/i915/i915.ko
1066949 4565 416 1071930 105b3a drivers/gpu/drm/i915/i915.ko
Created by the coccinelle script:
@@
struct drm_i915_private *d;
identifier i;
@@
(
- d->dev->i
+ d->drm.i
|
- d->dev
+ &d->drm
)
and for good measure the dev_priv->dev backpointer was removed entirely.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1467711623-2905-4-git-send-email-chris@chris-wilson.co.uk
By tracking the iomapping on the VMA itself, we can share that area
between multiple users. Also by only revoking the iomapping upon
unbinding from the mappable portion of the GGTT, we can keep that iomap
across multiple invocations (e.g. execlists context pinning).
Note that by moving the iounnmap tracking to the VMA, we actually end up
fixing a leak of the iomapping in intel_fbdev.
v1.5: Rebase prompted by Tvrtko
v2: Drop dev_priv parameter, we can recover the i915_ggtt from the vma.
v3: Move handling of ioremap space exhaustion to vmap_purge and also
allow vmallocs to recover old iomaps. Add Tvrtko's kerneldoc.
v4: Fix a use-after-free in shrinker and rearrange i915_vma_iomap
v5: Back to i915_vm_to_ggtt
v6: Use i915_vma_pin_iomap and i915_vma_unpin_iomap to mark critical
sections and ensure the VMA cannot be reaped whilst mapped.
v7: Move i915_vma_iounmap so that consumers of the API are not tempted,
and add iomem annotations
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1461833819-3991-5-git-send-email-chris@chris-wilson.co.uk
Elsewhere we have adopted the convention of using '_link' to denote
elements in the list (and '_list' for the actual list_head itself), and
that the name should indicate which list the link belongs to (and
preferrably not just where the link is being stored).
s/vma_link/obj_link/ (we iterate over obj->vma_list)
s/mm_list/vm_link/ (we iterate over vm->[in]active_list)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
If the system has no available swap pages, we cannot make forward
progress in the shrinker by releasing active pages, only by releasing
purgeable pages which are immediately reaped. Take total_swap_pages into
account when counting up available objects to be shrunk and subsequently
shrinking them. By doing so, we avoid unbinding objects that cannot be
shrunk and so wasting CPU cycles flushing those objects from the GPU to
the system and then immediately back again (as they will more than
likely be reused shortly after).
Based on a patch by Akash Goel.
v2: frontswap registers extra swap pages available for the system, so it
is already include in the count of available swap pages.
v3: Use get_nr_swap_pages() to query the currently available amount of
swap space. This should also stop us from shrinking the GPU buffers if
we ever run out of swap space. Though at that point, we would expect the
oom-notifier to be running and failing miserably...
Reported-by: Akash Goel <akash.goel@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: linux-mm@kvack.org
Cc: Akash Goel <akash.goel@intel.com>
Cc: sourab.gupta@intel.com
Link: http://patchwork.freedesktop.org/patch/msgid/1449244734-25733-2-git-send-email-chris@chris-wilson.co.uk
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Exclude active GPU pages from the purview of the background shrinker
(kswapd), as these cause uncontrollable GPU stalls. Given that the
shrinker is rerun until the freelists are satisfied, we should have
opportunity in subsequent passes to recover the pages once idle. If the
machine does run out of memory entirely, we have the forced idling in the
oom-notifier as a means of releasing all the pages we can before an oom
is prematurely executed.
Note that this relies upon an up-front retire_requests to keep the
inactive list in shape, which was added in a previous patch, mostly as
execlist ctx pinning band-aids.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Damien Lespiau <damien.lespiau@intel.com>
[danvet: Add note about retire_requests.]
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>