As the engine->kernel_context is used within the engine-pm barrier, we
have to be careful when emitting requests outside of the barrier, as the
strict timeline locking rules do not apply. Instead, we must ensure the
engine_park() cannot be entered as we build the request, which is
simplest by taking an explicit engine-pm wakeref around the request
construction.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191125105858.1718307-1-chris@chris-wilson.co.uk
As drm now exports a method to create an anonymous struct file around a
drm_device for internal use, make use of it to avoid our horrible hacks.
Danial suggested that the mock_file_put() wrapper was suitable for
drm-core, along with the mock_drm_getfile() [and that the vestigal
mock_drm_file() in this patch should perhaps be the drm interface
itself]. However, the eventual goal is to remove the mock_drm_file() and
use the struct file and fput() directly, in this patch we take a simple
transition in that direction.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20191107180601.30815-3-chris@chris-wilson.co.uk
The trouble with having a plain nesting flag for locks which do not
naturally nest (unlike block devices and their partitions, which is
the original motivation for nesting levels) is that lockdep will
never spot a true deadlock if you screw up.
This patch is an attempt at trying better, by highlighting a bit more
of the actual nature of the nesting that's going on. Essentially we
have two kinds of objects:
- objects without pages allocated, which cannot be on any lru and are
hence inaccessible to the shrinker.
- objects which have pages allocated, which are on an lru, and which
the shrinker can decide to throw out.
For the former type of object, memory allocations while holding
obj->mm.lock are permissible. For the latter they are not. And
get/put_pages transitions between the two types of objects.
This is still not entirely fool-proof since the rules might change.
But as long as we run such a code ever at runtime lockdep should be
able to observe the inconsistency and complain (like with any other
lockdep class that we've split up in multiple classes). But there are
a few clear benefits:
- We can drop the nesting flag parameter from
__i915_gem_object_put_pages, because that function by definition is
never going allocate memory, and calling it on an object which
doesn't have its pages allocated would be a bug.
- We strictly catch more bugs, since there's not only one place in the
entire tree which is annotated with the special class. All the
other places that had explicit lockdep nesting annotations we're now
going to leave up to lockdep again.
- Specifically this catches stuff like calling get_pages from
put_pages (which isn't really a good idea, if we can call get_pages
so could the shrinker). I've seen patches do exactly that.
Of course I fully expect CI will show me for the fool I am with this
one here :-)
v2: There can only be one (lockdep only has a cache for the first
subclass, not for deeper ones, and we don't want to make these locks
even slower). Still separate enums for better documentation.
Real fix: don't forget about phys objs and pin_map(), and fix the
shrinker to have the right annotations ... silly me.
v3: Forgot usertptr too ...
v4: Improve comment for pages_pin_count, drop the IMPORTANT comment
and instead prime lockdep (Chris).
v5: Appease checkpatch, no double empty lines (Chris)
v6: More rebasing over selftest changes. Also somehow I forgot to
push this patch :-/
Also format comments consistently while at it.
v7: Fix typo in commit message (Joonas)
Also drop the priming, with the lmem merge we now have allocations
while holding the lmem lock, which wreaks the generic priming I've
done in earlier patches. Should probably be resurrected when lmem is
fixed. See
commit 232a6ebae4
Author: Matthew Auld <matthew.auld@intel.com>
Date: Tue Oct 8 17:01:14 2019 +0100
drm/i915: introduce intel_memory_region
I'm keeping the priming patch locally so it wont get lost.
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Tang, CQ" <cq.tang@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> (v5)
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> (v6)
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191105090148.30269-1-daniel.vetter@ffwll.ch
[mlankhorst: Fix commit typos pointed out by Michael Ruhl]
Our existing behaviour is to allow contexts and their GPU requests to
persist past the point of closure until the requests are complete. This
allows clients to operate in a 'fire-and-forget' manner where they can
setup a rendering pipeline and hand it over to the display server and
immediately exit. As the rendering pipeline is kept alive until
completion, the display server (or other consumer) can use the results
in the future and present them to the user.
The compute model is a little different. They have little to no buffer
sharing between processes as their kernels tend to operate on a
continuous stream, feeding the results back to the client application.
These kernels operate for an indeterminate length of time, with many
clients wishing that the kernel was always running for as long as they
keep feeding in the data, i.e. acting like a DSP.
Not all clients want this persistent "desktop" behaviour and would prefer
that the contexts are cleaned up immediately upon closure. This ensures
that when clients are run without hangchecking (e.g. for compute kernels
of indeterminate runtime), any GPU hang or other unexpected workloads
are terminated with the process and does not continue to hog resources.
The default behaviour for new contexts is the legacy persistence mode,
as some desktop applications are dependent upon the existing behaviour.
New clients will have to opt in to immediate cleanup on context
closure. If the hangchecking modparam is disabled, so is persistent
context support -- all contexts will be terminated on closure.
We expect this behaviour change to be welcomed by compute users, who
have often been caught between a rock and a hard place. They disable
hangchecking to avoid their kernels being "unfairly" declared hung, but
have also experienced true hangs that the system was then unable to
clean up. Naturally, this leads to bug reports.
Testcase: igt/gem_ctx_persistence
Link: https://github.com/intel/compute-runtime/pull/228
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Jon Bloomfield <jon.bloomfield@intel.com>
Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Acked-by: Jason Ekstrand <jason@jlekstrand.net>
Link: https://patchwork.freedesktop.org/patch/msgid/20191029202338.8841-1-chris@chris-wilson.co.uk
With the introduction of ctx->engines[] we allow multiple logical
contexts to be used on the same engine (e.g. with virtual engines).
According to bspec, aach logical context requires a unique tag in order
for context-switching to occur correctly between them. [Simple
experiments show that it is not so easy to trick the HW into performing
a lite-restore with matching logical IDs, though my memory from early
Broadwell experiments do suggest that it should be generating
lite-restores.]
We only need to keep a unique tag for the active lifetime of the
context, and for as long as we need to identify that context. The HW
uses the tag to determine if it should use a lite-restore (why not the
LRCA?) and passes the tag back for various status identifies. The only
status we need to track is for OA, so when using perf, we assign the
specific context a unique tag.
v2: Calculate required number of tags to fill ELSP.
Fixes: 976b55f0e1 ("drm/i915: Allow a context to define its set of engines")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111895
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-14-chris@chris-wilson.co.uk
Replace the struct_mutex requirement for pinning the i915_vma with the
local vm->mutex instead. Note that the vm->mutex is tainted by the
shrinker (we require unbinding from inside fs-reclaim) and so we cannot
allocate while holding that mutex. Instead we have to preallocate
workers to do allocate and apply the PTE updates after we have we
reserved their slot in the drm_mm (using fences to order the PTE writes
with the GPU work and with later unbind).
In adding the asynchronous vma binding, one subtle requirement is to
avoid coupling the binding fence into the backing object->resv. That is
the asynchronous binding only applies to the vma timeline itself and not
to the pages as that is a more global timeline (the binding of one vma
does not need to be ordered with another vma, nor does the implicit GEM
fencing depend on a vma, only on writes to the backing store). Keeping
the vma binding distinct from the backing store timelines is verified by
a number of async gem_exec_fence and gem_exec_schedule tests. The way we
do this is quite simple, we keep the fence for the vma binding separate
and only wait on it as required, and never add it to the obj->resv
itself.
Another consequence in reducing the locking around the vma is the
destruction of the vma is no longer globally serialised by struct_mutex.
A natural solution would be to add a kref to i915_vma, but that requires
decoupling the reference cycles, possibly by introducing a new
i915_mm_pages object that is own by both obj->mm and vma->pages.
However, we have not taken that route due to the overshadowing lmem/ttm
discussions, and instead play a series of complicated games with
trylocks to (hopefully) ensure that only one destruction path is called!
v2: Add some commentary, and some helpers to reduce patch churn.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-4-chris@chris-wilson.co.uk
Currently, if there is time remaining before the start of the loop, we
do one full iteration over many possible different chunks within the
object. A full loop may take 50+s (depending on speed of indirect GTT
mmapings) and we try separately with LINEAR, X and Y -- at which point
igt times out. If we check more frequently, we will interrupt the loop
upon our timeout -- it is hard to argue for as this significantly reduces
the test coverage as we dramatically reduce the runtime. In practical
terms, the coverage we should prioritise is in using different fence
setups, forcing verification of the tile row computations over the
current preference of checking extracting chunks. Though the exhaustive
search is great given an infinite timeout, to improve our current
coverage, we also add a randomised smoketest of partial mmaps. So let's
do both, add a randomised smoketest of partial tiling chunks and the
exhaustive (though time limited) search for failures.
Even in adding another subtest, we should shave 100s off BAT! (With,
hopefully, no loss in coverage, at least over multiple runs.)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20190910121009.13431-1-chris@chris-wilson.co.uk