Commit Graph

866 Commits

Author SHA1 Message Date
Sean Christopherson
64d5cd99f7 KVM: x86/mmu: Put direct prefetched pages via kvm_release_page_clean()
Use kvm_release_page_clean() to put prefeteched pages instead of calling
put_page() directly.  This will allow de-duplicating the prefetch code
between indirect and direct MMUs.

Note, there's a small functional change as kvm_release_page_clean() marks
the page/folio as accessed.  While it's not strictly guaranteed that the
guest will access the page, KVM won't intercept guest accesses, i.e. won't
mark the page accessed if it _is_ accessed by the guest (unless A/D bits
are disabled, but running without A/D bits is effectively limited to
pre-HSW Intel CPUs).

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-39-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson
447c375c91 KVM: x86/mmu: Add "mmu" prefix fault-in helpers to free up generic names
Prefix x86's faultin_pfn helpers with "mmu" so that the mmu-less names can
be used by common KVM for similar APIs.

No functional change intended.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-38-seanjc@google.com>
2024-10-25 12:59:08 -04:00
Sean Christopherson
cccefb0a0d KVM: Drop unused "hva" pointer from __gfn_to_pfn_memslot()
Drop @hva from __gfn_to_pfn_memslot() now that all callers pass NULL.

No functional change intended.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-19-seanjc@google.com>
2024-10-25 12:57:58 -04:00
Sean Christopherson
084ecf95a0 KVM: x86/mmu: Drop kvm_page_fault.hva, i.e. don't track intermediate hva
Remove kvm_page_fault.hva as it is never read, only written.  This will
allow removing the @hva param from __gfn_to_pfn_memslot().

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-18-seanjc@google.com>
2024-10-25 12:57:58 -04:00
David Stevens
6769d1bcd3 KVM: Replace "async" pointer in gfn=>pfn with "no_wait" and error code
Add a pfn error code to communicate that hva_to_pfn() failed because I/O
was needed and disallowed, and convert @async to a constant @no_wait
boolean.  This will allow eliminating the @no_wait param by having callers
pass in FOLL_NOWAIT along with other FOLL_* flags.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: David Stevens <stevensd@chromium.org>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-17-seanjc@google.com>
2024-10-25 12:57:58 -04:00
Sean Christopherson
e2d2ca71ac KVM: Drop @atomic param from gfn=>pfn and hva=>pfn APIs
Drop @atomic from the myriad "to_pfn" APIs now that all callers pass
"false", and remove a comment blurb about KVM running only the "GUP fast"
part in atomic context.

No functional change intended.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-13-seanjc@google.com>
2024-10-25 12:57:58 -04:00
Sean Christopherson
6419bc5207 KVM: Rename gfn_to_page_many_atomic() to kvm_prefetch_pages()
Rename gfn_to_page_many_atomic() to kvm_prefetch_pages() to try and
communicate its true purpose, as the "atomic" aspect is essentially a
side effect of the fact that x86 uses the API while holding mmu_lock.
E.g. even if mmu_lock weren't held, KVM wouldn't want to fault-in pages,
as the goal is to opportunistically grab surrounding pages that have
already been accessed and/or dirtied by the host, and to do so quickly.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-12-seanjc@google.com>
2024-10-25 12:55:12 -04:00
Sean Christopherson
5f6a3badbb KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs
Now that KVM doesn't clobber Accessed bits of shadow-present SPTEs,
e.g. when prefetching, mark folios as accessed only when zapping leaf
SPTEs, which is a rough heuristic for "only in response to an mmu_notifier
invalidation".  Page aging and LRUs are tolerant of false negatives, i.e.
KVM doesn't need to be precise for correctness, and re-marking folios as
accessed when zapping entire roots or when zapping collapsible SPTEs is
expensive and adds very little value.

E.g. when a VM is dying, all of its memory is being freed; marking folios
accessed at that time provides no known value.  Similarly, because KVM
marks folios as accessed when creating SPTEs, marking all folios as
accessed when userspace happens to delete a memslot doesn't add value.
The folio was marked access when the old SPTE was created, and will be
marked accessed yet again if a vCPU accesses the pfn again after reloading
a new root.  Zapping collapsible SPTEs is a similar story; marking folios
accessed just because userspace disable dirty logging is a side effect of
KVM behavior, not a deliberate goal.

As an intermediate step, a.k.a. bisection point, towards *never* marking
folios accessed when dropping SPTEs, mark folios accessed when the primary
MMU might be invalidating mappings, as such zappings are not KVM initiated,
i.e. might actually be related to page aging and LRU activity.

Note, x86 is the only KVM architecture that "double dips"; every other
arch marks pfns as accessed only when mapping into the guest, not when
mapping into the guest _and_ when removing from the guest.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-10-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
aa85986e71 KVM: x86/mmu: Mark folio dirty when creating SPTE, not when zapping/modifying
Mark pages/folios dirty when creating SPTEs to map PFNs into the guest,
not when zapping or modifying SPTEs, as marking folios dirty when zapping
or modifying SPTEs can be extremely inefficient.  E.g. when KVM is zapping
collapsible SPTEs to reconstitute a hugepage after disbling dirty logging,
KVM will mark every 4KiB pfn as dirty, even though _at least_ 512 pfns are
guaranteed to be in a single folio (the SPTE couldn't potentially be huge
if that weren't the case).  The problem only becomes worse for 1GiB
HugeTLB pages, as KVM can mark a single folio dirty 512*512 times.

Marking a folio dirty when mapping is functionally safe as KVM drops all
relevant SPTEs in response to an mmu_notifier invalidation, i.e. ensures
that the guest can't dirty a folio after access has been removed.

And because KVM already marks folios dirty when zapping/modifying SPTEs
for KVM reasons, i.e. not in response to an mmu_notifier invalidation,
there is no danger of "prematurely" marking a folio dirty.  E.g. if a
filesystems cleans a folio without first removing write access, then there
already exists races where KVM could mark a folio dirty before remote TLBs
are flushed, i.e. before guest writes are guaranteed to stop.  Furthermore,
x86 is literally the only architecture that marks folios dirty on the
backend; every other KVM architecture marks folios dirty at map time.

x86's unique behavior likely stems from the fact that x86's MMU predates
mmu_notifiers.  Long, long ago, before mmu_notifiers were added, marking
pages dirty when zapping SPTEs was logical, and perhaps even necessary, as
KVM held references to pages, i.e. kept a page's refcount elevated while
the page was mapped into the guest.  At the time, KVM's rmap_remove()
simply did:

        if (is_writeble_pte(*spte))
                kvm_release_pfn_dirty(pfn);
        else
                kvm_release_pfn_clean(pfn);

i.e. dropped the refcount and marked the page dirty at the same time.
After mmu_notifiers were introduced, commit acb66dd051 ("KVM: MMU:
don't hold pagecount reference for mapped sptes pages") removed the
refcount logic, but kept the dirty logic, i.e. converted the above to:

	if (is_writeble_pte(*spte))
		kvm_release_pfn_dirty(pfn);

And for KVM x86, that's essentially how things have stayed over the last
~15 years, without anyone revisiting *why* KVM marks pages/folios dirty at
zap/modification time, e.g. the behavior was blindly carried forward to
the TDP MMU.

Practically speaking, the only downside to marking a folio dirty during
mapping is that KVM could trigger writeback of memory that was never
actually written.  Except that can't actually happen if KVM marks folios
dirty if and only if a writable SPTE is created (as done here), because
KVM always marks writable SPTEs as dirty during make_spte().  See commit
9b51a63024 ("KVM: MMU: Explicitly set D-bit for writable spte."), circa
2015.

Note, KVM's access tracking logic for prefetched SPTEs is a bit odd.  If a
guest PTE is dirty and writable, KVM will create a writable SPTE, but then
mark the SPTE for access tracking.  Which isn't wrong, just a bit odd, as
it results in _more_ precise dirty tracking for MMUs _without_ A/D bits.

To keep things simple, mark the folio dirty before access tracking comes
into play, as an access-tracked SPTE can be restored in the fast page
fault path, i.e. without holding mmu_lock.  While writing SPTEs and
accessing memslots outside of mmu_lock is safe, marking a folio dirty is
not.  E.g. if the fast path gets interrupted _just_ after setting a SPTE,
the primary MMU could theoretically invalidate and free a folio before KVM
marks it dirty.  Unlike the shadow MMU, which waits for CPUs to respond to
an IPI, the TDP MMU only guarantees the page tables themselves won't be
freed (via RCU).

Opportunistically update a few stale comments.

Cc: David Matlack <dmatlack@google.com>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-9-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
63c5754472 KVM: x86/mmu: Invert @can_unsync and renamed to @synchronizing
Invert the polarity of "can_unsync" and rename the parameter to
"synchronizing" to allow a future change to set the Accessed bit if KVM
is synchronizing an existing SPTE.  Querying "can_unsync" in that case is
nonsensical, as the fact that KVM can't unsync SPTEs doesn't provide any
justification for setting the Accessed bit.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-7-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
6385d01eec KVM: x86/mmu: Don't overwrite shadow-present MMU SPTEs when prefaulting
Treat attempts to prefetch/prefault MMU SPTEs as spurious if there's an
existing shadow-present SPTE, as overwriting a SPTE that may have been
create by a "real" fault is at best confusing, and at worst potentially
harmful.  E.g. mmu_try_to_unsync_pages() doesn't unsync when prefetching,
which creates a scenario where KVM could try to replace a Writable SPTE
with a !Writable SPTE, as sp->unsync is checked prior to acquiring
mmu_unsync_pages_lock.

Note, this applies to three of the four flavors of "prefetch" in KVM:

  - KVM_PRE_FAULT_MEMORY
  - Async #PF (host or PV)
  - Prefetching

The fourth flavor, SPTE synchronization, i.e. FNAME(sync_spte), _only_
overwrites shadow-present SPTEs when calling make_spte().  But SPTE
synchronization specifically uses mmu_spte_update(), and so naturally
avoids the @prefetch check in mmu_set_spte().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-6-seanjc@google.com>
2024-10-25 12:54:42 -04:00
Sean Christopherson
28cf497881 KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either
mmu_invalidate_in_progress is elevated, or that the range is being zapped
due to memslot removal (loosely detected by slots_lock being held).
Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe
as KVM's page fault path snapshots state before acquiring mmu_lock, and
thus can create SPTEs with stale information if vCPUs aren't forced to
retry faults (due to seeing an in-progress or past MMU invalidation).

Memslot removal is a special case, as the memslot is retrieved outside of
mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and
instead relies on SRCU synchronization to ensure any in-flight page faults
are fully resolved before zapping SPTEs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241009192345.1148353-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20 07:31:05 -04:00
Sean Christopherson
58a20a9435 KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
When performing a targeted zap on memslot removal, zap only MMU pages that
shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and
unnecessary.  Furthermore, for_each_gfn_valid_sp() arguably shouldn't
exist, because it doesn't do what most people would it expect it to do.
The "round gfn for level" adjustment that is done for direct SPs (no gPTE)
means that the exact gfn comparison will not get a match, even when a SP
does "cover" a gfn, or was even created specifically for a gfn.

For memslot deletion specifically, KVM's behavior will vary significantly
based on the size and alignment of a memslot, and in weird ways.  E.g. for
a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if
it's only 4KiB aligned.  And as described below, zapping SPs in the
aligned case overzaps for direct MMUs, as odds are good the upper-level
SPs are serving other memslots.

To iterate over all potentially-relevant gfns, KVM would need to make a
pass over the hash table for each level, with the gfn used for lookup
rounded for said level.  And then check that the SP is of the correct
level, too, e.g. to avoid over-zapping.

But even then, KVM would massively overzap, as processing every level is
all but guaranteed to zap SPs that serve other memslots, especially if the
memslot being removed is relatively small.  KVM could mitigate that issue
by processing only levels that can be possible guest huge pages, i.e. are
less likely to be re-used for other memslot, but while somewhat logical,
that's quite arbitrary and would be a bit of a mess to implement.

So, zap only SPs with gPTEs, as the resulting behavior is easy to describe,
is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that
absolutely must be zapped.

Cc: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20241009192345.1148353-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20 07:08:17 -04:00
Paolo Bonzini
c8d430db8e Merge tag 'kvmarm-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.12, take #1

- Fix pKVM error path on init, making sure we do not change critical
  system registers as we're about to fail

- Make sure that the host's vector length is at capped by a value
  common to all CPUs

- Fix kvm_has_feat*() handling of "negative" features, as the current
  code is pretty broken

- Promote Joey to the status of official reviewer, while James steps
  down -- hopefully only temporarly
2024-10-06 03:59:22 -04:00
Paolo Bonzini
fcd1ec9cb5 KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU
As was tried in commit 4e103134b8 ("KVM: x86/mmu: Zap only the relevant
pages when removing a memslot"), all shadow pages, i.e. non-leaf SPTEs,
need to be zapped.  All of the accounting for a shadow page is tied to the
memslot, i.e. the shadow page holds a reference to the memslot, for all
intents and purposes.  Deleting the memslot without removing all relevant
shadow pages, as is done when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled,
results in NULL pointer derefs when tearing down the VM.

Reintroduce from that commit the code that walks the whole memslot when
there are active shadow MMU pages.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-03 18:51:13 -04:00
Linus Torvalds
3efc57369a Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull x86 kvm updates from Paolo Bonzini:
 "x86:

   - KVM currently invalidates the entirety of the page tables, not just
     those for the memslot being touched, when a memslot is moved or
     deleted.

     This does not traditionally have particularly noticeable overhead,
     but Intel's TDX will require the guest to re-accept private pages
     if they are dropped from the secure EPT, which is a non starter.

     Actually, the only reason why this is not already being done is a
     bug which was never fully investigated and caused VM instability
     with assigned GeForce GPUs, so allow userspace to opt into the new
     behavior.

   - Advertise AVX10.1 to userspace (effectively prep work for the
     "real" AVX10 functionality that is on the horizon)

   - Rework common MSR handling code to suppress errors on userspace
     accesses to unsupported-but-advertised MSRs

     This will allow removing (almost?) all of KVM's exemptions for
     userspace access to MSRs that shouldn't exist based on the vCPU
     model (the actual cleanup is non-trivial future work)

   - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC)
     splits the 64-bit value into the legacy ICR and ICR2 storage,
     whereas Intel (APICv) stores the entire 64-bit value at the ICR
     offset

   - Fix a bug where KVM would fail to exit to userspace if one was
     triggered by a fastpath exit handler

   - Add fastpath handling of HLT VM-Exit to expedite re-entering the
     guest when there's already a pending wake event at the time of the
     exit

   - Fix a WARN caused by RSM entering a nested guest from SMM with
     invalid guest state, by forcing the vCPU out of guest mode prior to
     signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the
     nested guest)

   - Overhaul the "unprotect and retry" logic to more precisely identify
     cases where retrying is actually helpful, and to harden all retry
     paths against putting the guest into an infinite retry loop

   - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping
     rmaps in the shadow MMU

   - Refactor pieces of the shadow MMU related to aging SPTEs in
     prepartion for adding multi generation LRU support in KVM

   - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is
     enabled, i.e. when the CPU has already flushed the RSB

   - Trace the per-CPU host save area as a VMCB pointer to improve
     readability and cleanup the retrieval of the SEV-ES host save area

   - Remove unnecessary accounting of temporary nested VMCB related
     allocations

   - Set FINAL/PAGE in the page fault error code for EPT violations if
     and only if the GVA is valid. If the GVA is NOT valid, there is no
     guest-side page table walk and so stuffing paging related metadata
     is nonsensical

   - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit
     instead of emulating posted interrupt delivery to L2

   - Add a lockdep assertion to detect unsafe accesses of vmcs12
     structures

   - Harden eVMCS loading against an impossible NULL pointer deref
     (really truly should be impossible)

   - Minor SGX fix and a cleanup

   - Misc cleanups

  Generic:

   - Register KVM's cpuhp and syscore callbacks when enabling
     virtualization in hardware, as the sole purpose of said callbacks
     is to disable and re-enable virtualization as needed

   - Enable virtualization when KVM is loaded, not right before the
     first VM is created

     Together with the previous change, this simplifies a lot the logic
     of the callbacks, because their very existence implies
     virtualization is enabled

   - Fix a bug that results in KVM prematurely exiting to userspace for
     coalesced MMIO/PIO in many cases, clean up the related code, and
     add a testcase

   - Fix a bug in kvm_clear_guest() where it would trigger a buffer
     overflow _if_ the gpa+len crosses a page boundary, which thankfully
     is guaranteed to not happen in the current code base. Add WARNs in
     more helpers that read/write guest memory to detect similar bugs

  Selftests:

   - Fix a goof that caused some Hyper-V tests to be skipped when run on
     bare metal, i.e. NOT in a VM

   - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES
     guest

   - Explicitly include one-off assets in .gitignore. Past Sean was
     completely wrong about not being able to detect missing .gitignore
     entries

   - Verify userspace single-stepping works when KVM happens to handle a
     VM-Exit in its fastpath

   - Misc cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
  Documentation: KVM: fix warning in "make htmldocs"
  s390: Enable KVM_S390_UCONTROL config in debug_defconfig
  selftests: kvm: s390: Add VM run test case
  KVM: SVM: let alternatives handle the cases when RSB filling is required
  KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid
  KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent
  KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals
  KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
  KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
  KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed
  KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
  KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
  KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
  KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn
  KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
  KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version
  KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()
  KVM: x86: Update retry protection fields when forcing retry on emulation failure
  KVM: x86: Apply retry protection to "unprotect on failure" path
  KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
  ...
2024-09-28 09:20:14 -07:00
Paolo Bonzini
5d55a052e3 Merge tag 'kvm-x86-mmu-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.12:

 - Overhaul the "unprotect and retry" logic to more precisely identify cases
   where retrying is actually helpful, and to harden all retry paths against
   putting the guest into an infinite retry loop.

 - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in
   the shadow MMU.

 - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for
   adding MGLRU support in KVM.

 - Misc cleanups
2024-09-17 12:39:53 -04:00
Paolo Bonzini
55f50b2f86 Merge branch 'kvm-memslot-zap-quirk' into HEAD
Today whenever a memslot is moved or deleted, KVM invalidates the entire
page tables and generates fresh ones based on the new memslot layout.

This behavior traditionally was kept because of a bug which was never
fully investigated and caused VM instability with assigned GeForce
GPUs.  It generally does not have a huge overhead, because the old
MMU is able to reuse cached page tables and the new one is more
scalabale and can resolve EPT violations/nested page faults in parallel,
but it has worse performance if the guest frequently deletes and
adds small memslots, and it's entirely not viable for TDX.  This is
because TDX requires re-accepting of private pages after page dropping.

For non-TDX VMs, this series therefore introduces the
KVM_X86_QUIRK_SLOT_ZAP_ALL quirk, enabling users to control the behavior
of memslot zapping when a memslot is moved/deleted.  The quirk is turned
on by default, leading to the zapping of all SPTEs when a memslot is
moved/deleted; users however have the option to turn off the quirk,
which limits the zapping only to those SPTEs hat lie within the range
of memslot being moved/deleted.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-17 11:38:19 -04:00
Paolo Bonzini
9d70f3fec1 Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
This reverts commit 377b2f359d.

This caused a regression with the bochsdrm driver, which used ioremap()
instead of ioremap_wc() to map the video RAM.  After the commit, the
WB memory type is used without the IGNORE_PAT, resulting in the slower
UC memory type.  In fact, UC is slow enough to basically cause guests
to not boot... but only on new processors such as Sapphire Rapids and
Cascade Lake.  Coffee Lake for example works properly, though that might
also be an effect of being on a larger, more NUMA system.

The driver has been fixed but that does not help older guests.  Until we
figure out whether Cascade Lake and newer processors are working as
intended, revert the commit.  Long term we might add a quirk, but the
details depend on whether the processors are working as intended: for
example if they are, the quirk might reference bochs-compatible devices,
e.g. in the name and documentation, so that userspace can disable the
quirk by default and only leave it enabled if such a device is being
exposed to the guest.

If instead this is actually a bug in CLX+, then the actions we need to
take are different and depend on the actual cause of the bug.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-15 02:49:33 -04:00
Sean Christopherson
9a5bff7f5e KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent
Use KVM_PAGES_PER_HPAGE() instead of open coding equivalent logic that is
anything but obvious.

No functional change intended, and verified by compiling with the below
assertions:

        BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K)) !=
                      KVM_PAGES_PER_HPAGE(PG_LEVEL_4K));

        BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M)) !=
                      KVM_PAGES_PER_HPAGE(PG_LEVEL_2M));

        BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G)) !=
                      KVM_PAGES_PER_HPAGE(PG_LEVEL_1G));

Link: https://lore.kernel.org/r/20240809194335.1726916-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:08 -07:00
Sean Christopherson
7645829145 KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals
Replace all of the open coded '1' literals used to mark a PTE list as
having many/multiple entries with a proper define.  It's hard enough to
read the code with one magic bit, and a future patch to support "locking"
a single rmap will add another.

No functional change intended.

Link: https://lore.kernel.org/r/20240809194335.1726916-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:07 -07:00
Sean Christopherson
7aac9dc680 KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
Fold mmu_spte_age() into its sole caller now that aging and testing for
young SPTEs is handled in a common location, i.e. doesn't require more
helpers.

Opportunistically remove the use of mmu_spte_get_lockless(), as mmu_lock
is held (for write!), and marking SPTEs for access tracking outside of
mmu_lock is unsafe (at least, as written).  I.e. using the lockless
accessor is quite misleading.

No functional change intended.

Link: https://lore.kernel.org/r/20240809194335.1726916-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:06 -07:00
Sean Christopherson
c17f150000 KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
Rework kvm_handle_gfn_range() into an aging-specic helper,
kvm_rmap_age_gfn_range().  In addition to purging a bunch of unnecessary
boilerplate code, this sets the stage for aging rmap SPTEs outside of
mmu_lock.

Note, there's a small functional change, as kvm_test_age_gfn() will now
return immediately if a young SPTE is found, whereas previously KVM would
continue iterating over other levels.

Link: https://lore.kernel.org/r/20240809194335.1726916-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:05 -07:00
Sean Christopherson
548f87f667 KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed
Convert kvm_unmap_gfn_range(), which is the helper that zaps rmap SPTEs in
response to an mmu_notifier invalidation, to use __kvm_rmap_zap_gfn_range()
and feed in range->may_block.  In other words, honor NEED_RESCHED by way of
cond_resched() when zapping rmaps.  This fixes a long-standing issue where
KVM could process an absurd number of rmap entries without ever yielding,
e.g. if an mmu_notifier fired on a PUD (or larger) range.

Opportunistically rename __kvm_zap_rmap() to kvm_zap_rmap(), and drop the
old kvm_zap_rmap().  Ideally, the shuffling would be done in a different
patch, but that just makes the compiler unhappy, e.g.

  arch/x86/kvm/mmu/mmu.c:1462:13: error: ‘kvm_zap_rmap’ defined but not used

Reported-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20240809194335.1726916-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:04 -07:00
Sean Christopherson
dd9eaad744 KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
Add a dedicated helper to walk and zap rmaps for a given memslot so that
the code can be shared between KVM-initiated zaps and mmu_notifier
invalidations.

No functional change intended.

Link: https://lore.kernel.org/r/20240809194335.1726916-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:03 -07:00
Sean Christopherson
5b1fb116e1 KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
Add a @can_yield param to __walk_slot_rmaps() to control whether or not
dropping mmu_lock and conditionally rescheduling is allowed.  This will
allow using __walk_slot_rmaps() and thus cond_resched() to handle
mmu_notifier invalidations, which usually allow blocking/yielding, but not
when invoked by the OOM killer.

Link: https://lore.kernel.org/r/20240809194335.1726916-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:02 -07:00
Sean Christopherson
0a37fffda1 KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
Move walk_slot_rmaps() and friends up near for_each_slot_rmap_range() so
that the walkers can be used to handle mmu_notifier invalidations, and so
that similar function has some amount of locality in code.

No functional change intended.

Link: https://lore.kernel.org/r/20240809194335.1726916-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:22:01 -07:00
Sean Christopherson
98a69b96ca KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn
WARN if KVM gets an MMIO cache hit on a RET_PF_WRITE_PROTECTED fault, as
KVM should return RET_PF_WRITE_PROTECTED if and only if there is a memslot,
and creating a memslot is supposed to invalidate the MMIO cache by virtue
of changing the memslot generation.

Keep the code around mainly to provide a convenient location to document
why emulated MMIO should be impossible.

Suggested-by: Yuan Yao <yuan.yao@linux.intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:36 -07:00
Sean Christopherson
d859b16161 KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
Explicitly query the list of to-be-zapped shadow pages when checking to
see if unprotecting a gfn for retry has succeeded, i.e. if KVM should
retry the faulting instruction.

Add a comment to explain why the list needs to be checked before zapping,
which is the primary motivation for this change.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:35 -07:00
Sean Christopherson
6b3dcabc10 KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version
Fold kvm_mmu_unprotect_page() into kvm_mmu_unprotect_gfn_and_retry() now
that all other direct usage is gone.

No functional change intended.

Link: https://lore.kernel.org/r/20240831001538.336683-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:34 -07:00
Sean Christopherson
4df685664b KVM: x86: Update retry protection fields when forcing retry on emulation failure
When retrying the faulting instruction after emulation failure, refresh
the infinite loop protection fields even if no shadow pages were zapped,
i.e. avoid hitting an infinite loop even when retrying the instruction as
a last-ditch effort to avoid terminating the guest.

Link: https://lore.kernel.org/r/20240831001538.336683-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:32 -07:00
Sean Christopherson
b299c273c0 KVM: x86/mmu: Move event re-injection unprotect+retry into common path
Move the event re-injection unprotect+retry logic into
kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if
the #PF actually hit a write-protected gfn.  Note, there is a small
possibility that the gfn was unprotected by a different tasking between
hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume
the guest immediately anyways because KVM will treat the fault as spurious.

As a bonus, unprotecting _after_ handling the page fault also addresses the
case where the installing a SPTE to handle fault encounters a shadowed PTE,
i.e. *creates* a read-only SPTE.

Opportunstically add a comment explaining what on earth the intent of the
code is, as based on the changelog from commit 577bdc4966 ("KVM: Avoid
instruction emulation when event delivery is pending").

Link: https://lore.kernel.org/r/20240831001538.336683-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:29 -07:00
Sean Christopherson
29e495bdf8 KVM: x86/mmu: Always walk guest PTEs with WRITE access when unprotecting
When getting a gpa from a gva to unprotect the associated gfn when an
event is awating reinjection, walk the guest PTEs for WRITE as there's no
point in unprotecting the gfn if the guest is unable to write the page,
i.e. if write-protection can't trigger emulation.

Note, the entire flow should be guarded on the access being a write, and
even better should be conditioned on actually triggering a write-protect
fault.  This will be addressed in a future commit.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:28 -07:00
Sean Christopherson
b7e948898e KVM: x86/mmu: Don't try to unprotect an INVALID_GPA
If getting the gpa for a gva fails, e.g. because the gva isn't mapped in
the guest page tables, don't try to unprotect the invalid gfn.  This is
mostly a performance fix (avoids unnecessarily taking mmu_lock), as
for_each_gfn_valid_sp_with_gptes() won't explode on garbage input, it's
simply pointless.

Link: https://lore.kernel.org/r/20240831001538.336683-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:27 -07:00
Sean Christopherson
dfaae8447c KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs
Try to unprotect shadow pages if and only if indirect_shadow_pages is non-
zero, i.e. iff there is at least one protected such shadow page.  Pre-
checking indirect_shadow_pages avoids taking mmu_lock for write when the
gfn is write-protected by a third party, i.e. not for KVM shadow paging,
and in the *extremely* unlikely case that a different task has already
unprotected the last shadow page.

Link: https://lore.kernel.org/r/20240831001538.336683-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:24 -07:00
Sean Christopherson
01dd4d3192 KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path
Move the anti-infinite-loop protection provided by last_retry_{eip,addr}
into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that
never hits the emulator, as well as reexecute_instruction(), which is the
last ditch "might as well try it" logic that kicks in when emulation fails
on an instruction that faulted on a write-protected gfn.

Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry
fields and deduplicate other code (with more to come).

Link: https://lore.kernel.org/r/20240831001538.336683-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:23 -07:00
Sean Christopherson
2fb2b7877b KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
When doing "fast unprotection" of nested TDP page tables, skip emulation
if and only if at least one gfn was unprotected, i.e. continue with
emulation if simply resuming is likely to hit the same fault and risk
putting the vCPU into an infinite loop.

Note, it's entirely possible to get a false negative, e.g. if a different
vCPU faults on the same gfn and unprotects the gfn first, but that's a
relatively rare edge case, and emulating is still functionally ok, i.e.
saving a few cycles by avoiding emulation isn't worth the risk of putting
the vCPU into an infinite loop.

Opportunistically rewrite the relevant comment to document in gory detail
exactly what scenario the "fast unprotect" logic is handling.

Fixes: 147277540b ("kvm: svm: Add support for additional SVM NPF error codes")
Cc: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:20 -07:00
Sean Christopherson
989a84c93f KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults
Trigger KVM's various "unprotect gfn" paths if and only if the page fault
was a write to a write-protected gfn.  To do so, add a new page fault
return code, RET_PF_WRITE_PROTECTED, to explicitly and precisely track
such page faults.

If a page fault requires emulation for any MMIO (or any reason besides
write-protection), trying to unprotect the gfn is pointless and risks
putting the vCPU into an infinite loop.  E.g. KVM will put the vCPU into
an infinite loop if the vCPU manages to trigger MMIO on a page table walk.

Fixes: 147277540b ("kvm: svm: Add support for additional SVM NPF error codes")
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240831001538.336683-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:19 -07:00
Sean Christopherson
4ececec19a KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper
Drop the globally visible PFERR_NESTED_GUEST_PAGE and replace it with a
more appropriately named is_write_to_guest_page_table().  The macro name
is misleading, because while all nNPT walks match PAGE|WRITE|PRESENT, the
reverse is not true.

No functional change intended.

Link: https://lore.kernel.org/r/20240831001538.336683-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09 20:16:18 -07:00
Sean Christopherson
acf2923271 KVM: x86/mmu: Clean up function comments for dirty logging APIs
Rework the function comment for kvm_arch_mmu_enable_log_dirty_pt_masked()
into the body of the function, as it has gotten a bit stale, is harder to
read without the code context, and is the last source of warnings for W=1
builds in KVM x86 due to using a kernel-doc comment without documenting
all parameters.

Opportunistically subsume the functions comments for
kvm_mmu_write_protect_pt_masked() and kvm_mmu_clear_dirty_pt_masked(), as
there is no value in regurgitating similar information at a higher level,
and capturing the differences between write-protection and PML-based dirty
logging is best done in a common location.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20240802202006.340854-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:40:04 -07:00
Sean Christopherson
28cec7f08b KVM: x86/mmu: Check that root is valid/loaded when pre-faulting SPTEs
Error out if kvm_mmu_reload() fails when pre-faulting memory, as trying to
fault-in SPTEs will fail miserably due to root.hpa pointing at garbage.

Note, kvm_mmu_reload() can return -EIO and thus trigger the WARN on -EIO
in kvm_vcpu_pre_fault_memory(), but all such paths also WARN, i.e. the
WARN isn't user-triggerable and won't run afoul of warn-on-panic because
the kernel would already be panicking.

  BUG: unable to handle page fault for address: 000029ffffffffe8
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] PREEMPT SMP
  CPU: 22 PID: 1069 Comm: pre_fault_memor Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #548
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:is_page_fault_stale+0x3e/0xe0 [kvm]
  RSP: 0018:ffffc9000114bd48 EFLAGS: 00010206
  RAX: 00003fffffffffc0 RBX: ffff88810a07c080 RCX: ffffc9000114bd78
  RDX: ffff88810a07c080 RSI: ffffea0000000000 RDI: ffff88810a07c080
  RBP: ffffc9000114bd78 R08: 00007fa3c8c00000 R09: 8000000000000225
  R10: ffffea00043d7d80 R11: 0000000000000000 R12: ffff88810a07c080
  R13: 0000000100000000 R14: ffffc9000114be58 R15: 0000000000000000
  FS:  00007fa3c9da0740(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000029ffffffffe8 CR3: 000000011d698000 CR4: 0000000000352eb0
  Call Trace:
   <TASK>
   kvm_tdp_page_fault+0xcc/0x160 [kvm]
   kvm_mmu_do_page_fault+0xfb/0x1f0 [kvm]
   kvm_arch_vcpu_pre_fault_memory+0xd0/0x1a0 [kvm]
   kvm_vcpu_ioctl+0x761/0x8c0 [kvm]
   __x64_sys_ioctl+0x82/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  Modules linked in: kvm_intel kvm
  CR2: 000029ffffffffe8
  ---[ end trace 0000000000000000 ]---

Fixes: 6e01b7601d ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()")
Reported-by: syzbot+23786faffb695f17edaa@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000002b84dc061dd73544@google.com
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: xingwei lee <xrivendell7@gmail.com>
Tested-by: yuxin wang <wang1315768607@163.com>
Link: https://lore.kernel.org/r/20240723000211.3352304-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:25:32 -07:00
Yan Zhao
aa8d1f48d3 KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select
KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM
VMs. Make sure KVM behave as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs.

The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options:
- when enabled:  Invalidate/zap all SPTEs ("zap-all"),
- when disabled: Precisely zap only the leaf SPTEs within the range of the
                 moving/deleting memory slot ("zap-slot-leafs-only").

"zap-all" is today's KVM behavior to work around a bug [1] where the
changing the zapping behavior of memslot move/deletion would cause VM
instability for VMs with an Nvidia GPU assigned; while
"zap-slot-leafs-only" allows for more precise zapping of SPTEs within the
memory slot range, improving performance in certain scenarios [2], and
meeting the functional requirements for TDX.

Previous attempts to select "zap-slot-leafs-only" include a per-VM
capability approach [3] (which was not preferred because the root cause of
the bug remained unidentified) and a per-memslot flag approach [4]. Sean
and Paolo finally recommended the implementation of this quirk and
explained that it's the least bad option [5].

By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use
"zap-all". Users have the option to disable the quirk to select
"zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are
unaffected by this bug.

For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is
always selected without user's opt-in, regardless of if the user opts for
"zap-all".
This is because it is assumed until proven otherwise that non-
KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most
importantly, it's because TDX must have "zap-slot-leafs-only" always
selected. In TDX's case a memslot's GPA range can be a mixture of "private"
or "shared" memory. Shared is roughly analogous to how EPT is handled for
normal VMs, but private GPAs need lots of special treatment:
1) "zap-all" would require to zap private root page or non-leaf entries or
   at least leaf-entries beyond the deleting memslot scope. However, TDX
   demands that the root page of the private page table remains unchanged,
   with leaf entries being zapped before non-leaf entries, and any dropped
   private guest pages must be re-accepted by the guest.
2) if "zap-all" zaps only shared page tables, it would result in private
   pages still being mapped when the memslot is gone. This may affect even
   other processes if later the gmem fd was whole punched, causing the
   pages being freed on the host while still mapped in the TD, because
   there's no pgoff to the gfn information to zap the private page table
   after memslot is gone.

So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or
complicating quirk disabling interface (current quirk disabling interface
is limited, no way to query quirks, or force them to be disabled).

Add a new function kvm_mmu_zap_memslot_leafs() to implement
"zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(),
bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as
1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can
   it be moved. It is only deleted by KVM when APICv is permanently
   inhibited.
2) kvm_vcpu_reload_apic_access_page() effectively does nothing when
   APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted.
3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on
   costly IPIs.

Suggested-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3]
Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4]
Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5]
Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20240703021043.13881-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14 12:29:11 -04:00
Ackerley Tng
aca0ec970d KVM: x86/mmu: fix determination of max NPT mapping level for private pages
The `if (req_max_level)` test was meant ignore req_max_level if
PG_LEVEL_NONE was returned. Hence, this function should return
max_level instead of the ignored req_max_level.

This is only a latent issue for now, since guest_memfd does not
support large pages.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Message-ID: <20240801173955.1975034-1-ackerleytng@google.com>
Fixes: f32fb32820 ("KVM: x86: Add hook for determining max NPT mapping level")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-01 14:13:11 -04:00
Paolo Bonzini
4b5f67120a KVM: extend kvm_range_has_memory_attributes() to check subset of attributes
While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE,
KVM code such as kvm_mem_is_private() is written to expect their existence.
Allow using kvm_range_has_memory_attributes() as a multi-page version of
kvm_mem_is_private(), without it breaking later when more attributes are
introduced.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:15 -04:00
Paolo Bonzini
5932ca411e KVM: x86: disallow pre-fault for SNP VMs before initialization
KVM_PRE_FAULT_MEMORY for an SNP guest can race with
sev_gmem_post_populate() in bad ways. The following sequence for
instance can potentially trigger an RMP fault:

  thread A, sev_gmem_post_populate: called
  thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP
  thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i);
  thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE);
  RMP #PF

Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's
initial private memory contents have been finalized via
KVM_SEV_SNP_LAUNCH_FINISH.

Beyond fixing this issue, it just sort of makes sense to enforce this,
since the KVM_PRE_FAULT_MEMORY documentation states:

  "KVM maps memory as if the vCPU generated a stage-2 read page fault"

which sort of implies we should be acting on the same guest state that a
vCPU would see post-launch after the initial guest memory is all set up.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:14 -04:00
Wei Wang
896046474f KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
Introduces kvm_x86_call(), to streamline the usage of static calls of
kvm_x86_ops. The current implementation of these calls is verbose and
could lead to alignment challenges. This makes the code susceptible to
exceeding the "80 columns per single line of code" limit as defined in
the coding-style document. Another issue with the existing implementation
is that the addition of kvm_x86_ prefix to hooks at the static_call sites
hinders code readability and navigation. kvm_x86_call() is added to
improve code readability and maintainability, while adhering to the coding
style guidelines.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-16 12:14:12 -04:00
Paolo Bonzini
5c5ddf7107 Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MTRR virtualization removal

Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
hack, and instead always honor guest PAT on CPUs that support self-snoop.
2024-07-16 09:54:57 -04:00
Paolo Bonzini
34b69edecb Merge tag 'kvm-x86-mmu-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.11

 - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't
   hold leafs SPTEs.

 - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager
   page splitting to avoid stalling vCPUs when splitting huge pages.

 - Misc cleanups
2024-07-16 09:53:28 -04:00
Paolo Bonzini
5dcc1e7614 Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.11

 - Add a global struct to consolidate tracking of host values, e.g. EFER, and
   move "shadow_phys_bits" into the structure as "maxphyaddr".

 - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
   bus frequency, because TDX.

 - Print the name of the APICv/AVIC inhibits in the relevant tracepoint.

 - Clean up KVM's handling of vendor specific emulation to consistently act on
   "compatible with Intel/AMD", versus checking for a specific vendor.

 - Misc cleanups
2024-07-16 09:53:05 -04:00
Paolo Bonzini
f3996d4d79 Merge branch 'kvm-prefault' into HEAD
Pre-population has been requested several times to mitigate KVM page faults
during guest boot or after live migration.  It is also required by TDX
before filling in the initial guest memory with measured contents.
Introduce it as a generic API.
2024-07-12 11:18:45 -04:00