Commit Graph

3098 Commits

Author SHA1 Message Date
Pawan Gupta
2a0180129d KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
Mitigation for RFDS requires RFDS_CLEAR capability which is enumerated
by MSR_IA32_ARCH_CAPABILITIES bit 27. If the host has it set, export it
to guests so that they can deploy the mitigation.

RFDS_NO indicates that the system is not vulnerable to RFDS, export it
to guests so that they don't deploy the mitigation unnecessarily. When
the host is not affected by X86_BUG_RFDS, but has RFDS_NO=0, synthesize
RFDS_NO to the guest.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11 13:13:50 -07:00
Paolo Bonzini
e9a2bba476 Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM Xen and pfncache changes for 6.9:

 - Rip out the half-baked support for using gfn_to_pfn caches to manage pages
   that are "mapped" into guests via physical addresses.

 - Add support for using gfn_to_pfn caches with only a host virtual address,
   i.e. to bypass the "gfn" stage of the cache.  The primary use case is
   overlay pages, where the guest may change the gfn used to reference the
   overlay page, but the backing hva+pfn remains the same.

 - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
   of a gpa, so that userspace doesn't need to reconfigure and invalidate the
   cache/mapping if the guest changes the gpa (but userspace keeps the resolved
   hva the same).

 - When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.

 - Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).

 - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.

 - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
   refresh), and drop a now-redundant acquisition of xen_lock (that was
   protecting the shared_info cache) to fix a deadlock due to recursively
   acquiring xen_lock.
2024-03-11 10:42:55 -04:00
Paolo Bonzini
e9025cdd8c Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.9:

 - Fix several bugs where KVM speciously prevents the guest from utilizing
   fixed counters and architectural event encodings based on whether or not
   guest CPUID reports support for the _architectural_ encoding.

 - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
   priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.

 - Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
   and a variety of other PMC-related behaviors that depend on guest CPUID,
   i.e. are difficult to validate via KVM-Unit-Tests.

 - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
   cycles, e.g. when checking if a PMC event needs to be synthesized when
   skipping an instruction.

 - Optimize triggering of emulated events, e.g. for "count instructions" events
   when skipping an instruction, which yields a ~10% performance improvement in
   VM-Exit microbenchmarks when a vPMU is exposed to the guest.

 - Tighten the check for "PMI in guest" to reduce false positives if an NMI
   arrives in the host while KVM is handling an IRQ VM-Exit.
2024-03-11 10:41:09 -04:00
Paolo Bonzini
41ebae2ecd Merge tag 'kvm-x86-mmu-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.9:

 - Clean up code related to unprotecting shadow pages when retrying a guest
   instruction after failed #PF-induced emulation.

 - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
   a reschedule is needed, e.g. if a high priority task needs to run.  Because
   KVM doesn't support yielding in the middle of processing a zapped non-leaf
   SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
   attempting to schedule in a high priority.

 - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
   read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.

 - Allocate write-tracking metadata on-demand to avoid the memory overhead when
   running kernels built with KVMGT support (external write-tracking enabled),
   but for workloads that don't use nested virtualization (shadow paging) or
   KVMGT.
2024-03-11 10:29:22 -04:00
Paolo Bonzini
c9cd0beae9 Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.9:

 - Explicitly initialize a variety of on-stack variables in the emulator that
   triggered KMSAN false positives (though in fairness in KMSAN, it's comically
   difficult to see that the uninitialized memory is never truly consumed).

 - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
   DR6 and DR7.

 - Rework the "force immediate exit" code so that vendor code ultimately
   decides how and when to force the exit.  This allows VMX to further optimize
   handling preemption timer exits, and allows SVM to avoid sending a duplicate
   IPI (SVM also has a need to force an exit).

 - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
   vCPU creation ultimately failed, and add WARN to guard against similar bugs.

 - Provide a dedicated arch hook for checking if a different vCPU was in-kernel
   (for directed yield), and simplify the logic for checking if the currently
   loaded vCPU is in-kernel.

 - Misc cleanups and fixes.
2024-03-11 10:24:56 -04:00
Paolo Bonzini
39fee313fd Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM GUEST_MEMFD fixes for 6.8:

 - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
   avoid creating ABI that KVM can't sanely support.

 - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
   clear that such VMs are purely a development and testing vehicle, and
   come with zero guarantees.

 - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
   is to support confidential VMs with deterministic private memory (SNP
   and TDX) only in the TDP MMU.

 - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
   when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-09 11:42:17 -05:00
David Woodhouse
451a707813 KVM: x86/xen: improve accuracy of Xen timers
A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Fixes: 5363952605 ("KVM: x86/xen: handle PV timers oneshot mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org
[sean: massage moved comment, tweak if statement formatting]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-04 16:22:32 -08:00
Sean Christopherson
a1176ef5c9 KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU
Advertise and support software-protected VMs if and only if the TDP MMU is
enabled, i.e. disallow KVM_SW_PROTECTED_VM if TDP is enabled for KVM's
legacy/shadow MMU.  TDP support for the shadow MMU is maintenance-only,
e.g. support for TDX and SNP will also be restricted to the TDP MMU.

Fixes: 89ea60c2c7 ("KVM: x86: Add support for "protected VMs" that can utilize private memory")
Link: https://lore.kernel.org/r/20240222190612.2942589-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 17:07:06 -08:00
Sean Christopherson
322d79f1db KVM: x86: Clean up directed yield API for "has pending interrupt"
Directly return the boolean result of whether or not a vCPU has a pending
interrupt instead of effectively doing:

  if (true)
	return true;

  return false;

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240110003938.490206-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:27:40 -08:00
Sean Christopherson
9b8615c5d3 KVM: x86: Rely solely on preempted_in_kernel flag for directed yield
Snapshot preempted_in_kernel using kvm_arch_vcpu_in_kernel() so that the
flag is "accurate" (or rather, consistent and deterministic within KVM)
for guests with protected state, and explicitly use preempted_in_kernel
when checking if a vCPU was preempted in kernel mode instead of bouncing
through kvm_arch_vcpu_in_kernel().

Drop the gnarly logic in kvm_arch_vcpu_in_kernel() that redirects to
preempted_in_kernel if the target vCPU is not the "running", i.e. loaded,
vCPU, as the only reason that code existed was for the directed yield case
where KVM wants to check the CPL of a vCPU that may or may not be loaded
on the current pCPU.

Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240110003938.490206-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:27:03 -08:00
Sean Christopherson
77bcd9e623 KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel
Plumb in a dedicated hook for querying whether or not a vCPU was preempted
in-kernel.  Unlike literally every other architecture, x86's VMX can check
if a vCPU is in kernel context if and only if the vCPU is loaded on the
current pCPU.

x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying
kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel
as needed.  But that's unnecessary, confusing, and fragile, e.g. x86 has
had at least one bug where KVM incorrectly used a stale
preempted_in_kernel.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:26:26 -08:00
Sean Christopherson
fc3c94142b KVM: x86: Sanity check that kvm_has_noapic_vcpu is zero at module_exit()
WARN if kvm.ko is unloaded with an elevated kvm_has_noapic_vcpu to guard
against incorrect management of the key, e.g. to detect if KVM fails to
decrement the key in error paths.  Because kvm_has_noapic_vcpu is purely
an optimization, in all likelihood KVM could completely botch handling of
kvm_has_noapic_vcpu and no one would notice (which is a good argument for
deleting the key entirely, but that's a problem for another day).

Note, ideally the sanity check would be performance when kvm_usage_count
goes to zero, but adding an arch callback just for this sanity check isn't
at all worth doing.

Link: https://lore.kernel.org/r/20240209222047.394389-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:24:26 -08:00
Sean Christopherson
a78d904669 KVM: x86: Move "KVM no-APIC vCPU" key management into local APIC code
Move incrementing and decrementing of kvm_has_noapic_vcpu into
kvm_create_lapic() and kvm_free_lapic() respectively to fix a benign bug
where KVM fails to decrement the count if vCPU creation ultimately fails,
e.g. due to a memory allocation failing.

Note, the bug is benign as kvm_has_noapic_vcpu is used purely to optimize
lapic_in_kernel() checks, and that optimization is quite dubious.  That,
and practically speaking no setup that cares at all about performance runs
with a userspace local APIC.

Reported-by: Li RongQing <lirongqing@baidu.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Link: https://lore.kernel.org/r/20240209222047.394389-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:24:09 -08:00
Sean Christopherson
0ec3d6d1f1 KVM: x86: Fully defer to vendor code to decide how to force immediate exit
Now that vmx->req_immediate_exit is used only in the scope of
vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp
the VMX preemption to force a VM-Exit and let vendor code fully handle
forcing a VM-Exit.

Opportunsitically drop __kvm_request_immediate_exit() and just have
vendor code call smp_send_reschedule() directly.  SVM already does this
when injecting an event while also trying to single-step an IRET, i.e.
it's not exactly secret knowledge that KVM uses a reschedule IPI to force
an exit.

Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:41 -08:00
Sean Christopherson
9c9025ea00 KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepoint
Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is
forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to
inject an event but needs to first complete some other operation.
Knowing that KVM is (or isn't) forcing an exit is useful information when
debugging issues related to event injection.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:22:36 -08:00
Sean Christopherson
dfeef3d3f3 KVM: x86: Drop superfluous check on direct MMU vs. WRITE_PF_TO_SP flag
Remove reexecute_instruction()'s final check on the MMU being direct, as
EMULTYPE_WRITE_PF_TO_SP is only ever set if the MMU is indirect, i.e. is a
shadow MMU.  Prior to commit 93c05d3ef2 ("KVM: x86: improve
reexecute_instruction"), the flag simply didn't exist (and KVM actually
returned "true" unconditionally for both types of MMUs).  I.e. the
explicit check for a direct MMU is simply leftover artifact from old code.

Link: https://lore.kernel.org/r/20240203002343.383056-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:19:06 -08:00
Sean Christopherson
515c18a64e KVM: x86: Drop dedicated logic for direct MMUs in reexecute_instruction()
Now that KVM doesn't pointlessly acquire mmu_lock for direct MMUs, drop
the dedicated path entirely and always query indirect_shadow_pages when
deciding whether or not to try unprotecting the gfn.  For indirect, a.k.a.
shadow MMUs, checking indirect_shadow_pages is harmless; unless *every*
shadow page was somehow zapped while KVM was attempting to emulate the
instruction, indirect_shadow_pages is guaranteed to be non-zero.

Well, unless the instruction used a direct hugepage with 2-level paging
for its code page, but in that case, there's obviously nothing to
unprotect.  And in the extremely unlikely case all shadow pages were
zapped, there's again obviously nothing to unprotect.

Link: https://lore.kernel.org/r/20240203002343.383056-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:19:06 -08:00
Mingwei Zhang
474b99ed70 KVM: x86/mmu: Don't acquire mmu_lock when using indirect_shadow_pages as a heuristic
Drop KVM's completely pointless acquisition of mmu_lock when deciding
whether or not to unprotect any shadow pages residing at the gfn before
resuming the guest to let it retry an instruction that KVM failed to
emulated.  In this case, indirect_shadow_pages is used as a coarse-grained
heuristic to check if there is any chance of there being a relevant shadow
page to unprotected.  But acquiring mmu_lock largely defeats any benefit
to the heuristic, as taking mmu_lock for write is likely far more costly
to the VM as a whole than unnecessarily walking mmu_page_hash.

Furthermore, the current code is already prone to false negatives and
false positives, as it drops mmu_lock before checking the flag and
unprotecting shadow pages.  And as evidenced by the lack of bug reports,
neither false positives nor false negatives are problematic.  A false
positive simply means that KVM will try to unprotect shadow pages that
have already been zapped.  And a false negative means that KVM will
resume the guest without unprotecting the gfn, i.e. if a shadow page was
_just_ created, the vCPU will hit the same page fault and do the whole
dance all over again, and detect and unprotect the shadow page the second
time around (or not, if something else zaps it first).

Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: drop READ_ONCE() and comment change, rewrite changelog]
Link: https://lore.kernel.org/r/20240203002343.383056-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:19:06 -08:00
Sean Christopherson
2a5f091ce1 KVM: x86: Open code all direct reads to guest DR6 and DR7
Bite the bullet, and open code all direct reads of DR6 and DR7.  KVM
currently has a mix of open coded accesses and calls to kvm_get_dr(),
which is confusing and ugly because there's no rhyme or reason as to why
any particular chunk of code uses kvm_get_dr().

The obvious alternative is to force all accesses through kvm_get_dr(),
but it's not at all clear that doing so would be a net positive, e.g. even
if KVM ends up wanting/needing to force all reads through a common helper,
e.g. to play caching games, the cost of reverting this change is likely
lower than the ongoing cost of maintaining weird, arbitrary code.

No functional change intended.

Cc: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240209220752.388160-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:14:47 -08:00
Sean Christopherson
fc5375dd8c KVM: x86: Make kvm_get_dr() return a value, not use an out parameter
Convert kvm_get_dr()'s output parameter to a return value, and clean up
most of the mess that was created by forcing callers to provide a pointer.

No functional change intended.

Acked-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240209220752.388160-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 16:14:47 -08:00
Paul Durrant
615451d8cb KVM: x86/xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability
Now that all relevant kernel changes and selftests are in place, enable the
new capability.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-17-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 07:01:19 -08:00
Paul Durrant
a4bff3df51 KVM: pfncache: remove KVM_GUEST_USES_PFN usage
As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any
callers of kvm_gpc_init(), and for good reason: the implementation is
incomplete/broken.  And it's not clear that there will ever be a user of
KVM_GUEST_USES_PFN, as coordinating vCPUs with mmu_notifier events is
non-trivial.

Remove KVM_GUEST_USES_PFN and all related code, e.g. dropping
KVM_GUEST_USES_PFN also makes the 'vcpu' argument redundant, to avoid
having to reason about broken code as __kvm_gpc_refresh() evolves.

Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage
check in hva_to_pfn_retry() and hence the 'usage' argument to
kvm_gpc_init() are also redundant.

[1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-6-paul@xen.org
[sean: explicitly call out that guest usage is incomplete]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20 07:37:43 -08:00
Paul Durrant
78b74638eb KVM: pfncache: add a mark-dirty helper
At the moment pages are marked dirty by open-coded calls to
mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot
from the cache. After a subsequent patch these may not always be set
so add a helper now so that caller will protected from the need to know
about this detail.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-5-paul@xen.org
[sean: decrease indentation, use gpa_to_gfn()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20 07:37:42 -08:00
Sean Christopherson
910c57dfa4 KVM: x86: Mark target gfn of emulated atomic instruction as dirty
When emulating an atomic access on behalf of the guest, mark the target
gfn dirty if the CMPXCHG by KVM is attempted and doesn't fault.  This
fixes a bug where KVM effectively corrupts guest memory during live
migration by writing to guest memory without informing userspace that the
page is dirty.

Marking the page dirty got unintentionally dropped when KVM's emulated
CMPXCHG was converted to do a user access.  Before that, KVM explicitly
mapped the guest page into kernel memory, and marked the page dirty during
the unmap phase.

Mark the page dirty even if the CMPXCHG fails, as the old data is written
back on failure, i.e. the page is still written.  The value written is
guaranteed to be the same because the operation is atomic, but KVM's ABI
is that all writes are dirty logged regardless of the value written.  And
more importantly, that's what KVM did before the buggy commit.

Huge kudos to the folks on the Cc list (and many others), who did all the
actual work of triaging and debugging.

Fixes: 1c2361f667 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Cc: Pasha Tatashin <tatashin@google.com>
Cc: Michael Krebs <mkrebs@google.com>
base-commit: 6769ea8da8a93ed4630f1ce64df6aafcaabfce64
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240215010004.1456078-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-16 16:56:01 -08:00
Paolo Bonzini
2f8ebe43a0 Merge tag 'kvm-x86-selftests-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM selftests fixes/cleanups (and one KVM x86 cleanup) for 6.8:

 - Remove redundant newlines from error messages.

 - Delete an unused variable in the AMX test (which causes build failures when
   compiling with -Werror).

 - Fail instead of skipping tests if open(), e.g. of /dev/kvm, fails with an
   error code other than ENOENT (a Hyper-V selftest bug resulted in an EMFILE,
   and the test eventually got skipped).

 - Fix TSC related bugs in several Hyper-V selftests.

 - Fix a bug in the dirty ring logging test where a sem_post() could be left
   pending across multiple runs, resulting in incorrect synchronization between
   the main thread and the vCPU worker thread.

 - Relax the dirty log split test's assertions on 4KiB mappings to fix false
   positives due to the number of mappings for memslot 0 (used for code and
   data that is NOT being dirty logged) changing, e.g. due to NUMA balancing.

 - Have KVM's gtod_is_based_on_tsc() return "bool" instead of an "int" (the
   function generates boolean values, and all callers treat the return value as
   a bool).
2024-02-14 12:34:58 -05:00
Paolo Bonzini
22d0bc0721 Merge tag 'kvm-x86-fixes-6.8-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.8:

 - Make a KVM_REQ_NMI request while handling KVM_SET_VCPU_EVENTS if and only
   if the incoming events->nmi.pending is non-zero.  If the target vCPU is in
   the UNITIALIZED state, the spurious request will result in KVM exiting to
   userspace, which in turn causes QEMU to constantly acquire and release
   QEMU's global mutex, to the point where the BSP is unable to make forward
   progress.

 - Fix a type (u8 versus u64) goof that results in pmu->fixed_ctr_ctrl being
   incorrectly truncated, and ultimately causes KVM to think a fixed counter
   has already been disabled (KVM thinks the old value is '0').

 - Fix a stack leak in KVM_GET_MSRS where a failed MSR read from userspace
   that is ultimately ignored due to ignore_msrs=true doesn't zero the output
   as intended.
2024-02-14 12:34:43 -05:00
Mathias Krause
e1dda3afe2 KVM: x86: Fix broken debugregs ABI for 32 bit kernels
The ioctl()s to get and set KVM's debug registers are broken for 32 bit
kernels as they'd only copy half of the user register state because of a
UAPI and in-kernel type mismatch (__u64 vs. unsigned long; 8 vs. 4
bytes).

This makes it impossible for userland to set anything but DR0 without
resorting to bit folding tricks.

Switch to a loop for copying debug registers that'll implicitly do the
type conversion for us, if needed.

There are likely no users (left) for 32bit KVM, fix the bug nonetheless.

Fixes: a1efbe77c1 ("KVM: x86: Add support for saving&restoring debug registers")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240203124522.592778-4-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-05 15:40:54 -08:00
Mathias Krause
3376ca3f1a KVM: x86: Fix KVM_GET_MSRS stack info leak
Commit 6abe9c1386 ("KVM: X86: Move ignore_msrs handling upper the
stack") changed the 'ignore_msrs' handling, including sanitizing return
values to the caller. This was fine until commit 12bc2132b1 ("KVM:
X86: Do the same ignore_msrs check for feature msrs") which allowed
non-existing feature MSRs to be ignored, i.e. to not generate an error
on the ioctl() level. It even tried to preserve the sanitization of the
return value. However, the logic is flawed, as '*data' will be
overwritten again with the uninitialized stack value of msr.data.

Fix this by simplifying the logic and always initializing msr.data,
vanishing the need for an additional error exit path.

Fixes: 12bc2132b1 ("KVM: X86: Do the same ignore_msrs check for feature msrs")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240203124522.592778-2-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-05 11:20:51 -08:00
Sean Christopherson
f19063b1ca KVM: x86/pmu: Snapshot event selectors that KVM emulates in software
Snapshot the event selectors for the events that KVM emulates in software,
which is currently instructions retired and branch instructions retired.
The event selectors a tied to the underlying CPU, i.e. are constant for a
given platform even though perf doesn't manage the mappings as such.

Getting the event selectors from perf isn't exactly cheap, especially if
mitigations are enabled, as at least one indirect call is involved.

Snapshot the values in KVM instead of optimizing perf as working with the
raw event selectors will be required if KVM ever wants to emulate events
that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id"
entry.

Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 09:35:48 -08:00
Vitaly Kuznetsov
9e62797fd7 KVM: x86: Make gtod_is_based_on_tsc() return 'bool'
gtod_is_based_on_tsc() is boolean in nature, i.e. it returns '1' for good
clocksources and '0' otherwise. Moreover, its result is used raw by
kvm_get_time_and_clockread()/kvm_get_walltime_and_clockread() which are
'bool'.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240109141121.1619463-6-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-01 08:58:16 -08:00
Maciej S. Szmigiero
d52734d00b KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratum
Since commit b0563468ee ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17")
kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs.

Because KVM CPU caps are initialized from the kernel boot CPU features this
makes the XSAVES feature also unavailable for KVM guests in this case.
At the same time the XSAVEC feature is left enabled.

Unfortunately, having XSAVEC but no XSAVES in CPUID breaks Hyper-V enabled
Windows Server 2016 VMs that have more than one vCPU.

Let's at least give users hint in the kernel log what could be wrong since
these VMs currently simply hang at boot with a black screen - giving no
clue what suddenly broke them and how to make them work again.

Trigger the kernel message hint based on the particular guest ID written to
the Guest OS Identity Hyper-V MSR implemented by KVM.

Defer this check to when the L1 Hyper-V hypervisor enables SVM in EFER
since we want to limit this message to Hyper-V enabled Windows guests only
(Windows session running nested as L2) but the actual Guest OS Identity MSR
write is done by L1 and happens before it enables SVM.

Fixes: b0563468ee ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <b83ab45c5e239e5d148b0ae7750133a67ac9575c.1706127425.git.maciej.szmigiero@oracle.com>
[Move some checks before mutex_lock(), rename function. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-31 16:21:00 -05:00
Tengfei Yu
9e05d9b067 KVM: x86: Check irqchip mode before create PIT
As the kvm api(https://docs.kernel.org/virt/kvm/api.html) reads,
KVM_CREATE_PIT2 call is only valid after enabling in-kernel irqchip
support via KVM_CREATE_IRQCHIP.

Without this check, I can create PIT first and enable irqchip-split
then, which may cause the PIT invalid because of lacking of in-kernel
PIC to inject the interrupt.

Signed-off-by: Tengfei Yu <moehanabichan@gmail.com>
Message-Id: <20240125050823.4893-1-moehanabichan@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-31 16:21:00 -05:00
Prasad Pandit
6231c9e1a9 KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu
kvm_vcpu_ioctl_x86_set_vcpu_events() routine makes 'KVM_REQ_NMI'
request for a vcpu even when its 'events->nmi.pending' is zero.
Ex:
    qemu_thread_start
     kvm_vcpu_thread_fn
      qemu_wait_io_event
       qemu_wait_io_event_common
        process_queued_cpu_work
         do_kvm_cpu_synchronize_post_init/_reset
          kvm_arch_put_registers
           kvm_put_vcpu_events (cpu, level=[2|3])

This leads vCPU threads in QEMU to constantly acquire & release the
global mutex lock, delaying the guest boot due to lock contention.
Add check to make KVM_REQ_NMI request only if vcpu has NMI pending.

Fixes: bdedff2631 ("KVM: x86: Route pending NMIs from userspace through process_nmi()")
Cc: stable@vger.kernel.org
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Link: https://lore.kernel.org/r/20240103075343.549293-1-ppandit@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-31 07:35:07 -08:00
Sean Christopherson
7bb7fce136 KVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad index
Apply the pre-intercepts RDPMC validity check only to AMD, and rename all
relevant functions to make it as clear as possible that the check is not a
standard PMC index check.  On Intel, the basic rule is that only invalid
opcodes and privilege/permission/mode checks have priority over VM-Exit,
i.e. RDPMC with an invalid index should VM-Exit, not #GP.  While the SDM
doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a
non-existent MSR as an example where VM-Exit has priority over #GP, and
RDPMC is effectively just a variation of RDMSR.

Manually testing on various Intel CPUs confirms this behavior, and the
inverted priority was introduced for SVM compatibility, i.e. was not an
intentional change for Intel PMUs.  On AMD, *all* exceptions on RDPMC have
priority over VM-Exit.

Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0
static call so as to provide a convenient location to document the
difference between Intel and AMD, and to again try to make it as obvious
as possible that the early check is a one-off thing, not a generic "is
this PMC valid?" helper.

Fixes: 8061252ee0 ("KVM: SVM: Add intercept checks for remaining twobyte instructions")
Cc: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-30 15:28:02 -08:00
Nikolay Borisov
955997e880 KVM: x86: Use mutex guards to eliminate __kvm_x86_vendor_init()
Use the recently introduced guard(mutex) infrastructure acquire and
automatically release vendor_module_lock when the guard goes out of scope.
Drop the inner __kvm_x86_vendor_init(), its sole purpose was to simplify
releasing vendor_module_lock in error paths.

No functional change intended.

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20231030141728.1406118-1-nik.borisov@suse.com
[sean: rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-29 08:37:47 -08:00
Linus Torvalds
09d1c6a80f Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "Generic:

   - Use memdup_array_user() to harden against overflow.

   - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
     architectures.

   - Clean up Kconfigs that all KVM architectures were selecting

   - New functionality around "guest_memfd", a new userspace API that
     creates an anonymous file and returns a file descriptor that refers
     to it. guest_memfd files are bound to their owning virtual machine,
     cannot be mapped, read, or written by userspace, and cannot be
     resized. guest_memfd files do however support PUNCH_HOLE, which can
     be used to switch a memory area between guest_memfd and regular
     anonymous memory.

   - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
     per-page attributes for a given page of guest memory; right now the
     only attribute is whether the guest expects to access memory via
     guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
     TDX or ARM64 pKVM is checked by firmware or hypervisor that
     guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
     the case of pKVM).

  x86:

   - Support for "software-protected VMs" that can use the new
     guest_memfd and page attributes infrastructure. This is mostly
     useful for testing, since there is no pKVM-like infrastructure to
     provide a meaningfully reduced TCB.

   - Fix a relatively benign off-by-one error when splitting huge pages
     during CLEAR_DIRTY_LOG.

   - Fix a bug where KVM could incorrectly test-and-clear dirty bits in
     non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
     a non-huge SPTE.

   - Use more generic lockdep assertions in paths that don't actually
     care about whether the caller is a reader or a writer.

   - let Xen guests opt out of having PV clock reported as "based on a
     stable TSC", because some of them don't expect the "TSC stable" bit
     (added to the pvclock ABI by KVM, but never set by Xen) to be set.

   - Revert a bogus, made-up nested SVM consistency check for
     TLB_CONTROL.

   - Advertise flush-by-ASID support for nSVM unconditionally, as KVM
     always flushes on nested transitions, i.e. always satisfies flush
     requests. This allows running bleeding edge versions of VMware
     Workstation on top of KVM.

   - Sanity check that the CPU supports flush-by-ASID when enabling SEV
     support.

   - On AMD machines with vNMI, always rely on hardware instead of
     intercepting IRET in some cases to detect unmasking of NMIs

   - Support for virtualizing Linear Address Masking (LAM)

   - Fix a variety of vPMU bugs where KVM fail to stop/reset counters
     and other state prior to refreshing the vPMU model.

   - Fix a double-overflow PMU bug by tracking emulated counter events
     using a dedicated field instead of snapshotting the "previous"
     counter. If the hardware PMC count triggers overflow that is
     recognized in the same VM-Exit that KVM manually bumps an event
     count, KVM would pend PMIs for both the hardware-triggered overflow
     and for KVM-triggered overflow.

   - Turn off KVM_WERROR by default for all configs so that it's not
     inadvertantly enabled by non-KVM developers, which can be
     problematic for subsystems that require no regressions for W=1
     builds.

   - Advertise all of the host-supported CPUID bits that enumerate
     IA32_SPEC_CTRL "features".

   - Don't force a masterclock update when a vCPU synchronizes to the
     current TSC generation, as updating the masterclock can cause
     kvmclock's time to "jump" unexpectedly, e.g. when userspace
     hotplugs a pre-created vCPU.

   - Use RIP-relative address to read kvm_rebooting in the VM-Enter
     fault paths, partly as a super minor optimization, but mostly to
     make KVM play nice with position independent executable builds.

   - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
     CONFIG_HYPERV as a minor optimization, and to self-document the
     code.

   - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
     "emulation" at build time.

  ARM64:

   - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
     granule sizes. Branch shared with the arm64 tree.

   - Large Fine-Grained Trap rework, bringing some sanity to the
     feature, although there is more to come. This comes with a prefix
     branch shared with the arm64 tree.

   - Some additional Nested Virtualization groundwork, mostly
     introducing the NV2 VNCR support and retargetting the NV support to
     that version of the architecture.

   - A small set of vgic fixes and associated cleanups.

  Loongarch:

   - Optimization for memslot hugepage checking

   - Cleanup and fix some HW/SW timer issues

   - Add LSX/LASX (128bit/256bit SIMD) support

  RISC-V:

   - KVM_GET_REG_LIST improvement for vector registers

   - Generate ISA extension reg_list using macros in get-reg-list
     selftest

   - Support for reporting steal time along with selftest

  s390:

   - Bugfixes

  Selftests:

   - Fix an annoying goof where the NX hugepage test prints out garbage
     instead of the magic token needed to run the test.

   - Fix build errors when a header is delete/moved due to a missing
     flag in the Makefile.

   - Detect if KVM bugged/killed a selftest's VM and print out a helpful
     message instead of complaining that a random ioctl() failed.

   - Annotate the guest printf/assert helpers with __printf(), and fix
     the various bugs that were lurking due to lack of said annotation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  KVM: x86: add missing "depends on KVM"
  KVM: fix direction of dependency on MMU notifiers
  KVM: introduce CONFIG_KVM_COMMON
  KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
  KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
  RISC-V: KVM: selftests: Add get-reg-list test for STA registers
  RISC-V: KVM: selftests: Add steal_time test support
  RISC-V: KVM: selftests: Add guest_sbi_probe_extension
  RISC-V: KVM: selftests: Move sbi_ecall to processor.c
  RISC-V: KVM: Implement SBI STA extension
  RISC-V: KVM: Add support for SBI STA registers
  RISC-V: KVM: Add support for SBI extension registers
  RISC-V: KVM: Add SBI STA info to vcpu_arch
  RISC-V: KVM: Add steal-update vcpu request
  RISC-V: KVM: Add SBI STA extension skeleton
  RISC-V: paravirt: Implement steal-time support
  RISC-V: Add SBI STA extension definitions
  RISC-V: paravirt: Add skeleton for pv-time support
  RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
  ...
2024-01-17 13:03:37 -08:00
Linus Torvalds
b51cc5d028 Merge tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:

 - Change global variables to local

 - Add missing kernel-doc function parameter descriptions

 - Remove unused parameter from a macro

 - Remove obsolete Kconfig entry

 - Fix comments

 - Fix typos, mostly scripted, manually reviewed

and a micro-optimization got misplaced as a cleanup:

 - Micro-optimize the asm code in secondary_startup_64_no_verify()

* tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arch/x86: Fix typos
  x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify()
  x86/docs: Remove reference to syscall trampoline in PTI
  x86/Kconfig: Remove obsolete config X86_32_SMP
  x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro
  x86/mtrr: Document missing function parameters in kernel-doc
  x86/setup: Make relocated_ramdisk a local variable of relocate_initrd()
2024-01-08 17:23:32 -08:00
Paolo Bonzini
3115d2de39 Merge tag 'kvm-x86-xen-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM Xen change for 6.8:

To workaround Xen guests that don't expect Xen PV clocks to be marked as being
based on a stable TSC, add a Xen config knob to allow userspace to opt out of
KVM setting the "TSC stable" bit in Xen PV clocks.  Note, the "TSC stable" bit
was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't
entirely blameless for the buggy guest behavior.
2024-01-08 08:10:20 -05:00
Paolo Bonzini
8ecb10bcbf Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 support for virtualizing Linear Address Masking (LAM)

Add KVM support for Linear Address Masking (LAM).  LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit.  This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.

LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking

 - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
 - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.

For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively.  Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:

 - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
 - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
 - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers

For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:

 - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
 - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
 - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers

The modified LAM canonicality checks:
 - LAM_S48                : [ 1 ][ metadata ][ 1 ]
                              63               47
 - LAM_U48                : [ 0 ][ metadata ][ 0 ]
                              63               47
 - LAM_S57                : [ 1 ][ metadata ][ 1 ]
                              63               56
 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
                              63               56
 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
                              63               56..47

The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks.  The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48.  The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows

Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.
2024-01-08 08:10:12 -05:00
Paolo Bonzini
01edb1cfbd Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.8:

 - Fix a variety of bugs where KVM fail to stop/reset counters and other state
   prior to refreshing the vPMU model.

 - Fix a double-overflow PMU bug by tracking emulated counter events using a
   dedicated field instead of snapshotting the "previous" counter.  If the
   hardware PMC count triggers overflow that is recognized in the same VM-Exit
   that KVM manually bumps an event count, KVM would pend PMIs for both the
   hardware-triggered overflow and for KVM-triggered overflow.
2024-01-08 08:10:08 -05:00
Paolo Bonzini
33d0403fda Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.8:

 - Turn off KVM_WERROR by default for all configs so that it's not
   inadvertantly enabled by non-KVM developers, which can be problematic for
   subsystems that require no regressions for W=1 builds.

 - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
   "features".

 - Don't force a masterclock update when a vCPU synchronizes to the current TSC
   generation, as updating the masterclock can cause kvmclock's time to "jump"
   unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.

 - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
   partly as a super minor optimization, but mostly to make KVM play nice with
   position independent executable builds.
2024-01-08 08:10:04 -05:00
Paolo Bonzini
0afdfd85e3 Merge tag 'kvm-x86-hyperv-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 Hyper-V changes for 6.8:

 - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
   CONFIG_HYPERV as a minor optimization, and to self-document the code.

 - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
   at build time.
2024-01-08 08:10:01 -05:00
Bjorn Helgaas
54aa699e80 arch/x86: Fix typos
Fix typos, most reported by "codespell arch/x86".  Only touches comments,
no code changes.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2024-01-03 11:46:22 +01:00
Paolo Bonzini
136292522e Merge tag 'loongarch-kvm-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.8

1. Optimization for memslot hugepage checking.
2. Cleanup and fix some HW/SW timer issues.
3. Add LSX/LASX (128bit/256bit SIMD) support.
2024-01-02 13:16:29 -05:00
Paolo Bonzini
6254eebad4 Merge tag 'kvm-x86-fixes-6.7-rcN' of https://github.com/kvm-x86/linux into kvm-master
KVM fixes for 6.7-rcN:

 - When checking if a _running_ vCPU is "in-kernel", i.e. running at CPL0,
   get the CPL directly instead of relying on preempted_in_kernel, which
   is valid if and only if the vCPU was preempted, i.e. NOT running.

 - Set .owner for various KVM file_operations so that files refcount the
   KVM module until KVM is done executing _all_ code, including the last
   few instructions of kvm_put_kvm().  And then revert the misguided
   attempt to rely on "struct kvm" refcounts to pin KVM-the-module.

 - Fix a benign "return void" that was recently introduced.
2023-12-08 13:13:45 -05:00
Paul Durrant
6d72283526 KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT
Unless explicitly told to do so (by passing 'clocksource=tsc' and
'tsc=stable:socket', and then jumping through some hoops concerning
potential CPU hotplug) Xen will never use TSC as its clocksource.
Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set
in either the primary or secondary pvclock memory areas. This has
led to bugs in some guest kernels which only become evident if
PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support
such guests, give the VMM a new Xen HVM config flag to tell KVM to
forcibly clear the bit in the Xen pvclocks.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20231102162128.2353459-1-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 15:52:57 -08:00
Vitaly Kuznetsov
b4f69df0f6 KVM: x86: Make Hyper-V emulation optional
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be
desirable to not compile it in to reduce module sizes as well as the attack
surface. Introduce CONFIG_KVM_HYPERV option to make it possible.

Note, there's room for further nVMX/nSVM code optimizations when
!CONFIG_KVM_HYPERV, this will be done in follow-up patches.

Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files
are grouped together.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:57 -08:00
Vitaly Kuznetsov
cfef5af3cb KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation context
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and
is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg'
placement in 'struct kvm_hv' is unfortunate. As a preparation to making
Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it
under CONFIG_HYPERV.

While on it, introduce hv_get_partition_assist_page() helper to allocate
partition assist page. Move the comment explaining why we use a single page
for all vCPUs from VMX and expand it a bit.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:01 -08:00
Like Xu
ef8d89033c KVM: x86: Remove 'return void' expression for 'void function'
The requested info will be stored in 'guest_xsave->region' referenced by
the incoming pointer "struct kvm_xsave *guest_xsave", thus there is no need
to explicitly use return void expression for a void function "static void
kvm_vcpu_ioctl_x86_get_xsave(...)". The issue is caught with [-Wpedantic].

Fixes: 2d287ec65e79 ("x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231007064019.17472-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 08:14:27 -08:00
Sean Christopherson
f2f63f7ec6 KVM: x86/pmu: Stop calling kvm_pmu_reset() at RESET (it's redundant)
Drop kvm_vcpu_reset()'s call to kvm_pmu_reset(), the call is performed
only for RESET, which is really just the same thing as vCPU creation,
and kvm_arch_vcpu_create() *just* called kvm_pmu_init(), i.e. there can't
possibly be any work to do.

Unlike Intel, AMD's amd_pmu_refresh() does fill all_valid_pmc_idx even if
guest CPUID is empty, but everything that is at all dynamic is guaranteed
to be '0'/NULL, e.g. it should be impossible for KVM to have already
created a perf event.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:54 -08:00