Commit Graph

1177 Commits

Author SHA1 Message Date
Paolo Bonzini
8ecb10bcbf Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 support for virtualizing Linear Address Masking (LAM)

Add KVM support for Linear Address Masking (LAM).  LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit.  This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.

LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking

 - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
 - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.

For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively.  Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:

 - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
 - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
 - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers

For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:

 - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
 - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
 - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers

The modified LAM canonicality checks:
 - LAM_S48                : [ 1 ][ metadata ][ 1 ]
                              63               47
 - LAM_U48                : [ 0 ][ metadata ][ 0 ]
                              63               47
 - LAM_S57                : [ 1 ][ metadata ][ 1 ]
                              63               56
 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
                              63               56
 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
                              63               56..47

The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks.  The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48.  The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows

Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.
2024-01-08 08:10:12 -05:00
Paolo Bonzini
01edb1cfbd Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.8:

 - Fix a variety of bugs where KVM fail to stop/reset counters and other state
   prior to refreshing the vPMU model.

 - Fix a double-overflow PMU bug by tracking emulated counter events using a
   dedicated field instead of snapshotting the "previous" counter.  If the
   hardware PMC count triggers overflow that is recognized in the same VM-Exit
   that KVM manually bumps an event count, KVM would pend PMIs for both the
   hardware-triggered overflow and for KVM-triggered overflow.
2024-01-08 08:10:08 -05:00
Paolo Bonzini
33d0403fda Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.8:

 - Turn off KVM_WERROR by default for all configs so that it's not
   inadvertantly enabled by non-KVM developers, which can be problematic for
   subsystems that require no regressions for W=1 builds.

 - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
   "features".

 - Don't force a masterclock update when a vCPU synchronizes to the current TSC
   generation, as updating the masterclock can cause kvmclock's time to "jump"
   unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.

 - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
   partly as a super minor optimization, but mostly to make KVM play nice with
   position independent executable builds.
2024-01-08 08:10:04 -05:00
Paolo Bonzini
0afdfd85e3 Merge tag 'kvm-x86-hyperv-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM x86 Hyper-V changes for 6.8:

 - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
   CONFIG_HYPERV as a minor optimization, and to self-document the code.

 - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
   at build time.
2024-01-08 08:10:01 -05:00
Bjorn Helgaas
54aa699e80 arch/x86: Fix typos
Fix typos, most reported by "codespell arch/x86".  Only touches comments,
no code changes.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2024-01-03 11:46:22 +01:00
Michael Roth
a26b7cd225 KVM: SEV: Do not intercept accesses to MSR_IA32_XSS for SEV-ES guests
When intercepts are enabled for MSR_IA32_XSS, the host will swap in/out
the guest-defined values while context-switching to/from guest mode.
However, in the case of SEV-ES, vcpu->arch.guest_state_protected is set,
so the guest-defined value is effectively ignored when switching to
guest mode with the understanding that the VMSA will handle swapping
in/out this register state.

However, SVM is still configured to intercept these accesses for SEV-ES
guests, so the values in the initial MSR_IA32_XSS are effectively
read-only, and a guest will experience undefined behavior if it actually
tries to write to this MSR. Fortunately, only CET/shadowstack makes use
of this register on SEV-ES-capable systems currently, which isn't yet
widely used, but this may become more of an issue in the future.

Additionally, enabling intercepts of MSR_IA32_XSS results in #VC
exceptions in the guest in certain paths that can lead to unexpected #VC
nesting levels. One example is SEV-SNP guests when handling #VC
exceptions for CPUID instructions involving leaf 0xD, subleaf 0x1, since
they will access MSR_IA32_XSS as part of servicing the CPUID #VC, then
generate another #VC when accessing MSR_IA32_XSS, which can lead to
guest crashes if an NMI occurs at that point in time. Running perf on a
guest while it is issuing such a sequence is one example where these can
be problematic.

Address this by disabling intercepts of MSR_IA32_XSS for SEV-ES guests
if the host/guest configuration allows it. If the host/guest
configuration doesn't allow for MSR_IA32_XSS, leave it intercepted so
that it can be caught by the existing checks in
kvm_{set,get}_msr_common() if the guest still attempts to access it.

Fixes: 376c6d2850 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Cc: Alexey Kardashevskiy <aik@amd.com>
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20231016132819.1002933-4-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-13 12:46:07 -05:00
Sean Christopherson
4cdf351d36 KVM: SVM: Update EFER software model on CR0 trap for SEV-ES
In general, activating long mode involves setting the EFER_LME bit in
the EFER register and then enabling the X86_CR0_PG bit in the CR0
register. At this point, the EFER_LMA bit will be set automatically by
hardware.

In the case of SVM/SEV guests where writes to CR0 are intercepted, it's
necessary for the host to set EFER_LMA on behalf of the guest since
hardware does not see the actual CR0 write.

In the case of SEV-ES guests where writes to CR0 are trapped instead of
intercepted, the hardware *does* see/record the write to CR0 before
exiting and passing the value on to the host, so as part of enabling
SEV-ES support commit f1c6366e30 ("KVM: SVM: Add required changes to
support intercepts under SEV-ES") dropped special handling of the
EFER_LMA bit with the understanding that it would be set automatically.

However, since the guest never explicitly sets the EFER_LMA bit, the
host never becomes aware that it has been set. This becomes problematic
when userspace tries to get/set the EFER values via
KVM_GET_SREGS/KVM_SET_SREGS, since the EFER contents tracked by the host
will be missing the EFER_LMA bit, and when userspace attempts to pass
the EFER value back via KVM_SET_SREGS it will fail a sanity check that
asserts that EFER_LMA should always be set when X86_CR0_PG and EFER_LME
are set.

Fix this by always inferring the value of EFER_LMA based on X86_CR0_PG
and EFER_LME, regardless of whether or not SEV-ES is enabled.

Fixes: f1c6366e30 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Reported-by: Peter Gonda <pgonda@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210507165947.2502412-2-seanjc@google.com>
[A two year old patch that was revived after we noticed the failure in
 KVM_SET_SREGS and a similar patch was posted by Michael Roth.  This is
 Sean's patch, but with Michael's more complete commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08 13:37:05 -05:00
Vitaly Kuznetsov
017a99a966 KVM: nSVM: Hide more stuff under CONFIG_KVM_HYPERV/CONFIG_HYPERV
'struct hv_vmcb_enlightenments' in VMCB only make sense when either
CONFIG_KVM_HYPERV or CONFIG_HYPERV is enabled.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-17-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:35:26 -08:00
Vitaly Kuznetsov
b4f69df0f6 KVM: x86: Make Hyper-V emulation optional
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be
desirable to not compile it in to reduce module sizes as well as the attack
surface. Introduce CONFIG_KVM_HYPERV option to make it possible.

Note, there's room for further nVMX/nSVM code optimizations when
!CONFIG_KVM_HYPERV, this will be done in follow-up patches.

Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files
are grouped together.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:57 -08:00
Vitaly Kuznetsov
af9d544a45 KVM: x86: Introduce helper to handle Hyper-V paravirt TLB flush requests
As a preparation to making Hyper-V emulation optional, introduce a helper
to handle pending KVM_REQ_HV_TLB_FLUSH requests.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-8-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:23 -08:00
Vitaly Kuznetsov
cfef5af3cb KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation context
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and
is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg'
placement in 'struct kvm_hv' is unfortunate. As a preparation to making
Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it
under CONFIG_HYPERV.

While on it, introduce hv_get_partition_assist_page() helper to allocate
partition assist page. Move the comment explaining why we use a single page
for all vCPUs from VMX and expand it a bit.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:01 -08:00
Sean Christopherson
89acf1237b KVM: x86/pmu: Update sample period in pmc_write_counter()
Update a PMC's sample period in pmc_write_counter() to deduplicate code
across all callers of pmc_write_counter().  Opportunistically move
pmc_write_counter() into pmc.c now that it's doing more work.  WRMSR isn't
such a hot path that an extra CALL+RET pair will be problematic, and the
order of function definitions needs to be changed anyways, i.e. now is a
convenient time to eat the churn.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:55 -08:00
Sean Christopherson
cbb359d81a KVM: x86/pmu: Move PMU reset logic to common x86 code
Move the common (or at least "ignored") aspects of resetting the vPMU to
common x86 code, along with the stop/release helpers that are no used only
by the common pmu.c.

There is no need to manually handle fixed counters as all_valid_pmc_idx
tracks both fixed and general purpose counters, and resetting the vPMU is
far from a hot path, i.e. the extra bit of overhead to the PMC from the
index is a non-issue.

Zero fixed_ctr_ctrl in common code even though it's Intel specific.
Ensuring it's zero doesn't harm AMD/SVM in any way, and stopping the fixed
counters via all_valid_pmc_idx, but not clearing the associated control
bits, would be odd/confusing.

Make the .reset() hook optional as SVM no longer needs vendor specific
handling.

Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:54 -08:00
Uros Bizjak
15223c4f97 KVM: SVM,VMX: Use %rip-relative addressing to access kvm_rebooting
Instruction with %rip-relative address operand is one byte shorter than
its absolute address counterpart and is also compatible with position
independent executable (-fpie) build.

No functional changes intended.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231031075312.47525-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:51:54 -08:00
Sean Christopherson
72046d0a07 KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabled
When vNMI is enabled, rely entirely on hardware to correctly handle NMI
blocking, i.e. don't intercept IRET to detect when NMIs are no longer
blocked.  KVM already correctly ignores svm->nmi_masked when vNMI is
enabled, so the effect of the bug is essentially an unnecessary VM-Exit.

KVM intercepts IRET for two reasons:
 - To track NMI masking to be able to know at any point of time if NMI
   is masked.
 - To track NMI windows (to inject another NMI after the guest executes
   IRET, i.e. unblocks NMIs)

When vNMI is enabled, both cases are handled by hardware:
- NMI masking state resides in int_ctl.V_NMI_BLOCKING and can be read by
  KVM at will.
- Hardware automatically "injects" pending virtual NMIs when virtual NMIs
  become unblocked.

However, even though pending a virtual NMI for hardware to handle is the
most common way to synthesize a guest NMI, KVM may still directly inject
an NMI via when KVM is handling two "simultaneous" NMIs (see comments in
process_nmi() for details on KVM's simultaneous NMI handling).  Per AMD's
APM, hardware sets the BLOCKING flag when software directly injects an NMI
as well, i.e. KVM doesn't need to manually mark vNMIs as blocked:

  If Event Injection is used to inject an NMI when NMI Virtualization is
  enabled, VMRUN sets V_NMI_MASK in the guest state.

Note, it's still possible that KVM could trigger a spurious IRET VM-Exit.
When running a nested guest, KVM disables vNMI for L2 and thus will enable
IRET interception (in both vmcb01 and vmcb02) while running L2 reason.  If
a nested VM-Exit happens before L2 executes IRET, KVM can end up running
L1 with vNMI enable and IRET intercepted.  This is also a benign bug, and
even less likely to happen, i.e. can be safely punted to a future fix.

Fixes: fa4c027a79 ("KVM: x86: Add support for SVM's Virtual NMI")
Link: https://lore.kernel.org/all/ZOdnuDZUd4mevCqe@google.como
Cc: Santosh Shukla <santosh.shukla@amd.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <santosh.shukla@amd.com>
Link: https://lore.kernel.org/r/20231018192021.1893261-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:51:22 -08:00
Sean Christopherson
770d6aa2e4 KVM: SVM: Explicitly require FLUSHBYASID to enable SEV support
Add a sanity check that FLUSHBYASID is available if SEV is supported in
hardware, as SEV (and beyond) guests are bound to a single ASID, i.e. KVM
can't "flush" by assigning a new, fresh ASID to the guest.  If FLUSHBYASID
isn't supported for some bizarre reason, KVM would completely fail to do
TLB flushes for SEV+ guests (see pre_svm_run() and pre_sev_run()).

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20231018193617.1895752-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:51:14 -08:00
Sean Christopherson
176bfc5b17 KVM: nSVM: Advertise support for flush-by-ASID
Advertise support for FLUSHBYASID when nested SVM is enabled, as KVM can
always emulate flushing TLB entries for a vmcb12 ASID, e.g. by running L2
with a new, fresh ASID in vmcb02.  Some modern hypervisors, e.g. VMWare
Workstation 17, require FLUSHBYASID support and will refuse to run if it's
not present.

Punt on proper support, as "Honor L1's request to flush an ASID on nested
VMRUN" is one of the TODO items in the (incomplete) list of issues that
need to be addressed in order for KVM to NOT do a full TLB flush on every
nested SVM transition (see nested_svm_transition_tlb_flush()).

Reported-by: Stefan Sterz <s.sterz@proxmox.com>
Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231018194104.1896415-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:50:25 -08:00
Sean Christopherson
a484755ab2 Revert "nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB"
Revert KVM's made-up consistency check on SVM's TLB control.  The APM says
that unsupported encodings are reserved, but the APM doesn't state that
VMRUN checks for a supported encoding.  Unless something is called out
in "Canonicalization and Consistency Checks" or listed as MBZ (Must Be
Zero), AMD behavior is typically to let software shoot itself in the foot.

This reverts commit 174a921b69.

Fixes: 174a921b69 ("nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB")
Reported-by: Stefan Sterz <s.sterz@proxmox.com>
Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231018194104.1896415-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:47:43 -08:00
Binbin Wu
2c49db455e KVM: x86: Add & use kvm_vcpu_is_legal_cr3() to check CR3's legality
Add and use kvm_vcpu_is_legal_cr3() to check CR3's legality to provide
a clear distinction between CR3 and GPA checks.  This will allow exempting
bits from kvm_vcpu_is_legal_cr3() without affecting general GPA checks,
e.g. for upcoming features that will use high bits in CR3 for feature
enabling.

No functional change intended.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-7-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:04 -08:00
Linus Torvalds
6803bd7956 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Generalized infrastructure for 'writable' ID registers, effectively
     allowing userspace to opt-out of certain vCPU features for its
     guest

   - Optimization for vSGI injection, opportunistically compressing
     MPIDR to vCPU mapping into a table

   - Improvements to KVM's PMU emulation, allowing userspace to select
     the number of PMCs available to a VM

   - Guest support for memory operation instructions (FEAT_MOPS)

   - Cleanups to handling feature flags in KVM_ARM_VCPU_INIT, squashing
     bugs and getting rid of useless code

   - Changes to the way the SMCCC filter is constructed, avoiding wasted
     memory allocations when not in use

   - Load the stage-2 MMU context at vcpu_load() for VHE systems,
     reducing the overhead of errata mitigations

   - Miscellaneous kernel and selftest fixes

  LoongArch:

   - New architecture for kvm.

     The hardware uses the same model as x86, s390 and RISC-V, where
     guest/host mode is orthogonal to supervisor/user mode. The
     virtualization extensions are very similar to MIPS, therefore the
     code also has some similarities but it's been cleaned up to avoid
     some of the historical bogosities that are found in arch/mips. The
     kernel emulates MMU, timer and CSR accesses, while interrupt
     controllers are only emulated in userspace, at least for now.

  RISC-V:

   - Support for the Smstateen and Zicond extensions

   - Support for virtualizing senvcfg

   - Support for virtualized SBI debug console (DBCN)

  S390:

   - Nested page table management can be monitored through tracepoints
     and statistics

  x86:

   - Fix incorrect handling of VMX posted interrupt descriptor in
     KVM_SET_LAPIC, which could result in a dropped timer IRQ

   - Avoid WARN on systems with Intel IPI virtualization

   - Add CONFIG_KVM_MAX_NR_VCPUS, to allow supporting up to 4096 vCPUs
     without forcing more common use cases to eat the extra memory
     overhead.

   - Add virtualization support for AMD SRSO mitigation (IBPB_BRTYPE and
     SBPB, aka Selective Branch Predictor Barrier).

   - Fix a bug where restoring a vCPU snapshot that was taken within 1
     second of creating the original vCPU would cause KVM to try to
     synchronize the vCPU's TSC and thus clobber the correct TSC being
     set by userspace.

   - Compute guest wall clock using a single TSC read to avoid
     generating an inaccurate time, e.g. if the vCPU is preempted
     between multiple TSC reads.

   - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which
     complain about a "Firmware Bug" if the bit isn't set for select
     F/M/S combos. Likewise "virtualize" (ignore) MSR_AMD64_TW_CFG to
     appease Windows Server 2022.

   - Don't apply side effects to Hyper-V's synthetic timer on writes
     from userspace to fix an issue where the auto-enable behavior can
     trigger spurious interrupts, i.e. do auto-enabling only for guest
     writes.

   - Remove an unnecessary kick of all vCPUs when synchronizing the
     dirty log without PML enabled.

   - Advertise "support" for non-serializing FS/GS base MSR writes as
     appropriate.

   - Harden the fast page fault path to guard against encountering an
     invalid root when walking SPTEs.

   - Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.

   - Use the fast path directly from the timer callback when delivering
     Xen timer events, instead of waiting for the next iteration of the
     run loop. This was not done so far because previously proposed code
     had races, but now care is taken to stop the hrtimer at critical
     points such as restarting the timer or saving the timer information
     for userspace.

   - Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future
     flag.

   - Optimize injection of PMU interrupts that are simultaneous with
     NMIs.

   - Usual handful of fixes for typos and other warts.

  x86 - MTRR/PAT fixes and optimizations:

   - Clean up code that deals with honoring guest MTRRs when the VM has
     non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.

   - Zap EPT entries when non-coherent DMA assignment stops/start to
     prevent using stale entries with the wrong memtype.

   - Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y

     This was done as a workaround for virtual machine BIOSes that did
     not bother to clear CR0.CD (because ancient KVM/QEMU did not bother
     to set it, in turn), and there's zero reason to extend the quirk to
     also ignore guest PAT.

  x86 - SEV fixes:

   - Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts
     SHUTDOWN while running an SEV-ES guest.

   - Clean up the recognition of emulation failures on SEV guests, when
     KVM would like to "skip" the instruction but it had already been
     partially emulated. This makes it possible to drop a hack that
     second guessed the (insufficient) information provided by the
     emulator, and just do the right thing.

  Documentation:

   - Various updates and fixes, mostly for x86

   - MTRR and PAT fixes and optimizations"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (164 commits)
  KVM: selftests: Avoid using forced target for generating arm64 headers
  tools headers arm64: Fix references to top srcdir in Makefile
  KVM: arm64: Add tracepoint for MMIO accesses where ISV==0
  KVM: arm64: selftest: Perform ISB before reading PAR_EL1
  KVM: arm64: selftest: Add the missing .guest_prepare()
  KVM: arm64: Always invalidate TLB for stage-2 permission faults
  KVM: x86: Service NMI requests after PMI requests in VM-Enter path
  KVM: arm64: Handle AArch32 SPSR_{irq,abt,und,fiq} as RAZ/WI
  KVM: arm64: Do not let a L1 hypervisor access the *32_EL2 sysregs
  KVM: arm64: Refine _EL2 system register list that require trap reinjection
  arm64: Add missing _EL2 encodings
  arm64: Add missing _EL12 encodings
  KVM: selftests: aarch64: vPMU test for validating user accesses
  KVM: selftests: aarch64: vPMU register test for unimplemented counters
  KVM: selftests: aarch64: vPMU register test for implemented counters
  KVM: selftests: aarch64: Introduce vpmu_counter_access test
  tools: Import arm_pmuv3.h
  KVM: arm64: PMU: Allow userspace to limit PMCR_EL0.N for the guest
  KVM: arm64: Sanitize PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} before first run
  KVM: arm64: Add {get,set}_user for PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
  ...
2023-11-02 15:45:15 -10:00
Paolo Bonzini
be47941980 Merge tag 'kvm-x86-svm-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.7:

 - Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
   running an SEV-ES guest.

 - Clean up handling "failures" when KVM detects it can't emulate the "skip"
   action for an instruction that has already been partially emulated.  Drop a
   hack in the SVM code that was fudging around the emulator code not giving
   SVM enough information to do the right thing.
2023-10-31 10:22:43 -04:00
Paolo Bonzini
f292dc8aad Merge tag 'kvm-x86-misc-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.7:

 - Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
   forcing more common use cases to eat the extra memory overhead.

 - Add IBPB and SBPB virtualization support.

 - Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
   creating the original vCPU would cause KVM to try to synchronize the vCPU's
   TSC and thus clobber the correct TSC being set by userspace.

 - Compute guest wall clock using a single TSC read to avoid generating an
   inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.

 - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
    about a "Firmware Bug" if the bit isn't set for select F/M/S combos.

 - Don't apply side effects to Hyper-V's synthetic timer on writes from
   userspace to fix an issue where the auto-enable behavior can trigger
   spurious interrupts, i.e. do auto-enabling only for guest writes.

 - Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
   without PML enabled.

 - Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.

 - Use octal notation for file permissions through KVM x86.

 - Fix a handful of typo fixes and warts.
2023-10-31 10:15:15 -04:00
Linus Torvalds
ca2e9c3bee Merge tag 'x86_cpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:

 - Make sure the "svm" feature flag is cleared from /proc/cpuinfo when
   virtualization support is disabled in the BIOS on AMD and Hygon
   platforms

 - A minor cleanup

* tag 'x86_cpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/amd: Remove redundant 'break' statement
  x86/cpu: Clear SVM feature if disabled by BIOS
2023-10-30 12:10:24 -10:00
Peng Hao
26951ec862 KVM: x86: Use octal for file permission
Convert all module params to octal permissions to improve code readability
and to make checkpatch happy:

  WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using
           octal permissions '0444'.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20231013113020.77523-1-flyingpeng@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-17 10:29:10 -07:00
Paolo Bonzini
88e4cd893f Merge tag 'kvm-x86-pmu-6.6-fixes' of https://github.com/kvm-x86/linux into HEAD
KVM x86/pmu fixes for 6.6:

 - Truncate writes to PMU counters to the counter's width to avoid spurious
   overflows when emulating counter events in software.

 - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
   architectural behavior).

 - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
   kick the guest out of emulated halt.
2023-10-15 08:24:18 -04:00
Tom Lendacky
3e93467346 KVM: SVM: Fix build error when using -Werror=unused-but-set-variable
Commit 916e3e5f26 ("KVM: SVM: Do not use user return MSR support for
virtualized TSC_AUX") introduced a local variable used for the rdmsr()
function for the high 32-bits of the MSR value. This variable is not used
after being set and triggers a warning or error, when treating warnings
as errors, when the unused-but-set-variable flag is set. Mark this
variable as __maybe_unused to fix this.

Fixes: 916e3e5f26 ("KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <0da9874b6e9fcbaaa5edeb345d7e2a7c859fc818.1696271334.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:09:36 -04:00
Maxim Levitsky
3fdc6087df x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested()
svm_leave_nested() similar to a nested VM exit, get the vCPU out of nested
mode and thus should end the local inhibition of AVIC on this vCPU.

Failure to do so, can lead to hangs on guest reboot.

Raise the KVM_REQ_APICV_UPDATE request to refresh the AVIC state of the
current vCPU in this case.

Fixes: f44509f849 ("KVM: x86: SVM: allow AVIC to co-exist with a nested guest running")
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928173354.217464-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:09:00 -04:00
Maxim Levitsky
2dcf37abf9 x86: KVM: SVM: add support for Invalid IPI Vector interception
In later revisions of AMD's APM, there is a new 'incomplete IPI' exit code:

"Invalid IPI Vector - The vector for the specified IPI was set to an
illegal value (VEC < 16)"

Note that tests on Zen2 machine show that this VM exit doesn't happen and
instead AVIC just does nothing.

Add support for this exit code by doing nothing, instead of filling
the kernel log with errors.

Also replace an unthrottled 'pr_err()' if another unknown incomplete
IPI exit happens with vcpu_unimpl()

(e.g in case AMD adds yet another 'Invalid IPI' exit reason)

Cc: <stable@vger.kernel.org>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928173354.217464-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:08:59 -04:00
Maxim Levitsky
b65235f6e1 x86: KVM: SVM: always update the x2avic msr interception
The following problem exists since x2avic was enabled in the KVM:

svm_set_x2apic_msr_interception is called to enable the interception of
the x2apic msrs.

In particular it is called at the moment the guest resets its apic.

Assuming that the guest's apic was in x2apic mode, the reset will bring
it back to the xapic mode.

The svm_set_x2apic_msr_interception however has an erroneous check for
'!apic_x2apic_mode()' which prevents it from doing anything in this case.

As a result of this, all x2apic msrs are left unintercepted, and that
exposes the bare metal x2apic (if enabled) to the guest.
Oops.

Remove the erroneous '!apic_x2apic_mode()' check to fix that.

This fixes CVE-2023-5090

Fixes: 4d1d7942e3 ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928173354.217464-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:08:59 -04:00
Sean Christopherson
0068299540 KVM: SVM: Treat all "skip" emulation for SEV guests as outright failures
Treat EMULTYPE_SKIP failures on SEV guests as unhandleable emulation
instead of simply resuming the guest, and drop the hack-a-fix which
effects that behavior for the INT3/INTO injection path.  If KVM can't
skip an instruction for which KVM has already done partial emulation,
resuming the guest is undesirable as doing so may corrupt guest state.

Link: https://lore.kernel.org/r/20230825013621.2845700-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 15:08:53 -07:00
Sean Christopherson
aeb904f6b9 KVM: x86: Refactor can_emulate_instruction() return to be more expressive
Refactor and rename can_emulate_instruction() to allow vendor code to
return more than true/false, e.g. to explicitly differentiate between
"retry", "fault", and "unhandleable".  For now, just do the plumbing, a
future patch will expand SVM's implementation to signal outright failure
if KVM attempts EMULTYPE_SKIP on an SEV guest.

No functional change intended (or rather, none that are visible to the
guest or userspace).

Link: https://lore.kernel.org/r/20230825013621.2845700-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 15:08:53 -07:00
Peter Gonda
bc3d7c5570 KVM: SVM: Update SEV-ES shutdown intercepts with more metadata
Currently if an SEV-ES VM shuts down userspace sees KVM_RUN struct with
only errno=EINVAL. This is a very limited amount of information to debug
the situation. Instead return KVM_EXIT_SHUTDOWN to alert userspace the VM
is shutting down and is not usable any further.

Signed-off-by: Peter Gonda <pgonda@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230907162449.1739785-1-pgonda@google.com
[sean: tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-28 10:15:00 -07:00
Roman Kagan
b29a2acd36 KVM: x86/pmu: Truncate counter value to allowed width on write
Performance counters are defined to have width less than 64 bits.  The
vPMU code maintains the counters in u64 variables but assumes the value
to fit within the defined width.  However, for Intel non-full-width
counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
truncated to 32 bits and then sign-extended to full 64 bits.  If a
negative value is set, it's sign-extended to 64 bits, but then in
kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
previous value for overflow detection.

That previous value is not truncated, so it always evaluates bigger than
the truncated new one, and a PMI is injected.  If the PMI handler writes
a negative counter value itself, the vCPU never quits the PMI loop.

Turns out that Linux PMI handler actually does write the counter with
the value just read with RDPMC, so when no full-width support is exposed
via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
a negative value, it locks up.

This has been observed in the field, for example, when the guest configures
atop to use perfevents and runs two instances of it simultaneously.

To address the problem, maintain the invariant that the counter value
always fits in the defined bit width, by truncating the received value
in the respective set_msr methods.  For better readability, factor the
out into a helper function, pmc_write_counter(), shared by vmx and svm
parts.

Fixes: 9cd803d496 ("KVM: x86: Update vPMCs when retiring instructions")
Cc: stable@vger.kernel.org
Signed-off-by: Roman Kagan <rkagan@amazon.de>
Link: https://lore.kernel.org/all/20230504120042.785651-1-rkagan@amazon.de
Tested-by: Like Xu <likexu@tencent.com>
[sean: tweak changelog, s/set/write in the helper]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-25 14:30:44 -07:00
Tom Lendacky
916e3e5f26 KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX
When the TSC_AUX MSR is virtualized, the TSC_AUX value is swap type "B"
within the VMSA. This means that the guest value is loaded on VMRUN and
the host value is restored from the host save area on #VMEXIT.

Since the value is restored on #VMEXIT, the KVM user return MSR support
for TSC_AUX can be replaced by populating the host save area with the
current host value of TSC_AUX. And, since TSC_AUX is not changed by Linux
post-boot, the host save area can be set once in svm_hardware_enable().
This eliminates the two WRMSR instructions associated with the user return
MSR support.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <d381de38eb0ab6c9c93dda8503b72b72546053d7.1694811272.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:49 -04:00
Tom Lendacky
e0096d01c4 KVM: SVM: Fix TSC_AUX virtualization setup
The checks for virtualizing TSC_AUX occur during the vCPU reset processing
path. However, at the time of initial vCPU reset processing, when the vCPU
is first created, not all of the guest CPUID information has been set. In
this case the RDTSCP and RDPID feature support for the guest is not in
place and so TSC_AUX virtualization is not established.

This continues for each vCPU created for the guest. On the first boot of
an AP, vCPU reset processing is executed as a result of an APIC INIT
event, this time with all of the guest CPUID information set, resulting
in TSC_AUX virtualization being enabled, but only for the APs. The BSP
always sees a TSC_AUX value of 0 which probably went unnoticed because,
at least for Linux, the BSP TSC_AUX value is 0.

Move the TSC_AUX virtualization enablement out of the init_vmcb() path and
into the vcpu_after_set_cpuid() path to allow for proper initialization of
the support after the guest CPUID information has been set.

With the TSC_AUX virtualization support now in the vcpu_set_after_cpuid()
path, the intercepts must be either cleared or set based on the guest
CPUID input.

Fixes: 296d5a17e7 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <4137fbcb9008951ab5f0befa74a0399d2cce809a.1694811272.git.thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:49 -04:00
Paolo Bonzini
e8d93d5d93 KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway
svm_recalc_instruction_intercepts() is always called at least once
before the vCPU is started, so the setting or clearing of the RDTSCP
intercept can be dropped from the TSC_AUX virtualization support.

Extracted from a patch by Tom Lendacky.

Cc: stable@vger.kernel.org
Fixes: 296d5a17e7 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:49 -04:00
Paolo Bonzini
7deda2ce5b x86/cpu: Clear SVM feature if disabled by BIOS
When SVM is disabled by BIOS, one cannot use KVM but the
SVM feature is still shown in the output of /proc/cpuinfo.
On Intel machines, VMX is cleared by init_ia32_feat_ctl(),
so do the same on AMD and Hygon processors.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230921114940.957141-1-pbonzini@redhat.com
2023-09-22 10:55:26 +02:00
Paolo Bonzini
6d5e3c318a Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.6:

 - Misc cleanups

 - Retry APIC optimized recalculation if a vCPU is added/enabled

 - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
   "emergency disabling" behavior to KVM actually being loaded, and move all of
   the logic within KVM

 - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
   ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
   up related code

 - Add a framework to allow "caching" feature flags so that KVM can check if
   the guest can use a feature without needing to search guest CPUID
2023-08-31 13:36:33 -04:00
Paolo Bonzini
bd7fe98b35 Merge tag 'kvm-x86-svm-6.6' of https://github.com/kvm-x86/linux into HEAD
KVM: x86: SVM changes for 6.6:

 - Add support for SEV-ES DebugSwap, i.e. allow SEV-ES guests to use debug
   registers and generate/handle #DBs

 - Clean up LBR virtualization code

 - Fix a bug where KVM fails to set the target pCPU during an IRTE update

 - Fix fatal bugs in SEV-ES intrahost migration

 - Fix a bug where the recent (architecturally correct) change to reinject
   #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it)
2023-08-31 13:32:40 -04:00
Sean Christopherson
80d0f521d5 KVM: SVM: Require nrips support for SEV guests (and beyond)
Disallow SEV (and beyond) if nrips is disabled via module param, as KVM
can't read guest memory to partially emulate and skip an instruction.  All
CPUs that support SEV support NRIPS, i.e. this is purely stopping the user
from shooting themselves in the foot.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230825013621.2845700-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25 09:00:40 -07:00
Sean Christopherson
cb49631ad1 KVM: SVM: Don't inject #UD if KVM attempts to skip SEV guest insn
Don't inject a #UD if KVM attempts to "emulate" to skip an instruction
for an SEV guest, and instead resume the guest and hope that it can make
forward progress.  When commit 04c40f344d ("KVM: SVM: Inject #UD on
attempted emulation for SEV guest w/o insn buffer") added the completely
arbitrary #UD behavior, there were no known scenarios where a well-behaved
guest would induce a VM-Exit that triggered emulation, i.e. it was thought
that injecting #UD would be helpful.

However, now that KVM (correctly) attempts to re-inject INT3/INTO, e.g. if
a #NPF is encountered when attempting to deliver the INT3/INTO, an SEV
guest can trigger emulation without a buffer, through no fault of its own.
Resuming the guest and retrying the INT3/INTO is architecturally wrong,
e.g. the vCPU will incorrectly re-hit code #DBs, but for SEV guests there
is literally no other option that has a chance of making forward progress.

Drop the #UD injection for all "skip" emulation, not just those related to
INT3/INTO, even though that means that the guest will likely end up in an
infinite loop instead of getting a #UD (the vCPU may also crash, e.g. if
KVM emulated everything about an instruction except for advancing RIP).
There's no evidence that suggests that an unexpected #UD is actually
better than hanging the vCPU, e.g. a soft-hung vCPU can still respond to
IRQs and NMIs to generate a backtrace.

Reported-by: Wu Zongyo <wuzongyo@mail.ustc.edu.cn>
Closes: https://lore.kernel.org/all/8eb933fd-2cf3-d7a9-32fe-2a1d82eac42a@mail.ustc.edu.cn
Fixes: 6ef88d6e36 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Cc: stable@vger.kernel.org
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230825013621.2845700-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25 09:00:40 -07:00
Sean Christopherson
1952e74da9 KVM: SVM: Skip VMSA init in sev_es_init_vmcb() if pointer is NULL
Skip initializing the VMSA physical address in the VMCB if the VMSA is
NULL, which occurs during intrahost migration as KVM initializes the VMCB
before copying over state from the source to the destination (including
the VMSA and its physical address).

In normal builds, __pa() is just math, so the bug isn't fatal, but with
CONFIG_DEBUG_VIRTUAL=y, the validity of the virtual address is verified
and passing in NULL will make the kernel unhappy.

Fixes: 6defa24d3b ("KVM: SEV: Init target VMCBs in sev_migrate_from")
Cc: stable@vger.kernel.org
Cc: Peter Gonda <pgonda@google.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20230825022357.2852133-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25 09:00:16 -07:00
Sean Christopherson
f1187ef24e KVM: SVM: Get source vCPUs from source VM for SEV-ES intrahost migration
Fix a goof where KVM tries to grab source vCPUs from the destination VM
when doing intrahost migration.  Grabbing the wrong vCPU not only hoses
the guest, it also crashes the host due to the VMSA pointer being left
NULL.

  BUG: unable to handle page fault for address: ffffe38687000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP NOPTI
  CPU: 39 PID: 17143 Comm: sev_migrate_tes Tainted: GO       6.5.0-smp--fff2e47e6c3b-next #151
  Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 34.28.0 07/10/2023
  RIP: 0010:__free_pages+0x15/0xd0
  RSP: 0018:ffff923fcf6e3c78 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffffe38687000000 RCX: 0000000000000100
  RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffffe38687000000
  RBP: ffff923fcf6e3c88 R08: ffff923fcafb0000 R09: 0000000000000000
  R10: 0000000000000000 R11: ffffffff83619b90 R12: ffff923fa9540000
  R13: 0000000000080007 R14: ffff923f6d35d000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff929d0d7c0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffe38687000000 CR3: 0000005224c34005 CR4: 0000000000770ee0
  PKRU: 55555554
  Call Trace:
   <TASK>
   sev_free_vcpu+0xcb/0x110 [kvm_amd]
   svm_vcpu_free+0x75/0xf0 [kvm_amd]
   kvm_arch_vcpu_destroy+0x36/0x140 [kvm]
   kvm_destroy_vcpus+0x67/0x100 [kvm]
   kvm_arch_destroy_vm+0x161/0x1d0 [kvm]
   kvm_put_kvm+0x276/0x560 [kvm]
   kvm_vm_release+0x25/0x30 [kvm]
   __fput+0x106/0x280
   ____fput+0x12/0x20
   task_work_run+0x86/0xb0
   do_exit+0x2e3/0x9c0
   do_group_exit+0xb1/0xc0
   __x64_sys_exit_group+0x1b/0x20
   do_syscall_64+0x41/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd
   </TASK>
  CR2: ffffe38687000000

Fixes: 6defa24d3b ("KVM: SEV: Init target VMCBs in sev_migrate_from")
Cc: stable@vger.kernel.org
Cc: Peter Gonda <pgonda@google.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20230825022357.2852133-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25 09:00:16 -07:00
Sean Christopherson
ee785c870d KVM: nSVM: Use KVM-governed feature framework to track "vNMI enabled"
Track "virtual NMI exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.

Note, checking KVM's capabilities instead of the "vnmi" param means that
the code isn't strictly equivalent, as vnmi_enabled could have been set
if nested=false where as that the governed feature cannot.  But that's a
glorified nop as the feature/flag is consumed only by paths that are
gated by nSVM being enabled.

Link: https://lore.kernel.org/r/20230815203653.519297-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:31 -07:00
Sean Christopherson
b89456aee7 KVM: nSVM: Use KVM-governed feature framework to track "vGIF enabled"
Track "virtual GIF exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.

Note, checking KVM's capabilities instead of the "vgif" param means that
the code isn't strictly equivalent, as vgif_enabled could have been set
if nested=false where as that the governed feature cannot.  But that's a
glorified nop as the feature/flag is consumed only by paths that are

Link: https://lore.kernel.org/r/20230815203653.519297-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:31 -07:00
Sean Christopherson
59d67fc1f0 KVM: nSVM: Use KVM-governed feature framework to track "Pause Filter enabled"
Track "Pause Filtering is exposed to L1" via governed feature flags
instead of using dedicated bits/flags in vcpu_svm.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:30 -07:00
Sean Christopherson
e183d17ac3 KVM: nSVM: Use KVM-governed feature framework to track "LBRv enabled"
Track "LBR virtualization exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.

Note, checking KVM's capabilities instead of the "lbrv" param means that
the code isn't strictly equivalent, as lbrv_enabled could have been set
if nested=false where as that the governed feature cannot.  But that's a
glorified nop as the feature/flag is consumed only by paths that are
gated by nSVM being enabled.

Link: https://lore.kernel.org/r/20230815203653.519297-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:30 -07:00
Sean Christopherson
4d2a1560ff KVM: nSVM: Use KVM-governed feature framework to track "vVM{SAVE,LOAD} enabled"
Track "virtual VMSAVE/VMLOAD exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.

Opportunistically add a comment explaining why KVM disallows virtual
VMLOAD/VMSAVE when the vCPU model is Intel.

No functional change intended.

Link: https://lore.kernel.org/r/20230815203653.519297-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:29 -07:00
Sean Christopherson
4365a45571 KVM: nSVM: Use KVM-governed feature framework to track "TSC scaling enabled"
Track "TSC scaling exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.

Note, this fixes a benign bug where KVM would mark TSC scaling as exposed
to L1 even if overall nested SVM supported is disabled, i.e. KVM would let
L1 write MSR_AMD64_TSC_RATIO even when KVM didn't advertise TSCRATEMSR
support to userspace.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:28 -07:00
Sean Christopherson
7a6a6a3bf5 KVM: nSVM: Use KVM-governed feature framework to track "NRIPS enabled"
Track "NRIPS exposed to L1" via a governed feature flag instead of using
a dedicated bit/flag in vcpu_svm.

No functional change intended.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17 11:43:28 -07:00