Commit Graph

261 Commits

Author SHA1 Message Date
Mauro Carvalho Chehab
7ac3945d8e Documentation: KVM: update amd-memory-encryption.rst references
Changeset daec8d4083 ("Documentation: KVM: add separate directories for architecture-specific documentation")
renamed: Documentation/virt/kvm/amd-memory-encryption.rst
to: Documentation/virt/kvm/x86/amd-memory-encryption.rst.

Update the cross-references accordingly.

Fixes: daec8d4083 ("Documentation: KVM: add separate directories for architecture-specific documentation")
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Link: https://lore.kernel.org/r/fd80db889e34aae87a4ca88cad94f650723668f4.1656234456.git.mchehab@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-07-07 13:09:59 -06:00
Ben Gardon
084cc29f8b KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis
In some cases, the NX hugepage mitigation for iTLB multihit is not
needed for all guests on a host. Allow disabling the mitigation on a
per-VM basis to avoid the performance hit of NX hugepages on trusted
workloads.

In order to disable NX hugepages on a VM, ensure that the userspace
actor has permission to reboot the system. Since disabling NX hugepages
would allow a guest to crash the system, it is similar to reboot
permissions.

Ideally, KVM would require userspace to prove it has access to KVM's
nx_huge_pages module param, e.g. so that userspace can opt out without
needing full reboot permissions.  But getting access to the module param
file info is difficult because it is buried in layers of sysfs and module
glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases.

Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:51:49 -04:00
Sean Christopherson
bfbcc81bb8 KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behavior
Add a quirk for KVM's behavior of emulating intercepted MONITOR/MWAIT
instructions a NOPs regardless of whether or not they are supported in
guest CPUID.  KVM's current behavior was likely motiviated by a certain
fruity operating system that expects MONITOR/MWAIT to be supported
unconditionally and blindly executes MONITOR/MWAIT without first checking
CPUID.  And because KVM does NOT advertise MONITOR/MWAIT to userspace,
that's effectively the default setup for any VMM that regurgitates
KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2.

Note, this quirk interacts with KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT.  The
behavior is actually desirable, as userspace VMMs that want to
unconditionally hide MONITOR/MWAIT from the guest can leave the
MISC_ENABLE quirk enabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220608224516.3788274-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 11:50:42 -04:00
Sean Christopherson
8deb03e75f KVM: Fix references to non-existent KVM_CAP_TRIPLE_FAULT_EVENT
The x86-only KVM_CAP_TRIPLE_FAULT_EVENT was (appropriately) renamed to
KVM_CAP_X86_TRIPLE_FAULT_EVENT when the patches were applied, but the
docs and selftests got left behind.  Fix them.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-11 10:14:28 -04:00
Paul Durrant
b172862241 KVM: x86: PIT: Preserve state of speaker port data bit
Currently the state of the speaker port (0x61) data bit (bit 1) is not
saved in the exported state (kvm_pit_state2) and hence is lost when
re-constructing guest state.

This patch removes the 'speaker_data_port' field from kvm_kpit_state and
instead tracks the state using a new KVM_PIT_FLAGS_SPEAKER_DATA_ON flag
defined in the API.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Message-Id: <20220531124421.1427-1-pdurrant@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 13:06:20 -04:00
Tao Xu
2f4073e08f KVM: VMX: Enable Notify VM exit
There are cases that malicious virtual machines can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs.

VMM can enable notify VM exit that a VM exit generated if no event
window occurs in VM non-root mode for a specified amount of time (notify
window).

Feature enabling:
- The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
  enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
  the expected notify window.
- Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
  can query and enable this feature in per-VM scope. The argument is a
  64bit value: bits 63:32 are used for notify window, and bits 31:0 are
  for flags. Current supported flags:
  - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
    window provided.
  - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
- It's safe to even set notify window to zero since an internal hardware
  threshold is added to vmcs.notify_window.

VM exit handling:
- Introduce a vcpu state notify_window_exits to records the count of
  notify VM exits and expose it through the debugfs.
- Notify VM exit can happen incident to delivery of a vector event.
  Allow it in KVM.
- Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
  bit is set.

Nested handling
- Nested notify VM exits are not supported yet. Keep the same notify
  window control in vmcs02 as vmcs01, so that L1 can't escape the
  restriction of notify VM exits through launching L2 VM.

Notify VM exit is defined in latest Intel Architecture Instruction Set
Extensions Programming Reference, chapter 9.2.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220524135624.22988-5-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 05:56:24 -04:00
Chenyi Qiang
ed2351174e KVM: x86: Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault
For the triple fault sythesized by KVM, e.g. the RSM path or
nested_vmx_abort(), if KVM exits to userspace before the request is
serviced, userspace could migrate the VM and lose the triple fault.

Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault with a
new event KVM_VCPUEVENT_VALID_FAULT_FAULT so that userspace can save and
restore the triple fault event. This extension is guarded by a new KVM
capability KVM_CAP_TRIPLE_FAULT_EVENT.

Note that in the set_vcpu_events path, userspace is able to set/clear
the triple fault request through triple_fault.pending field.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220524135624.22988-2-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 05:20:53 -04:00
Zeng Guang
3587531638 KVM: x86: Allow userspace to set maximum VCPU id for VM
Introduce new max_vcpu_ids in KVM for x86 architecture. Userspace
can assign maximum possible vcpu id for current VM session using
KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().

This is done for x86 only because the sole use case is to guide
memory allocation for PID-pointer table, a structure needed to
enable VMX IPI.

By default, max_vcpu_ids set as KVM_MAX_VCPU_IDS.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419154444.11888-1-guang.zeng@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:47:31 -04:00
Paolo Bonzini
5552de7b92 Merge tag 'kvm-s390-next-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: pvdump and selftest improvements

- add an interface to provide a hypervisor dump for secure guests
- improve selftests to show tests
2022-06-07 12:28:53 -04:00
Janosch Frank
b0f46280d3 Documentation/virt/kvm/api.rst: Explain rc/rrc delivery
Let's explain in which situations the rc/rrc will set in struct
kvm_pv_cmd so it's clear that the struct members should be set to
0. rc/rrc are independent of the IOCTL return code.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20220517163629.3443-12-frankja@linux.ibm.com
Message-Id: <20220517163629.3443-12-frankja@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-06-01 16:57:14 +02:00
Janosch Frank
437cfd714d Documentation/virt/kvm/api.rst: Add protvirt dump/info api descriptions
Time to add the dump API changes to the api documentation file.
Also some minor cleanup.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20220517163629.3443-11-frankja@linux.ibm.com
Message-Id: <20220517163629.3443-11-frankja@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-06-01 16:57:14 +02:00
Linus Torvalds
bf9095424d Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "S390:

   - ultravisor communication device driver

   - fix TEID on terminating storage key ops

  RISC-V:

   - Added Sv57x4 support for G-stage page table

   - Added range based local HFENCE functions

   - Added remote HFENCE functions based on VCPU requests

   - Added ISA extension registers in ONE_REG interface

   - Updated KVM RISC-V maintainers entry to cover selftests support

  ARM:

   - Add support for the ARMv8.6 WFxT extension

   - Guard pages for the EL2 stacks

   - Trap and emulate AArch32 ID registers to hide unsupported features

   - Ability to select and save/restore the set of hypercalls exposed to
     the guest

   - Support for PSCI-initiated suspend in collaboration with userspace

   - GICv3 register-based LPI invalidation support

   - Move host PMU event merging into the vcpu data structure

   - GICv3 ITS save/restore fixes

   - The usual set of small-scale cleanups and fixes

  x86:

   - New ioctls to get/set TSC frequency for a whole VM

   - Allow userspace to opt out of hypercall patching

   - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr

  AMD SEV improvements:

   - Add KVM_EXIT_SHUTDOWN metadata for SEV-ES

   - V_TSC_AUX support

  Nested virtualization improvements for AMD:

   - Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE,
     nested vGIF)

   - Allow AVIC to co-exist with a nested guest running

   - Fixes for LBR virtualizations when a nested guest is running, and
     nested LBR virtualization support

   - PAUSE filtering for nested hypervisors

  Guest support:

   - Decoupling of vcpu_is_preempted from PV spinlocks"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (199 commits)
  KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest
  KVM: selftests: x86: Sync the new name of the test case to .gitignore
  Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND
  x86, kvm: use correct GFP flags for preemption disabled
  KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer
  x86/kvm: Alloc dummy async #PF token outside of raw spinlock
  KVM: x86: avoid calling x86 emulator without a decoded instruction
  KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak
  x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave)
  s390/uv_uapi: depend on CONFIG_S390
  KVM: selftests: x86: Fix test failure on arch lbr capable platforms
  KVM: LAPIC: Trace LAPIC timer expiration on every vmentry
  KVM: s390: selftest: Test suppression indication on key prot exception
  KVM: s390: Don't indicate suppression on dirtying, failing memop
  selftests: drivers/s390x: Add uvdevice tests
  drivers/s390/char: Add Ultravisor io device
  MAINTAINERS: Update KVM RISC-V entry to cover selftests support
  RISC-V: KVM: Introduce ISA extension register
  RISC-V: KVM: Cleanup stale TLB entries when host CPU changes
  RISC-V: KVM: Add remote HFENCE functions based on VCPU requests
  ...
2022-05-26 14:20:14 -07:00
Paolo Bonzini
186af6bb40 Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25 05:15:43 -04:00
Paolo Bonzini
1644e27059 Merge tag 'kvm-s390-next-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: Fix and feature for 5.19

- ultravisor communication device driver
- fix TEID on terminating storage key ops
2022-05-25 05:11:21 -04:00
Paolo Bonzini
47e8eec832 Merge tag 'kvmarm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.19

- Add support for the ARMv8.6 WFxT extension

- Guard pages for the EL2 stacks

- Trap and emulate AArch32 ID registers to hide unsupported features

- Ability to select and save/restore the set of hypercalls exposed
  to the guest

- Support for PSCI-initiated suspend in collaboration with userspace

- GICv3 register-based LPI invalidation support

- Move host PMU event merging into the vcpu data structure

- GICv3 ITS save/restore fixes

- The usual set of small-scale cleanups and fixes

[Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated
 from 4 to 6. - Paolo]
2022-05-25 05:09:23 -04:00
Linus Torvalds
143a6252e1 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:

 - Initial support for the ARMv9 Scalable Matrix Extension (SME).

   SME takes the approach used for vectors in SVE and extends this to
   provide architectural support for matrix operations. No KVM support
   yet, SME is disabled in guests.

 - Support for crashkernel reservations above ZONE_DMA via the
   'crashkernel=X,high' command line option.

 - btrfs search_ioctl() fix for live-lock with sub-page faults.

 - arm64 perf updates: support for the Hisilicon "CPA" PMU for
   monitoring coherent I/O traffic, support for Arm's CMN-650 and
   CMN-700 interconnect PMUs, minor driver fixes, kerneldoc cleanup.

 - Kselftest updates for SME, BTI, MTE.

 - Automatic generation of the system register macros from a 'sysreg'
   file describing the register bitfields.

 - Update the type of the function argument holding the ESR_ELx register
   value to unsigned long to match the architecture register size
   (originally 32-bit but extended since ARMv8.0).

 - stacktrace cleanups.

 - ftrace cleanups.

 - Miscellaneous updates, most notably: arm64-specific huge_ptep_get(),
   avoid executable mappings in kexec/hibernate code, drop TLB flushing
   from get_clear_flush() (and rename it to get_clear_contig()),
   ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE.

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (145 commits)
  arm64/sysreg: Generate definitions for FAR_ELx
  arm64/sysreg: Generate definitions for DACR32_EL2
  arm64/sysreg: Generate definitions for CSSELR_EL1
  arm64/sysreg: Generate definitions for CPACR_ELx
  arm64/sysreg: Generate definitions for CONTEXTIDR_ELx
  arm64/sysreg: Generate definitions for CLIDR_EL1
  arm64/sve: Move sve_free() into SVE code section
  arm64: Kconfig.platforms: Add comments
  arm64: Kconfig: Fix indentation and add comments
  arm64: mm: avoid writable executable mappings in kexec/hibernate code
  arm64: lds: move special code sections out of kernel exec segment
  arm64/hugetlb: Implement arm64 specific huge_ptep_get()
  arm64/hugetlb: Use ptep_get() to get the pte value of a huge page
  arm64: kdump: Do not allocate crash low memory if not needed
  arm64/sve: Generate ZCR definitions
  arm64/sme: Generate defintions for SVCR
  arm64/sme: Generate SMPRI_EL1 definitions
  arm64/sme: Automatically generate SMPRIMAP_EL2 definitions
  arm64/sme: Automatically generate SMIDR_EL1 defines
  arm64/sme: Automatically generate defines for SMCR
  ...
2022-05-23 21:06:11 -07:00
Janis Schoetterl-Glausch
c783631b0b KVM: s390: Don't indicate suppression on dirtying, failing memop
If user space uses a memop to emulate an instruction and that
memop fails, the execution of the instruction ends.
Instruction execution can end in different ways, one of which is
suppression, which requires that the instruction execute like a no-op.
A writing memop that spans multiple pages and fails due to key
protection may have modified guest memory, as a result, the likely
correct ending is termination. Therefore, do not indicate a
suppressing instruction ending in this case.

Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20220512131019.2594948-2-scgl@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-05-20 16:38:42 +02:00
Marc Zyngier
3b8e21e3c3 Merge branch kvm-arm64/psci-suspend into kvmarm-master/next
* kvm-arm64/psci-suspend:
  : .
  : Add support for PSCI SYSTEM_SUSPEND and allow userspace to
  : filter the wake-up events.
  :
  : Patches courtesy of Oliver.
  : .
  Documentation: KVM: Fix title level for PSCI_SUSPEND
  selftests: KVM: Test SYSTEM_SUSPEND PSCI call
  selftests: KVM: Refactor psci_test to make it amenable to new tests
  selftests: KVM: Use KVM_SET_MP_STATE to power off vCPU in psci_test
  selftests: KVM: Create helper for making SMCCC calls
  selftests: KVM: Rename psci_cpu_on_test to psci_test
  KVM: arm64: Implement PSCI SYSTEM_SUSPEND
  KVM: arm64: Add support for userspace to suspend a vCPU
  KVM: arm64: Return a value from check_vcpu_requests()
  KVM: arm64: Rename the KVM_REQ_SLEEP handler
  KVM: arm64: Track vCPU power state using MP state values
  KVM: arm64: Dedupe vCPU power off helpers
  KVM: arm64: Don't depend on fallthrough to hide SYSTEM_RESET2

Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-05-16 17:48:20 +01:00
Marc Zyngier
0586e28aaa Merge branch kvm-arm64/hcall-selection into kvmarm-master/next
* kvm-arm64/hcall-selection:
  : .
  : Introduce a new set of virtual sysregs for userspace to
  : select the hypercalls it wants to see exposed to the guest.
  :
  : Patches courtesy of Raghavendra and Oliver.
  : .
  KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run
  KVM: arm64: Hide KVM_REG_ARM_*_BMAP_BIT_COUNT from userspace
  Documentation: Fix index.rst after psci.rst renaming
  selftests: KVM: aarch64: Add the bitmap firmware registers to get-reg-list
  selftests: KVM: aarch64: Introduce hypercall ABI test
  selftests: KVM: Create helper for making SMCCC calls
  selftests: KVM: Rename psci_cpu_on_test to psci_test
  tools: Import ARM SMCCC definitions
  Docs: KVM: Add doc for the bitmap firmware registers
  Docs: KVM: Rename psci.rst to hypercalls.rst
  KVM: arm64: Add vendor hypervisor firmware register
  KVM: arm64: Add standard hypervisor firmware register
  KVM: arm64: Setup a framework for hypercall bitmap firmware registers
  KVM: arm64: Factor out firmware register handling from psci.c

Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-05-16 17:47:03 +01:00
Stephen Rothwell
582eb04e05 Documentation: KVM: Fix title level for PSCI_SUSPEND
The htmldoc build breaks in a funny way with:

<quote>
Sphinx parallel build error:
docutils.utils.SystemMessage: /home/sfr/next/next/Documentation/virt/kvm/api.rst:6175: (SEVERE/4) Title level inconsistent:

For arm/arm64:
^^^^^^^^^^^^^^
</quote>

Swap the ^^s for a bunch of --s...

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
[maz: commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-05-05 12:24:21 +01:00
Oliver Upton
bfbab44568 KVM: arm64: Implement PSCI SYSTEM_SUSPEND
ARM DEN0022D.b 5.19 "SYSTEM_SUSPEND" describes a PSCI call that allows
software to request that a system be placed in the deepest possible
low-power state. Effectively, software can use this to suspend itself to
RAM.

Unfortunately, there really is no good way to implement a system-wide
PSCI call in KVM. Any precondition checks done in the kernel will need
to be repeated by userspace since there is no good way to protect a
critical section that spans an exit to userspace. SYSTEM_RESET and
SYSTEM_OFF are equally plagued by this issue, although no users have
seemingly cared for the relatively long time these calls have been
supported.

The solution is to just make the whole implementation userspace's
problem. Introduce a new system event, KVM_SYSTEM_EVENT_SUSPEND, that
indicates to userspace a calling vCPU has invoked PSCI SYSTEM_SUSPEND.
Additionally, add a CAP to get buy-in from userspace for this new exit
type.

Only advertise the SYSTEM_SUSPEND PSCI call if userspace has opted in.
If a vCPU calls SYSTEM_SUSPEND, punt straight to userspace. Provide
explicit documentation of userspace's responsibilites for the exit and
point to the PSCI specification to describe the actual PSCI call.

Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220504032446.4133305-8-oupton@google.com
2022-05-04 09:28:45 +01:00
Oliver Upton
7b33a09d03 KVM: arm64: Add support for userspace to suspend a vCPU
Introduce a new MP state, KVM_MP_STATE_SUSPENDED, which indicates a vCPU
is in a suspended state. In the suspended state the vCPU will block
until a wakeup event (pending interrupt) is recognized.

Add a new system event type, KVM_SYSTEM_EVENT_WAKEUP, to indicate to
userspace that KVM has recognized one such wakeup event. It is the
responsibility of userspace to then make the vCPU runnable, or leave it
suspended until the next wakeup event.

Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220504032446.4133305-7-oupton@google.com
2022-05-04 09:28:45 +01:00
Raghavendra Rao Ananta
fa246c68a0 Docs: KVM: Add doc for the bitmap firmware registers
Add the documentation for the bitmap firmware registers in
hypercalls.rst and api.rst. This includes the details for
KVM_REG_ARM_STD_BMAP, KVM_REG_ARM_STD_HYP_BMAP, and
KVM_REG_ARM_VENDOR_HYP_BMAP registers.

Since the document is growing to carry other hypercall related
information, make necessary adjustments to present the document
in a generic sense, rather than being PSCI focused.

Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
[maz: small scale reformat, move things about, random typo fixes]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220502233853.1233742-7-rananta@google.com
2022-05-03 21:30:19 +01:00
Alexandru Elisei
18f3976fdb KVM: arm64: uapi: Add kvm_debug_exit_arch.hsr_high
When userspace is debugging a VM, the kvm_debug_exit_arch part of the
kvm_run struct contains arm64 specific debug information: the ESR_EL2
value, encoded in the field "hsr", and the address of the instruction
that caused the exception, encoded in the field "far".

Linux has moved to treating ESR_EL2 as a 64-bit register, but unfortunately
kvm_debug_exit_arch.hsr cannot be changed because that would change the
memory layout of the struct on big endian machines:

Current layout:			| Layout with "hsr" extended to 64 bits:
				|
offset 0: ESR_EL2[31:0] (hsr)   | offset 0: ESR_EL2[61:32] (hsr[61:32])
offset 4: padding		| offset 4: ESR_EL2[31:0]  (hsr[31:0])
offset 8: FAR_EL2[61:0] (far)	| offset 8: FAR_EL2[61:0]  (far)

which breaks existing code.

The padding is inserted by the compiler because the "far" field must be
aligned to 8 bytes (each field must be naturally aligned - aapcs64 [1],
page 18), and the struct itself must be aligned to 8 bytes (the struct must
be aligned to the maximum alignment of its fields - aapcs64, page 18),
which means that "hsr" must be aligned to 8 bytes as it is the first field
in the struct.

To avoid changing the struct size and layout for the existing fields, add a
new field, "hsr_high", which replaces the existing padding. "hsr_high" will
be used to hold the ESR_EL2[61:32] bits of the register. The memory layout,
both on big and little endian machine, becomes:

offset 0: ESR_EL2[31:0]  (hsr)
offset 4: ESR_EL2[61:32] (hsr_high)
offset 8: FAR_EL2[61:0]  (far)

The padding that the compiler inserts for the current struct layout is
unitialized. To prevent an updated userspace running on an old kernel
mistaking the padding for a valid "hsr_high" value, add a new flag,
KVM_DEBUG_ARCH_HSR_HIGH_VALID, to kvm_run->flags to let userspace know that
"hsr_high" holds a valid ESR_EL2[61:32] value.

[1] https://github.com/ARM-software/abi-aa/releases/download/2021Q3/aapcs64.pdf

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220425114444.368693-6-alexandru.elisei@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-04-29 19:26:27 +01:00
Paolo Bonzini
71d7c575a6 Merge branch 'kvm-fixes-for-5.18-rc5' into HEAD
Fixes for (relatively) old bugs, to be merged in both the -rc and next
development trees.

The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between
5.18 and commit c24a950ec7 ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata
for SEV-ES", 2022-04-13).
2022-04-29 12:47:59 -04:00
Paolo Bonzini
d495f942f4 KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT
When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
member that at the time was unused.  Unfortunately this extensibility
mechanism has several issues:

- x86 is not writing the member, so it would not be possible to use it
  on x86 except for new events

- the member is not aligned to 64 bits, so the definition of the
  uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
  problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
  usage of flags was only introduced in 5.18.

Since padding has to be introduced, place a new field in there
that tells if the flags field is valid.  To allow further extensibility,
in fact, change flags to an array of 16 values, and store how many
of the values are valid.  The availability of the new ndata field
is tied to a system capability; all architectures are changed to
fill in the field.

To avoid breaking compilation of userspace that was using the flags
field, provide a userspace-only union to overlap flags with data[0].
The new field is placed at the same offset for both 32- and 64-bit
userspace.

Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:38:22 -04:00
Peter Gonda
c24a950ec7 KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata for SEV-ES
If an SEV-ES guest requests termination, exit to userspace with
KVM_EXIT_SYSTEM_EVENT and a dedicated SEV_TERM type instead of -EINVAL
so that userspace can take appropriate action.

See AMD's GHCB spec section '4.1.13 Termination Request' for more details.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Gonda <pgonda@google.com>

Reported-by: kernel test robot <lkp@intel.com>
Message-Id: <20220407210233.782250-1-pgonda@google.com>
[Add documentatino. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:46 -04:00
Paolo Bonzini
a4cfff3f0f Merge branch 'kvm-older-features' into HEAD
Merge branch for features that did not make it into 5.18:

* New ioctls to get/set TSC frequency for a whole VM

* Allow userspace to opt out of hypercall patching

Nested virtualization improvements for AMD:

* Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE,
  nested vGIF)

* Allow AVIC to co-exist with a nested guest running

* Fixes for LBR virtualizations when a nested guest is running,
  and nested LBR virtualization support

* PAUSE filtering for nested hypervisors

Guest support:

* Decoupling of vcpu_is_preempted from PV spinlocks

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:17 -04:00
Bagas Sanjaya
c1be1ef1b4 Documentation: kvm: Add missing line break in api.rst
Add missing line break separator between literal block and description
of KVM_EXIT_RISCV_SBI.

This fixes:
</path/to/linux>/Documentation/virt/kvm/api.rst:6118: WARNING: Literal block ends without a blank line; unexpected unindent.

Fixes: da40d85805 (RISC-V: KVM: Document RISC-V specific parts of KVM API, 2021-09-27)
Cc: Anup Patel <anup.patel@wdc.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Message-Id: <20220403065735.23859-1-bagasdotme@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-05 08:11:37 -04:00
David Woodhouse
ffbb61d09f KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl.
This sets the default TSC frequency for subsequently created vCPUs.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20220225145304.36166-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:19 -04:00
David Woodhouse
661a20fab7 KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND
At the end of the patch series adding this batch of event channel
acceleration features, finally add the feature bit which advertises
them and document it all.

For SCHEDOP_poll we need to wake a polling vCPU when a given port
is triggered, even when it's masked — and we want to implement that
in the kernel, for efficiency. So we want the kernel to know that it
has sole ownership of event channel delivery. Thus, we allow
userspace to make the 'promise' by setting the corresponding feature
bit in its KVM_XEN_HVM_CONFIG call. As we implement SCHEDOP_poll
bypass later, we will do so only if that promise has been made by
userspace.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-16-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:17 -04:00
Oliver Upton
f1a9761fbb KVM: x86: Allow userspace to opt out of hypercall patching
KVM handles the VMCALL/VMMCALL instructions very strangely. Even though
both of these instructions really should #UD when executed on the wrong
vendor's hardware (i.e. VMCALL on SVM, VMMCALL on VMX), KVM replaces the
guest's instruction with the appropriate instruction for the vendor.
Nonetheless, older guest kernels without commit c1118b3602 ("x86: kvm:
use alternatives for VMCALL vs. VMMCALL if kernel text is read-only")
do not patch in the appropriate instruction using alternatives, likely
motivating KVM's intervention.

Add a quirk allowing userspace to opt out of hypercall patching. If the
quirk is disabled, KVM synthesizes a #UD in the guest.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220316005538.2282772-2-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:41:10 -04:00
Paolo Bonzini
fe5f691413 KVM: MIPS: remove reference to trap&emulate virtualization
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220313140522.1307751-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:47 -04:00
Paolo Bonzini
ce2f72e26c KVM: x86: document limitations of MSR filtering
MSR filtering requires an exit to userspace that is hard to implement and
would be very slow in the case of nested VMX vmexit and vmentry MSR
accesses.  Document the limitation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:47 -04:00
David Woodhouse
cf1d88b36b KVM: Remove dirty handling from gfn_to_pfn_cache completely
It isn't OK to cache the dirty status of a page in internal structures
for an indefinite period of time.

Any time a vCPU exits the run loop to userspace might be its last; the
VMM might do its final check of the dirty log, flush the last remaining
dirty pages to the destination and complete a live migration. If we
have internal 'dirty' state which doesn't get flushed until the vCPU
is finally destroyed on the source after migration is complete, then
we have lost data because that will escape the final copy.

This problem already exists with the use of kvm_vcpu_unmap() to mark
pages dirty in e.g. VMX nesting.

Note that the actual Linux MM already considers the page to be dirty
since we have a writeable mapping of it. This is just about the KVM
dirty logging.

For the nesting-style use cases (KVM_GUEST_USES_PFN) we will need to
track which gfn_to_pfn_caches have been used and explicitly mark the
corresponding pages dirty before returning to userspace. But we would
have needed external tracking of that anyway, rather than walking the
full list of GPCs to find those belonging to this vCPU which are dirty.

So let's rely *solely* on that external tracking, and keep it simple
rather than laying a tempting trap for callers to fall into.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220303154127.202856-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02 05:34:41 -04:00
Paolo Bonzini
cde363ab7c Documentation: KVM: add API issues section
Add a section to document all the different ways in which the KVM API sucks.

I am sure there are way more, give people a place to vent so that userspace
authors are aware.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220322110712.222449-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-29 13:21:20 -04:00
Oliver Upton
6d8491910f KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2
KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not
advertise the set of quirks which may be disabled to userspace, so it is
impossible to predict the behavior of KVM. Worse yet,
KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning
it fails to reject attempts to set invalid quirk bits.

The only valid workaround for the quirky quirks API is to add a new CAP.
Actually advertise the set of quirks that can be disabled to userspace
so it can predict KVM's behavior. Reject values for cap->args[0] that
contain invalid bits.

Finally, add documentation for the new capability and describe the
existing quirks.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20220301060351.442881-5-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-21 09:28:41 -04:00
Paolo Bonzini
714797c98e Merge tag 'kvmarm-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.18

- Proper emulation of the OSLock feature of the debug architecture

- Scalibility improvements for the MMU lock when dirty logging is on

- New VMID allocator, which will eventually help with SVA in VMs

- Better support for PMUs in heterogenous systems

- PSCI 1.1 support, enabling support for SYSTEM_RESET2

- Implement CONFIG_DEBUG_LIST at EL2

- Make CONFIG_ARM64_ERRATUM_2077057 default y

- Reduce the overhead of VM exit when no interrupt is pending

- Remove traces of 32bit ARM host support from the documentation

- Updated vgic selftests

- Various cleanups, doc updates and spelling fixes
2022-03-18 12:43:24 -04:00
Marc Zyngier
7297a8bcc0 Merge branch kvm-arm64/misc-5.18 into kvmarm-master/next
* kvm-arm64/misc-5.18:
  : .
  : Misc fixes for KVM/arm64 5.18:
  :
  : - Drop unused kvm parameter to kvm_psci_version()
  :
  : - Implement CONFIG_DEBUG_LIST at EL2
  :
  : - Make CONFIG_ARM64_ERRATUM_2077057 default y
  :
  : - Only do the interrupt dance if we have exited because of an interrupt
  :
  : - Remove traces of 32bit ARM host support from the documentation
  : .
  Documentation: KVM: Update documentation to indicate KVM is arm64-only
  KVM: arm64: Only open the interrupt window on exit due to an interrupt
  KVM: arm64: Enable Cortex-A510 erratum 2077057 by default

Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-03-09 11:16:48 +00:00
Oliver Upton
3fbf4207dc Documentation: KVM: Update documentation to indicate KVM is arm64-only
KVM support for 32-bit ARM hosts (KVM/arm) has been removed from the
kernel since commit 541ad0150c ("arm: Remove 32bit KVM host
support"). There still exists some remnants of the old architecture in
the KVM documentation.

Remove all traces of 32-bit host support from the documentation. Note
that AArch32 guests are still supported.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220308172856.2997250-1-oupton@google.com
2022-03-09 11:15:24 +00:00
Paolo Bonzini
0564eeb71b Merge branch 'kvm-bugfixes' into HEAD
Merge bugfixes from 5.17 before merging more tricky work.
2022-03-04 18:39:29 -05:00
David Dunn
ba7bb663f5 KVM: x86: Provide per VM capability for disabling PMU virtualization
Add a new capability, KVM_CAP_PMU_CAPABILITY, that takes a bitmask of
settings/features to allow userspace to configure PMU virtualization on
a per-VM basis.  For now, support a single flag, KVM_PMU_CAP_DISABLE,
to allow disabling PMU virtualization for a VM even when KVM is configured
with enable_pmu=true a module level.

To keep KVM simple, disallow changing VM's PMU configuration after vCPUs
have been created.

Signed-off-by: David Dunn <daviddunn@google.com>
Message-Id: <20220223225743.2703915-2-daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 08:20:14 -05:00
Paolo Bonzini
4dfc4ec2b7 Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18 2022-02-24 08:49:56 -05:00
Paolo Bonzini
1e2277ed70 Merge branch 'kvm-ppc-cap-210' into kvm-master
By request of Nick Piggin:

> Patch 3 requires a KVM_CAP_PPC number allocated. QEMU maintainers are
> happy with it (link in changelog) just waiting on KVM upstreaming. Do
> you have objections to the series going to ppc/kvm tree first, or
> another option is you could take patch 3 alone first (it's relatively
> independent of the other 2) and ppc/kvm gets it from you?
2022-02-22 09:07:16 -05:00
Nicholas Piggin
93b71801a8 KVM: PPC: reserve capability 210 for KVM_CAP_PPC_AIL_MODE_3
Add KVM_CAP_PPC_AIL_MODE_3 to advertise the capability to set the AIL
resource mode to 3 with the H_SET_MODE hypercall. This capability
differs between processor types and KVM types (PR, HV, Nested HV), and
affects guest-visible behaviour.

QEMU will implement a cap-ail-mode-3 to control this behaviour[1], and
use the KVM CAP if available to determine KVM support[2].

Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-22 09:06:54 -05:00
Janis Schoetterl-Glausch
cbf9b8109d KVM: s390: Clarify key argument for MEM_OP in api docs
Clarify that the key argument represents the access key, not the whole
storage key.

Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Link: https://lore.kernel.org/r/20220221143657.3712481-1-scgl@linux.ibm.com
Fixes: 5e35d0eb47 ("KVM: s390: Update api documentation for memop ioctl")
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-02-22 09:16:18 +01:00
Will Deacon
34739fd95f KVM: arm64: Indicate SYSTEM_RESET2 in kvm_run::system_event flags field
When handling reset and power-off PSCI calls from the guest, we
initialise X0 to PSCI_RET_INTERNAL_FAILURE in case the VMM tries to
re-run the vCPU after issuing the call.

Unfortunately, this also means that the VMM cannot see which PSCI call
was issued and therefore cannot distinguish between PSCI SYSTEM_RESET
and SYSTEM_RESET2 calls, which is necessary in order to determine the
validity of the "reset_type" in X1.

Allocate bit 0 of the previously unused 'flags' field of the
system_event structure so that we can indicate the PSCI call used to
initiate the reset.

Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220221153524.15397-4-will@kernel.org
2022-02-21 16:02:55 +00:00
Aaron Lewis
127770ac0d KVM: x86: Add KVM_CAP_ENABLE_CAP to x86
Follow the precedent set by other architectures that support the VCPU
ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP.
This way, userspace can ensure that KVM_ENABLE_CAP is available on a
vcpu before using it.

Fixes: 5c919412fe ("kvm/x86: Hyper-V synthetic interrupt controller")
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Message-Id: <20220214212950.1776943-1-aaronlewis@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17 09:52:50 -05:00
Janis Schoetterl-Glausch
5e35d0eb47 KVM: s390: Update api documentation for memop ioctl
Document all currently existing operations, flags and explain under
which circumstances they are available. Document the recently
introduced absolute operations and the storage key protection flag,
as well as the existing SIDA operations.

Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20220211182215.2730017-10-scgl@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-02-14 16:12:57 +01:00
Paolo Bonzini
dd6e631220 KVM: x86: add system attribute to retrieve full set of supported xsave states
Because KVM_GET_SUPPORTED_CPUID is meant to be passed (by simple-minded
VMMs) to KVM_SET_CPUID2, it cannot include any dynamic xsave states that
have not been enabled.  Probing those, for example so that they can be
passed to ARCH_REQ_XCOMP_GUEST_PERM, requires a new ioctl or arch_prctl.
The latter is in fact worse, even though that is what the rest of the
API uses, because it would require supported_xcr0 to be moved from the
KVM module to the kernel just for this use.  In addition, the value
would be nonsensical (or an error would have to be returned) until
the KVM module is loaded in.

Therefore, to limit the growth of system ioctls, add a /dev/kvm
variant of KVM_{GET,HAS}_DEVICE_ATTR, and implement it in x86
with just one group (0) and attribute (KVM_X86_XCOMP_GUEST_SUPP).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-28 07:33:32 -05:00