Commit Graph

526 Commits

Author SHA1 Message Date
Linus Torvalds
2c9b351240 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Initial infrastructure for shadow stage-2 MMUs, as part of nested
     virtualization enablement

   - Support for userspace changes to the guest CTR_EL0 value, enabling
     (in part) migration of VMs between heterogenous hardware

   - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1
     of the protocol

   - FPSIMD/SVE support for nested, including merged trap configuration
     and exception routing

   - New command-line parameter to control the WFx trap behavior under
     KVM

   - Introduce kCFI hardening in the EL2 hypervisor

   - Fixes + cleanups for handling presence/absence of FEAT_TCRX

   - Miscellaneous fixes + documentation updates

  LoongArch:

   - Add paravirt steal time support

   - Add support for KVM_DIRTY_LOG_INITIALLY_SET

   - Add perf kvm-stat support for loongarch

  RISC-V:

   - Redirect AMO load/store access fault traps to guest

   - perf kvm stat support

   - Use guest files for IMSIC virtualization, when available

  s390:

   - Assortment of tiny fixes which are not time critical

  x86:

   - Fixes for Xen emulation

   - Add a global struct to consolidate tracking of host values, e.g.
     EFER

   - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the
     effective APIC bus frequency, because TDX

   - Print the name of the APICv/AVIC inhibits in the relevant
     tracepoint

   - Clean up KVM's handling of vendor specific emulation to
     consistently act on "compatible with Intel/AMD", versus checking
     for a specific vendor

   - Drop MTRR virtualization, and instead always honor guest PAT on
     CPUs that support self-snoop

   - Update to the newfangled Intel CPU FMS infrastructure

   - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as
     it reads '0' and writes from userspace are ignored

   - Misc cleanups

  x86 - MMU:

   - Small cleanups, renames and refactoring extracted from the upcoming
     Intel TDX support

   - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages
     that can't hold leafs SPTEs

   - Unconditionally drop mmu_lock when allocating TDP MMU page tables
     for eager page splitting, to avoid stalling vCPUs when splitting
     huge pages

   - Bug the VM instead of simply warning if KVM tries to split a SPTE
     that is non-present or not-huge. KVM is guaranteed to end up in a
     broken state because the callers fully expect a valid SPTE, it's
     all but dangerous to let more MMU changes happen afterwards

  x86 - AMD:

   - Make per-CPU save_area allocations NUMA-aware

   - Force sev_es_host_save_area() to be inlined to avoid calling into
     an instrumentable function from noinstr code

   - Base support for running SEV-SNP guests. API-wise, this includes a
     new KVM_X86_SNP_VM type, encrypting/measure the initial image into
     guest memory, and finalizing it before launching it. Internally,
     there are some gmem/mmu hooks needed to prepare gmem-allocated
     pages before mapping them into guest private memory ranges

     This includes basic support for attestation guest requests, enough
     to say that KVM supports the GHCB 2.0 specification

     There is no support yet for loading into the firmware those signing
     keys to be used for attestation requests, and therefore no need yet
     for the host to provide certificate data for those keys.

     To support fetching certificate data from userspace, a new KVM exit
     type will be needed to handle fetching the certificate from
     userspace.

     An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS
     exit type to handle this was introduced in v1 of this patchset, but
     is still being discussed by community, so for now this patchset
     only implements a stub version of SNP Extended Guest Requests that
     does not provide certificate data

  x86 - Intel:

   - Remove an unnecessary EPT TLB flush when enabling hardware

   - Fix a series of bugs that cause KVM to fail to detect nested
     pending posted interrupts as valid wake eents for a vCPU executing
     HLT in L2 (with HLT-exiting disable by L1)

   - KVM: x86: Suppress MMIO that is triggered during task switch
     emulation

     Explicitly suppress userspace emulated MMIO exits that are
     triggered when emulating a task switch as KVM doesn't support
     userspace MMIO during complex (multi-step) emulation

     Silently ignoring the exit request can result in the
     WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace
     for some other reason prior to purging mmio_needed

     See commit 0dc902267c ("KVM: x86: Suppress pending MMIO write
     exits if emulator detects exception") for more details on KVM's
     limitations with respect to emulated MMIO during complex emulator
     flows

  Generic:

   - Rename the AS_UNMOVABLE flag that was introduced for KVM to
     AS_INACCESSIBLE, because the special casing needed by these pages
     is not due to just unmovability (and in fact they are only
     unmovable because the CPU cannot access them)

   - New ioctl to populate the KVM page tables in advance, which is
     useful to mitigate KVM page faults during guest boot or after live
     migration. The code will also be used by TDX, but (probably) not
     through the ioctl

   - Enable halt poll shrinking by default, as Intel found it to be a
     clear win

   - Setup empty IRQ routing when creating a VM to avoid having to
     synchronize SRCU when creating a split IRQCHIP on x86

   - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with
     a flag that arch code can use for hooking both sched_in() and
     sched_out()

   - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
     truncating a bogus value from userspace, e.g. to help userspace
     detect bugs

   - Mark a vCPU as preempted if and only if it's scheduled out while in
     the KVM_RUN loop, e.g. to avoid marking it preempted and thus
     writing guest memory when retrieving guest state during live
     migration blackout

  Selftests:

   - Remove dead code in the memslot modification stress test

   - Treat "branch instructions retired" as supported on all AMD Family
     17h+ CPUs

   - Print the guest pseudo-RNG seed only when it changes, to avoid
     spamming the log for tests that create lots of VMs

   - Make the PMU counters test less flaky when counting LLC cache
     misses by doing CLFLUSH{OPT} in every loop iteration"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
  crypto: ccp: Add the SNP_VLEK_LOAD command
  KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops
  KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
  KVM: x86: Replace static_call_cond() with static_call()
  KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
  x86/sev: Move sev_guest.h into common SEV header
  KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
  KVM: x86: Suppress MMIO that is triggered during task switch emulation
  KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro
  KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE
  KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY
  KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()
  KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level
  KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"
  KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler
  KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory
  KVM: Document KVM_PRE_FAULT_MEMORY ioctl
  mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
  perf kvm: Add kvm-stat for loongarch64
  LoongArch: KVM: Add PV steal time support in guest side
  ...
2024-07-20 12:41:03 -07:00
Paolo Bonzini
5dcc1e7614 Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.11

 - Add a global struct to consolidate tracking of host values, e.g. EFER, and
   move "shadow_phys_bits" into the structure as "maxphyaddr".

 - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
   bus frequency, because TDX.

 - Print the name of the APICv/AVIC inhibits in the relevant tracepoint.

 - Clean up KVM's handling of vendor specific emulation to consistently act on
   "compatible with Intel/AMD", versus checking for a specific vendor.

 - Misc cleanups
2024-07-16 09:53:05 -04:00
Thomas Prescher
85542adb65 KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag
When a vCPU is interrupted by a signal while running a nested guest,
KVM will exit to userspace with L2 state. However, userspace has no
way to know whether it sees L1 or L2 state (besides calling
KVM_GET_STATS_FD, which does not have a stable ABI).

This causes multiple problems:

The simplest one is L2 state corruption when userspace marks the sregs
as dirty. See this mailing list thread [1] for a complete discussion.

Another problem is that if userspace decides to continue by emulating
instructions, it will unknowingly emulate with L2 state as if L1
doesn't exist, which can be considered a weird guest escape.

Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data
structure, which is set when the vCPU exited while running a nested
guest.  Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to
advertise the functionality to userspace.

[1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20240508132502.184428-1-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 09:24:31 -07:00
Tom Lendacky
34ff659017 x86/sev: Use kernel provided SVSM Calling Areas
The SVSM Calling Area (CA) is used to communicate between Linux and the
SVSM. Since the firmware supplied CA for the BSP is likely to be in
reserved memory, switch off that CA to a kernel provided CA so that access
and use of the CA is available during boot. The CA switch is done using
the SVSM core protocol SVSM_CORE_REMAP_CA call.

An SVSM call is executed by filling out the SVSM CA and setting the proper
register state as documented by the SVSM protocol. The SVSM is invoked by
by requesting the hypervisor to run VMPL0.

Once it is safe to allocate/reserve memory, allocate a CA for each CPU.
After allocating the new CAs, the BSP will switch from the boot CA to the
per-CPU CA. The CA for an AP is identified to the SVSM when creating the
VMSA in preparation for booting the AP.

  [ bp: Heavily simplify svsm_issue_call() asm, other touchups. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/fa8021130bcc3bcf14d722a25548cb0cdf325456.1717600736.git.thomas.lendacky@amd.com
2024-06-11 07:22:46 +02:00
Brijesh Singh
ad27ce1555 KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command
Add a KVM_SEV_SNP_LAUNCH_FINISH command to finalize the cryptographic
launch digest which stores the measurement of the guest at launch time.
Also extend the existing SNP firmware data structures to support
disabling the use of Versioned Chip Endorsement Keys (VCEK) by guests as
part of this command.

While finalizing the launch flow, the code also issues the LAUNCH_UPDATE
SNP firmware commands to encrypt/measure the initial VMSA pages for each
configured vCPU, which requires setting the RMP entries for those pages
to private, so also add handling to clean up the RMP entries for these
pages whening freeing vCPUs during shutdown.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Harald Hoyer <harald@profian.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-8-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:30 -04:00
Brijesh Singh
dee5a47cc7 KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command
A key aspect of a launching an SNP guest is initializing it with a
known/measured payload which is then encrypted into guest memory as
pre-validated private pages and then measured into the cryptographic
launch context created with KVM_SEV_SNP_LAUNCH_START so that the guest
can attest itself after booting.

Since all private pages are provided by guest_memfd, make use of the
kvm_gmem_populate() interface to handle this. The general flow is that
guest_memfd will handle allocating the pages associated with the GPA
ranges being initialized by each particular call of
KVM_SEV_SNP_LAUNCH_UPDATE, copying data from userspace into those pages,
and then the post_populate callback will do the work of setting the
RMP entries for these pages to private and issuing the SNP firmware
calls to encrypt/measure them.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-7-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:29 -04:00
Brijesh Singh
136d8bc931 KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest.
The command initializes a cryptographic digest context used to construct
the measurement of the guest. Other commands can then at that point be
used to load/encrypt data into the guest's initial launch image.

For more information see the SEV-SNP specification.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-6-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:29 -04:00
Brijesh Singh
1dfe571c12 KVM: SEV: Add initial SEV-SNP support
SEV-SNP builds upon existing SEV and SEV-ES functionality while adding
new hardware-based security protection. SEV-SNP adds strong memory
encryption and integrity protection to help prevent malicious
hypervisor-based attacks such as data replay, memory re-mapping, and
more, to create an isolated execution environment.

Define a new KVM_X86_SNP_VM type which makes use of these capabilities
and extend the KVM_SEV_INIT2 ioctl to support it. Also add a basic
helper to check whether SNP is enabled and set PFERR_PRIVATE_ACCESS for
private #NPFs so they are handled appropriately by KVM MMU.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240501085210.2213060-5-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-12 04:09:28 -04:00
Michael Roth
4af663c2f6 KVM: SEV: Allow per-guest configuration of GHCB protocol version
The GHCB protocol version may be different from one guest to the next.
Add a field to track it for each KVM instance and extend KVM_SEV_INIT2
to allow it to be configured by userspace.

Now that all SEV-ES support for GHCB protocol version 2 is in place, go
ahead and default to it when creating SEV-ES guests through the new
KVM_SEV_INIT2 interface. Keep the older KVM_SEV_ES_INIT interface
restricted to GHCB protocol version 1.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501071048.2208265-5-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-07 13:28:05 -04:00
Paolo Bonzini
4f5defae70 KVM: SEV: introduce KVM_SEV_INIT2 operation
The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic.  In fact, in some sense it's
already a parameter whether SEV or SEV-ES is desired.  Another possible
source of variability is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.

Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct,
and put the new op to work by including the VMSA features as a field of the
struct.  The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility.

The struct also includes the usual bells and whistles for future
extensibility: a flags field that must be zero for now, and some padding
at the end.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:25 -04:00
Paolo Bonzini
26c44aa9e0 KVM: SEV: define VM types for SEV and SEV-ES
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-11-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:25 -04:00
Paolo Bonzini
ac5c48027b KVM: SEV: publish supported VMSA features
Compute the set of features to be stored in the VMSA when KVM is
initialized; move it from there into kvm_sev_info when SEV is initialized,
and then into the initial VMSA.

The new variable can then be used to return the set of supported features
to userspace, via the KVM_GET_DEVICE_ATTR ioctl.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11 13:08:22 -04:00
Paolo Bonzini
f3c80061c0 KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OP
The data structs for KVM_MEMORY_ENCRYPT_OP have different sizes for 32- and 64-bit
userspace, but they do not make any attempt to convert from one ABI to the other
when 32-bit userspace is running on 64-bit kernels.  This configuration never
worked, and SEV is only for 64-bit kernels so we're not breaking ABI on 32-bit
kernels.

Fix this by adding the appropriate padding; no functional change intended
for 64-bit userspace.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-03-18 19:03:52 -04:00
Paolo Bonzini
c822a075ab Merge tag 'kvm-x86-asyncpf_abi-6.9' of https://github.com/kvm-x86/linux into HEAD
Guest-side KVM async #PF ABI cleanup for 6.9

Delete kvm_vcpu_pv_apf_data.enabled to fix a goof in KVM's async #PF ABI where
the enabled field pushes the size of "struct kvm_vcpu_pv_apf_data" from 64 to
68 bytes, i.e. beyond a single cache line.

The enabled field is purely a guest-side flag that Linux-as-a-guest uses to
track whether or not the guest has enabled async #PF support.  The actual flag
that is passed to the host, i.e. to KVM proper, is a single bit in a synthetic
MSR, MSR_KVM_ASYNC_PF_EN, i.e. is in a location completely unrelated to the
shared kvm_vcpu_pv_apf_data structure.

Simply drop the the field and use a dedicated guest-side per-CPU variable to
fix the ABI, as opposed to fixing the documentation to match reality.  KVM has
never consumed kvm_vcpu_pv_apf_data.enabled, so the odds of the ABI change
breaking anything are extremely low.
2024-03-18 19:03:42 -04:00
Linus Torvalds
4f712ee0cb Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "S390:

   - Changes to FPU handling came in via the main s390 pull request

   - Only deliver to the guest the SCLP events that userspace has
     requested

   - More virtual vs physical address fixes (only a cleanup since
     virtual and physical address spaces are currently the same)

   - Fix selftests undefined behavior

  x86:

   - Fix a restriction that the guest can't program a PMU event whose
     encoding matches an architectural event that isn't included in the
     guest CPUID. The enumeration of an architectural event only says
     that if a CPU supports an architectural event, then the event can
     be programmed *using the architectural encoding*. The enumeration
     does NOT say anything about the encoding when the CPU doesn't
     report support the event *in general*. It might support it, and it
     might support it using the same encoding that made it into the
     architectural PMU spec

   - Fix a variety of bugs in KVM's emulation of RDPMC (more details on
     individual commits) and add a selftest to verify KVM correctly
     emulates RDMPC, counter availability, and a variety of other
     PMC-related behaviors that depend on guest CPUID and therefore are
     easier to validate with selftests than with custom guests (aka
     kvm-unit-tests)

   - Zero out PMU state on AMD if the virtual PMU is disabled, it does
     not cause any bug but it wastes time in various cases where KVM
     would check if a PMC event needs to be synthesized

   - Optimize triggering of emulated events, with a nice ~10%
     performance improvement in VM-Exit microbenchmarks when a vPMU is
     exposed to the guest

   - Tighten the check for "PMI in guest" to reduce false positives if
     an NMI arrives in the host while KVM is handling an IRQ VM-Exit

   - Fix a bug where KVM would report stale/bogus exit qualification
     information when exiting to userspace with an internal error exit
     code

   - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support

   - Rework TDP MMU root unload, free, and alloc to run with mmu_lock
     held for read, e.g. to avoid serializing vCPUs when userspace
     deletes a memslot

   - Tear down TDP MMU page tables at 4KiB granularity (used to be
     1GiB). KVM doesn't support yielding in the middle of processing a
     zap, and 1GiB granularity resulted in multi-millisecond lags that
     are quite impolite for CONFIG_PREEMPT kernels

   - Allocate write-tracking metadata on-demand to avoid the memory
     overhead when a kernel is built with i915 virtualization support
     but the workloads use neither shadow paging nor i915 virtualization

   - Explicitly initialize a variety of on-stack variables in the
     emulator that triggered KMSAN false positives

   - Fix the debugregs ABI for 32-bit KVM

   - Rework the "force immediate exit" code so that vendor code
     ultimately decides how and when to force the exit, which allowed
     some optimization for both Intel and AMD

   - Fix a long-standing bug where kvm_has_noapic_vcpu could be left
     elevated if vCPU creation ultimately failed, causing extra
     unnecessary work

   - Cleanup the logic for checking if the currently loaded vCPU is
     in-kernel

   - Harden against underflowing the active mmu_notifier invalidation
     count, so that "bad" invalidations (usually due to bugs elsehwere
     in the kernel) are detected earlier and are less likely to hang the
     kernel

  x86 Xen emulation:

   - Overlay pages can now be cached based on host virtual address,
     instead of guest physical addresses. This removes the need to
     reconfigure and invalidate the cache if the guest changes the gpa
     but the underlying host virtual address remains the same

   - When possible, use a single host TSC value when computing the
     deadline for Xen timers in order to improve the accuracy of the
     timer emulation

   - Inject pending upcall events when the vCPU software-enables its
     APIC to fix a bug where an upcall can be lost (and to follow Xen's
     behavior)

   - Fall back to the slow path instead of warning if "fast" IRQ
     delivery of Xen events fails, e.g. if the guest has aliased xAPIC
     IDs

  RISC-V:

   - Support exception and interrupt handling in selftests

   - New self test for RISC-V architectural timer (Sstc extension)

   - New extension support (Ztso, Zacas)

   - Support userspace emulation of random number seed CSRs

  ARM:

   - Infrastructure for building KVM's trap configuration based on the
     architectural features (or lack thereof) advertised in the VM's ID
     registers

   - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
     x86's WC) at stage-2, improving the performance of interacting with
     assigned devices that can tolerate it

   - Conversion of KVM's representation of LPIs to an xarray, utilized
     to address serialization some of the serialization on the LPI
     injection path

   - Support for _architectural_ VHE-only systems, advertised through
     the absence of FEAT_E2H0 in the CPU's ID register

   - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
     selftests

  LoongArch:

   - Set reserved bits as zero in CPUCFG

   - Start SW timer only when vcpu is blocking

   - Do not restart SW timer when it is expired

   - Remove unnecessary CSR register saving during enter guest

   - Misc cleanups and fixes as usual

  Generic:

   - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
     always true on all architectures except MIPS (where Kconfig
     determines the available depending on CPU capabilities). It is
     replaced either by an architecture-dependent symbol for MIPS, and
     IS_ENABLED(CONFIG_KVM) everywhere else

   - Factor common "select" statements in common code instead of
     requiring each architecture to specify it

   - Remove thoroughly obsolete APIs from the uapi headers

   - Move architecture-dependent stuff to uapi/asm/kvm.h

   - Always flush the async page fault workqueue when a work item is
     being removed, especially during vCPU destruction, to ensure that
     there are no workers running in KVM code when all references to
     KVM-the-module are gone, i.e. to prevent a very unlikely
     use-after-free if kvm.ko is unloaded

   - Grab a reference to the VM's mm_struct in the async #PF worker
     itself instead of gifting the worker a reference, so that there's
     no need to remember to *conditionally* clean up after the worker

  Selftests:

   - Reduce boilerplate especially when utilize selftest TAP
     infrastructure

   - Add basic smoke tests for SEV and SEV-ES, along with a pile of
     library support for handling private/encrypted/protected memory

   - Fix benign bugs where tests neglect to close() guest_memfd files"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
  selftests: kvm: remove meaningless assignments in Makefiles
  KVM: riscv: selftests: Add Zacas extension to get-reg-list test
  RISC-V: KVM: Allow Zacas extension for Guest/VM
  KVM: riscv: selftests: Add Ztso extension to get-reg-list test
  RISC-V: KVM: Allow Ztso extension for Guest/VM
  RISC-V: KVM: Forward SEED CSR access to user space
  KVM: riscv: selftests: Add sstc timer test
  KVM: riscv: selftests: Change vcpu_has_ext to a common function
  KVM: riscv: selftests: Add guest helper to get vcpu id
  KVM: riscv: selftests: Add exception handling support
  LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
  LoongArch: KVM: Do not restart SW timer when it is expired
  LoongArch: KVM: Start SW timer only when vcpu is blocking
  LoongArch: KVM: Set reserved bits as zero in CPUCFG
  KVM: selftests: Explicitly close guest_memfd files in some gmem tests
  KVM: x86/xen: fix recursive deadlock in timer injection
  KVM: pfncache: simplify locking and make more self-contained
  KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
  KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
  KVM: x86/xen: improve accuracy of Xen timers
  ...
2024-03-15 13:03:13 -07:00
Ingo Molnar
2e2bc42c83 Merge branch 'linus' into x86/boot, to resolve conflict
There's a new conflict with Linus's upstream tree, because
in the following merge conflict resolution in <asm/coco.h>:

  38b334fc76 Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Linus has resolved the conflicting placement of 'cc_mask' better
than the original commit:

  1c811d403a x86/sev: Fix position dependent variable references in startup code

... which was also done by an internal merge resolution:

  2e5fc4786b Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree

But Linus is right in 38b334fc76, the 'cc_mask' declaration is sufficient
within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block.

So instead of forcing Linus to do the same resolution again, merge in Linus's
tree and follow his conflict resolution.

 Conflicts:
	arch/x86/include/asm/coco.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-03-12 09:55:57 +01:00
Linus Torvalds
d69ad12c78 Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Ingo Molnar:

 - Reduce <asm/bootparam.h> dependencies

 - Simplify <asm/efi.h>

 - Unify *_setup_data definitions into <asm/setup_data.h>

 - Reduce the size of <asm/bootparam.h>

* tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Do not include <asm/bootparam.h> in several files
  x86/efi: Implement arch_ima_efi_boot_mode() in source file
  x86/setup: Move internal setup_data structures into setup_data.h
  x86/setup: Move UAPI setup structures into setup_data.h
2024-03-11 19:23:16 -07:00
Ard Biesheuvel
cd0d9d92c8 x86/boot: Move mem_encrypt= parsing to the decompressor
The early SME/SEV code parses the command line very early, in order to
decide whether or not memory encryption should be enabled, which needs
to occur even before the initial page tables are created.

This is problematic for a number of reasons:
- this early code runs from the 1:1 mapping provided by the decompressor
  or firmware, which uses a different translation than the one assumed by
  the linker, and so the code needs to be built in a special way;
- parsing external input while the entire kernel image is still mapped
  writable is a bad idea in general, and really does not belong in
  security minded code;
- the current code ignores the built-in command line entirely (although
  this appears to be the case for the entire decompressor)

Given that the decompressor/EFI stub is an intrinsic part of the x86
bootable kernel image, move the command line parsing there and out of
the core kernel. This removes the need to build lib/cmdline.o in a
special way, or to use RIP-relative LEA instructions in inline asm
blocks.

This involves a new xloadflag in the setup header to indicate
that mem_encrypt=on appeared on the kernel command line.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-17-ardb+git@google.com
2024-03-04 18:12:28 +01:00
Paul Durrant
3991f35805 KVM: x86/xen: allow vcpu_info to be mapped by fixed HVA
If the guest does not explicitly set the GPA of vcpu_info structure in
memory then, for guests with 32 vCPUs or fewer, the vcpu_info embedded
in the shared_info page may be used. As described in a previous commit,
the shared_info page is an overlay at a fixed HVA within the VMM, so in
this case it also more optimal to activate the vcpu_info cache with a
fixed HVA to avoid unnecessary invalidation if the guest memory layout
is modified.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-14-paul@xen.org
[sean: use kvm_gpc_is_{gpa,hva}_active()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 07:01:17 -08:00
Paul Durrant
b9220d3279 KVM: x86/xen: allow shared_info to be mapped by fixed HVA
The shared_info page is not guest memory as such. It is a dedicated page
allocated by the VMM and overlaid onto guest memory in a GFN chosen by the
guest and specified in the XENMEM_add_to_physmap hypercall. The guest may
even request that shared_info be moved from one GFN to another by
re-issuing that hypercall, but the HVA is never going to change.

Because the shared_info page is an overlay the memory slots need to be
updated in response to the hypercall. However, memory slot adjustment is
not atomic and, whilst all vCPUs are paused, there is still the possibility
that events may be delivered (which requires the shared_info page to be
updated) whilst the shared_info GPA is absent. The HVA is never absent
though, so it makes much more sense to use that as the basis for the
kernel's mapping.

Hence add a new KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA attribute type for this
purpose and a KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag to advertize its
availability. Don't actually advertize it yet though. That will be done in
a subsequent patch, which will also add tests for the new attribute type.

Also update the KVM API documentation with the new attribute and also fix
it up to consistently refer to 'shared_info' (with the underscore).

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240215152916.1158-13-paul@xen.org
[sean: store "hva" as a user pointer, use kvm_gpc_is_{gpa,hva}_active()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22 07:00:52 -08:00
Paolo Bonzini
6bda055d62 KVM: define __KVM_HAVE_GUEST_DEBUG unconditionally
Since all architectures (for historical reasons) have to define
struct kvm_guest_debug_arch, and since userspace has to check
KVM_CHECK_EXTENSION(KVM_CAP_SET_GUEST_DEBUG) anyway, there is
no advantage in masking the capability #define itself.  Remove
the #define __KVM_HAVE_GUEST_DEBUG from architecture-specific
headers.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08 08:41:06 -05:00
Paolo Bonzini
8886640dad kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbol
KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask
unused definitions in include/uapi/linux/kvm.h.  __KVM_HAVE_READONLY_MEM however
was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on
architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly
return nonzero.  This however does not make sense, and it prevented userspace
from supporting this architecture-independent feature without recompilation.

Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and
is only used in virt/kvm/kvm_main.c.  Userspace does not need to test it
and there should be no need for it to exist.  Remove it and replace it
with a Kconfig symbol within Linux source code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08 08:41:06 -05:00
Paolo Bonzini
bcac047727 KVM: x86: move x86-specific structs to uapi/asm/kvm.h
Several capabilities that exist only on x86 nevertheless have their
structs defined in include/uapi/linux/kvm.h.  Move them to
arch/x86/include/uapi/asm/kvm.h for cleanliness.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08 08:41:04 -05:00
Paolo Bonzini
458822416a kvm: x86: use a uapi-friendly macro for GENMASK
Change uapi header uses of GENMASK to instead use the uapi/linux/bits.h bit
macros, since GENMASK is not defined in uapi headers.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08 08:41:04 -05:00
Dionna Glaze
882dd4aee3 kvm: x86: use a uapi-friendly macro for BIT
Change uapi header uses of BIT to instead use the uapi/linux/const.h bit
macros, since BIT is not defined in uapi headers.

The PMU mask uses _BITUL since it targets a 32 bit flag field, whereas
the longmode definition is meant for a 64 bit flag field.

Cc: Sean Christophersen <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Dionna Glaze <dionnaglaze@google.com>
Message-Id: <20231207001142.3617856-1-dionnaglaze@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08 08:41:03 -05:00
Xiaoyao Li
ccb2280ec2 x86/kvm: Use separate percpu variable to track the enabling of asyncpf
Refer to commit fd10cde929 ("KVM paravirt: Add async PF initialization
to PV guest") and commit 344d9588a9 ("KVM: Add PV MSR to enable
asynchronous page faults delivery"). It turns out that at the time when
asyncpf was introduced, the purpose was defining the shared PV data 'struct
kvm_vcpu_pv_apf_data' with the size of 64 bytes. However, it made a mistake
and defined the size to 68 bytes, which failed to make fit in a cache line
and made the code inconsistent with the documentation.

Below justification quoted from Sean[*]

  KVM (the host side) has *never* read kvm_vcpu_pv_apf_data.enabled, and
  the documentation clearly states that enabling is based solely on the
  bit in the synthetic MSR.

  So rather than update the documentation, fix the goof by removing the
  enabled filed and use the separate percpu variable instread.
  KVM-as-a-host obviously doesn't enforce anything or consume the size,
  and changing the header will only affect guests that are rebuilt against
  the new header, so there's no chance of ABI breakage between KVM and its
  guests. The only possible breakage is if some other hypervisor is
  emulating KVM's async #PF (LOL) and relies on the guest to set
  kvm_vcpu_pv_apf_data.enabled. But (a) I highly doubt such a hypervisor
  exists, (b) that would arguably be a violation of KVM's "spec", and
  (c) the worst case scenario is that the guest would simply lose async
  #PF functionality.

[*] https://lore.kernel.org/all/ZS7ERnnRqs8Fl0ZF@google.com/T/#u

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20231025055914.1201792-2-xiaoyao.li@intel.com
[sean: use true/false instead of 1/0 for booleans]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-06 10:58:56 -08:00
H. Peter Anvin (Intel)
ff45746fbf x86/cpu: Add X86_CR4_FRED macro
Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit must not be
changed after initialization, so add it to the pinned CR4 bits.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-12-xin3.li@intel.com
2024-01-31 22:00:38 +01:00
Thomas Zimmermann
efd7def004 x86/setup: Move UAPI setup structures into setup_data.h
The type definition of struct pci_setup_rom in <asm/pci.h> requires
struct setup_data from <asm/bootparam.h>. Many drivers include
<linux/pci.h>, but do not use boot parameters. Changes to bootparam.h or
its included header files could easily trigger a large, unnecessary
rebuild of the kernel.

Moving struct setup_data and related code into its own header file
avoids including <asm/bootparam.h> in <asm/pci.h>. Instead include the
new header <asm/screen_data.h> and remove the include statement for
x86_init.h, which is unnecessary but pulls in bootparams.h.

Suggested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240112095000.8952-2-tzimmermann@suse.de
2024-01-30 15:17:06 +01:00
Linus Torvalds
09d1c6a80f Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "Generic:

   - Use memdup_array_user() to harden against overflow.

   - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
     architectures.

   - Clean up Kconfigs that all KVM architectures were selecting

   - New functionality around "guest_memfd", a new userspace API that
     creates an anonymous file and returns a file descriptor that refers
     to it. guest_memfd files are bound to their owning virtual machine,
     cannot be mapped, read, or written by userspace, and cannot be
     resized. guest_memfd files do however support PUNCH_HOLE, which can
     be used to switch a memory area between guest_memfd and regular
     anonymous memory.

   - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
     per-page attributes for a given page of guest memory; right now the
     only attribute is whether the guest expects to access memory via
     guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
     TDX or ARM64 pKVM is checked by firmware or hypervisor that
     guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
     the case of pKVM).

  x86:

   - Support for "software-protected VMs" that can use the new
     guest_memfd and page attributes infrastructure. This is mostly
     useful for testing, since there is no pKVM-like infrastructure to
     provide a meaningfully reduced TCB.

   - Fix a relatively benign off-by-one error when splitting huge pages
     during CLEAR_DIRTY_LOG.

   - Fix a bug where KVM could incorrectly test-and-clear dirty bits in
     non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
     a non-huge SPTE.

   - Use more generic lockdep assertions in paths that don't actually
     care about whether the caller is a reader or a writer.

   - let Xen guests opt out of having PV clock reported as "based on a
     stable TSC", because some of them don't expect the "TSC stable" bit
     (added to the pvclock ABI by KVM, but never set by Xen) to be set.

   - Revert a bogus, made-up nested SVM consistency check for
     TLB_CONTROL.

   - Advertise flush-by-ASID support for nSVM unconditionally, as KVM
     always flushes on nested transitions, i.e. always satisfies flush
     requests. This allows running bleeding edge versions of VMware
     Workstation on top of KVM.

   - Sanity check that the CPU supports flush-by-ASID when enabling SEV
     support.

   - On AMD machines with vNMI, always rely on hardware instead of
     intercepting IRET in some cases to detect unmasking of NMIs

   - Support for virtualizing Linear Address Masking (LAM)

   - Fix a variety of vPMU bugs where KVM fail to stop/reset counters
     and other state prior to refreshing the vPMU model.

   - Fix a double-overflow PMU bug by tracking emulated counter events
     using a dedicated field instead of snapshotting the "previous"
     counter. If the hardware PMC count triggers overflow that is
     recognized in the same VM-Exit that KVM manually bumps an event
     count, KVM would pend PMIs for both the hardware-triggered overflow
     and for KVM-triggered overflow.

   - Turn off KVM_WERROR by default for all configs so that it's not
     inadvertantly enabled by non-KVM developers, which can be
     problematic for subsystems that require no regressions for W=1
     builds.

   - Advertise all of the host-supported CPUID bits that enumerate
     IA32_SPEC_CTRL "features".

   - Don't force a masterclock update when a vCPU synchronizes to the
     current TSC generation, as updating the masterclock can cause
     kvmclock's time to "jump" unexpectedly, e.g. when userspace
     hotplugs a pre-created vCPU.

   - Use RIP-relative address to read kvm_rebooting in the VM-Enter
     fault paths, partly as a super minor optimization, but mostly to
     make KVM play nice with position independent executable builds.

   - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
     CONFIG_HYPERV as a minor optimization, and to self-document the
     code.

   - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
     "emulation" at build time.

  ARM64:

   - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
     granule sizes. Branch shared with the arm64 tree.

   - Large Fine-Grained Trap rework, bringing some sanity to the
     feature, although there is more to come. This comes with a prefix
     branch shared with the arm64 tree.

   - Some additional Nested Virtualization groundwork, mostly
     introducing the NV2 VNCR support and retargetting the NV support to
     that version of the architecture.

   - A small set of vgic fixes and associated cleanups.

  Loongarch:

   - Optimization for memslot hugepage checking

   - Cleanup and fix some HW/SW timer issues

   - Add LSX/LASX (128bit/256bit SIMD) support

  RISC-V:

   - KVM_GET_REG_LIST improvement for vector registers

   - Generate ISA extension reg_list using macros in get-reg-list
     selftest

   - Support for reporting steal time along with selftest

  s390:

   - Bugfixes

  Selftests:

   - Fix an annoying goof where the NX hugepage test prints out garbage
     instead of the magic token needed to run the test.

   - Fix build errors when a header is delete/moved due to a missing
     flag in the Makefile.

   - Detect if KVM bugged/killed a selftest's VM and print out a helpful
     message instead of complaining that a random ioctl() failed.

   - Annotate the guest printf/assert helpers with __printf(), and fix
     the various bugs that were lurking due to lack of said annotation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  KVM: x86: add missing "depends on KVM"
  KVM: fix direction of dependency on MMU notifiers
  KVM: introduce CONFIG_KVM_COMMON
  KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
  KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
  RISC-V: KVM: selftests: Add get-reg-list test for STA registers
  RISC-V: KVM: selftests: Add steal_time test support
  RISC-V: KVM: selftests: Add guest_sbi_probe_extension
  RISC-V: KVM: selftests: Move sbi_ecall to processor.c
  RISC-V: KVM: Implement SBI STA extension
  RISC-V: KVM: Add support for SBI STA registers
  RISC-V: KVM: Add support for SBI extension registers
  RISC-V: KVM: Add SBI STA info to vcpu_arch
  RISC-V: KVM: Add steal-update vcpu request
  RISC-V: KVM: Add SBI STA extension skeleton
  RISC-V: paravirt: Implement steal-time support
  RISC-V: Add SBI STA extension definitions
  RISC-V: paravirt: Add skeleton for pv-time support
  RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
  ...
2024-01-17 13:03:37 -08:00
Linus Torvalds
78273df7f6 Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs
Pull header cleanups from Kent Overstreet:
 "The goal is to get sched.h down to a type only header, so the main
  thing happening in this patchset is splitting out various _types.h
  headers and dependency fixups, as well as moving some things out of
  sched.h to better locations.

  This is prep work for the memory allocation profiling patchset which
  adds new sched.h interdepencencies"

* tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits)
  Kill sched.h dependency on rcupdate.h
  kill unnecessary thread_info.h include
  Kill unnecessary kernel.h include
  preempt.h: Kill dependency on list.h
  rseq: Split out rseq.h from sched.h
  LoongArch: signal.c: add header file to fix build error
  restart_block: Trim includes
  lockdep: move held_lock to lockdep_types.h
  sem: Split out sem_types.h
  uidgid: Split out uidgid_types.h
  seccomp: Split out seccomp_types.h
  refcount: Split out refcount_types.h
  uapi/linux/resource.h: fix include
  x86/signal: kill dependency on time.h
  syscall_user_dispatch.h: split out *_types.h
  mm_types_task.h: Trim dependencies
  Split out irqflags_types.h
  ipc: Kill bogus dependency on spinlock.h
  shm: Slim down dependencies
  workqueue: Split out workqueue_types.h
  ...
2024-01-10 16:43:55 -08:00
Bjorn Helgaas
54aa699e80 arch/x86: Fix typos
Fix typos, most reported by "codespell arch/x86".  Only touches comments,
no code changes.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
2024-01-03 11:46:22 +01:00
Kent Overstreet
3447066982 x86/signal: kill dependency on time.h
this is unecessary, and was pulling in printk.h from uapi headers

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-12-20 19:26:31 -05:00
Paolo Bonzini
6c370dc653 Merge branch 'kvm-guestmemfd' into HEAD
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd.  Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.

The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it.  Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.

A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory.  In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.

The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption.  In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.

Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory.  As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).

guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs.  But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.

The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.

Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
  the same memory attributes introduced here
- SNP and TDX support

There are two non-KVM patches buried in the middle of this series:

  fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-14 08:31:31 -05:00
Sean Christopherson
89ea60c2c7 KVM: x86: Add support for "protected VMs" that can utilize private memory
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-24-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14 08:01:05 -05:00
Suma Hegde
5150542b8e platform/x86/amd/hsmp: add support for metrics tbl
AMD MI300 MCM provides GET_METRICS_TABLE message to retrieve
all the system management information from SMU.

The metrics table is made available as hexadecimal sysfs binary file
under per socket sysfs directory created at
/sys/devices/platform/amd_hsmp/socket%d/metrics_bin

Metrics table definitions will be documented as part of Public PPR.
The same is defined in the amd_hsmp.h header.

Signed-off-by: Suma Hegde <suma.hegde@amd.com>
Reviewed-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Link: https://lore.kernel.org/r/20231010120310.3464066-2-suma.hegde@amd.com
[ij: lseek -> lseek(), dram -> DRAM in dev_err()]
[ij: added period to terminate a documentation sentence]
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2023-10-12 16:29:58 +03:00
Thomas Huth
659df86a7b x86: Remove the arch_calc_vm_prot_bits() macro from the UAPI
The arch_calc_vm_prot_bits() macro uses VM_PKEY_BIT0 etc. which are
not part of the UAPI, so the macro is completely useless for userspace.

It is also hidden behind the CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
config switch which we shouldn't expose to userspace. Thus let's move
this macro into a new internal header instead.

Fixes: 8f62c88322 ("x86/mm/pkeys: Add arch-specific VMA protection bits")
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lore.kernel.org/r/20230906162658.142511-1-thuth@redhat.com
2023-09-06 23:50:46 +02:00
Rick Edgecombe
67840ad0fa x86/shstk: Add ARCH_SHSTK_STATUS
CRIU and GDB need to get the current shadow stack and WRSS enablement
status. This information is already available via /proc/pid/status, but
this is inconvenient for CRIU because it involves parsing the text output
in an area of the code where this is difficult. Provide a status
arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a
userspace address, and make the new arch_prctl simply copy the features
out to userspace.

Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-43-rick.p.edgecombe%40intel.com
2023-08-02 15:01:51 -07:00
Mike Rapoport
680ed2f15e x86/shstk: Add ARCH_SHSTK_UNLOCK
Userspace loaders may lock features before a CRIU restore operation has
the chance to set them to whatever state is required by the process
being restored. Allow a way for CRIU to unlock features. Add it as an
arch_prctl() like the other shadow stack operations, but restrict it being
called by the ptrace arch_pctl() interface.

[Merged into recent API changes, added commit log and docs]

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-42-rick.p.edgecombe%40intel.com
2023-08-02 15:01:51 -07:00
Rick Edgecombe
1d62c65372 x86/shstk: Support WRSS for userspace
For the current shadow stack implementation, shadow stacks contents can't
easily be provisioned with arbitrary data. This property helps apps
protect themselves better, but also restricts any potential apps that may
want to do exotic things at the expense of a little security.

The x86 shadow stack feature introduces a new instruction, WRSS, which
can be enabled to write directly to shadow stack memory from userspace.
Allow it to get enabled via the prctl interface.

Only enable the userspace WRSS instruction, which allows writes to
userspace shadow stacks from userspace. Do not allow it to be enabled
independently of shadow stack, as HW does not support using WRSS when
shadow stack is disabled.

>From a fault handler perspective, WRSS will behave very similar to WRUSS,
which is treated like a user access from a #PF err code perspective.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-36-rick.p.edgecombe%40intel.com
2023-08-02 15:01:51 -07:00
Rick Edgecombe
c35559f94e x86/shstk: Introduce map_shadow_stack syscall
When operating with shadow stacks enabled, the kernel will automatically
allocate shadow stacks for new threads, however in some cases userspace
will need additional shadow stacks. The main example of this is the
ucontext family of functions, which require userspace allocating and
pivoting to userspace managed stacks.

Unlike most other user memory permissions, shadow stacks need to be
provisioned with special data in order to be useful. They need to be setup
with a restore token so that userspace can pivot to them via the RSTORSSP
instruction. But, the security design of shadow stacks is that they
should not be written to except in limited circumstances. This presents a
problem for userspace, as to how userspace can provision this special
data, without allowing for the shadow stack to be generally writable.

Previously, a new PROT_SHADOW_STACK was attempted, which could be
mprotect()ed from RW permissions after the data was provisioned. This was
found to not be secure enough, as other threads could write to the
shadow stack during the writable window.

The kernel can use a special instruction, WRUSS, to write directly to
userspace shadow stacks. So the solution can be that memory can be mapped
as shadow stack permissions from the beginning (never generally writable
in userspace), and the kernel itself can write the restore token.

First, a new madvise() flag was explored, which could operate on the
PROT_SHADOW_STACK memory. This had a couple of downsides:
1. Extra checks were needed in mprotect() to prevent writable memory from
   ever becoming PROT_SHADOW_STACK.
2. Extra checks/vma state were needed in the new madvise() to prevent
   restore tokens being written into the middle of pre-used shadow stacks.
   It is ideal to prevent restore tokens being added at arbitrary
   locations, so the check was to make sure the shadow stack had never been
   written to.
3. It stood out from the rest of the madvise flags, as more of direct
   action than a hint at future desired behavior.

So rather than repurpose two existing syscalls (mmap, madvise) that don't
quite fit, just implement a new map_shadow_stack syscall to allow
userspace to map and setup new shadow stacks in one step. While ucontext
is the primary motivator, userspace may have other unforeseen reasons to
setup its own shadow stacks using the WRSS instruction. Towards this
provide a flag so that stacks can be optionally setup securely for the
common case of ucontext without enabling WRSS. Or potentially have the
kernel set up the shadow stack in some new way.

The following example demonstrates how to create a new shadow stack with
map_shadow_stack:
void *shstk = map_shadow_stack(addr, stack_size, SHADOW_STACK_SET_TOKEN);

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-35-rick.p.edgecombe%40intel.com
2023-08-02 15:01:51 -07:00
Rick Edgecombe
2d39a6add4 x86/shstk: Add user-mode shadow stack support
Introduce basic shadow stack enabling/disabling/allocation routines.
A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag
and has a fixed size of min(RLIMIT_STACK, 4GB).

Keep the task's shadow stack address and size in thread_struct. This will
be copied when cloning new threads, but needs to be cleared during exec,
so add a function to do this.

32 bit shadow stack is not expected to have many users and it will
complicate the signal implementation. So do not support IA32 emulation
or x32.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-29-rick.p.edgecombe%40intel.com
2023-08-02 15:01:50 -07:00
Rick Edgecombe
98cfa46309 x86: Introduce userspace API for shadow stack
Add three new arch_prctl() handles:

 - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified
   feature. Returns 0 on success or a negative value on error.

 - ARCH_SHSTK_LOCK prevents future disabling or enabling of the
   specified feature. Returns 0 on success or a negative value
   on error.

The features are handled per-thread and inherited over fork(2)/clone(2),
but reset on exec().

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-27-rick.p.edgecombe%40intel.com
2023-08-02 15:01:50 -07:00
Rick Edgecombe
29f890d105 x86/mm: Introduce MAP_ABOVE4G
The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which require some core mm changes to function
properly.

One of the properties is that the shadow stack pointer (SSP), which is a
CPU register that points to the shadow stack like the stack pointer points
to the stack, can't be pointing outside of the 32 bit address space when
the CPU is executing in 32 bit mode. It is desirable to prevent executing
in 32 bit mode when shadow stack is enabled because the kernel can't easily
support 32 bit signals.

On x86 it is possible to transition to 32 bit mode without any special
interaction with the kernel, by doing a "far call" to a 32 bit segment.
So the shadow stack implementation can use this address space behavior
as a feature, by enforcing that shadow stack memory is always mapped
outside of the 32 bit address space. This way userspace will trigger a
general protection fault which will in turn trigger a segfault if it
tries to transition to 32 bit mode with shadow stack enabled.

This provides a clean error generating border for the user if they try
attempt to do 32 bit mode shadow stack, rather than leave the kernel in a
half working state for userspace to be surprised by.

So to allow future shadow stack enabling patches to map shadow stacks
out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior
is pretty much like MAP_32BIT, except that it has the opposite address
range. The are a few differences though.

If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the
MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit
syscall.

Since the default search behavior is top down, the normal kaslr base can
be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add its
own randomization in the bottom up case.

For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G
both are potentially valid, so both are used. In the bottomup search
path, the default behavior is already consistent with MAP_ABOVE4G since
mmap base should be above 4GB.

Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB.
So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode
with shadow stack enabled would usually segfault anyway. This is already
pretty decent guard rails. But the addition of MAP_ABOVE4G is some small
complexity spent to make it make it more complete.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-21-rick.p.edgecombe%40intel.com
2023-07-11 14:12:19 -07:00
Juergen Gross
973df19420 x86/mtrr: Don't let mtrr_type_lookup() return MTRR_TYPE_INVALID
mtrr_type_lookup() should always return a valid memory type. In case
there is no information available, it should return the default UC.

This will remove the last case where mtrr_type_lookup() can return
MTRR_TYPE_INVALID, so adjust the comment in include/uapi/asm/mtrr.h.

Note that removing the MTRR_TYPE_INVALID #define from that header
could break user code, so it has to stay.

At the same time the mtrr_type_lookup() stub for the !CONFIG_MTRR
case should set uniform to 1, as if the memory range would be
covered by no MTRR at all.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-15-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-06-01 15:04:33 +02:00
Juergen Gross
d053b481a5 x86/mtrr: Replace size_or_mask and size_and_mask with a much easier concept
Replace size_or_mask and size_and_mask with the much easier concept of
high reserved bits.

While at it, instead of using constants in the MTRR code, use some new

  [ bp:
   - Drop mtrr_set_mask()
   - Unbreak long lines
   - Move struct mtrr_state_type out of the uapi header as it doesn't
     belong there. It also fixes a HDRTEST breakage "unknown type name ‘bool’"
     as Reported-by: kernel test robot <lkp@intel.com>
   - Massage.
  ]

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230502120931.20719-3-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-06-01 15:04:23 +02:00
Linus Torvalds
c8c655c34e Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "s390:

   - More phys_to_virt conversions

   - Improvement of AP management for VSIE (nested virtualization)

  ARM64:

   - Numerous fixes for the pathological lock inversion issue that
     plagued KVM/arm64 since... forever.

   - New framework allowing SMCCC-compliant hypercalls to be forwarded
     to userspace, hopefully paving the way for some more features being
     moved to VMMs rather than be implemented in the kernel.

   - Large rework of the timer code to allow a VM-wide offset to be
     applied to both virtual and physical counters as well as a
     per-timer, per-vcpu offset that complements the global one. This
     last part allows the NV timer code to be implemented on top.

   - A small set of fixes to make sure that we don't change anything
     affecting the EL1&0 translation regime just after having having
     taken an exception to EL2 until we have executed a DSB. This
     ensures that speculative walks started in EL1&0 have completed.

   - The usual selftest fixes and improvements.

  x86:

   - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
     enabled, and by giving the guest control of CR0.WP when EPT is
     enabled on VMX (VMX-only because SVM doesn't support per-bit
     controls)

   - Add CR0/CR4 helpers to query single bits, and clean up related code
     where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
     return as a bool

   - Move AMD_PSFD to cpufeatures.h and purge KVM's definition

   - Avoid unnecessary writes+flushes when the guest is only adding new
     PTEs

   - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
     optimizations when emulating invalidations

   - Clean up the range-based flushing APIs

   - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
     single A/D bit using a LOCK AND instead of XCHG, and skip all of
     the "handle changed SPTE" overhead associated with writing the
     entire entry

   - Track the number of "tail" entries in a pte_list_desc to avoid
     having to walk (potentially) all descriptors during insertion and
     deletion, which gets quite expensive if the guest is spamming
     fork()

   - Disallow virtualizing legacy LBRs if architectural LBRs are
     available, the two are mutually exclusive in hardware

   - Disallow writes to immutable feature MSRs (notably
     PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features

   - Overhaul the vmx_pmu_caps selftest to better validate
     PERF_CAPABILITIES

   - Apply PMU filters to emulated events and add test coverage to the
     pmu_event_filter selftest

   - AMD SVM:
       - Add support for virtual NMIs
       - Fixes for edge cases related to virtual interrupts

   - Intel AMX:
       - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
         XTILE_DATA is not being reported due to userspace not opting in
         via prctl()
       - Fix a bug in emulation of ENCLS in compatibility mode
       - Allow emulation of NOP and PAUSE for L2
       - AMX selftests improvements
       - Misc cleanups

  MIPS:

   - Constify MIPS's internal callbacks (a leftover from the hardware
     enabling rework that landed in 6.3)

  Generic:

   - Drop unnecessary casts from "void *" throughout kvm_main.c

   - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
     struct size by 8 bytes on 64-bit kernels by utilizing a padding
     hole

  Documentation:

   - Fix goof introduced by the conversion to rST"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
  KVM: s390: pci: fix virtual-physical confusion on module unload/load
  KVM: s390: vsie: clarifications on setting the APCB
  KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
  KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
  KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
  KVM: selftests: Test the PMU event "Instructions retired"
  KVM: selftests: Copy full counter values from guest in PMU event filter test
  KVM: selftests: Use error codes to signal errors in PMU event filter test
  KVM: selftests: Print detailed info in PMU event filter asserts
  KVM: selftests: Add helpers for PMC asserts in PMU event filter test
  KVM: selftests: Add a common helper for the PMU event filter guest code
  KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
  KVM: arm64: vhe: Drop extra isb() on guest exit
  KVM: arm64: vhe: Synchronise with page table walker on MMU update
  KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
  KVM: arm64: nvhe: Synchronise with page table walker on TLBI
  KVM: arm64: Handle 32bit CNTPCTSS traps
  KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
  KVM: arm64: vgic: Don't acquire its_lock before config_lock
  KVM: selftests: Add test to verify KVM's supported XCR0
  ...
2023-05-01 12:06:20 -07:00
Linus Torvalds
22b8cc3e78 Merge tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 LAM (Linear Address Masking) support from Dave Hansen:
 "Add support for the new Linear Address Masking CPU feature.

  This is similar to ARM's Top Byte Ignore and allows userspace to store
  metadata in some bits of pointers without masking it out before use"

* tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside
  x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA
  selftests/x86/lam: Add test cases for LAM vs thread creation
  selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking
  selftests/x86/lam: Add inherit test cases for linear-address masking
  selftests/x86/lam: Add io_uring test cases for linear-address masking
  selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking
  selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking
  x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
  iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
  mm: Expose untagging mask in /proc/$PID/status
  x86/mm: Provide arch_prctl() interface for LAM
  x86/mm: Reduce untagged_addr() overhead for systems without LAM
  x86/uaccess: Provide untagged_addr() and remove tags before address check
  mm: Introduce untagged_addr_remote()
  x86/mm: Handle LAM on context switch
  x86: CPUID and CR3/CR4 flags for Linear Address Masking
  x86: Allow atomic MM_CONTEXT flags setting
  x86/mm: Rework address range check in get_user() and put_user()
2023-04-28 09:43:49 -07:00
Oliver Upton
e65733b5c5 KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL
The 'longmode' field is a bit annoying as it blows an entire __u32 to
represent a boolean value. Since other architectures are looking to add
support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it
up.

Redefine the field (and the remaining padding) as a set of flags.
Preserve the existing ABI by using bit 0 to indicate if the guest was in
long mode and requiring that the remaining 31 bits must be zero.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev
2023-04-05 12:07:41 +01:00
Chang S. Bae
a03c376eba x86/arch_prctl: Add AMX feature numbers as ABI constants
Each distinct XSAVE feature has a number assigned to it.  Among other
things, the number determines the ordering of features in the XSAVE
buffer and is also used to generate XSAVE bitmasks like the value
for XCR0.

AMX state is dynamically enabled by the architecture-specific prctl().
This prctl() takes one XSAVE feature number as an argument.  However, the
feature numbers are not defined in any readily available userspace headers.
The means that each userspace app trying to use dynamic feature prctl()s
will likely end up defining their own constants for each feature.

Since these feature numbers are a part of the uabi, expose them in the
prctl() uabi header.  Save everyone the trouble of looking them up and
defining their own.

[ dhansen: expand changelog a bit ]

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20230121001900.14900-3-chang.seok.bae%40intel.com
2023-03-22 13:02:33 -07:00
Kirill A. Shutemov
23e5d9ec2b x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
IOMMU and SVA-capable devices know nothing about LAM and only expect
canonical addresses. An attempt to pass down tagged pointer will lead
to address translation failure.

By default do not allow to enable both LAM and use SVA in the same
process.

The new ARCH_FORCE_TAGGED_SVA arch_prctl() overrides the limitation.
By using the arch_prctl() userspace takes responsibility to never pass
tagged address to the device.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20230312112612.31869-12-kirill.shutemov%40linux.intel.com
2023-03-16 13:08:40 -07:00