Pull KVM updates from Paolo Bonzini:
"ARM:
- Move a lot of state that was previously stored on a per vcpu basis
into a per-CPU area, because it is only pertinent to the host while
the vcpu is loaded. This results in better state tracking, and a
smaller vcpu structure.
- Add full handling of the ERET/ERETAA/ERETAB instructions in nested
virtualisation. The last two instructions also require emulating
part of the pointer authentication extension. As a result, the trap
handling of pointer authentication has been greatly simplified.
- Turn the global (and not very scalable) LPI translation cache into
a per-ITS, scalable cache, making non directly injected LPIs much
cheaper to make visible to the vcpu.
- A batch of pKVM patches, mostly fixes and cleanups, as the
upstreaming process seems to be resuming. Fingers crossed!
- Allocate PPIs and SGIs outside of the vcpu structure, allowing for
smaller EL2 mapping and some flexibility in implementing more or
less than 32 private IRQs.
- Purge stale mpidr_data if a vcpu is created after the MPIDR map has
been created.
- Preserve vcpu-specific ID registers across a vcpu reset.
- Various minor cleanups and improvements.
LoongArch:
- Add ParaVirt IPI support
- Add software breakpoint support
- Add mmio trace events support
RISC-V:
- Support guest breakpoints using ebreak
- Introduce per-VCPU mp_state_lock and reset_cntx_lock
- Virtualize SBI PMU snapshot and counter overflow interrupts
- New selftests for SBI PMU and Guest ebreak
- Some preparatory work for both TDX and SNP page fault handling.
This also cleans up the page fault path, so that the priorities of
various kinds of fauls (private page, no memory, write to read-only
slot, etc.) are easier to follow.
x86:
- Minimize amount of time that shadow PTEs remain in the special
REMOVED_SPTE state.
This is a state where the mmu_lock is held for reading but
concurrent accesses to the PTE have to spin; shortening its use
allows other vCPUs to repopulate the zapped region while the zapper
finishes tearing down the old, defunct page tables.
- Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID
field, which is defined by hardware but left for software use.
This lets KVM communicate its inability to map GPAs that set bits
51:48 on hosts without 5-level nested page tables. Guest firmware
is expected to use the information when mapping BARs; this avoids
that they end up at a legal, but unmappable, GPA.
- Fixed a bug where KVM would not reject accesses to MSR that aren't
supposed to exist given the vCPU model and/or KVM configuration.
- As usual, a bunch of code cleanups.
x86 (AMD):
- Implement a new and improved API to initialize SEV and SEV-ES VMs,
which will also be extendable to SEV-SNP.
The new API specifies the desired encryption in KVM_CREATE_VM and
then separately initializes the VM. The new API also allows
customizing the desired set of VMSA features; the features affect
the measurement of the VM's initial state, and therefore enabling
them cannot be done tout court by the hypervisor.
While at it, the new API includes two bugfixes that couldn't be
applied to the old one without a flag day in userspace or without
affecting the initial measurement. When a SEV-ES VM is created with
the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected
once the VMSA has been encrypted. Also, the FPU and AVX state will
be synchronized and encrypted too.
- Support for GHCB version 2 as applicable to SEV-ES guests.
This, once more, is only accessible when using the new
KVM_SEV_INIT2 flow for initialization of SEV-ES VMs.
x86 (Intel):
- An initial bunch of prerequisite patches for Intel TDX were merged.
They generally don't do anything interesting. The only somewhat
user visible change is a new debugging mode that checks that KVM's
MMU never triggers a #VE virtualization exception in the guest.
- Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig
VM-Exit to L1, as per the SDM.
Generic:
- Use vfree() instead of kvfree() for allocations that always use
vcalloc() or __vcalloc().
- Remove .change_pte() MMU notifier - the changes to non-KVM code are
small and Andrew Morton asked that I also take those through the
KVM tree.
The callback was only ever implemented by KVM (which was also the
original user of MMU notifiers) but it had been nonfunctional ever
since calls to set_pte_at_notify were wrapped with
invalidate_range_start and invalidate_range_end... in 2012.
Selftests:
- Enhance the demand paging test to allow for better reporting and
stressing of UFFD performance.
- Convert the steal time test to generate TAP-friendly output.
- Fix a flaky false positive in the xen_shinfo_test due to comparing
elapsed time across two different clock domains.
- Skip the MONITOR/MWAIT test if the host doesn't actually support
MWAIT.
- Avoid unnecessary use of "sudo" in the NX hugepage test wrapper
shell script, to play nice with running in a minimal userspace
environment.
- Allow skipping the RSEQ test's sanity check that the vCPU was able
to complete a reasonable number of KVM_RUNs, as the assert can fail
on a completely valid setup.
If the test is run on a large-ish system that is otherwise idle,
and the test isn't affined to a low-ish number of CPUs, the vCPU
task can be repeatedly migrated to CPUs that are in deep sleep
states, which results in the vCPU having very little net runtime
before the next migration due to high wakeup latencies.
- Define _GNU_SOURCE for all selftests to fix a warning that was
introduced by a change to kselftest_harness.h late in the 6.9
cycle, and because forcing every test to #define _GNU_SOURCE is
painful.
- Provide a global pseudo-RNG instance for all tests, so that library
code can generate random, but determinstic numbers.
- Use the global pRNG to randomly force emulation of select writes
from guest code on x86, e.g. to help validate KVM's emulation of
locked accesses.
- Allocate and initialize x86's GDT, IDT, TSS, segments, and default
exception handlers at VM creation, instead of forcing tests to
manually trigger the related setup.
Documentation:
- Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits)
selftests/kvm: remove dead file
KVM: selftests: arm64: Test vCPU-scoped feature ID registers
KVM: selftests: arm64: Test that feature ID regs survive a reset
KVM: selftests: arm64: Store expected register value in set_id_regs
KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope
KVM: arm64: Only reset vCPU-scoped feature ID regs once
KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs()
KVM: arm64: Rename is_id_reg() to imply VM scope
KVM: arm64: Destroy mpidr_data for 'late' vCPU creation
KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support
KVM: arm64: Fix hvhe/nvhe early alias parsing
KVM: SEV: Allow per-guest configuration of GHCB protocol version
KVM: SEV: Add GHCB handling for termination requests
KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests
KVM: SEV: Add support to handle AP reset MSR protocol
KVM: x86: Explicitly zero kvm_caps during vendor module load
KVM: x86: Fully re-initialize supported_mce_cap on vendor module load
KVM: x86: Fully re-initialize supported_vm_types on vendor module load
KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns
KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values
...
SBI v2.0 SBI introduced PMU snapshot feature which adds the following
features.
1. Read counter values directly from the shared memory instead of
csr read.
2. Start multiple counters with initial values with one SBI call.
These functionalities optimizes the number of traps to the higher
privilege mode. If the kernel is in VS mode while the hypervisor
deploy trap & emulate method, this would minimize all the hpmcounter
CSR read traps. If the kernel is running in S-mode, the benefits
reduced to CSR latency vs DRAM/cache latency as there is no trap
involved while accessing the hpmcounter CSRs.
In both modes, it does saves the number of ecalls while starting
multiple counter together with an initial values. This is a likely
scenario if multiple counters overflow at the same time.
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20240420151741.962500-10-atishp@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
Pull RISC-V updates from Palmer Dabbelt:
- Support for various vector-accelerated crypto routines
- Hibernation is now enabled for portable kernel builds
- mmap_rnd_bits_max is larger on systems with larger VAs
- Support for fast GUP
- Support for membarrier-based instruction cache synchronization
- Support for the Andes hart-level interrupt controller and PMU
- Some cleanups around unaligned access speed probing and Kconfig
settings
- Support for ACPI LPI and CPPC
- Various cleanus related to barriers
- A handful of fixes
* tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits)
riscv: Fix syscall wrapper for >word-size arguments
crypto: riscv - add vector crypto accelerated AES-CBC-CTS
crypto: riscv - parallelize AES-CBC decryption
riscv: Only flush the mm icache when setting an exec pte
riscv: Use kcalloc() instead of kzalloc()
riscv/barrier: Add missing space after ','
riscv/barrier: Consolidate fence definitions
riscv/barrier: Define RISCV_FULL_BARRIER
riscv/barrier: Define __{mb,rmb,wmb}
RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ
cpufreq: Move CPPC configs to common Kconfig and add RISC-V
ACPI: RISC-V: Add CPPC driver
ACPI: Enable ACPI_PROCESSOR for RISC-V
ACPI: RISC-V: Add LPI driver
cpuidle: RISC-V: Move few functions to arch/riscv
riscv: Introduce set_compat_task() in asm/compat.h
riscv: Introduce is_compat_thread() into compat.h
riscv: add compile-time test into is_compat_task()
riscv: Replace direct thread flag check with is_compat_task()
riscv: Improve arch_get_mmap_end() macro
...
This is really just a single patch, but since the offending fix hasn't
yet made it to my for-next I'm merging it here.
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
We used to unconditionnally expose the cycle and instret csrs to
userspace, which gives rise to security concerns.
So now we only allow access to hw counters from userspace through the perf
framework which will handle context switches, per-task events...etc. A
sysctl allows to revert the behaviour to the legacy mode so that userspace
applications which are not ready for this change do not break.
But the default value is to allow userspace only through perf: this will
break userspace applications which rely on direct access to rdcycle.
This choice was made for security reasons [1][2]: most of the applications
which use rdcycle can instead use rdtime to count the elapsed time.
[1] https://groups.google.com/a/groups.riscv.org/g/sw-dev/c/REWcwYnzsKE?pli=1
[2] https://www.youtube.com/watch?v=3-c4C_L2PRQ&ab_channel=IEEESymposiumonSecurityandPrivacy
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Pull more RISC-V updates from Palmer Dabbelt:
- Some cleanups and fixes for the Zbb-optimized string routines
- Support for custom (vendor or implementation defined) perf events
- COMMAND_LINE_SIZE has been increased to 1024
* tag 'riscv-for-linus-6.3-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Bump COMMAND_LINE_SIZE value to 1024
drivers/perf: RISC-V: Allow programming custom firmware events
riscv, lib: Fix Zbb strncmp
RISC-V: improve string-function assembly
Pull kvm updates from Paolo Bonzini:
"ARM:
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place
- Add support for taking stage-2 access faults in parallel. This was
an accidental omission in the original parallel faults
implementation, but should provide a marginal improvement to
machines w/o FEAT_HAFDBS (such as hardware from the fruit company)
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception
handling and masking unsupported features for nested guests
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at
reducing the trap overhead of running nested
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems
- Avoid VM-wide stop-the-world operations when a vCPU accesses its
own redistributor
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected
exceptions in the host
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the
guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Sort out confusion between virtual and physical addresses, which
currently are the same on s390
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world, some
of them affecting architecurally legal but unlikely to happen in
practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give
SVM similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at
this point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the
PMU and MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's
send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't
support EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just let
the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how to
do initialization
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to
emit the correct hypercall instruction instead of relying on KVM to
patch in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (325 commits)
KVM: SVM: hyper-v: placate modpost section mismatch error
KVM: x86/mmu: Make tdp_mmu_allowed static
KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID
KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes
KVM: arm64: nv: Filter out unsupported features from ID regs
KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2
KVM: arm64: nv: Allow a sysreg to be hidden from userspace only
KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor
KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
KVM: arm64: nv: Handle SMCs taken from virtual EL2
KVM: arm64: nv: Handle trapped ERET from virtual EL2
KVM: arm64: nv: Inject HVC exceptions to the virtual EL2
KVM: arm64: nv: Support virtual EL2 exceptions
KVM: arm64: nv: Handle HCR_EL2.NV system register traps
KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state
KVM: arm64: nv: Add EL2 system registers to vcpu context
KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x
KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set
KVM: arm64: nv: Introduce nested virtualization VCPU feature
KVM: arm64: Use the S2 MMU context to iterate over S2 table
...
Currently, the host driver doesn't have any method to identify if the
requested perf event is from kvm or bare metal. As KVM runs in HS
mode, there are no separate hypervisor privilege mode to distinguish
between the attributes for guest/host.
Improve the privilege mode filtering by using the event specific
config1 field.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
KVM module needs to know how many hardware counters and the counter
width that the platform supports. Otherwise, it will not be able to show
optimal value of virtual counters to the guest. The virtual hardware
counters also need to have the same width as the logical hardware
counters for simplicity. However, there shouldn't be mapping between
virtual hardware counters and logical hardware counters. As we don't
support hetergeneous harts or counters with different width as of now,
the implementation relies on the counter width of the first available
programmable counter.
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
With the T-HEAD C9XX cores being designed before or during the ratification
to the SSCOFPMF extension, it implements functionality very similar but
not equal to it.
It implements overflow handling and also some privilege-mode filtering.
While SSCOFPMF supports this for all modes, the C9XX only implements the
filtering for M-mode and S-mode but not user-mode.
So add some adaptions to allow the C9XX to still handle
its PMU through the regular SBI PMU interface instead of defining new
interfaces or drivers.
To work properly, this requires a matching change in SBI, though the actual
interface between kernel and SBI does not change.
The main differences are a the overflow CSR and irq number.
As the reading of the overflow-csr is in the hot-path during irq handling,
use an errata and alternatives to not introduce new conditionals there.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://lore.kernel.org/all/20221011231841.2951264-2-heiko@sntech.de/
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Pull RISC-V updates from Palmer Dabbelt:
- Improvements to the CPU topology subsystem, which fix some issues
where RISC-V would report bad topology information.
- The default NR_CPUS has increased to XLEN, and the maximum
configurable value is 512.
- The CD-ROM filesystems have been enabled in the defconfig.
- Support for THP_SWAP has been added for rv64 systems.
There are also a handful of cleanups and fixes throughout the tree.
* tag 'riscv-for-linus-6.1-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: enable THP_SWAP for RV64
RISC-V: Print SSTC in canonical order
riscv: compat: s/failed/unsupported if compat mode isn't supported
RISC-V: Increase range and default value of NR_CPUS
cpuidle: riscv-sbi: Fix CPU_PM_CPU_IDLE_ENTER_xyz() macro usage
perf: RISC-V: throttle perf events
perf: RISC-V: exclude invalid pmu counters from SBI calls
riscv: enable CD-ROM file systems in defconfig
riscv: topology: fix default topology reporting
arm64: topology: move store_cpu_topology() to shared code
SBI firmware should report total number of firmware and hardware counters
including unused ones or special ones. In this case the kernel doesn't need
to make any assumptions about gaps in reported counters, e.g. excluded timer
counter. That was fixed in OpenSBI v1.1 by commit 3f66465fb6bf ("lib: pmu:
allow to use the highest available counter"). This kernel patch has no effect
if SBI firmware behaves correctly. However it eliminates access beyond the
allocated pmu_ctr_list if the kernel is used with OpenSBI older than v1.1.
Fixes: e999143459 ("RISC-V: Add perf platform driver based on SBI PMU extension")
Signed-off-by: Sergey Matyukevich <sergey.matyukevich@syntacore.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220830155306.301714-2-geomatsi@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Some of the SBI PMU calls does not pass 64bit arguments
correctly and not under RV32 compile time flags. Currently,
this doesn't create any incorrect results as RV64 ignores
any value in the additional register and qemu doesn't support
raw events.
Fix those SBI calls in order to set correct values for RV32.
Fixes: e999143459 ("RISC-V: Add perf platform driver based on SBI PMU extension")
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220711174632.4186047-4-atishp@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Currently, riscv_pmu_event_set_period updates the userpage mapping.
However, the caller of riscv_pmu_event_set_period should update
the userpage mapping because the counter can not be updated/started
from set_period function in counter overflow path.
Invoke the perf_event_update_userpage at the caller so that it
doesn't get invoked twice during counter start path.
Fixes: f5bfa23f57 ("RISC-V: Add a perf core library for pmu drivers")
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Guo Ren <guoren@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220711174632.4186047-3-atishp@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Currently, when the CPU is doing suspend to ram, we don't
save pmu counter register and its content will be lost.
To ensure perf profiling is not affected by suspend to ram,
this patch is based on arm_pmu CPU_PM notifier and implements riscv
pmu pm notifier. In the pm notifier, we stop the counter and update
the counter value before suspend and start the counter after resume.
Signed-off-by: Eric Lin <eric.lin@sifive.com>
Link: https://lore.kernel.org/r/20220705091920.27432-1-eric.lin@sifive.com
Signed-off-by: Will Deacon <will@kernel.org>
The sscofpmf extension allows counter overflow and filtering for
programmable counters. Enable the perf driver to handle the overflow
interrupt. The overflow interrupt is a hart local interrupt.
Thus, per cpu overflow interrupts are setup as a child under the root
INTC irq domain.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
RISC-V SBI specification added a PMU extension that allows to configure
start/stop any pmu counter. The RISC-V perf can use most of the generic
perf features except interrupt overflow and event filtering based on
privilege mode which will be added in future.
It also allows to monitor a handful of firmware counters that can provide
insights into firmware activity during a performance analysis.
Signed-off-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>