Commit Graph

710 Commits

Author SHA1 Message Date
Sean Christopherson
23ad135ae6 KVM: Call kvm_arch_memslots_updated() before updating memslots
commit 152482580a upstream.

kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots.  Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09f8 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-03-23 20:10:13 +01:00
Christoffer Dall
04131dfcb9 KVM: arm/arm64: vgic: Always initialize the group of private IRQs
[ Upstream commit ab2d5eb03d ]

We currently initialize the group of private IRQs during
kvm_vgic_vcpu_init, and the value of the group depends on the GIC model
we are emulating.  However, CPUs created before creating (and
initializing) the VGIC might end up with the wrong group if the VGIC
is created as GICv3 later.

Since we have no enforced ordering of creating the VGIC and creating
VCPUs, we can end up with part the VCPUs being properly intialized and
the remaining incorrectly initialized.  That also means that we have no
single place to do the per-cpu data structure initialization which
depends on knowing the emulated GIC model (which is only the group
field).

This patch removes the incorrect comment from kvm_vgic_vcpu_init and
initializes the group of all previously created VCPUs's private
interrupts in vgic_init in addition to the existing initialization in
kvm_vgic_vcpu_init.

Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:43 +01:00
Marc Zyngier
b78379c337 arm/arm64: KVM: Allow a VCPU to fully reset itself
[ Upstream commit 358b28f09f ]

The current kvm_psci_vcpu_on implementation will directly try to
manipulate the state of the VCPU to reset it.  However, since this is
not done on the thread that runs the VCPU, we can end up in a strangely
corrupted state when the source and target VCPUs are running at the same
time.

Fix this by factoring out all reset logic from the PSCI implementation
and forwarding the required information along with a request to the
target VCPU.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:43 +01:00
Julien Thierry
5f4a64b040 KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlock
[ Upstream commit fc3bc47523 ]

vgic_dist->lpi_list_lock must always be taken with interrupts disabled as
it is used in interrupt context.

For configurations such as PREEMPT_RT_FULL, this means that it should
be a raw_spinlock since RT spinlocks are interruptible.

Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23 20:09:42 +01:00
Mark Rutland
c709eeb02c arm64: KVM: Skip MMIO insn after emulation
[ Upstream commit 0d640732db ]

When we emulate an MMIO instruction, we advance the CPU state within
decode_hsr(), before emulating the instruction effects.

Having this logic in decode_hsr() is opaque, and advancing the state
before emulation is problematic. It gets in the way of applying
consistent single-step logic, and it prevents us from being able to fail
an MMIO instruction with a synchronous exception.

Clean this up by only advancing the CPU state *after* the effects of the
instruction are emulated.

Cc: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12 19:47:12 +01:00
Christoffer Dall
4f14f446d1 KVM: arm/arm64: Fix VMID alloc race by reverting to lock-less
commit fb544d1ca6 upstream.

We recently addressed a VMID generation race by introducing a read/write
lock around accesses and updates to the vmid generation values.

However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but
does so without taking the read lock.

As far as I can tell, this can lead to the same kind of race:

  VM 0, VCPU 0			VM 0, VCPU 1
  ------------			------------
  update_vttbr (vmid 254)
  				update_vttbr (vmid 1) // roll over
				read_lock(kvm_vmid_lock);
				force_vm_exit()
  local_irq_disable
  need_new_vmid_gen == false //because vmid gen matches

  enter_guest (vmid 254)
  				kvm_arch.vttbr = <PGD>:<VMID 1>
				read_unlock(kvm_vmid_lock);

  				enter_guest (vmid 1)

Which results in running two VCPUs in the same VM with different VMIDs
and (even worse) other VCPUs from other VMs could now allocate clashing
VMID 254 from the new generation as long as VCPU 0 is not exiting.

Attempt to solve this by making sure vttbr is updated before another CPU
can observe the updated VMID generation.

Cc: stable@vger.kernel.org
Fixes: f0cf47d939 "KVM: arm/arm64: Close VMID generation race"
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16 22:04:37 +01:00
Gustavo A. R. Silva
f318d0cf26 KVM: arm/arm64: vgic: Fix off-by-one bug in vgic_get_irq()
commit c23b2e6fc4 upstream.

When using the nospec API, it should be taken into account that:

"...if the CPU speculates past the bounds check then
 * array_index_nospec() will clamp the index within the range of [0,
 * size)."

The above is part of the header for macro array_index_nospec() in
linux/nospec.h

Now, in this particular case, if intid evaluates to exactly VGIC_MAX_SPI
or to exaclty VGIC_MAX_PRIVATE, the array_index_nospec() macro ends up
returning VGIC_MAX_SPI - 1 or VGIC_MAX_PRIVATE - 1 respectively, instead
of VGIC_MAX_SPI or VGIC_MAX_PRIVATE, which, based on the original logic:

	/* SGIs and PPIs */
	if (intid <= VGIC_MAX_PRIVATE)
 		return &vcpu->arch.vgic_cpu.private_irqs[intid];

 	/* SPIs */
	if (intid <= VGIC_MAX_SPI)
 		return &kvm->arch.vgic.spis[intid - VGIC_NR_PRIVATE_IRQS];

are valid values for intid.

Fix this by calling array_index_nospec() macro with VGIC_MAX_PRIVATE + 1
and VGIC_MAX_SPI + 1 as arguments for its parameter size.

Fixes: 41b87599c7 ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()")
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
[dropped the SPI part which was fixed separately]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:49 +01:00
Christoffer Dall
47ffaa7dec KVM: arm/arm64: vgic-v2: Set active_source to 0 when restoring state
commit 60c3ab30d8 upstream.

When restoring the active state from userspace, we don't know which CPU
was the source for the active state, and this is not architecturally
exposed in any of the register state.

Set the active_source to 0 in this case.  In the future, we can expand
on this and exposse the information as additional information to
userspace for GICv2 if anyone cares.

Cc: stable@vger.kernel.org
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:49 +01:00
Marc Zyngier
6318b1b7c9 KVM: arm/arm64: vgic: Cap SPIs to the VM-defined maximum
commit bea2ef803a upstream.

SPIs should be checked against the VMs specific configuration, and
not the architectural maximum.

Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:49 +01:00
Julien Thierry
f0fcc4d17c KVM: arm/arm64: vgic: Do not cond_resched_lock() with IRQs disabled
commit 2e2f6c3c0b upstream.

To change the active state of an MMIO, halt is requested for all vcpus of
the affected guest before modifying the IRQ state. This is done by calling
cond_resched_lock() in vgic_mmio_change_active(). However interrupts are
disabled at this point and we cannot reschedule a vcpu.

We actually don't need any of this, as kvm_arm_halt_guest ensures that
all the other vcpus are out of the guest. Let's just drop that useless
code.

Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Suggested-by: Christoffer Dall <christoffer.dall@arm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:49 +01:00
Marc Zyngier
0fa6851804 arm/arm64: KVM: vgic: Force VM halt when changing the active state of GICv3 PPIs/SGIs
commit 107352a249 upstream.

We currently only halt the guest when a vCPU messes with the active
state of an SPI. This is perfectly fine for GICv2, but isn't enough
for GICv3, where all vCPUs can access the state of any other vCPU.

Let's broaden the condition to include any GICv3 interrupt that
has an active state (i.e. all but LPIs).

Cc: stable@vger.kernel.org
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09 17:38:48 +01:00
Mark Rutland
5957178501 KVM: arm64: Fix caching of host MDCR_EL2 value
commit da5a3ce66b upstream.

At boot time, KVM stashes the host MDCR_EL2 value, but only does this
when the kernel is not running in hyp mode (i.e. is non-VHE). In these
cases, the stashed value of MDCR_EL2.HPMN happens to be zero, which can
lead to CONSTRAINED UNPREDICTABLE behaviour.

Since we use this value to derive the MDCR_EL2 value when switching
to/from a guest, after a guest have been run, the performance counters
do not behave as expected. This has been observed to result in accesses
via PMXEVTYPER_EL0 and PMXEVCNTR_EL0 not affecting the relevant
counters, resulting in events not being counted. In these cases, only
the fixed-purpose cycle counter appears to work as expected.

Fix this by always stashing the host MDCR_EL2 value, regardless of VHE.

Cc: Christopher Dall <christoffer.dall@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@vger.kernel.org
Fixes: 1e947bad0b ("arm64: KVM: Skip HYP setup when already running in HYP")
Tested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:08:47 -08:00
Punit Agrawal
3e286d39cc KVM: arm/arm64: Ensure only THP is candidate for adjustment
commit fd2ef35828 upstream.

PageTransCompoundMap() returns true for hugetlbfs and THP
hugepages. This behaviour incorrectly leads to stage 2 faults for
unsupported hugepage sizes (e.g., 64K hugepage with 4K pages) to be
treated as THP faults.

Tighten the check to filter out hugetlbfs pages. This also leads to
consistently mapping all unsupported hugepage sizes as PTE level
entries at stage 2.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Reviewed-by: Suzuki Poulose <suzuki.poulose@arm.com>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:08:47 -08:00
Marc Zyngier
a35381e10d KVM: Remove obsolete kvm_unmap_hva notifier backend
kvm_unmap_hva is long gone, and we only have kvm_unmap_hva_range to
deal with. Drop the now obsolete code.

Fixes: fb1522e099 ("KVM: update to new mmu_notifier semantic v2")
Cc: James Hogan <jhogan@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
2018-09-07 15:06:02 +02:00
Marc Zyngier
694556d54f KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoW
When triggering a CoW, we unmap the RO page via an MMU notifier
(invalidate_range_start), and then populate the new PTE using another
one (change_pte). In the meantime, we'll have copied the old page
into the new one.

The problem is that the data for the new page is sitting in the
cache, and should the guest have an uncached mapping to that page
(or its MMU off), following accesses will bypass the cache.

In a way, this is similar to what happens on a translation fault:
We need to clean the page to the PoC before mapping it. So let's just
do that.

This fixes a KVM unit test regression observed on a HiSilicon platform,
and subsequently reproduced on Seattle.

Fixes: a9c0e12ebe ("KVM: arm/arm64: Only clean the dcache on translation fault")
Cc: stable@vger.kernel.org # v4.16+
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
2018-09-07 15:05:40 +02:00
Paolo Bonzini
631989303b Merge tag 'kvmarm-for-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm updates for 4.19

- Support for Group0 interrupts in guests
- Cache management optimizations for ARMv8.4 systems
- Userspace interface for RAS, allowing error retrival and injection
- Fault path optimization
- Emulated physical timer fixes
- Random cleanups
2018-08-22 14:07:56 +02:00
Linus Torvalds
1202f4fdbc Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "A bunch of good stuff in here. Worth noting is that we've pulled in
  the x86/mm branch from -tip so that we can make use of the core
  ioremap changes which allow us to put down huge mappings in the
  vmalloc area without screwing up the TLB. Much of the positive
  diffstat is because of the rseq selftest for arm64.

  Summary:

   - Wire up support for qspinlock, replacing our trusty ticket lock
     code

   - Add an IPI to flush_icache_range() to ensure that stale
     instructions fetched into the pipeline are discarded along with the
     I-cache lines

   - Support for the GCC "stackleak" plugin

   - Support for restartable sequences, plus an arm64 port for the
     selftest

   - Kexec/kdump support on systems booting with ACPI

   - Rewrite of our syscall entry code in C, which allows us to zero the
     GPRs on entry from userspace

   - Support for chained PMU counters, allowing 64-bit event counters to
     be constructed on current CPUs

   - Ensure scheduler topology information is kept up-to-date with CPU
     hotplug events

   - Re-enable support for huge vmalloc/IO mappings now that the core
     code has the correct hooks to use break-before-make sequences

   - Miscellaneous, non-critical fixes and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
  arm64: alternative: Use true and false for boolean values
  arm64: kexec: Add comment to explain use of __flush_icache_range()
  arm64: sdei: Mark sdei stack helper functions as static
  arm64, kaslr: export offset in VMCOREINFO ELF notes
  arm64: perf: Add cap_user_time aarch64
  efi/libstub: Only disable stackleak plugin for arm64
  arm64: drop unused kernel_neon_begin_partial() macro
  arm64: kexec: machine_kexec should call __flush_icache_range
  arm64: svc: Ensure hardirq tracing is updated before return
  arm64: mm: Export __sync_icache_dcache() for xen-privcmd
  drivers/perf: arm-ccn: Use devm_ioremap_resource() to map memory
  arm64: Add support for STACKLEAK gcc plugin
  arm64: Add stack information to on_accessible_stack
  drivers/perf: hisi: update the sccl_id/ccl_id when MT is supported
  arm64: fix ACPI dependencies
  rseq/selftests: Add support for arm64
  arm64: acpi: fix alignment fault in accessing ACPI
  efi/arm: map UEFI memory map even w/o runtime services enabled
  efi/arm: preserve early mapping of UEFI memory map longer for BGRT
  drivers: acpi: add dependency of EFI for arm64
  ...
2018-08-14 16:39:13 -07:00
Punit Agrawal
976d34e2da KVM: arm/arm64: Skip updating PTE entry if no change
When there is contention on faulting in a particular page table entry
at stage 2, the break-before-make requirement of the architecture can
lead to additional refaulting due to TLB invalidation.

Avoid this by skipping a page table update if the new value of the PTE
matches the previous value.

Cc: stable@vger.kernel.org
Fixes: d5d8184d35 ("KVM: ARM: Memory virtualization setup")
Reviewed-by: Suzuki Poulose <suzuki.poulose@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-08-13 15:32:01 +01:00
Punit Agrawal
86658b819c KVM: arm/arm64: Skip updating PMD entry if no change
Contention on updating a PMD entry by a large number of vcpus can lead
to duplicate work when handling stage 2 page faults. As the page table
update follows the break-before-make requirement of the architecture,
it can lead to repeated refaults due to clearing the entry and
flushing the tlbs.

This problem is more likely when -

* there are large number of vcpus
* the mapping is large block mapping

such as when using PMD hugepages (512MB) with 64k pages.

Fix this by skipping the page table update if there is no change in
the entry being updated.

Cc: stable@vger.kernel.org
Fixes: ad361f093c ("KVM: ARM: Support hugetlbfs backed huge pages")
Reviewed-by: Suzuki Poulose <suzuki.poulose@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-08-13 15:31:35 +01:00
Jia He
d0823cb346 KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled
kvm_vgic_sync_hwstate is only called with IRQ being disabled.
There is thus no need to call spin_lock_irqsave/restore in
vgic_fold_lr_state and vgic_prune_ap_list.

This patch replace them with the non irq-safe version.

Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
[maz: commit message tidy-up]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-08-12 12:15:18 +01:00
Jia He
dc961e5395 KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h
DEBUG_SPINLOCK_BUG_ON can be used with both vgic-v2 and vgic-v3,
so let's move it to vgic.h

Signed-off-by: Jia He <jia.he@hxt-semitech.com>
[maz: commit message tidy-up]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-08-12 12:14:08 +01:00
Marc Zyngier
6249f2a479 KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs
Although vgic-v3 now supports Group0 interrupts, it still doesn't
deal with Group0 SGIs. As usually with the GIC, nothing is simple:

- ICC_SGI1R can signal SGIs of both groups, since GICD_CTLR.DS==1
  with KVM (as per 8.1.10, Non-secure EL1 access)

- ICC_SGI0R can only generate Group0 SGIs

- ICC_ASGI1R sees its scope refocussed to generate only Group0
  SGIs (as per the note at the bottom of Table 8-14)

We only support Group1 SGIs so far, so no material change.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-08-12 12:06:34 +01:00
Christoffer Dall
245715cbe8 KVM: arm/arm64: Fix lost IRQs from emulated physcial timer when blocked
When the VCPU is blocked (for example from WFI) we don't inject the
physical timer interrupt if it should fire while the CPU is blocked, but
instead we just wake up the VCPU and expect kvm_timer_vcpu_load to take
care of injecting the interrupt.

Unfortunately, kvm_timer_vcpu_load() doesn't actually do that, it only
has support to schedule a soft timer if the emulated phys timer is
expected to fire in the future.

Follow the same pattern as kvm_timer_update_state() and update the irq
state after potentially scheduling a soft timer.

Reported-by: Andre Przywara <andre.przywara@arm.com>
Cc: Stable <stable@vger.kernel.org> # 4.15+
Fixes: bbdd52cfcb ("KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exit")
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-31 07:53:20 +01:00
Christoffer Dall
7afc4ddbf2 KVM: arm/arm64: Fix potential loss of ptimer interrupts
kvm_timer_update_state() is called when changing the phys timer
configuration registers, either via vcpu reset, as a result of a trap
from the guest, or when userspace programs the registers.

phys_timer_emulate() is in turn called by kvm_timer_update_state() to
either cancel an existing software timer, or program a new software
timer, to emulate the behavior of a real phys timer, based on the change
in configuration registers.

Unfortunately, the interaction between these two functions left a small
race; if the conceptual emulated phys timer should actually fire, but
the soft timer hasn't executed its callback yet, we cancel the timer in
phys_timer_emulate without injecting an irq.  This only happens if the
check in kvm_timer_update_state is called before the timer should fire,
which is relatively unlikely, but possible.

The solution is to update the state of the phys timer after calling
phys_timer_emulate, which will pick up the pending timer state and
update the interrupt value.

Note that this leaves the opportunity of raising the interrupt twice,
once in the just-programmed soft timer, and once in
kvm_timer_update_state.  Since this always happens synchronously with
the VCPU execution, there is no harm in this, and the guest ever only
sees a single timer interrupt.

Cc: Stable <stable@vger.kernel.org> # 4.15+
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-31 07:53:16 +01:00
Mark Rutland
6b8b9a4854 KVM: arm/arm64: vgic: Fix possible spectre-v1 write in vgic_mmio_write_apr()
It's possible for userspace to control n. Sanitize n when using it as an
array index, to inhibit the potential spectre-v1 write gadget.

Note that while it appears that n must be bound to the interval [0,3]
due to the way it is extracted from addr, we cannot guarantee that
compiler transformations (and/or future refactoring) will ensure this is
the case, and given this is a slow path it's better to always perform
the masking.

Found by smatch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: kvmarm@lists.cs.columbia.edu
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-24 13:53:54 +01:00
James Morse
b0960b9569 KVM: arm: Add 32bit get/set events support
arm64's new use of KVMs get_events/set_events API calls isn't just
or RAS, it allows an SError that has been made pending by KVM as
part of its device emulation to be migrated.

Wire this up for 32bit too.

We only need to read/write the HCR_VA bit, and check that no esr has
been provided, as we don't yet support VDFSR.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Dongjiu Geng <gengdongjiu@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:32 +01:00
James Morse
539aee0edb KVM: arm64: Share the parts of get/set events useful to 32bit
The get/set events helpers to do some work to check reserved
and padding fields are zero. This is useful on 32bit too.

Move this code into virt/kvm/arm/arm.c, and give the arch
code some underscores.

This is temporarily hidden behind __KVM_HAVE_VCPU_EVENTS until
32bit is wired up.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Dongjiu Geng <gengdongjiu@huawei.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:31 +01:00
Dongjiu Geng
b7b27facc7 arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS
For the migrating VMs, user space may need to know the exception
state. For example, in the machine A, KVM make an SError pending,
when migrate to B, KVM also needs to pend an SError.

This new IOCTL exports user-invisible states related to SError.
Together with appropriate user space changes, user space can get/set
the SError exception state to do migrate/snapshot/suspend.

Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
Reviewed-by: James Morse <james.morse@arm.com>
[expanded documentation wording]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:30 +01:00
Christoffer Dall
32f8777ed9 KVM: arm/arm64: vgic: Let userspace opt-in to writable v2 IGROUPR
Simply letting IGROUPR be writable from userspace would break
migration from old kernels to newer kernels, because old kernels
incorrectly report interrupt groups as group 1.  This would not be a big
problem if userspace wrote GICD_IIDR as read from the kernel, because we
could detect the incompatibility and return an error to userspace.
Unfortunately, this is not the case with current userspace
implementations and simply letting IGROUPR be writable from userspace for
an emulated GICv2 silently breaks migration and causes the destination
VM to no longer run after migration.

We now encourage userspace to write the read and expected value of
GICD_IIDR as the first part of a GIC register restore, and if we observe
a write to GICD_IIDR we know that userspace has been updated and has had
a chance to cope with older kernels (VGICv2 IIDR.Revision == 0)
incorrectly reporting interrupts as group 1, and therefore we now allow
groups to be user writable.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:29 +01:00
Christoffer Dall
d53c2c29ae KVM: arm/arm64: vgic: Allow configuration of interrupt groups
Implement the required MMIO accessors for GICv2 and GICv3 for the
IGROUPR distributor and redistributor registers.

This can allow guests to change behavior compared to running on previous
versions of KVM, but only to align with the architecture and hardware
implementations.

This also allows userspace to configure the interrupts groups for GICv3.
We don't allow userspace to write the groups on GICv2 just yet, because
that would result in GICv2 guests not receiving interrupts after
migrating from an older kernel that exposes GICv2 interrupts as group 1.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:29 +01:00
Christoffer Dall
b489edc361 KVM: arm/arm64: vgic: Return error on incompatible uaccess GICD_IIDR writes
If userspace attempts to write a GICD_IIDR that does not match the
kernel version, return an error to userspace.  The intention is to allow
implementation changes inside KVM while avoiding silently breaking
migration resulting in guests not running without any clear indication
of what went wrong.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:28 +01:00
Christoffer Dall
c6e0917b67 KVM: arm/arm64: vgic: Permit uaccess writes to return errors
Currently we do not allow any vgic mmio write operations to fail, which
makes sense from mmio traps from the guest.  However, we should be able
to report failures to userspace, if userspace writes incompatible values
to read-only registers.  Rework the internal interface to allow errors
to be returned on the write side for userspace writes.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:27 +01:00
Christoffer Dall
8732209905 KVM: arm/arm64: vgic: Signal IRQs using their configured group
Now when we have a group configuration on the struct IRQ, use this state
when populating the LR and signaling interrupts as either group 0 or
group 1 to the VM.  Depending on the model of the emulated GIC, and the
guest's configuration of the VMCR, interrupts may be signaled as IRQs or
FIQs to the VM.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:26 +01:00
Christoffer Dall
8df3c8f33f KVM: arm/arm64: vgic: Add group field to struct irq
In preparation for proper group 0 and group 1 support in the vgic, we
add a field in the struct irq to store the group of all interrupts.

We initialize the group to group 0 when emulating GICv2 and to group 1
when emulating GICv3, just like we treat them today.  LPIs are always
group 1.  We also continue to ignore writes from the guest, preserving
existing functionality, for now.

Finally, we also add this field to the vgic debug logic to show the
group for all interrupts.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:24 +01:00
Christoffer Dall
dd6251e463 KVM: arm/arm64: vgic: GICv2 IGROUPR should read as zero
We currently don't support grouping in the emulated VGIC, which is a
known defect on KVM (not hurting any currently used guests as far as
we're aware). This is currently handled by treating all interrupts as
group 0 interrupts for an emulated GICv2 and always signaling interrupts
as group 0 to the virtual CPU interface.

However, when reading which group interrupts belongs to in the guest
from the emulated VGIC, the VGIC currently reports group 1 instead of
group 0, which is misleading.  Fix this temporarily before introducing
full group support by changing the hander to _raz instead of _rao.

Fixes: fb848db396 "KVM: arm/arm64: vgic-new: Add GICv2 MMIO handling framework"
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:22 +01:00
Christoffer Dall
aa075b0f30 KVM: arm/arm64: vgic: Keep track of implementation revision
As we are about to tweak implementation aspects of the VGIC emulation,
while still preserving some level of backwards compatibility support,
add a field to keep track of the implementation revision field which is
reported to the VM and to userspace.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:21 +01:00
Christoffer Dall
a2dca217da KVM: arm/arm64: vgic: Define GICD_IIDR fields for GICv2 and GIv3
Instead of hardcoding the shifts and masks in the GICD_IIDR register
emulation, let's add the definition of these fields to the GIC header
files and use them.

This will make things more obvious when we're going to bump the revision
in the IIDR when we'll make guest-visible changes to the implementation.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:19 +01:00
Marc Zyngier
e294cb3a6d KVM: arm/arm64: vgic-debug: Show LPI status
The vgic debugfs file only knows about SGI/PPI/SPI interrupts, and
completely ignores LPIs. Let's fix that.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:16 +01:00
Kees Cook
2326aceebc KVM: arm64: vgic-its: Remove VLA usage
In the quest to remove all stack VLA usage from the kernel[1], this
switches to using a maximum size and adds sanity checks. Additionally
cleans up some of the int-vs-u32 usage and adds additional bounds checking.
As it currently stands, this will always be 8 bytes until the ABI changes.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Cc: Christoffer Dall <christoffer.dall@arm.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.cs.columbia.edu
Signed-off-by: Kees Cook <keescook@chromium.org>
[maz: dropped WARN_ONs]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:14 +01:00
Christoffer Dall
1d47191de7 KVM: arm/arm64: Fix vgic init race
The vgic_init function can race with kvm_arch_vcpu_create() which does
not hold kvm_lock() and we therefore have no synchronization primitives
to ensure we're doing the right thing.

As the user is trying to initialize or run the VM while at the same time
creating more VCPUs, we just have to refuse to initialize the VGIC in
this case rather than silently failing with a broken VCPU.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-21 16:02:07 +01:00
Marc Zyngier
de73708915 KVM: arm/arm64: Enable adaptative WFE trapping
Trapping blocking WFE is extremely beneficial in situations where
the system is oversubscribed, as it allows another thread to run
while being blocked. In a non-oversubscribed environment, this is
the complete opposite, and trapping WFE is just unnecessary overhead.

Let's only enable WFE trapping if the CPU has more than a single task
to run (that is, more than just the vcpu thread).

Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-09 11:38:24 +01:00
Marc Zyngier
0a72a5ab9f KVM: arm/arm64: Remove unnecessary CMOs when creating HYP page tables
There is no need to perform cache maintenance operations when
creating the HYP page tables if we have the multiprocessing
extensions. ARMv7 mandates them with the virtualization support,
and ARMv8 just mandates them unconditionally.

Let's remove these operations.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-09 11:37:42 +01:00
Marc Zyngier
0db9dd8a0f KVM: arm/arm64: Stop using the kernel's {pmd,pud,pgd}_populate helpers
The {pmd,pud,pgd}_populate accessors usage have always been a bit weird
in KVM. We don't have a struct mm to pass (and neither does the kernel
most of the time, but still...), and the 32bit code has all kind of
cache maintenance that doesn't make sense on ARMv7+ when MP extensions
are mandatory (which is the case when the VEs are present).

Let's bite the bullet and provide our own implementations. The only bit
of architectural code left has to do with building the table entry
itself (arm64 having up to 52bit PA, arm lacking PUD level).

Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-09 11:37:42 +01:00
Marc Zyngier
88dc25e8ea KVM: arm/arm64: Consolidate page-table accessors
The arm and arm64 KVM page tables accessors are pointlessly different
between the two architectures, and likely both wrong one way or another:
arm64 lacks a dsb(), and arm doesn't use WRITE_ONCE.

Let's unify them.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-09 11:37:42 +01:00
Marc Zyngier
e48d53a91f arm64: KVM: Add support for Stage-2 control of memory types and cacheability
Up to ARMv8.3, the combinaison of Stage-1 and Stage-2 attributes
results in the strongest attribute of the two stages.  This means
that the hypervisor has to perform quite a lot of cache maintenance
just in case the guest has some non-cacheable mappings around.

ARMv8.4 solves this problem by offering a different mode (FWB) where
Stage-2 has total control over the memory attribute (this is limited
to systems where both I/O and instruction fetches are coherent with
the dcache). This is achieved by having a different set of memory
attributes in the page tables, and a new bit set in HCR_EL2.

On such a system, we can then safely sidestep any form of dcache
management.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-07-09 11:37:41 +01:00
Mark Rutland
256c0960b7 kvm/arm: use PSR_AA32 definitions
Some code cares about the SPSR_ELx format for exceptions taken from
AArch32 to inspect or manipulate the SPSR_ELx value, which is already in
the SPSR_ELx format, and not in the AArch32 PSR format.

To separate these from cases where we care about the AArch32 PSR format,
migrate these cases to use the PSR_AA32_* definitions rather than
COMPAT_PSR_*.

There should be no functional change as a result of this patch.

Note that arm64 KVM does not support a compat KVM API, and always uses
the SPSR_ELx format, even for AArch32 guests.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-07-05 17:24:15 +01:00
Ingo Molnar
4520843dfa Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-03 09:20:22 +02:00
Jia He
47a91b7232 KVM: arm/arm64: add WARN_ON if size is not PAGE_SIZE aligned in unmap_stage2_range
There is a panic in armv8a server(QDF2400) under memory pressure tests
(start 20 guests and run memhog in the host).

---------------------------------begin--------------------------------
[35380.800950] BUG: Bad page state in process qemu-kvm  pfn:dd0b6
[35380.805825] page:ffff7fe003742d80 count:-4871 mapcount:-2126053375
mapping:          (null) index:0x0
[35380.815024] flags: 0x1fffc00000000000()
[35380.818845] raw: 1fffc00000000000 0000000000000000 0000000000000000
ffffecf981470000
[35380.826569] raw: dead000000000100 dead000000000200 ffff8017c001c000
0000000000000000
[35380.805825] page:ffff7fe003742d80 count:-4871 mapcount:-2126053375
mapping:          (null) index:0x0
[35380.815024] flags: 0x1fffc00000000000()
[35380.818845] raw: 1fffc00000000000 0000000000000000 0000000000000000
ffffecf981470000
[35380.826569] raw: dead000000000100 dead000000000200 ffff8017c001c000
0000000000000000
[35380.834294] page dumped because: nonzero _refcount
[...]
--------------------------------end--------------------------------------

The root cause might be what was fixed at [1]. But from the KVM points of
view, it would be better if the issue was caught earlier.

If the size is not PAGE_SIZE aligned, unmap_stage2_range might unmap the
wrong(more or less) page range. Hence it caused the "BUG: Bad page
state"

Let's WARN in that case, so that the issue is obvious.

[1] https://lkml.org/lkml/2018/5/3/1042

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: jia.he@hxt-semitech.com
[maz: tidied up commit message]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-06-21 11:48:03 +01:00
Ard Biesheuvel
ba56bc3a07 KVM: arm/arm64: Drop resource size check for GICV window
When booting a 64 KB pages kernel on a ACPI GICv3 system that
implements support for v2 emulation, the following warning is
produced

  GICV size 0x2000 not a multiple of page size 0x10000

and support for v2 emulation is disabled, preventing GICv2 VMs
from being able to run on such hosts.

The reason is that vgic_v3_probe() performs a sanity check on the
size of the window (it should be a multiple of the page size),
while the ACPI MADT parsing code hardcodes the size of the window
to 8 KB. This makes sense, considering that ACPI does not bother
to describe the size in the first place, under the assumption that
platforms implementing ACPI will follow the architecture and not
put anything else in the same 64 KB window.

So let's just drop the sanity check altogether, and assume that
the window is at least 64 KB in size.

Fixes: 9097773245 ("KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init")
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-06-21 09:14:44 +01:00
Peter Zijlstra
b3dae109fa sched/swait: Rename to exclusive
Since swait basically implemented exclusive waits only, make sure
the API reflects that.

  $ git grep -l -e "\<swake_up\>"
		-e "\<swait_event[^ (]*"
		-e "\<prepare_to_swait\>" | while read file;
    do
	sed -i -e 's/\<swake_up\>/&_one/g'
	       -e 's/\<swait_event[^ (]*/&_exclusive/g'
	       -e 's/\<prepare_to_swait\>/&_exclusive/g' $file;
    done

With a few manual touch-ups.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: bigeasy@linutronix.de
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180612083909.261946548@infradead.org
2018-06-20 11:35:56 +02:00