commit d4912215d1 upstream.
WARNING: CPU: 3 PID: 2840 at arch/x86/kvm/vmx.c:10966 nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
CPU: 3 PID: 2840 Comm: qemu-system-x86 Tainted: G OE 4.12.0-rc3+ #23
RIP: 0010:nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
Call Trace:
? kvm_check_async_pf_completion+0xef/0x120 [kvm]
? rcu_read_lock_sched_held+0x79/0x80
vmx_queue_exception+0x104/0x160 [kvm_intel]
? vmx_queue_exception+0x104/0x160 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x1171/0x1ce0 [kvm]
? kvm_arch_vcpu_load+0x47/0x240 [kvm]
? kvm_arch_vcpu_load+0x62/0x240 [kvm]
kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
? __fget+0xf3/0x210
do_vfs_ioctl+0xa4/0x700
? __fget+0x114/0x210
SyS_ioctl+0x79/0x90
do_syscall_64+0x81/0x220
entry_SYSCALL64_slow_path+0x25/0x25
This is triggered occasionally by running both win7 and win2016 in L2, in
addition, EPT is disabled on both L1 and L2. It can't be reproduced easily.
Commit 0b6ac343fc (KVM: nVMX: Correct handling of exception injection) mentioned
that "KVM wants to inject page-faults which it got to the guest. This function
assumes it is called with the exit reason in vmcs02 being a #PF exception".
Commit e011c663 (KVM: nVMX: Check all exceptions for intercept during delivery to
L2) allows to check all exceptions for intercept during delivery to L2. However,
there is no guarantee the exit reason is exception currently, when there is an
external interrupt occurred on host, maybe a time interrupt for host which should
not be injected to guest, and somewhere queues an exception, then the function
nested_vmx_check_exception() will be called and the vmexit emulation codes will
try to emulate the "Acknowledge interrupt on exit" behavior, the warning is
triggered.
Reusing the exit reason from the L2->L0 vmexit is wrong in this case,
the reason must always be EXCEPTION_NMI when injecting an exception into
L1 as a nested vmexit.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Fixes: e011c663b9 ("KVM: nVMX: Check all exceptions for intercept during delivery to L2")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f0367ee1d6 upstream.
Static checker noticed that base3 could be used uninitialized if the
segment was not present (useable). Random stack values probably would
not pass VMCS entry checks.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 1aa366163b ("KVM: x86 emulator: consolidate segment accessors")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 34b0dadbdf upstream.
Static analysis noticed that pmu->nr_arch_gp_counters can be 32
(INTEL_PMC_MAX_GENERIC) and therefore cannot be used to shift 'int'.
I didn't add BUILD_BUG_ON for it as we have a better checker.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 25462f7f52 ("KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6ed071f051 upstream.
On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
on hflags is reverted later on in x86_emulate_instruction where hflags are
overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.
Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
an instruction is emulated, this commit deletes emul_flags altogether and
makes the emulator access vcpu->arch.hflags using two new accessors. This
way all changes, on the emulator side as well as in functions called from
the emulator and accessing vcpu state with emul_to_vcpu, are preserved.
More details on the bug and its manifestation with Windows and OVMF:
It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
I believe that the SMM part explains why we started seeing this only with
OVMF.
KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
later on in x86_emulate_instruction we overwrite arch.hflags with
ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
The AMD-specific hflag of interest here is HF_NMI_MASK.
When rebooting the system, Windows sends an NMI IPI to all but the current
cpu to shut them down. Only after all of them are parked in HLT will the
initiating cpu finish the restart. If NMI is masked, other cpus never get
the memo and the initiating cpu spins forever, waiting for
hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.
Fixes: a584539b24 ("KVM: x86: pass the whole hflags field to emulator and back")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c8401dda2f upstream.
TF is handled a bit differently for syscall and sysret, compared
to the other instructions: TF is checked after the instruction completes,
so that the OS can disable #DB at a syscall by adding TF to FMASK.
When the sysret is executed the #DB is taken "as if" the syscall insn
just completed.
KVM emulates syscall so that it can trap 32-bit syscall on Intel processors.
Fix the behavior, otherwise you could get #DB on a user stack which is not
nice. This does not affect Linux guests, as they use an IST or task gate
for #DB.
This fixes CVE-2017-7518.
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9bc1f09f6f upstream.
INFO: task gnome-terminal-:1734 blocked for more than 120 seconds.
Not tainted 4.12.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gnome-terminal- D 0 1734 1015 0x00000000
Call Trace:
__schedule+0x3cd/0xb30
schedule+0x40/0x90
kvm_async_pf_task_wait+0x1cc/0x270
? __vfs_read+0x37/0x150
? prepare_to_swait+0x22/0x70
do_async_page_fault+0x77/0xb0
? do_async_page_fault+0x77/0xb0
async_page_fault+0x28/0x30
This is triggered by running both win7 and win2016 on L1 KVM simultaneously,
and then gives stress to memory on L1, I can observed this hang on L1 when
at least ~70% swap area is occupied on L0.
This is due to async pf was injected to L2 which should be injected to L1,
L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host
actually), and L1 guest starts accumulating tasks stuck in D state in
kvm_async_pf_task_wait() since missing PAGE_READY async_pfs.
This patch fixes the hang by doing async pf when executing L1 guest.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a3641631d1 upstream.
If "i" is the last element in the vcpu->arch.cpuid_entries[] array, it
potentially can be exploited the vulnerability. this will out-of-bounds
read and write. Luckily, the effect is small:
/* when no next entry is found, the current entry[i] is reselected */
for (j = i + 1; ; j = (j + 1) % nent) {
struct kvm_cpuid_entry2 *ej = &vcpu->arch.cpuid_entries[j];
if (ej->function == e->function) {
It reads ej->maxphyaddr, which is user controlled. However...
ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT;
After cpuid_entries there is
int maxphyaddr;
struct x86_emulate_ctxt emulate_ctxt; /* 16-byte aligned */
So we have:
- cpuid_entries at offset 1B50 (6992)
- maxphyaddr at offset 27D0 (6992 + 3200 = 10192)
- padding at 27D4...27DF
- emulate_ctxt at 27E0
And it writes in the padding. Pfew, writing the ops field of emulate_ctxt
would have been much worse.
This patch fixes it by modding the index to avoid the out-of-bounds
access. Worst case, i == j and ej->function == e->function,
the loop can bail out.
Reported-by: Moguofang <moguofang@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Guofang Mo <moguofang@huawei.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cbfc6c9184 upstream.
Huawei folks reported a read out-of-bounds vulnerability in kvm pio emulation.
- "inb" instruction to access PIT Mod/Command register (ioport 0x43, write only,
a read should be ignored) in guest can get a random number.
- "rep insb" instruction to access PIT register port 0x43 can control memcpy()
in emulator_pio_in_emulated() to copy max 0x400 bytes but only read 1 bytes,
which will disclose the unimportant kernel memory in host but no crash.
The similar test program below can reproduce the read out-of-bounds vulnerability:
void hexdump(void *mem, unsigned int len)
{
unsigned int i, j;
for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++)
{
/* print offset */
if(i % HEXDUMP_COLS == 0)
{
printf("0x%06x: ", i);
}
/* print hex data */
if(i < len)
{
printf("%02x ", 0xFF & ((char*)mem)[i]);
}
else /* end of block, just aligning for ASCII dump */
{
printf(" ");
}
/* print ASCII dump */
if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1))
{
for(j = i - (HEXDUMP_COLS - 1); j <= i; j++)
{
if(j >= len) /* end of block, not really printing */
{
putchar(' ');
}
else if(isprint(((char*)mem)[j])) /* printable char */
{
putchar(0xFF & ((char*)mem)[j]);
}
else /* other char */
{
putchar('.');
}
}
putchar('\n');
}
}
}
int main(void)
{
int i;
if (iopl(3))
{
err(1, "set iopl unsuccessfully\n");
return -1;
}
static char buf[0x40];
/* test ioport 0x40,0x41,0x42,0x43,0x44,0x45 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x40, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x41, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x42, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x43, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x44, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x45, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
/* ins port 0x40 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x20, %rcx;");
asm volatile ("mov $0x40, %rdx;");
asm volatile ("rep insb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
/* ins port 0x43 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x20, %rcx;");
asm volatile ("mov $0x43, %rdx;");
asm volatile ("rep insb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
return 0;
}
The vcpu->arch.pio_data buffer is used by both in/out instrutions emulation
w/o clear after using which results in some random datas are left over in
the buffer. Guest reads port 0x43 will be ignored since it is write only,
however, the function kernel_pio() can't distigush this ignore from successfully
reads data from device's ioport. There is no new data fill the buffer from
port 0x43, however, emulator_pio_in_emulated() will copy the stale data in
the buffer to the guest unconditionally. This patch fixes it by clearing the
buffer before in instruction emulation to avoid to grant guest the stale data
in the buffer.
In addition, string I/O is not supported for in kernel device. So there is no
iteration to read ioport %RCX times for string I/O. The function kernel_pio()
just reads one round, and then copy the io size * %RCX to the guest unconditionally,
actually it copies the one round ioport data w/ other random datas which are left
over in the vcpu->arch.pio_data buffer to the guest. This patch fixes it by
introducing the string I/O support for in kernel device in order to grant the right
ioport datas to the guest.
Before the patch:
0x000000: fe 38 93 93 ff ff ab ab .8......
0x000008: ab ab ab ab ab ab ab ab ........
0x000010: ab ab ab ab ab ab ab ab ........
0x000018: ab ab ab ab ab ab ab ab ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: f6 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
0x000018: 30 30 20 33 20 20 20 20 00 3
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: f6 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
0x000018: 30 30 20 33 20 20 20 20 00 3
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
After the patch:
0x000000: 1e 02 f8 00 ff ff ab ab ........
0x000008: ab ab ab ab ab ab ab ab ........
0x000010: ab ab ab ab ab ab ab ab ........
0x000018: ab ab ab ab ab ab ab ab ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: d2 e2 d2 df d2 db d2 d7 ........
0x000008: d2 d3 d2 cf d2 cb d2 c7 ........
0x000010: d2 c4 d2 c0 d2 bc d2 b8 ........
0x000018: d2 b4 d2 b0 d2 ac d2 a8 ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: 00 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 00 00 00 00 ........
0x000018: 00 00 00 00 00 00 00 00 ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
Reported-by: Moguofang <moguofang@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Moguofang <moguofang@huawei.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e2c2206a18 upstream.
BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/2809
caller is __this_cpu_preempt_check+0x13/0x20
CPU: 2 PID: 2809 Comm: qemu-system-x86 Not tainted 4.11.0+ #13
Call Trace:
dump_stack+0x99/0xce
check_preemption_disabled+0xf5/0x100
__this_cpu_preempt_check+0x13/0x20
get_kvmclock_ns+0x6f/0x110 [kvm]
get_time_ref_counter+0x5d/0x80 [kvm]
kvm_hv_process_stimers+0x2a1/0x8a0 [kvm]
? kvm_hv_process_stimers+0x2a1/0x8a0 [kvm]
? kvm_arch_vcpu_ioctl_run+0xac9/0x1ce0 [kvm]
kvm_arch_vcpu_ioctl_run+0x5bf/0x1ce0 [kvm]
kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
? __fget+0xf3/0x210
do_vfs_ioctl+0xa4/0x700
? __fget+0x114/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
RIP: 0033:0x7f9d164ed357
? __this_cpu_preempt_check+0x13/0x20
This can be reproduced by run kvm-unit-tests/hyperv_stimer.flat w/
CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT enabled.
Safe access to per-CPU data requires a couple of constraints, though: the
thread working with the data cannot be preempted and it cannot be migrated
while it manipulates per-CPU variables. If the thread is preempted, the
thread that replaces it could try to work with the same variables; migration
to another CPU could also cause confusion. However there is no preemption
disable when reads host per-CPU tsc rate to calculate the current kvmclock
timestamp.
This patch fixes it by utilizing get_cpu/put_cpu pair to guarantee both
__this_cpu_read() and rdtsc() are not preempted.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a575813bfe upstream.
Reported by syzkaller:
BUG: unable to handle kernel paging request at ffffffffc07f6a2e
IP: report_bug+0x94/0x120
PGD 348e12067
P4D 348e12067
PUD 348e14067
PMD 3cbd84067
PTE 80000003f7e87161
Oops: 0003 [#1] SMP
CPU: 2 PID: 7091 Comm: kvm_load_guest_ Tainted: G OE 4.11.0+ #8
task: ffff92fdfb525400 task.stack: ffffbda6c3d04000
RIP: 0010:report_bug+0x94/0x120
RSP: 0018:ffffbda6c3d07b20 EFLAGS: 00010202
do_trap+0x156/0x170
do_error_trap+0xa3/0x170
? kvm_load_guest_fpu.part.175+0x12a/0x170 [kvm]
? mark_held_locks+0x79/0xa0
? retint_kernel+0x10/0x10
? trace_hardirqs_off_thunk+0x1a/0x1c
do_invalid_op+0x20/0x30
invalid_op+0x1e/0x30
RIP: 0010:kvm_load_guest_fpu.part.175+0x12a/0x170 [kvm]
? kvm_load_guest_fpu.part.175+0x1c/0x170 [kvm]
kvm_arch_vcpu_ioctl_run+0xed6/0x1b70 [kvm]
kvm_vcpu_ioctl+0x384/0x780 [kvm]
? kvm_vcpu_ioctl+0x384/0x780 [kvm]
? sched_clock+0x13/0x20
? __do_page_fault+0x2a0/0x550
do_vfs_ioctl+0xa4/0x700
? up_read+0x1f/0x40
? __do_page_fault+0x2a0/0x550
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
SDM mentioned that "The MXCSR has several reserved bits, and attempting to write
a 1 to any of these bits will cause a general-protection exception(#GP) to be
generated". The syzkaller forks' testcase overrides xsave area w/ random values
and steps on the reserved bits of MXCSR register. The damaged MXCSR register
values of guest will be restored to SSEx MXCSR register before vmentry. This
patch fixes it by catching userspace override MXCSR register reserved bits w/
random values and bails out immediately.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4e335d9e7d upstream.
This reverts commit bbd6411513.
I've been sitting on this revert for too long and it unfortunately
missed 4.11. It's also the reason why I haven't merged ring-based
dirty tracking for 4.12.
Using kvm_vcpu_memslots in kvm_gfn_to_hva_cache_init and
kvm_vcpu_write_guest_offset_cached means that the MSR value can
now be used to access SMRAM, simply by making it point to an SMRAM
physical address. This is problematic because it lets the guest
OS overwrite memory that it shouldn't be able to touch.
Fixes: bbd6411513
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 28bf288879 upstream.
If we already entered/are about to enter SMM, don't allow switching to
INIT/SIPI_RECEIVED, otherwise the next call to kvm_apic_accept_events()
will report a warning.
Same applies if we are already in MP state INIT_RECEIVED and SMM is
requested to be turned on. Refuse to set the VCPU events in this case.
Fixes: cd7764fe9f ("KVM: x86: latch INITs while in system management mode")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
L2 was running with uninitialized PML fields which led to incomplete
dirty bitmap logging. This manifested as all kinds of subtle erratic
behavior of the nested guest.
Fixes: 843e433057 ("KVM: VMX: Add PML support in VMX")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The PML feature is not exposed to guests so we should not be forwarding
the vmexit either.
This commit fixes BSOD 0x20001 (HYPERVISOR_ERROR) when running Hyper-V
enabled Windows Server 2016 in L1 on hardware that supports PML.
Fixes: 843e433057 ("KVM: VMX: Add PML support in VMX")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The nested_ept_enabled flag introduced in commit 7ca29de213 was not
computed correctly. We are interested only in L1's EPT state, not the
the combined L0+L1 value.
In particular, if L0 uses EPT but L1 does not, nested_ept_enabled must
be false to make sure that PDPSTRs are loaded based on CR3 as usual,
because the special case described in 26.3.2.4 Loading Page-Directory-
Pointer-Table Entries does not apply.
Fixes: 7ca29de213 ("KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT")
Cc: qemu-stable@nongnu.org
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This can be reproduced by running L2 on L1, and disable VPID on L0
if w/o commit "KVM: nVMX: Fix nested VPID vmx exec control", the L2
crash as below:
KVM: entry failed, hardware error 0x7
EAX=00000000 EBX=00000000 ECX=00000000 EDX=000306c3
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT= 00000000 0000ffff
IDT= 00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Reference SDM 30.3 INVVPID:
Protected Mode Exceptions
- #UD
- If not in VMX operation.
- If the logical processor does not support VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=0).
- If the logical processor supports VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=1) but does
not support the INVVPID instruction (IA32_VMX_EPT_VPID_CAP[32]=0).
So we should check both VPID enable bit in vmx exec control and INVVPID support bit
in vmx capability MSRs to enable VPID. This patch adds the guarantee to not enable
VPID if either INVVPID or single-context/all-context invalidation is not exposed in
vmx capability MSRs.
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This can be reproduced by running kvm-unit-tests/vmx.flat on L0 w/ vpid disabled.
Test suite: VPID
Unhandled exception 6 #UD at ip 00000000004051a6
error_code=0000 rflags=00010047 cs=00000008
rax=0000000000000000 rcx=0000000000000001 rdx=0000000000000047 rbx=0000000000402f79
rbp=0000000000456240 rsi=0000000000000001 rdi=0000000000000000
r8=000000000000000a r9=00000000000003f8 r10=0000000080010011 r11=0000000000000000
r12=0000000000000003 r13=0000000000000708 r14=0000000000000000 r15=0000000000000000
cr0=0000000080010031 cr2=0000000000000000 cr3=0000000007fff000 cr4=0000000000002020
cr8=0000000000000000
STACK: @4051a6 40523e 400f7f 402059 40028f
We should hide and forbid VPID in L1 if it is disabled on L0. However, nested VPID
enable bit is set unconditionally during setup nested vmx exec controls though VPID
is not exposed through nested VMX capablity. This patch fixes it by don't set nested
VPID enable bit if it is disabled on L0.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 5c614b3583 (KVM: nVMX: nested VPID emulation)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After async pf setup successfully, there is a broadcast wakeup w/ special
token 0xffffffff which tells vCPU that it should wake up all processes
waiting for APFs though there is no real process waiting at the moment.
The async page present tracepoint print prematurely and fails to catch the
special token setup. This patch fixes it by moving the async page present
tracepoint after the special token setup.
Before patch:
qemu-system-x86-8499 [006] ...1 5973.473292: kvm_async_pf_ready: token 0x0 gva 0x0
After patch:
qemu-system-x86-8499 [006] ...1 5973.473292: kvm_async_pf_ready: token 0xffffffff gva 0x0
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Quoting from the Intel SDM, volume 3, section 28.3.3.4: Guidelines for
Use of the INVEPT Instruction:
If EPT was in use on a logical processor at one time with EPTP X, it
is recommended that software use the INVEPT instruction with the
"single-context" INVEPT type and with EPTP X in the INVEPT descriptor
before a VM entry on the same logical processor that enables EPT with
EPTP X and either (a) the "virtualize APIC accesses" VM-execution
control was changed from 0 to 1; or (b) the value of the APIC-access
address was changed.
In the nested case, the burden falls on L1, unless L0 enables EPT in
vmcs02 when L1 doesn't enable EPT in vmcs12.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
We have specific destructors for pic/ioapic, we'd better use them when
destroying the VM as well.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Mostly used for split irqchip mode. In that case, these two things are
not inited at all, so no need to release.
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
kvm mmu is reset once successfully loading CR3 as part of emulating vmentry
in nested_vmx_load_cr3(). We should not reset kvm mmu twice.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
If avic is not enabled, avic_vm_init() does nothing and returns early.
However, avic_vm_destroy() still tries to destroy what hasn't been created.
The only bad consequence of this now is that avic_vm_destroy() uses
svm_vm_data_hash_lock that hasn't been initialized (and is not meant
to be used at all if avic is not enabled).
Return early from avic_vm_destroy() if avic is not enabled.
It has nothing to destroy.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kvm@vger.kernel.org
Cc: syzkaller@googlegroups.com
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Before trying to do nested_get_page() in nested_vmx_merge_msr_bitmap(),
we have already checked that the MSR bitmap address is valid (4k aligned
and within physical limits). SDM doesn't specify what happens if the
there is no memory mapped at the valid address, but Intel CPUs treat the
situation as if the bitmap was configured to trap all MSRs.
KVM already does that by returning false and a correct handling doesn't
need the guest-trigerrable warning that was reported by syzkaller:
(The warning was originally there to catch some possible bugs in nVMX.)
------------[ cut here ]------------
WARNING: CPU: 0 PID: 7832 at arch/x86/kvm/vmx.c:9709
nested_vmx_merge_msr_bitmap arch/x86/kvm/vmx.c:9709 [inline]
WARNING: CPU: 0 PID: 7832 at arch/x86/kvm/vmx.c:9709
nested_get_vmcs12_pages+0xfb6/0x15c0 arch/x86/kvm/vmx.c:9640
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 7832 Comm: syz-executor1 Not tainted 4.10.0+ #229
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
panic+0x1fb/0x412 kernel/panic.c:179
__warn+0x1c4/0x1e0 kernel/panic.c:540
warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
nested_vmx_merge_msr_bitmap arch/x86/kvm/vmx.c:9709 [inline]
nested_get_vmcs12_pages+0xfb6/0x15c0 arch/x86/kvm/vmx.c:9640
enter_vmx_non_root_mode arch/x86/kvm/vmx.c:10471 [inline]
nested_vmx_run+0x6186/0xaab0 arch/x86/kvm/vmx.c:10561
handle_vmlaunch+0x1a/0x20 arch/x86/kvm/vmx.c:7312
vmx_handle_exit+0xfc0/0x3f00 arch/x86/kvm/vmx.c:8526
vcpu_enter_guest arch/x86/kvm/x86.c:6982 [inline]
vcpu_run arch/x86/kvm/x86.c:7044 [inline]
kvm_arch_vcpu_ioctl_run+0x1418/0x4840 arch/x86/kvm/x86.c:7205
kvm_vcpu_ioctl+0x673/0x1120 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2570
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
[Jim Mattson explained the bare metal behavior: "I believe this behavior
would be documented in the chipset data sheet rather than the SDM,
since the chipset returns all 1s for an unclaimed read."]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
VMCLEAR should silently ignore a failure to clear the launch state of
the VMCS referenced by the operand.
Signed-off-by: Jim Mattson <jmattson@google.com>
[Changed "kvm_write_guest(vcpu->kvm" to "kvm_vcpu_write_guest(vcpu".]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Pull more KVM updates from Radim Krčmář:
"Second batch of KVM changes for the 4.11 merge window:
PPC:
- correct assumption about ASDR on POWER9
- fix MMIO emulation on POWER9
x86:
- add a simple test for ioperm
- cleanup TSS (going through KVM tree as the whole undertaking was
caused by VMX's use of TSS)
- fix nVMX interrupt delivery
- fix some performance counters in the guest
... and two cleanup patches"
* tag 'kvm-4.11-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: nVMX: Fix pending events injection
x86/kvm/vmx: remove unused variable in segment_base()
selftests/x86: Add a basic selftest for ioperm
x86/asm: Tidy up TSS limit code
kvm: convert kvm.users_count from atomic_t to refcount_t
KVM: x86: never specify a sample period for virtualized in_tx_cp counters
KVM: PPC: Book3S HV: Don't use ASDR for real-mode HPT faults on POWER9
KVM: PPC: Book3S HV: Fix software walk of guest process page tables
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
L2 fails to boot on a non-APICv box dues to 'commit 0ad3bed6c5
("kvm: nVMX: move nested events check to kvm_vcpu_running")'
KVM internal error. Suberror: 3
extra data[0]: 800000ef
extra data[1]: 1
RAX=0000000000000000 RBX=ffffffff81f36140 RCX=0000000000000000 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=ffff88007c92fe90 RSP=ffff88007c92fe90
R8 =ffff88007fccdca0 R9 =0000000000000000 R10=00000000fffedb3d R11=0000000000000000
R12=0000000000000003 R13=0000000000000000 R14=0000000000000000 R15=ffff88007c92c000
RIP=ffffffff810645e6 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0000 0000000000000000 ffffffff 00c00000
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 0000000000000000 ffffffff 00c00000
GS =0000 ffff88007fcc0000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 ffff88007fcd4200 00002087 00008b00 DPL=0 TSS64-busy
GDT= ffff88007fcc9000 0000007f
IDT= ffffffffff578000 00000fff
CR0=80050033 CR2=00000000ffffffff CR3=0000000001e0a000 CR4=003406e0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
We should try to reinject previous events if any before trying to inject
new event if pending. If vmexit is triggered by L2 guest and L0 interested
in, we should reinject IDT-vectoring info to L2 through vmcs02 if any,
otherwise, we can consider new IRQs/NMIs which can be injected and call
nested events callback to switch from L2 to L1 if needed and inject the
proper vmexit events. However, 'commit 0ad3bed6c5 ("kvm: nVMX: move
nested events check to kvm_vcpu_running")' results in the handle events
order reversely on non-APICv box. This patch fixes it by bailing out for
pending events and not consider new events in this scenario.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Fixes: 0ad3bed6c5 ("kvm: nVMX: move nested events check to kvm_vcpu_running")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The pointer 'struct desc_struct *d' is unused since commit 8c2e41f7ae
("x86/kvm/vmx: Simplify segment_base()") so let's remove it.
Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
pmc_reprogram_counter() always sets a sample period based on the value of
pmc->counter. However, hsw_hw_config() rejects sample periods less than
2^31 - 1. So for example, if a KVM guest does
struct perf_event_attr attr;
memset(&attr, 0, sizeof(attr));
attr.type = PERF_TYPE_RAW;
attr.size = sizeof(attr);
attr.config = 0x2005101c4; // conditional branches retired IN_TXCP
attr.sample_period = 0;
int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
the guest kernel counts some conditional branch events, then updates the
virtual PMU register with a nonzero count. The host reaches
pmc_reprogram_counter() with nonzero pmc->counter, triggers EOPNOTSUPP
in hsw_hw_config(), prints "kvm_pmu: event creation failed" in
pmc_reprogram_counter(), and silently (from the guest's point of view) stops
counting events.
We fix event counting by forcing attr.sample_period to always be zero for
in_tx_cp counters. Sampling doesn't work, but it already didn't work and
can't be fixed without major changes to the approach in hsw_hw_config().
Signed-off-by: Robert O'Callahan <robert@ocallahan.org>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Pull KVM updates from Paolo Bonzini:
"4.11 is going to be a relatively large release for KVM, with a little
over 200 commits and noteworthy changes for most architectures.
ARM:
- GICv3 save/restore
- cache flushing fixes
- working MSI injection for GICv3 ITS
- physical timer emulation
MIPS:
- various improvements under the hood
- support for SMP guests
- a large rewrite of MMU emulation. KVM MIPS can now use MMU
notifiers to support copy-on-write, KSM, idle page tracking,
swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is
also supported, so that writes to some memory regions can be
treated as MMIO. The new MMU also paves the way for hardware
virtualization support.
PPC:
- support for POWER9 using the radix-tree MMU for host and guest
- resizable hashed page table
- bugfixes.
s390:
- expose more features to the guest
- more SIMD extensions
- instruction execution protection
- ESOP2
x86:
- improved hashing in the MMU
- faster PageLRU tracking for Intel CPUs without EPT A/D bits
- some refactoring of nested VMX entry/exit code, preparing for live
migration support of nested hypervisors
- expose yet another AVX512 CPUID bit
- host-to-guest PTP support
- refactoring of interrupt injection, with some optimizations thrown
in and some duct tape removed.
- remove lazy FPU handling
- optimizations of user-mode exits
- optimizations of vcpu_is_preempted() for KVM guests
generic:
- alternative signaling mechanism that doesn't pound on
tsk->sighand->siglock"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (195 commits)
x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
x86/paravirt: Change vcp_is_preempted() arg type to long
KVM: VMX: use correct vmcs_read/write for guest segment selector/base
x86/kvm/vmx: Defer TR reload after VM exit
x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss
x86/kvm/vmx: Simplify segment_base()
x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels
x86/kvm/vmx: Don't fetch the TSS base from the GDT
x86/asm: Define the kernel TSS limit in a macro
kvm: fix page struct leak in handle_vmon
KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now
KVM: Return an error code only as a constant in kvm_get_dirty_log()
KVM: Return an error code only as a constant in kvm_get_dirty_log_protect()
KVM: Return directly after a failed copy_from_user() in kvm_vm_compat_ioctl()
KVM: x86: remove code for lazy FPU handling
KVM: race-free exit from KVM_RUN without POSIX signals
KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message
KVM: PPC: Book3S PR: Ratelimit copy data failure error messages
KVM: Support vCPU-based gfn->hva cache
KVM: use separate generations for each address space
...
Guest segment selector is 16 bit field and guest segment base is natural
width field. Fix two incorrect invocations accordingly.
Without this patch, build fails when aggressive inlining is used with ICC.
Cc: stable@vger.kernel.org
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel's VMX is daft and resets the hidden TSS limit register to 0x67
on VMX reload, and the 0x67 is not configurable. KVM currently
reloads TR using the LTR instruction on every exit, but this is quite
slow because LTR is serializing.
The 0x67 limit is entirely harmless unless ioperm() is in use, so
defer the reload until a task using ioperm() is actually running.
Here's some poorly done benchmarking using kvm-unit-tests:
Before:
cpuid 1313
vmcall 1195
mov_from_cr8 11
mov_to_cr8 17
inl_from_pmtimer 6770
inl_from_qemu 6856
inl_from_kernel 2435
outl_to_kernel 1402
After:
cpuid 1291
vmcall 1181
mov_from_cr8 11
mov_to_cr8 16
inl_from_pmtimer 6457
inl_from_qemu 6209
inl_from_kernel 2339
outl_to_kernel 1391
Signed-off-by: Andy Lutomirski <luto@kernel.org>
[Force-reload TR in invalidate_tss_limit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use actual pointer types for pointers (instead of unsigned long) and
replace hardcoded constants with the appropriate self-documenting
macros.
The function is still a bit messy, but this seems a lot better than
before to me.
This is mostly borrowed from a patch by Thomas Garnier.
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It was a bit buggy (it didn't list all segment types that needed
64-bit fixups), but the bug was irrelevant because it wasn't called
in any interesting context on 64-bit kernels and was only used for
data segents on 32-bit kernels.
To avoid confusion, make it explicitly 32-bit only.
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current CPU's TSS base is a foregone conclusion, so there's no need
to parse it out of the segment tables. This should save a couple cycles
(as STR is surely microcoded and poorly optimized) but, more importantly,
it's a cleanup and it means that segment_base() will never be called on
64-bit kernels.
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull scheduler updates from Ingo Molnar:
"The main changes in this (fairly busy) cycle were:
- There was a class of scheduler bugs related to forgetting to update
the rq-clock timestamp which can cause weird and hard to debug
problems, so there's a new debug facility for this: which uncovered
a whole lot of bugs which convinced us that we want to keep the
debug facility.
(Peter Zijlstra, Matt Fleming)
- Various cputime related updates: eliminate cputime and use u64
nanoseconds directly, simplify and improve the arch interfaces,
implement delayed accounting more widely, etc. - (Frederic
Weisbecker)
- Move code around for better structure plus cleanups (Ingo Molnar)
- Move IO schedule accounting deeper into the scheduler plus related
changes to improve the situation (Tejun Heo)
- ... plus a round of sched/rt and sched/deadline fixes, plus other
fixes, updats and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
sched/core: Remove unlikely() annotation from sched_move_task()
sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
sched/topology: Split out scheduler topology code from core.c into topology.c
sched/core: Remove unnecessary #include headers
sched/rq_clock: Consolidate the ordering of the rq_clock methods
delayacct: Include <uapi/linux/taskstats.h>
sched/core: Clean up comments
sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
sched/clock: Add dummy clear_sched_clock_stable() stub function
sched/cputime: Remove generic asm headers
sched/cputime: Remove unused nsec_to_cputime()
s390, sched/cputime: Remove unused cputime definitions
powerpc, sched/cputime: Remove unused cputime definitions
s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
ia64, sched/cputime: Remove unused cputime definitions
ia64: Convert vtime to use nsec units directly
ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
sched/cputime: Remove jiffies based cputime
sched/cputime, vtime: Return nsecs instead of cputime_t to account
sched/cputime: Complete nsec conversion of tick based accounting
...
handle_vmon gets a reference on VMXON region page,
but does not release it. Release the reference.
Found by syzkaller; based on a patch by Dmitry.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick"
a VCPU out of KVM_RUN through a POSIX signal. A signal is attached
to a dummy signal handler; by blocking the signal outside KVM_RUN and
unblocking it inside, this possible race is closed:
VCPU thread service thread
--------------------------------------------------------------
check flag
set flag
raise signal
(signal handler does nothing)
KVM_RUN
However, one issue with KVM_SET_SIGNAL_MASK is that it has to take
tsk->sighand->siglock on every KVM_RUN. This lock is often on a
remote NUMA node, because it is on the node of a thread's creator.
Taking this lock can be very expensive if there are many userspace
exits (as is the case for SMP Windows VMs without Hyper-V reference
time counter).
As an alternative, we can put the flag directly in kvm_run so that
KVM can see it:
VCPU thread service thread
--------------------------------------------------------------
raise signal
signal handler
set run->immediate_exit
KVM_RUN
check run->immediate_exit
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Provide versions of struct gfn_to_hva_cache functions that
take vcpu as a parameter instead of struct kvm. The existing functions
are not needed anymore, so delete them. This allows dirty pages to
be logged in the vcpu dirty ring, instead of the global dirty ring,
for ring-based dirty memory tracking.
Signed-off-by: Lei Cao <lei.cao@stratus.com>
Message-Id: <CY1PR08MB19929BD2AC47A291FD680E83F04F0@CY1PR08MB1992.namprd08.prod.outlook.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>