linux

mirror of https://github.com/raspberrypi/linux.git synced 2025-12-15 22:41:38 +00:00

Author	SHA1	Message	Date
Sean Christopherson	28cec7f08b	KVM: x86/mmu: Check that root is valid/loaded when pre-faulting SPTEs Error out if kvm_mmu_reload() fails when pre-faulting memory, as trying to fault-in SPTEs will fail miserably due to root.hpa pointing at garbage. Note, kvm_mmu_reload() can return -EIO and thus trigger the WARN on -EIO in kvm_vcpu_pre_fault_memory(), but all such paths also WARN, i.e. the WARN isn't user-triggerable and won't run afoul of warn-on-panic because the kernel would already be panicking. BUG: unable to handle page fault for address: 000029ffffffffe8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP CPU: 22 PID: 1069 Comm: pre_fault_memor Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #548 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:is_page_fault_stale+0x3e/0xe0 [kvm] RSP: 0018:ffffc9000114bd48 EFLAGS: 00010206 RAX: 00003fffffffffc0 RBX: ffff88810a07c080 RCX: ffffc9000114bd78 RDX: ffff88810a07c080 RSI: ffffea0000000000 RDI: ffff88810a07c080 RBP: ffffc9000114bd78 R08: 00007fa3c8c00000 R09: 8000000000000225 R10: ffffea00043d7d80 R11: 0000000000000000 R12: ffff88810a07c080 R13: 0000000100000000 R14: ffffc9000114be58 R15: 0000000000000000 FS: 00007fa3c9da0740(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000029ffffffffe8 CR3: 000000011d698000 CR4: 0000000000352eb0 Call Trace: <TASK> kvm_tdp_page_fault+0xcc/0x160 [kvm] kvm_mmu_do_page_fault+0xfb/0x1f0 [kvm] kvm_arch_vcpu_pre_fault_memory+0xd0/0x1a0 [kvm] kvm_vcpu_ioctl+0x761/0x8c0 [kvm] __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Modules linked in: kvm_intel kvm CR2: 000029ffffffffe8 ---[ end trace 0000000000000000 ]--- Fixes: `6e01b7601d` ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()") Reported-by: syzbot+23786faffb695f17edaa@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000002b84dc061dd73544@google.com Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: xingwei lee <xrivendell7@gmail.com> Tested-by: yuxin wang <wang1315768607@163.com> Link: https://lore.kernel.org/r/20240723000211.3352304-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:32 -07:00
Yan Zhao	e03a7caa53	KVM: x86/mmu: Fixup comments missed by the REMOVED_SPTE=>FROZEN_SPTE rename Replace "removed" with "frozen" in comments as appropriate to complete the rename of REMOVED_SPTE to FROZEN_SPTE. Fixes: `964cea8171` ("KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE") Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20240712233438.518591-1-rick.p.edgecombe@intel.com [sean: write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-08-22 11:25:31 -07:00
Ackerley Tng	aca0ec970d	KVM: x86/mmu: fix determination of max NPT mapping level for private pages The `if (req_max_level)` test was meant ignore req_max_level if PG_LEVEL_NONE was returned. Hence, this function should return max_level instead of the ignored req_max_level. This is only a latent issue for now, since guest_memfd does not support large pages. Signed-off-by: Ackerley Tng <ackerleytng@google.com> Message-ID: <20240801173955.1975034-1-ackerleytng@google.com> Fixes: `f32fb32820` ("KVM: x86: Add hook for determining max NPT mapping level") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-08-01 14:13:11 -04:00
Paolo Bonzini	4b5f67120a	KVM: extend kvm_range_has_memory_attributes() to check subset of attributes While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM code such as kvm_mem_is_private() is written to expect their existence. Allow using kvm_range_has_memory_attributes() as a multi-page version of kvm_mem_is_private(), without it breaking later when more attributes are introduced. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:15 -04:00
Paolo Bonzini	5932ca411e	KVM: x86: disallow pre-fault for SNP VMs before initialization KVM_PRE_FAULT_MEMORY for an SNP guest can race with sev_gmem_post_populate() in bad ways. The following sequence for instance can potentially trigger an RMP fault: thread A, sev_gmem_post_populate: called thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP thread A, sev_gmem_post_populate: vaddr = kmap_local_pfn(pfn + i); thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i PAGE_SIZE, PAGE_SIZE); RMP #PF Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's initial private memory contents have been finalized via KVM_SEV_SNP_LAUNCH_FINISH. Beyond fixing this issue, it just sort of makes sense to enforce this, since the KVM_PRE_FAULT_MEMORY documentation states: "KVM maps memory as if the vCPU generated a stage-2 read page fault" which sort of implies we should be acting on the same guest state that a vCPU would see post-launch after the initial guest memory is all set up. Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-26 14:46:14 -04:00
Wei Wang	896046474f	KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops Introduces kvm_x86_call(), to streamline the usage of static calls of kvm_x86_ops. The current implementation of these calls is verbose and could lead to alignment challenges. This makes the code susceptible to exceeding the "80 columns per single line of code" limit as defined in the coding-style document. Another issue with the existing implementation is that the addition of kvm_x86_ prefix to hooks at the static_call sites hinders code readability and navigation. kvm_x86_call() is added to improve code readability and maintainability, while adhering to the coding style guidelines. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 12:14:12 -04:00
Sean Christopherson	9fe17d2ada	KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro Tweak the definition of make_huge_page_split_spte() to eliminate an unnecessarily long line, and opportunistically initialize child_spte to make it more obvious that the child is directly derived from the huge parent. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240712151335.1242633-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 09:56:56 -04:00
Sean Christopherson	3d4415ed75	KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state as the callers fully expect a valid SPTE, e.g. the shadow MMU will add an rmap entry, and all MMUs will account the expected small page. Returning '0' is also technically wrong now that SHADOW_NONPRESENT_VALUE exists, i.e. would cause KVM to create a potential #VE SPTE. While it would be possible to have the callers gracefully handle failure, doing so would provide no practical value as the scenario really should be impossible, while the error handling would add a non-trivial amount of noise. Fixes: `a3fe5dbda0` ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled") Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240712151335.1242633-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-16 09:56:56 -04:00
Paolo Bonzini	5c5ddf7107	Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86 MTRR virtualization removal Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD hack, and instead always honor guest PAT on CPUs that support self-snoop.	2024-07-16 09:54:57 -04:00
Paolo Bonzini	34b69edecb	Merge tag 'kvm-x86-mmu-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.11 - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs. - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting to avoid stalling vCPUs when splitting huge pages. - Misc cleanups	2024-07-16 09:53:28 -04:00
Paolo Bonzini	5dcc1e7614	Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.11 - Add a global struct to consolidate tracking of host values, e.g. EFER, and move "shadow_phys_bits" into the structure as "maxphyaddr". - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. - Print the name of the APICv/AVIC inhibits in the relevant tracepoint. - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. - Misc cleanups	2024-07-16 09:53:05 -04:00
Paolo Bonzini	f3996d4d79	Merge branch 'kvm-prefault' into HEAD Pre-population has been requested several times to mitigate KVM page faults during guest boot or after live migration. It is also required by TDX before filling in the initial guest memory with measured contents. Introduce it as a generic API.	2024-07-12 11:18:45 -04:00
Paolo Bonzini	6e01b7601d	KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory() Wire KVM_PRE_FAULT_MEMORY ioctl to kvm_mmu_do_page_fault() to populate guest memory. It can be called right after KVM_CREATE_VCPU creates a vCPU, since at that point kvm_mmu_create() and kvm_init_mmu() are called and the vCPU is ready to invoke the KVM page fault handler. The helper function kvm_tdp_map_page() takes care of the logic to process RET_PF_* return values and convert them to success or errno. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-12 11:17:47 -04:00
Paolo Bonzini	58ef24699b	KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level The guest memory population logic will need to know what page size or level (4K, 2M, ...) is mapped. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <eabc3f3e5eb03b370cadf6e1901ea34d7a020adc.1712785629.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-12 11:17:36 -04:00
Sean Christopherson	f5e7f00cf1	KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault" Move the accounting of the result of kvm_mmu_do_page_fault() to its callers, as only pf_fixed is common to guest page faults and async #PFs, and upcoming support KVM_PRE_FAULT_MEMORY won't bump _any_ stats. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-12 11:17:35 -04:00
Sean Christopherson	5186ec223b	KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler Account stat.pf_taken in kvm_mmu_page_fault(), i.e. the actual page fault handler, instead of conditionally bumping it in kvm_mmu_do_page_fault(). The "real" page fault handler is the only path that should ever increment the number of taken page faults, as all other paths that "do page fault" are by definition not handling faults that occurred in the guest. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-07-12 11:17:35 -04:00
Rick Edgecombe	c2f38f75fc	KVM: x86/tdp_mmu: Take a GFN in kvm_tdp_mmu_fast_pf_get_last_sptep() Pass fault->gfn into kvm_tdp_mmu_fast_pf_get_last_sptep(), instead of passing fault->addr and then converting it to a GFN. Future changes will make fault->addr and fault->gfn differ when running TDX guests. The GFN will be conceptually the same as it is for normal VMs, but fault->addr may contain a TDX specific bit that differentiates between "shared" and "private" memory. This bit will be used to direct faults to be handled on different roots, either the normal "direct" root or a new type of root that handles private memory. The TDP iterators will process the traditional GFN concept and apply the required TDX specifics depending on the root type. For this reason, it needs to operate on regular GFN and not the addr, which may contain these special TDX specific bits. Today kvm_tdp_mmu_fast_pf_get_last_sptep() takes fault->addr and then immediately converts it to a GFN with a bit shift. However, this would unfortunately retain the TDX specific bits in what is supposed to be a traditional GFN. Excluding TDX's needs, it is also is unnecessary to pass fault->addr and convert it to a GFN when the GFN is already on hand. So instead just pass the GFN into kvm_tdp_mmu_fast_pf_get_last_sptep() and use it directly. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240619223614.290657-9-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-20 18:43:31 -04:00
Rick Edgecombe	964cea8171	KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other multi-part operations. REMOVED_SPTE is used as a non-present intermediate value for multi-part operations that can happen when a thread doesn't have an MMU write lock. Today these operations are when removing PTEs. However, future changes will want to use the same concept for setting a PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it to FROZEN_SPTE so it can be used for both types of operations. Also rename the relevant helpers and comments that refer to "removed" within the context of the SPTE value. Take care to not update naming referring the "remove" operations, which are still distinct. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240619223614.290657-2-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-20 18:43:31 -04:00
Paolo Bonzini	02b0d3b9d4	Merge branch 'kvm-6.10-fixes' into HEAD	2024-06-20 17:31:50 -04:00
Isaku Yamahata	8a4e2742a5	KVM: x86/tdp_mmu: Sprinkle __must_check The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace the SPTE value and returns -EBUSY on failure. The caller must check the return value and retry. Add __must_check to it, as well as to two more functions that forward the return value of __tdp_mmu_set_spte_atomic to their caller. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-20 17:28:44 -04:00
David Matlack	0089c055b5	KVM: x86/mmu: Avoid reacquiring RCU if TDP MMU fails to allocate an SP Avoid needlessly reacquiring the RCU read lock if the TDP MMU fails to allocate a shadow page during eager page splitting. Opportunistically drop the local variable ret as well now that it's no longer necessary. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240611220512.2426439-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:25:03 -07:00
David Matlack	3d4a5a45ca	KVM: x86/mmu: Unnest TDP MMU helpers that allocate SPs for eager splitting Move the implementation of tdp_mmu_alloc_sp_for_split() to its one and only caller to reduce unnecessary nesting and make it more clear why the eager split loop continues after allocating a new SP. Opportunistically drop the double-underscores from __tdp_mmu_alloc_sp_for_split() now that its parent is gone. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240611220512.2426439-4-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:24:23 -07:00
David Matlack	e1c04f7a9f	KVM: x86/mmu: Hard code GFP flags for TDP MMU eager split allocations Now that the GFP_NOWAIT case is gone, hard code GFP_KERNEL_ACCOUNT when allocating shadow pages during eager page splitting in the TDP MMU. Opportunistically replace use of __GFP_ZERO with allocations that zero to improve readability. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240611220512.2426439-3-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:23:40 -07:00
David Matlack	cf3ff0ee24	KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting Always drop mmu_lock to allocate shadow pages in the TDP MMU when doing eager page splitting. Dropping mmu_lock during eager page splitting is cheap since KVM does not have to flush remote TLBs, and avoids stalling vCPU threads that are taking page faults while KVM is eager splitting under mmu_lock held for write. This change reduces 20%+ dips in MySQL throughput during live migration in a 160 vCPU VM while userspace is issuing CLEAR_DIRTY_LOG ioctls (tested with 1GiB and 8GiB CLEARs). Userspace could issue finer-grained CLEARs, which would also reduce contention on mmu_lock, but doing so will increase the rate of remote TLB flushing, since KVM must flush TLBs before returning from CLEAR_DITY_LOG. When there isn't contention on mmu_lock[1], this change does not regress the time it takes to perform eager page splitting (the cost of releasing and re-acquiring an uncontended lock is minimal on x86). [1] Tested with dirty_log_perf_test, which does not run vCPUs during eager page splitting, and with a 16 vCPU VM Live Migration with manual-protect disabled (where mmu_lock is held for read). Cc: Bibo Mao <maobibo@loongson.cn> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240611220512.2426439-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:22:22 -07:00
Sean Christopherson	caa7278829	KVM: x86/mmu: Rephrase comment about synthetic PFERR flags in #PF handler Reword the BUILD_BUG_ON() comment in the legacy #PF handler to explicitly describe how asserting that synthetic PFERR flags are limited to bits 31:0 protects KVM against inadvertently passing a synthetic flag to the common page fault handler. No functional change intended. Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240608001108.3296879-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-14 09:20:47 -07:00
Sean Christopherson	377b2f359d	KVM: VMX: Always honor guest PAT on CPUs that support self-snoop Unconditionally honor guest PAT on CPUs that support self-snoop, as Intel has confirmed that CPUs that support self-snoop always snoop caches and store buffers. I.e. CPUs with self-snoop maintain cache coherency even in the presence of aliased memtypes, thus there is no need to trust the guest behaves and only honor PAT as a last resort, as KVM does today. Honoring guest PAT is desirable for use cases where the guest has access to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual (mediated, for all intents and purposes) GPU is exposed to the guest, along with buffers that are consumed directly by the physical GPU, i.e. which can't be proxied by the host to ensure writes from the guest are performed with the correct memory type for the GPU. Cc: Yiwei Zhang <zzyiwei@google.com> Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-07 07:18:03 -07:00
Sean Christopherson	0a7b73559b	KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR adds no value, negatively impacts guest performance, and is a maintenance burden due to it's complexity and oddities. KVM's approach to virtualizating MTRRs make no sense, at all. KVM only honors guest MTRR memtypes if EPT is enabled and the guest has a device that may perform non-coherent DMA access. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. Unwinding and deciphering KVM's murky history, the MTRR virtualization code appears to be the result of misdiagnosed issues when EPT + VT-d with passthrough devices was enabled years and years ago. And importantly, the underlying bugs that were fudged around by honoring guest MTRR memtypes have since been fixed (though rather poorly in some cases). The zapping GFNs logic in the MTRR virtualization code came from: commit `efdfe536d8` Author: Xiao Guangrong <guangrong.xiao@linux.intel.com> Date: Wed May 13 14:42:27 2015 +0800 KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR which was a fix for: commit `0bed3b568b` Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Oct 9 16:01:54 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Wed Dec 31 16:51:44 2008 +0200 KVM: Improve MTRR structure As well as reset mmu context when set MTRR. which was part of a "MTRR/PAT support for EPT" series that also added: + if (mt_mask) { + mt_mask = get_memory_type(vcpu, gfn) << + kvm_x86_ops->get_mt_mask_shift(); + spte \|= mt_mask; + } where get_memory_type() was a truly gnarly helper to retrieve the guest MTRR memtype for a given memtype. And very subtly, at the time of that change, KVM always set VMX_EPT_IGMT_BIT, kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK \| VMX_EPT_WRITABLE_MASK \| VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT \| VMX_EPT_IGMT_BIT); which came in via: commit `928d4bf747` Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Nov 6 14:55:45 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Tue Nov 11 21:00:37 2008 +0200 KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole "ignoreIGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough enabling, but it was applied earlier because it was a generic fix. Jumping back to `0bed3b568b` ("KVM: Improve MTRR structure"), the other relevant code, or rather lack thereof, is the handling of host MMIO. That fix came in a bit later, but given the author and timing, it's safe to say it was all part of the same EPT+VT-d enabling mess. commit `2aaf69dcee` Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Wed Jan 21 16:52:16 2009 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Sun Feb 15 02:47:37 2009 +0200 KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization was obviously never tested on NPT until much later, which lends further credence to the theory/argument that this was all the result of misdiagnosed issues. Discussion from the EPT+MTRR enabling thread[] more or less confirms that Sheng Yang was trying to resolve issues with passthrough MMIO. Sheng Yang : Do you mean host(qemu) would access this memory and if we set it to guest : MTRR, host access would be broken? We would cover this in our shadow MTRR : patch, for we encountered this in video ram when doing some experiment with : VGA assignment. And in the same thread, there's also what appears to be confirmation of Intel running into issues with Windows XP related to a guest device driver mapping DMA with WC in the PAT. * Avi Kavity : Sheng Yang wrote: : > Yes... But it's easy to do with assigned devices' mmio, but what if guest : > specific some non-mmio memory's memory type? E.g. we have met one issue in : > Xen, that a assigned-device's XP driver specific one memory region as buffer, : > and modify the memory type then do DMA. : > : > Only map MMIO space can be first step, but I guess we can modify assigned : > memory region memory type follow guest's? : > : : With ept/npt, we can't, since the memory type is in the guest's : pagetable entries, and these are not accessible. [] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com So, for the most part, what likely happened is that 15 years ago, a few engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough device MMIO by emulating guest* MTRRs. Except for the below case, everything since then has been a result of those two intertwined changes. The one exception, which is actually yet more confirmation of all of the above, is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: commit `606decd670` Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:12:47 2015 +0200 Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" This reverts commit `fd717f1101`. It was reported to cause Machine Check Exceptions (bug 104091). ... commit `fd717f1101` Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:38:13 2015 +0200 KVM: x86: apply guest MTRR virtualization on host reserved pages Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. However, the guest could prefer a different page type than UC for such pages. A good example is that pass-throughed VGA frame buffer is not always UC as host expected. This patch enables full use of virtual guest MTRRs. I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" and got the same result: machine checks, likely due to the guest MTRRs not being trustworthy/sane at all times. Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that too got reverted. Unfortunately, it doesn't appear that anyone ever found a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times doesn't appear to have a definitive root cause. commit `fc07e76ac7` Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:20:22 2015 +0200 Revert "KVM: SVM: use NPT page attributes" This reverts commit `3c2e7f7de3`. Initializing the mapping from MTRR to PAT values was reported to fail nondeterministically, and it also caused extremely slow boot (due to caching getting disabled---bug 103321) with assigned devices. ... commit `3c2e7f7de3` Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:32:17 2015 +0200 KVM: SVM: use NPT page attributes Right now, NPT page attributes are not used, and the final page attribute depends solely on gPAT (which however is not synced correctly), the guest MTRRs and the guest page attributes. However, we can do better by mimicking what is done for VMX. In the absence of PCI passthrough, the guest PAT can be ignored and the page attributes can be just WB. If passthrough is being used, instead, keep respecting the guest PAT, and emulate the guest MTRRs through the PAT field of the nested page tables. The only snag is that WP memory cannot be emulated correctly, because Linux's default PAT setting only includes the other types. In short, honoring guest MTRRs for VMX was initially a workaround of sorts for KVM ignoring guest PAT and for KVM not forcing UC for host MMIO. And while there are known cases where honoring guest MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in that case is to get WC instead of UC, i.e. at this point it's for performance, not correctness. Furthermore, the complete absence of MTRR virtualization on NPT and shadow paging proves that, while KVM theoretically can do better, it's by no means necessary for correctnesss. Lastly, since kernels mostly rely on firmware to do MTRR setup, and the host typically provides guest firmware, honoring guest MTRRs is effectively honoring host userspace memtypes, which is also backwards. I.e. it would be far better for host userspace to communicate its desired memtype directly to KVM (or perhaps indirectly via VMAs in the host kernel), not through guest MTRRs. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-05 08:13:14 -07:00
Tao Su	db574f2f96	KVM: x86/mmu: Don't save mmu_invalidate_seq after checking private attr Drop the second snapshot of mmu_invalidate_seq in kvm_faultin_pfn(). Before checking the mismatch of private vs. shared, mmu_invalidate_seq is saved to fault->mmu_seq, which can be used to detect an invalidation related to the gfn occurred, i.e. KVM will not install a mapping in page table if fault->mmu_seq != mmu_invalidate_seq. Currently there is a second snapshot of mmu_invalidate_seq, which may not be same as the first snapshot in kvm_faultin_pfn(), i.e. the gfn attribute may be changed between the two snapshots, but the gfn may be mapped in page table without hindrance. Therefore, drop the second snapshot as it has no obvious benefits. Fixes: `f6adeae81f` ("KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn()") Signed-off-by: Tao Su <tao1.su@linux.intel.com> Message-ID: <20240528102234.2162763-1-tao1.su@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-06-05 06:45:06 -04:00
Hou Wenlong	9ecc1c119b	KVM: x86/mmu: Only allocate shadowed translation cache for sp->role.level <= KVM_MAX_HUGEPAGE_LEVEL Only the indirect SP with sp->role.level <= KVM_MAX_HUGEPAGE_LEVEL might have leaf gptes, so allocation of shadowed translation cache is needed only for it. Then, it can use sp->shadowed_translation to determine whether to use the information in the shadowed translation cache or not. Also, extend the WARN in FNAME(sync_spte)() to ensure that this won't break shadow_mmu_get_sp_for_split(). Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: https://lore.kernel.org/r/5b0cda8a7456cda476b14fca36414a56f921dd52.1715398655.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 14:06:39 -07:00
Liang Chen	4f8973e65f	KVM: x86: invalid_list not used anymore in mmu_shrink_scan 'invalid_list' is now gathered in KVM_MMU_ZAP_OLDEST_MMU_PAGES. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Link: https://lore.kernel.org/r/20240509044710.18788-1-liangchen.linux@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 10:55:15 -07:00
Paolo Bonzini	ab978c62e7	Merge branch 'kvm-6.11-sev-snp' into HEAD Pull base x86 KVM support for running SEV-SNP guests from Michael Roth: * add some basic infrastructure and introduces a new KVM_X86_SNP_VM vm_type to handle differences versus the existing KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types. * implement the KVM API to handle the creation of a cryptographic launch context, encrypt/measure the initial image into guest memory, and finalize it before launching it. * implement handling for various guest-generated events such as page state changes, onlining of additional vCPUs, etc. * implement the gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges as well as cleaning them up prior to returning them to the host for use as normal memory. Because those cleanup hooks supplant certain activities like issuing WBINVDs during KVM MMU invalidations, avoid duplicating that work to avoid unecessary overhead. This merge leaves out support support for attestation guest requests and for loading the signing keys to be used for attestation requests.	2024-06-03 13:19:46 -04:00
Sean Christopherson	82897db912	KVM: x86: Move shadow_phys_bits into "kvm_host", as "maxphyaddr" Move shadow_phys_bits into "struct kvm_host_values", i.e. into KVM's global "kvm_host" variable, so that it is automatically exported for use in vendor modules. Rename the variable/field to maxphyaddr to more clearly capture what value it holds, now that it's used outside of the MMU (and because the "shadow" part is more than a bit misleading as the variable is not at all unique to shadow paging). Recomputing the raw/true host.MAXPHYADDR on every use can be subtly expensive, e.g. it will incur a VM-Exit on the CPUID if KVM is running as a nested hypervisor. Vendor code already has access to the information, e.g. by directly doing CPUID or by invoking kvm_get_shadow_phys_bits(), so there's no tangible benefit to making it MMU-only. Link: https://lore.kernel.org/r/20240423221521.2923759-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 08:58:55 -07:00
Sean Christopherson	c043eaaa6b	KVM: x86/mmu: Snapshot shadow_phys_bits when kvm.ko is loaded Snapshot shadow_phys_bits when kvm.ko is loaded, not when a vendor module is loaded, to guard against usage of shadow_phys_bits before it is initialized. The computation isn't vendor specific in any way, i.e. there there is no reason to wait to snapshot the value until a vendor module is loaded, nor is there any reason to recompute the value every time a vendor module is loaded. Opportunistically convert it from "read mostly" to "read-only after init". Link: https://lore.kernel.org/r/20240423221521.2923759-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-06-03 08:58:55 -07:00
Sean Christopherson	bca99c0356	KVM: x86/mmu: Print SPTEs on unexpected #VE Print the SPTEs that correspond to the faulting GPA on an unexpected EPT Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE didn't have SUPPRESS_VE set. Opportunistically assert that the underlying exit reason was indeed an EPT Violation, as the CPU has really gone off the rails if a #VE occurs due to a completely unexpected exit reason. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-23 12:28:45 -04:00
Sean Christopherson	837d557aba	KVM: x86/mmu: Add sanity checks that KVM doesn't create EPT #VE SPTEs Assert that KVM doesn't set a SPTE to a value that could trigger an EPT Violation #VE on a non-MMIO SPTE, e.g. to help detect bugs even without KVM_INTEL_PROVE_VE enabled, and to help debug actual #VE failures. Note, this will run afoul of TDX support, which needs to reflect emulated MMIO accesses into the guest as #VEs (which was the whole point of adding EPT Violation #VE support in KVM). The obvious fix for that is to exempt MMIO SPTEs, but that's annoyingly difficult now that is_mmio_spte() relies on a per-VM value. However, resolving that conundrum is a future problem, whereas getting KVM_INTEL_PROVE_VE healthy is a current problem. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-23 12:27:26 -04:00
Isaku Yamahata	803482f472	KVM: x86/mmu: Use SHADOW_NONPRESENT_VALUE for atomic zap in TDP MMU Use SHADOW_NONPRESENT_VALUE when zapping TDP MMU SPTEs with mmu_lock held for read, tdp_mmu_zap_spte_atomic() was simply missed during the initial development. Fixes: `7f01cab849` ("KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE") Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [sean: write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240518000430.1118488-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-23 12:24:39 -04:00
Brijesh Singh	c63cf135cc	KVM: SEV: Add support to handle RMP nested page faults When SEV-SNP is enabled in the guest, the hardware places restrictions on all memory accesses based on the contents of the RMP table. When hardware encounters RMP check failure caused by the guest memory access it raises the #NPF. The error code contains additional information on the access type. See the APM volume 2 for additional information. When using gmem, RMP faults resulting from mismatches between the state in the RMP table vs. what the guest expects via its page table result in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This means the only expected case that needs to be handled in the kernel is when the page size of the entry in the RMP table is larger than the mapping in the nested page table, in which case a PSMASH instruction needs to be issued to split the large RMP entry into individual 4K entries so that subsequent accesses can succeed. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Message-ID: <20240501085210.2213060-12-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-12 04:09:31 -04:00
Michael Roth	b74d002d3d	KVM: MMU: Disable fast path if KVM_EXIT_MEMORY_FAULT is needed For hardware-protected VMs like SEV-SNP guests, certain conditions like attempting to perform a write to a page which is not in the state that the guest expects it to be in can result in a nested/extended #PF which can only be satisfied by the host performing an implicit page state change to transition the page into the expected shared/private state. This is generally handled by generating a KVM_EXIT_MEMORY_FAULT event that gets forwarded to userspace to handle via KVM_SET_MEMORY_ATTRIBUTES. However, the fast_page_fault() code might misconstrue this situation as being the result of a write-protected access, and treat it as a spurious case when it sees that writes are already allowed for the sPTE. This results in the KVM MMU trying to resume the guest rather than taking any action to satisfy the real source of the #PF such as generating a KVM_EXIT_MEMORY_FAULT, resulting in the guest spinning on nested #PFs. Check for this condition and bail out of the fast path if it is detected. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-12 04:09:28 -04:00
Paolo Bonzini	7323260373	Merge branch 'kvm-coco-hooks' into HEAD Common patches for the target-independent functionality and hooks that are needed by SEV-SNP and TDX.	2024-05-12 04:07:01 -04:00
Paolo Bonzini	7d41e24da2	Merge tag 'kvm-x86-misc-6.10' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.10: - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is unused by hardware, so that KVM can communicate its inability to map GPAs that set bits 51:48 due to lack of 5-level paging. Guest firmware is expected to use the information to safely remap BARs in the uppermost GPA space, i.e to avoid placing a BAR at a legal, but unmappable, GPA. - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Don't completely ignore same-value writes to immutable feature MSRs, as doing so results in KVM failing to reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - Don't mark APICv as being inhibited due to ABSENT if APICv is disabled KVM-wide to avoid confusing debuggers (KVM will never bother clearing the ABSENT inhibit, even if userspace enables in-kernel local APIC).	2024-05-12 03:18:44 -04:00
Paolo Bonzini	5a1c72e07e	Merge tag 'kvm-x86-mmu-6.10' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.10: - Process TDP MMU SPTEs that are are zapped while holding mmu_lock for read after replacing REMOVED_SPTE with '0' and flushing remote TLBs, which allows vCPU tasks to repopulate the zapped region while the zapper finishes tearing down the old, defunct page tables. - Fix a longstanding, likely benign-in-practice race where KVM could fail to detect a write from kvm_mmu_track_write() to a shadowed GPTE if the GPTE is first page table being shadowed.	2024-05-12 03:18:30 -04:00
Paolo Bonzini	31a6cd7f16	Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.10: - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is used only when synthesizing nested EPT violation, i.e. it's not the vCPU's "real" exit_qualification, which is tracked elsewhere. - Add a sanity check to assert that EPT Violations are the only sources of nested PML Full VM-Exits.	2024-05-12 03:17:17 -04:00
Paolo Bonzini	4232da23d7	Merge tag 'loongarch-kvm-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.	2024-05-10 13:20:18 -04:00
Paolo Bonzini	f36508422a	Merge branch 'kvm-coco-pagefault-prep' into HEAD A combination of prep work for TDX and SNP, and a clean up of the page fault path to (hopefully) make it easier to follow the rules for private memory, noslot faults, writes to read-only slots, etc.	2024-05-10 13:18:48 -04:00
Michael Roth	f32fb32820	KVM: x86: Add hook for determining max NPT mapping level In the case of SEV-SNP, whether or not a 2MB page can be mapped via a 2MB mapping in the guest's nested page table depends on whether or not any subpages within the range have already been initialized as private in the RMP table. The existing mixed-attribute tracking in KVM is insufficient here, for instance: - gmem allocates 2MB page - guest issues PVALIDATE on 2MB page - guest later converts a subpage to shared - SNP host code issues PSMASH to split 2MB RMP mapping to 4K - KVM MMU splits NPT mapping to 4K - guest later converts that shared page back to private At this point there are no mixed attributes, and KVM would normally allow for 2MB NPT mappings again, but this is actually not allowed because the RMP table mappings are 4K and cannot be promoted on the hypervisor side, so the NPT mappings must still be limited to 4K to match this. Add a hook to determine the max NPT mapping size in situations like this. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240501085210.2213060-3-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-10 13:11:48 -04:00
Sean Christopherson	2b1f435505	KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns WARN if __kvm_faultin_pfn() generates a "no slot" pfn, and gracefully handle the unexpected behavior instead of continuing on with dangerous state, e.g. tdp_mmu_map_handle_target_level() _only_ checks fault->slot, and so could install a bogus PFN into the guest. The existing code is functionally ok, because kvm_faultin_pfn() pre-checks all of the cases that result in KVM_PFN_NOSLOT, but it is unnecessarily unsafe as it relies on __gfn_to_pfn_memslot() getting the _exact_ same memslot, i.e. not a re-retrieved pointer with KVM_MEMSLOT_INVALID set. And checking only fault->slot would fall apart if KVM ever added a flag or condition that forced emulation, similar to how KVM handles writes to read-only memslots. Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-07 11:59:24 -04:00
Sean Christopherson	f3310e622f	KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values Explicitly set "pfn" and "hva" to error values in kvm_mmu_do_page_fault() to harden KVM against using "uninitialized" values. In quotes because the fields are actually zero-initialized, and zero is a legal value for both page frame numbers and virtual addresses. E.g. failure to set "pfn" prior to creating an SPTE could result in KVM pointing at physical address '0', which is far less desirable than KVM generating a SPTE with reserved PA bits set and thus effectively killing the VM. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-07 11:59:23 -04:00
Sean Christopherson	36d4492765	KVM: x86/mmu: Set kvm_page_fault.hva to KVM_HVA_ERR_BAD for "no slot" faults Explicitly set fault->hva to KVM_HVA_ERR_BAD when handling a "no slot" fault to ensure that KVM doesn't use a bogus virtual address, e.g. if there was a slot but it's unusable (APIC access page), or if there really was no slot, in which case fault->hva will be '0' (which is a legal address for x86). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-07 11:59:23 -04:00
Sean Christopherson	f6adeae81f	KVM: x86/mmu: Handle no-slot faults at the beginning of kvm_faultin_pfn() Handle the "no memslot" case at the beginning of kvm_faultin_pfn(), just after the private versus shared check, so that there's no need to repeatedly query whether or not a slot exists. This also makes it more obvious that, except for private vs. shared attributes, the process of faulting in a pfn simply doesn't apply to gfns without a slot. Opportunistically stuff @fault's metadata in kvm_handle_noslot_fault() so that it doesn't need to be duplicated in all paths that invoke kvm_handle_noslot_fault(), and to minimize the probability of not stuffing the right fields. Leave the existing handle behind, but convert it to a WARN, to guard against __kvm_faultin_pfn() unexpectedly nullifying fault->slot. Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-07 11:59:22 -04:00
Sean Christopherson	cd272fc439	KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn() Move the checks related to the validity of an access to a memslot from the inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn(). This allows emulating accesses to the APIC access page, which don't need to resolve a pfn, even if there is a relevant in-progress mmu_notifier invalidation. Ditto for accesses to KVM internal memslots from L2, which KVM also treats as emulated MMIO. More importantly, this will allow for future cleanup by having the "no memslot" case bail from kvm_faultin_pfn() very early on. Go to rather extreme and gross lengths to make the change a glorified nop, e.g. call into __kvm_faultin_pfn() even when there is no slot, as the related code is very subtle. E.g. fault->slot can be nullified if it points at the APIC access page, some flows in KVM x86 expect fault->pfn to be KVM_PFN_NOSLOT, while others check only fault->slot, etc. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240228024147.41573-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-05-07 11:59:22 -04:00

1 2 3 4 5 ...

1059 Commits