linux

mirror of https://github.com/raspberrypi/linux.git synced 2025-12-20 16:52:28 +00:00

Author	SHA1	Message	Date
Paolo Bonzini	4286a3ec25	Merge tag 'kvm-x86-mmu-6.15' of https://github.com/kvm-x86/linux into HEAD KVM x86/mmu changes for 6.15 Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where "fast" means "without holding mmu_lock". Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs, e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting in memory. For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and KVM doesn't need to access any metadata to age SPTEs. For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a list of rmaps) to implement a per-rmap single-bit spinlock. When aging a gfn, acquire the rmap's spinlock with read-only permissions, which allows hardening and optimizing the locking and aging, e.g. locking an rmap for write requires mmu_lock to also be held. The lock is NOT a true R/W spinlock, i.e. multiple concurrent readers aren't supported. To avoid forcing all SPTE updates to use atomic operations (clearing the Accessed bit out of mmu_lock makes it inherently volatile), rework and rename spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude the Accessed bit. KVM (and mm/) already tolerates false positives/negatives for Accessed information, and all testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information.	2025-03-19 09:04:33 -04:00
Keith Busch	916b7f42b3	kvm: retry nx_huge_page_recovery_thread creation A VMM may send a non-fatal signal to its threads, including vCPU tasks, at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU task receives the signal while its trying to spawn the huge page recovery vhost task, then KVM_RUN will fail due to copy_process() returning -ERESTARTNOINTR. Rework call_once() to mark the call complete if and only if the called function succeeds, and plumb the function's true error code back to the call_once() invoker. This provides userspace with the correct, non-fatal error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge page task. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> [implemented the kvm user side] Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-3-kbusch@meta.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-01 02:54:18 -05:00
Keith Busch	cb380909ae	vhost: return task creation error instead of NULL Lets callers distinguish why the vhost task creation failed. No one currently cares why it failed, so no real runtime change from this patch, but that will not be the case for long. Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-2-kbusch@meta.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-01 02:52:52 -05:00
Sean Christopherson	af3b6a9eba	KVM: x86/mmu: Walk rmaps (shadow MMU) without holding mmu_lock when aging gfns Convert the shadow MMU to use per-rmap locking instead of the per-VM mmu_lock to protect rmaps when aging SPTEs. When A/D bits are enabled, it is safe to simply clear the Accessed bits, i.e. KVM just needs to ensure the parent page table isn't freed. The less obvious case is marking SPTEs for access tracking in the non-A/D case (for EPT only). Because aging a gfn means making the SPTE not-present, KVM needs to play nice with the case where the CPU has TLB entries for a SPTE that is not-present in memory. For example, when doing dirty tracking, if KVM encounters a non-present shadow accessed SPTE, KVM must know to do a TLB invalidation. Fortunately, KVM already provides (and relies upon) the necessary functionality. E.g. KVM doesn't flush TLBs when aging pages (even in the clear_flush_young() case), and when harvesting dirty bitmaps, KVM flushes based on the dirty bitmaps, not on SPTEs. Co-developed-by: James Houghton <jthoughton@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-12-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-14 07:17:47 -08:00
Sean Christopherson	bb6c7749cc	KVM: x86/mmu: Add support for lockless walks of rmap SPTEs Add a lockless version of for_each_rmap_spte(), which is pretty much the same as the normal version, except that it doesn't BUG() the host if a non-present SPTE is encountered. When mmu_lock is held, it should be impossible for a different task to zap a SPTE, _and_ zapped SPTEs must be removed from their rmap chain prior to dropping mmu_lock. Thus, the normal walker BUG()s if a non-present SPTE is encountered as something is wildly broken. When walking rmaps without holding mmu_lock, the SPTEs pointed at by the rmap chain can be zapped/dropped, and so a lockless walk can observe a non-present SPTE if it runs concurrently with a different operation that is zapping SPTEs. Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-11-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-14 07:17:44 -08:00
Sean Christopherson	4834eaded9	KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock Steal another bit from rmap entries (which are word aligned pointers, i.e. have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use the bit to implement a very rudimentary per-rmap spinlock. The only anticipated usage of the lock outside of mmu_lock is for aging gfns, and collisions between aging and other MMU rmap operations are quite rare, e.g. unless userspace is being silly and aging a tiny range over and over in a tight loop, time between contention when aging an actively running VM is O(seconds). In short, a more sophisticated locking scheme shouldn't be necessary. Note, the lock only protects the rmap structure itself, SPTEs that are pointed at by a locked rmap can still be modified and zapped by another task (KVM drops/zaps SPTEs before deleting the rmap entries) Co-developed-by: James Houghton <jthoughton@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-10-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-14 07:17:40 -08:00
Sean Christopherson	9fb13ba6b5	KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock Refactor the pte_list and rmap code to always read and write rmap_head->val exactly once, e.g. by collecting changes in a local variable and then propagating those changes back to rmap_head->val as appropriate. This will allow implementing a per-rmap rwlock (of sorts) by adding a LOCKED bit into the rmap value alongside the MANY bit. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Reviewed-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-9-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-14 07:17:33 -08:00
James Houghton	8c403cf231	KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 When aging SPTEs and the TDP MMU is enabled, process the shadow MMU if and only if the VM has at least one shadow page, as opposed to checking if the VM has rmaps. Checking for rmaps will effectively yield a false positive if the VM ran nested TDP VMs in the past, but is not currently doing so. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-8-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-14 07:17:28 -08:00
James Houghton	e25c233234	KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young Reorder the processing of the TDP MMU versus the shadow MMU when aging SPTEs, and skip the shadow MMU entirely in the test-only case if the TDP MMU reports that the page is young, i.e. completely avoid taking mmu_lock if the TDP MMU SPTE is young. Swap the order for the test-and-age helper as well for consistency. Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-7-jthoughton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-14 07:17:23 -08:00
Sean Christopherson	b146a9b34a	KVM: x86/mmu: Age TDP MMU SPTEs without holding mmu_lock Walk the TDP MMU in an RCU read-side critical section without holding mmu_lock when harvesting and potentially updating age information on TDP MMU SPTEs. Add a new macro to do RCU-safe walking of TDP MMU roots, and do all SPTE aging with atomic updates; while clobbering Accessed information is ok, KVM must not corrupt other bits, e.g. must not drop a Dirty or Writable bit when making a SPTE young.. If updating a SPTE to mark it for access tracking fails, leave it as is and treat it as if it were young. If the spte is being actively modified, it is most likely young. Acquire and release mmu_lock for write when harvesting age information from the shadow MMU, as the shadow MMU doesn't yet support aging outside of mmu_lock. Suggested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-5-jthoughton@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-14 07:17:17 -08:00
James Houghton	61d65f2dc7	KVM: x86/mmu: Don't force atomic update if only the Accessed bit is volatile Don't force SPTE modifications to be done atomically if the only volatile bit in the SPTE is the Accessed bit. KVM and the primary MMU tolerate stale aging state, and the probability of an Accessed bit A/D assist being clobbered and affecting again is likely far lower than the probability of consuming stale information due to not flushing TLBs when aging. Rename spte_has_volatile_bits() to spte_needs_atomic_update() to better capture the nature of the helper. Opportunstically do s/write/update on the TDP MMU wrapper, as it's not simply the "write" that needs to be done atomically, it's the entire update, i.e. the entire read-modify-write operation needs to be done atomically so that KVM has an accurate view of the old SPTE. Leave kvm_tdp_mmu_write_spte_atomic() as is. While the name is imperfect, it pairs with kvm_tdp_mmu_write_spte(), which in turn pairs with kvm_tdp_mmu_read_spte(). And renaming all of those isn't obviously a net positive, and would require significant churn. Signed-off-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250204004038.1680123-6-jthoughton@google.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-14 07:16:45 -08:00
Sean Christopherson	46d6c6f3ef	KVM: nSVM: Enter guest mode before initializing nested NPT MMU When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest mode prior to initializing the MMU for nested NPT so that guest_mode is set in the MMU's role. KVM's model is that all L2 MMUs are tagged with guest_mode, as the behavior of hypervisor MMUs tends to be significantly different than kernel MMUs. Practically speaking, the bug is relatively benign, as KVM only directly queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths that are optimizations (mmu_page_zap_pte() and shadow_mmu_try_split_huge_pages()). And while the role is incorprated into shadow page usage, because nested NPT requires KVM to be using NPT for L1, reusing shadow pages across L1 and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs will have direct=0. Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning of the flow instead of forcing guest_mode in the MMU, as nothing in nested_vmcb02_prepare_control() between the old and new locations touches TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present inconsistent vCPU state to the MMU. Fixes: `69cb877487` ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control") Cc: stable@vger.kernel.org Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-12 08:57:55 -08:00
Sean Christopherson	43fb96ae78	KVM: x86/mmu: Ensure NX huge page recovery thread is alive before waking When waking a VM's NX huge page recovery thread, ensure the thread is actually alive before trying to wake it. Now that the thread is spawned on-demand during KVM_RUN, a VM without a recovery thread is reachable via the related module params. BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:vhost_task_wake+0x5/0x10 Call Trace: <TASK> set_nx_huge_pages+0xcc/0x1e0 [kvm] param_attr_store+0x8a/0xd0 module_attr_store+0x1a/0x30 kernfs_fop_write_iter+0x12f/0x1e0 vfs_write+0x233/0x3e0 ksys_write+0x60/0xd0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f3b52710104 </TASK> Modules linked in: kvm_intel kvm CR2: 0000000000000040 Fixes: `931656b9e2` ("kvm: defer huge page recovery vhost task to later") Cc: stable@vger.kernel.org Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250124234623.3609069-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-02-04 11:28:21 -05:00
Keith Busch	931656b9e2	kvm: defer huge page recovery vhost task to later Some libraries want to ensure they are single threaded before forking, so making the kernel's kvm huge page recovery process a vhost task of the user process breaks those. The minijail library used by crosvm is one such affected application. Defer the task to after the first VM_RUN call, which occurs after the parent process has forked all its jailed processes. This needs to happen only once for the kvm instance, so introduce some general-purpose infrastructure for that, too. It's similar in concept to pthread_once; except it is actually usable, because the callback takes a parameter. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Alyssa Ross <hi@alyssa.is> Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250123153543.2769928-1-kbusch@meta.com> [Move call_once API to include/linux. - Paolo] Cc: stable@vger.kernel.org Fixes: `d96c77bd4e` ("KVM: x86: switch hugepage recovery thread to vhost_task") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-01-24 10:53:56 -05:00
Paolo Bonzini	86eb1aef72	Merge branch 'kvm-mirror-page-tables' into HEAD As part of enabling TDX virtual machines, support support separation of private/shared EPT into separate roots. Confidential computing solutions almost invariably have concepts of private and shared memory, but they may different a lot in the details. In SEV, for example, the bit is handled more like a permission bit as far as the page tables are concerned: the private/shared bit is not included in the physical address. For TDX, instead, the bit is more like a physical address bit, with the host mapping private memory in one half of the address space and shared in another. Furthermore, the two halves are mapped by different EPT roots and only the shared half is managed by KVM; the private half (also called Secure EPT in Intel documentation) gets managed by the privileged TDX Module via SEAMCALLs. As a result, the operations that actually change the private half of the EPT are limited and relatively slow compared to reading a PTE. For this reason the design for KVM is to keep a mirror of the private EPT in host memory. This allows KVM to quickly walk the EPT and only perform the slower private EPT operations when it needs to actually modify mid-level private PTEs. There are thus three sets of EPT page tables: external, mirror and direct. In the case of TDX (the only user of this framework) the first two cover private memory, whereas the third manages shared memory: external EPT - Hidden within the TDX module, modified via TDX module calls. mirror EPT - Bookkeeping tree used as an optimization by KVM, not used by the processor. direct EPT - Normal EPT that maps unencrypted shared memory. Managed like the EPT of a normal VM. Modifying external EPT ---------------------- Modifications to the mirrored page tables need to also perform the same operations to the private page tables, which will be handled via kvm_x86_ops. Although this prep series does not interact with the TDX module at all to actually configure the private EPT, it does lay the ground work for doing this. In some ways updating the private EPT is as simple as plumbing PTE modifications through to also call into the TDX module; however, the locking is more complicated because inserting a single PTE cannot anymore be done atomically with a single CMPXCHG. For this reason, the existing FROZEN_SPTE mechanism is used whenever a call to the TDX module updates the private EPT. FROZEN_SPTE acts basically as a spinlock on a PTE. Besides protecting operation of KVM, it limits the set of cases in which the TDX module will encounter contention on its own PTE locks. Zapping external EPT -------------------- While the framework tries to be relatively generic, and to be understandable without knowing TDX much in detail, some requirements of TDX sometimes leak; for example the private page tables also cannot be zapped while the range has anything mapped, so the mirrored/private page tables need to be protected from KVM operations that zap any non-leaf PTEs, for example kvm_mmu_reset_context() or kvm_mmu_zap_all_fast(). For normal VMs, guest memory is zapped for several reasons: user memory getting paged out by the guest, memslots getting deleted, passthrough of devices with non-coherent DMA. Confidential computing adds to these the conversion of memory between shared and privates. These operations must not zap any private memory that is in use by the guest. This is possible because the only zapping that is out of the control of KVM/userspace is paging out userspace memory, which cannot apply to guestmemfd operations. Thus a TDX VM will only zap private memory from memslot deletion and from conversion between private and shared memory which is triggered by the guest. To avoid zapping too much memory, enums are introduced so that operations can choose to target only private or shared memory, and thus only direct or mirror EPT. For example: Memslot deletion - Private and shared MMU notifier based zapping - Shared only Conversion to shared - Private only Conversion to private - Shared only Other cases of zapping will not be supported for KVM, for example APICv update or non-coherent DMA status update; for the latter, TDX will simply require that the CPU supports self-snoop and honor guest PAT unconditionally for shared memory.	2025-01-20 07:15:58 -05:00
Paolo Bonzini	4f7ff70c05	Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.14: - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities instead of just those where KVM needs to manage state and/or explicitly enable the feature in hardware. Along the way, refactor the code to make it easier to add features, and to make it more self-documenting how KVM is handling each feature. - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes where KVM unintentionally puts the vCPU into infinite loops in some scenarios (e.g. if emulation is triggered by the exit), and brings parity between VMX and SVM. - Add pending request and interrupt injection information to the kvm_exit and kvm_entry tracepoints respectively. - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when loading guest/host PKRU, due to a refactoring of the kernel helpers that didn't account for KVM's pre-checking of the need to do WRPKRU.	2025-01-20 06:49:39 -05:00
Yan Zhao	7c54803863	KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault() Return RET_PF* (excluding RET_PF_EMULATE/RET_PF_CONTINUE/RET_PF_INVALID) instead of 1 in kvm_mmu_page_fault(). The callers of kvm_mmu_page_fault() are KVM page fault handlers (i.e., npf_interception(), handle_ept_misconfig(), __vmx_handle_ept_violation(), kvm_handle_page_fault()). They either check if the return value is > 0 (as in npf_interception()) or pass it further to vcpu_run() to decide whether to break out of the kernel loop and return to the user when r <= 0. Therefore, returning any positive value is equivalent to returning 1. Warn if r == RET_PF_CONTINUE (which should not be a valid value) to ensure a positive return value. This is a preparation to allow TDX's EPT violation handler to check the RET_PF* value and retry internally for RET_PF_RETRY. No functional changes are intended. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250113021138.18875-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-01-15 11:45:26 -05:00
Rick Edgecombe	2c3412e999	KVM: x86/mmu: Prevent aliased memslot GFNs Add a few sanity checks to prevent memslot GFNs from ever having alias bits set. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly though calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. For TDX, the shared half will be mapped in the higher alias, with a "shared bit" set in the GPA. However, KVM will still manage it with the same memslots as the private half. This means memslot looks ups and zapping operations will be provided with a GFN without the shared bit set. If these memslot GFNs ever had the bit that selects between the two aliases it could lead to unexpected behavior in the complicated code that directs faulting or zapping operations between the roots that map the two aliases. As a safety measure, prevent memslots from being set at a GFN range that contains the alias bit. Also, check in the kvm_faultin_pfn() for the fault path. This later check does less today, as the alias bits are specifically stripped from the GFN being checked, however future code could possibly call in to the fault handler in a way that skips this stripping. Since kvm_faultin_pfn() now has many references to vcpu->kvm, extract it to local variable. Link: https://lore.kernel.org/kvm/ZpbKqG_ZhCWxl-Fc@google.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-19-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-23 08:31:55 -05:00
Isaku Yamahata	a89ecbb56b	KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots() Rename kvm_tdp_mmu_invalidate_all_roots() to kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types as an argument. kvm_tdp_mmu_invalidate_roots() is called with different root types. For kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing down a VM it needs to invalidate all roots. Have the callers only invalidate the required roots instead of all roots. Within kvm_tdp_mmu_invalidate_roots(), respect the root type passed by checking the root type in root iterator. Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-17-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-23 08:31:54 -05:00
Isaku Yamahata	fabaa76501	KVM: x86/tdp_mmu: Support mirror root for TDP MMU Add the ability for the TDP MMU to maintain a mirror of a separate mapping. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly through calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. In order to handle both shared and private memory, KVM needs to learn to handle faults and other operations on the correct root for the operation. KVM could learn the concept of private roots, and operate on them by calling out to operations that call into the TDX module. But there are two problems with that: 1. Calls into the TDX module are relatively slow compared to the simple accesses required to read a PTE managed directly by KVM. 2. Other Coco technologies deal with private memory completely differently and it will make the code confusing when being read from their perspective. Special operations added for TDX that set private or zap private memory will have nothing to do with these other private memory technologies. (SEV, etc). To handle these, instead teach the TDP MMU about a new concept "mirror roots". Such roots maintain page tables that are not actually mapped, and are just used to traverse quickly to determine if the mid level page tables need to be installed. When the memory be mirrored needs to actually be changed, calls can be made to via x86_ops. private KVM page fault \| \| \| V \| private GPA \| CPU protected EPTP \| \| \| V \| V mirror PT root \| external PT root \| \| \| V \| V mirror PT --hook to propagate-->external PT \| \| \| \--------------------+------\ \| \| \| \| \| V V \| private guest page \| \| non-encrypted memory \| encrypted memory \| Leave calling out to actually update the private page tables that are being mirrored for later changes. Just implement the handling of MMU operations on to mirrored roots. In order to direct operations to correct root, add root types KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct roots to private/shared with conditionals. It could also be implemented by making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line up such that conversion could be a direct assignment with a case. Don't do this because the mapping of private to mirrored is confusing enough. So it is worth not hiding the logic in type casting. Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place in kvm_mmu_free_roots(), because the private root that is being cannot be rebuilt like a normal root. It needs to persist for the lifetime of the VM. The TDX module will also need to be provided with page tables to use for the actual mapping being mirrored by the mirrored page tables. Allocate these in the mapping path using the recently added kvm_mmu_alloc_external_spt(). Don't support 2M page for now. This is avoided by forcing 4k pages in the fault. Add a KVM_BUG_ON() to verify. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-23 08:31:54 -05:00
Rick Edgecombe	243e13e810	KVM: x86/mmu: Make kvm_tdp_mmu_alloc_root() return void The kvm_tdp_mmu_alloc_root() function currently always returns 0. This allows for the caller, mmu_alloc_direct_roots(), to call kvm_tdp_mmu_alloc_root() and also return 0 in one line: return kvm_tdp_mmu_alloc_root(vcpu); So it is useful even though the return value of kvm_tdp_mmu_alloc_root() is always the same. However, in future changes, kvm_tdp_mmu_alloc_root() will be called twice in mmu_alloc_direct_roots(). This will force the first call to either awkwardly handle the return value that will always be zero or ignore it. So change kvm_tdp_mmu_alloc_root() to return void. Do it in a separate change so the future change will be cleaner. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240718211230.1492011-7-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-23 08:31:54 -05:00
Isaku Yamahata	3a4eb364a4	KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page Add an external pointer to struct kvm_mmu_page for TDX's private page table and add helper functions to allocate/initialize/free a private page table page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU doesn't use unsync_children and write_flooding_count, pack them to have room for a pointer and use a union to avoid memory overhead. For private GPA, CPU refers to a private page table whose contents are encrypted. The dedicated APIs to operate on it (e.g. updating/reading its PTE entry) are used, and their cost is expensive. When KVM resolves the KVM page fault, it walks the page tables. To reuse the existing KVM MMU code and mitigate the heavy cost of directly walking the private page table allocate two sets of page tables for the private half of the GPA space. For the page tables that KVM will walk, allocate them like normal and refer to them as mirror page tables. Additionally allocate one more page for the page tables the CPU will walk, and call them external page tables. Resolve the KVM page fault with the existing code, and do additional operations necessary for modifying the external page table in future patches. The relationship of the types of page tables in this scheme is depicted below: KVM page fault \| \| \| V \| -------------+---------- \| \| \| \| V V \| shared GPA private GPA \| \| \| \| V V \| shared PT root mirror PT root \| private PT root \| \| \| \| V V \| V shared PT mirror PT --propagate--> external PT \| \| \| \| \| \-----------------+------\ \| \| \| \| \| V \| V V shared guest page \| private guest page \| non-encrypted memory \| encrypted memory \| PT - Page table Shared PT - Visible to KVM, and the CPU uses it for shared mappings. External PT - The CPU uses it, but it is invisible to KVM. TDX module updates this table to map private guest pages. Mirror PT - It is visible to KVM, but the CPU doesn't use it. KVM uses it to propagate PT change to the actual private PT. Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it to the TDX vm type. Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20240718211230.1492011-5-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-23 08:28:55 -05:00
Isaku Yamahata	dca6c88532	KVM: Add member to struct kvm_gfn_range to indicate private/shared Add new members to strut kvm_gfn_range to indicate which mapping (private-vs-shared) to operate on: enum kvm_gfn_range_filter attr_filter. Update the core zapping operations to set them appropriately. TDX utilizes two GPA aliases for the same memslots, one for memory that is for private memory and one that is for shared. For private memory, KVM cannot always perform the same operations it does on memory for default VMs, such as zapping pages and having them be faulted back in, as this requires guest coordination. However, some operations such as guest driven conversion of memory between private and shared should zap private memory. Internally to the MMU, private and shared mappings are tracked on separate roots. Mapping and zapping operations will operate on the respective GFN alias for each root (private or shared). So zapping operations will by default zap both aliases. Add fields in struct kvm_gfn_range to allow callers to specify which aliases so they can only target the aliases appropriate for their specific operation. There was feedback that target aliases should be specified such that the default value (0) is to operate on both aliases. Several options were considered. Several variations of having separate bools defined such that the default behavior was to process both aliases. They either allowed nonsensical configurations, or were confusing for the caller. A simple enum was also explored and was close, but was hard to process in the caller. Instead, use an enum with the default value (0) reserved as a disallowed value. Catch ranges that didn't have the target aliases specified by looking for that specific value. Set target alias with enum appropriately for these MMU operations: - For KVM's mmu notifier callbacks, zap shared pages only because private pages won't have a userspace mapping - For setting memory attributes, kvm_arch_pre_set_memory_attributes() chooses the aliases based on the attribute. - For guest_memfd invalidations, zap private only. Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/ Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-23 08:28:55 -05:00
Rick Edgecombe	35be969d1e	KVM: x86/mmu: Zap invalid roots with mmu_lock holding for write at uninit Prepare for a future TDX patch which asserts that atomic zapping (i.e. zapping with mmu_lock taken for read) don't operate on mirror roots. When tearing down a VM, all roots have to be zapped (including mirror roots once they're in place) so do that with the mmu_lock taken for write. kvm_mmu_uninit_tdp_mmu() is invoked either before or after executing any atomic operations on SPTEs by vCPU threads. Therefore, it will not impact vCPU threads performance if kvm_tdp_mmu_zap_invalidated_roots() acquires mmu_lock for write to zap invalid roots. Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-2-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-12-23 08:28:49 -05:00
Sean Christopherson	386d69f9f2	KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed Treat slow-path TDP MMU faults as spurious if the access is allowed given the existing SPTE to fix a benign warning (other than the WARN itself) due to replacing a writable SPTE with a read-only SPTE, and to avoid the unnecessary LOCK CMPXCHG and subsequent TLB flush. If a read fault races with a write fault, fast GUP fails for any reason when trying to "promote" the read fault to a writable mapping, and KVM resolves the write fault first, then KVM will end up trying to install a read-only SPTE (for a !map_writable fault) overtop a writable SPTE. Note, it's not entirely clear why fast GUP fails, or if that's even how KVM ends up with a !map_writable fault with a writable SPTE. If something else is going awry, e.g. due to a bug in mmu_notifiers, then treating read faults as spurious in this scenario could effectively mask the underlying problem. However, retrying the faulting access instead of overwriting an existing SPTE is functionally correct and desirable irrespective of the WARN, and fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed bit in primary MMU's PTE is toggled and causes a PTE value mismatch. The WARN was also recently added, specifically to track down scenarios where KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious doesn't regress KVM's bug-finding capabilities in any way. In short, letting the WARN linger because there's a tiny chance it's due to a bug elsewhere would be excessively paranoid. Fixes: `1a175082b1` ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable") Reported-by: Lei Yang <leiyang@redhat.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588 Tested-by: Lei Yang <leiyang@redhat.com> Link: https://lore.kernel.org/r/20241218213611.3181643-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-19 17:47:52 -08:00
Sean Christopherson	2c5e168e5c	KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap" As the first step toward replacing KVM's so-called "governed features" framework with a more comprehensive, less poorly named implementation, replace the "kvm_governed_feature" function prefix with "guest_cpu_cap" and rename guest_can_use() to guest_cpu_cap_has(). The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and provides a more clear distinction between guest capabilities, which are KVM controlled (heh, or one might say "governed"), and guest CPUID, which with few exceptions is fully userspace controlled. Opportunistically rewrite the comment about XSS passthrough for SEV-ES guests to avoid referencing so many functions, as such comments are prone to becoming stale (case in point...). No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-12-18 14:20:03 -08:00
Paolo Bonzini	d96c77bd4e	KVM: x86: switch hugepage recovery thread to vhost_task kvm_vm_create_worker_thread() is meant to be used for kthreads that can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory). Therefore it wants to charge the CPU time consumed by that work to the VM's container. However, because of these threads, cgroups which have kvm instances inside never complete freezing. This can be trivially reproduced: root@test ~# mkdir /sys/fs/cgroup/test root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs root@test ~# qemu-system-x86_64 -nographic -enable-kvm and in another terminal: root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze root@test ~# cat /sys/fs/cgroup/test/cgroup.events populated 1 frozen 0 The cgroup freezing happens in the signal delivery path but kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never calls into the signal delivery path and thus never gets frozen. Because the cgroup freezer determines whether a given cgroup is frozen by comparing the number of frozen threads to the total number of threads in the cgroup, the cgroup never becomes frozen and users waiting for the state transition may hang indefinitely. Since the worker kthread is tied to a user process, it's better if it behaves similarly to user tasks as much as possible, including being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted to be and more: not only it inherits the userspace process's cgroups, it has other niceties like being parented properly in the process tree. Use it instead of the homegrown alternative. Incidentally, the new code is also better behaved when you flip recovery back and forth to disabled and back to enabled. If your recovery period is 1 minute, it will run the next recovery after 1 minute independent of how many times you flipped the parameter. (Commit message based on emails from Tejun). Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Luca Boccassi <bluca@debian.org> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Luca Boccassi <bluca@debian.org> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-11-14 13:20:04 -05:00
Paolo Bonzini	bb4409a9e7	Merge tag 'kvm-x86-misc-6.13' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.13 - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe, and harden the cache accessors to try to prevent similar issues from occuring in the future. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups	2024-11-13 06:33:00 -05:00
Vipin Sharma	4cf20d4254	KVM: x86/mmu: Drop per-VM zapped_obsolete_pages list Drop the per-VM zapped_obsolete_pages list now that the usage from the defunct mmu_shrinker is gone, and instead use a local list to track pages in kvm_zap_obsolete_pages(), the sole remaining user of zapped_obsolete_pages. Opportunistically add an assertion to verify and document that slots_lock must be held, i.e. that there can only be one active instance of kvm_zap_obsolete_pages() at any given time, and by doing so also prove that using a local list instead of a per-VM list doesn't change any functionality (beyond trivialities like list initialization). Signed-off-by: Vipin Sharma <vipinsh@google.com> Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 19:22:53 -08:00
Vipin Sharma	fe140e611d	KVM: x86/mmu: Remove KVM's MMU shrinker Remove KVM's MMU shrinker and (almost) all of its related code, as the current implementation is very disruptive to VMs (if it ever runs), without providing any meaningful benefit[1]. Alternatively, KVM could repurpose its shrinker, e.g. to reclaim pages from the per-vCPU caches[2], but given that no one has complained about lack of TDP MMU support for the shrinker in the 3+ years since the TDP MMU was enabled by default, it's safe to say that there is likely no real use case for initiating reclaim of KVM's page tables from the shrinker. And while clever/cute, reclaiming the per-vCPU caches doesn't scale the same way that reclaiming in-use page table pages does. E.g. the amount of memory being used by a VM doesn't always directly correlate with the number vCPUs, and even when it does, reclaiming a few pages from per-vCPU caches likely won't make much of a dent in the VM's total memory usage, especially for VMs with huge amounts of memory. Lastly, if it turns out that there is a strong use case for dropping the per-vCPU caches, re-introducing the shrinker registration is trivial compared to the complexity of actually reclaiming pages from the caches. [1] https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com [2] https://lore.kernel.org/kvm/20241004195540.210396-3-vipinsh@google.com Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Vipin Sharma <vipinsh@google.com> Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com [sean: keep zapped_obsolete_pages for now, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 19:18:22 -08:00
David Matlack	430e264b76	KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte() Rename make_huge_page_split_spte() to make_small_spte(). This ensures that the usage of "small_spte" and "huge_spte" are consistent between make_huge_spte() and make_small_spte(). This should also reduce some confusion as make_huge_page_split_spte() almost reads like it will create a huge SPTE, when in fact it is creating a small SPTE to split the huge SPTE. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-6-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 18:37:23 -08:00
David Matlack	13e2e4f62a	KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zapping Recover TDP MMU huge page mappings in-place instead of zapping them when dirty logging is disabled, and rename functions that recover huge page mappings when dirty logging is disabled to move away from the "zap collapsible spte" terminology. Before KVM flushes TLBs, guest accesses may be translated through either the (stale) small SPTE or the (new) huge SPTE. This is already possible when KVM is doing eager page splitting (where TLB flushes are also batched), and when vCPUs are faulting in huge mappings (where TLBs are flushed after the new huge SPTE is installed). Recovering huge pages reduces the number of page faults when dirty logging is disabled: $ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g Before: 393,599 kvm:kvm_page_fault After: 262,575 kvm:kvm_page_fault vCPU throughput and the latency of disabling dirty-logging are about equal compared to zapping, but avoiding faults can be beneficial to remove vCPU jitter in extreme scenarios. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-04 18:37:22 -08:00
Maxim Levitsky	9245fd6b85	KVM: x86: model canonical checks more precisely As a result of a recent investigation, it was determined that x86 CPUs which support 5-level paging, don't always respect CR4.LA57 when doing canonical checks. In particular: 1. MSRs which contain a linear address, allow full 57-bitcanonical address regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE. 2. All hidden segment bases and GDT/IDT bases also behave like MSRs. This means that full 57-bit canonical address can be loaded to them regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions (e.g LGDT). 3. TLB invalidation instructions also allow the user to use full 57-bit address regardless of the CR4.LA57. Finally, it must be noted that the CPU doesn't prevent the user from disabling 5-level paging, even when the full 57-bit canonical address is present in one of the registers mentioned above (e.g GDT base). In fact, this can happen without any userspace help, when the CPU enters SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain a non-canonical address in regard to the new mode. Since most of the affected MSRs and all segment bases can be read and written freely by the guest without any KVM intervention, this patch makes the emulator closely follow hardware behavior, which means that the emulator doesn't take in the account the guest CPUID support for 5-level paging, and only takes in the account the host CPU support. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-11-01 09:22:26 -07:00
David Matlack	8ccd51cb59	KVM: x86/mmu: Drop @max_level from kvm_mmu_max_mapping_level() Drop the @max_level parameter from kvm_mmu_max_mapping_level(). All callers pass in PG_LEVEL_NUM, so @max_level can be replaced with PG_LEVEL_NUM in the function body. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 15:25:42 -07:00
Sean Christopherson	c9b625625b	KVM: x86/mmu: Dedup logic for detecting TLB flushes on leaf SPTE changes Now that the shadow MMU and TDP MMU have identical logic for detecting required TLB flushes when updating SPTEs, move said logic to a helper so that the TDP MMU code can benefit from the comments that are currently exclusive to the shadow MMU. No functional change intended. Link: https://lore.kernel.org/r/20241011021051.1557902-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 15:25:37 -07:00
Sean Christopherson	7971801b56	KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled Use the Accessed bit in SPTEs even when A/D bits are disabled in hardware, i.e. propagate accessed information to SPTE.Accessed even when KVM is doing manual tracking by making SPTEs not-present. In addition to eliminating a small amount of code in is_accessed_spte(), this also paves the way for preserving Accessed information when a SPTE is zapped in response to a mmu_notifier PROTECTION event, e.g. if a SPTE is zapped because NUMA balancing kicks in. Note, EPT is the only flavor of paging in which A/D bits are conditionally enabled, and the Accessed (and Dirty) bit is software-available when A/D bits are disabled. Note #2, there are currently no concrete plans to preserve Accessed information. Explorations on that front were the initial catalyst, but the cleanup is the motivation for the actual commit. Link: https://lore.kernel.org/r/20241011021051.1557902-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 14:46:47 -07:00
Sean Christopherson	a5da5dde4b	KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled Add a dedicated flag to track if KVM has enabled A/D bits at the module level, instead of inferring the state based on whether or not the MMU's shadow_accessed_mask is non-zero. This will allow defining and using shadow_accessed_mask even when A/D bits aren't used by hardware. Link: https://lore.kernel.org/r/20241011021051.1557902-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 14:46:46 -07:00
Sean Christopherson	67c9380292	KVM: x86/mmu: Fold mmu_spte_update_no_track() into mmu_spte_update() Fold the guts of mmu_spte_update_no_track() into mmu_spte_update() now that the latter doesn't flush when clearing A/D bits, i.e. now that there is no need to explicitly avoid TLB flushes when aging SPTEs. Opportunistically WARN if mmu_spte_update() requests a TLB flush when aging SPTEs, as aging should never modify a SPTE in such a way that KVM thinks a TLB flush is needed. Link: https://lore.kernel.org/r/20241011021051.1557902-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 14:46:46 -07:00
Sean Christopherson	856cf4a60c	KVM: x86/mmu: Don't flush TLBs when clearing Dirty bit in shadow MMU Don't force a TLB flush when an SPTE update in the shadow MMU happens to clear the Dirty bit, as KVM unconditionally flushes TLBs when enabling dirty logging, and when clearing dirty logs, KVM flushes based on its software structures, not the SPTEs. I.e. the flows that care about accurate Dirty bit information already ensure there are no stale TLB entries. Opportunistically drop is_dirty_spte() as mmu_spte_update() was the sole caller. Link: https://lore.kernel.org/r/20241011021051.1557902-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 14:46:46 -07:00
Sean Christopherson	b7ed46b201	KVM: x86/mmu: Don't force flush if SPTE update clears Accessed bit Don't force a TLB flush if mmu_spte_update() clears the Accessed bit, as access tracking tolerates false negatives, as evidenced by the mmu_notifier hooks that explicitly test and age SPTEs without doing a TLB flush. In practice, this is very nearly a nop. spte_write_protect() and spte_clear_dirty() never clear the Accessed bit. make_spte() always sets the Accessed bit for !prefetch scenarios. FNAME(sync_spte) only sets SPTE if the protection bits are changing, i.e. if a flush will be needed regardless of the Accessed bits. And FNAME(pte_prefetch) sets SPTE if and only if the old SPTE is !PRESENT. That leaves kvm_arch_async_page_ready() as the one path that will generate a !ACCESSED SPTE and overwrite a PRESENT SPTE. And that's very arguably a bug, as clobbering a valid SPTE in that case is nonsensical. Tested-by: Alex Bennée <alex.bennee@linaro.org> Link: https://lore.kernel.org/r/20241011021051.1557902-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 14:46:46 -07:00
Sean Christopherson	081976992f	KVM: x86/mmu: Flush remote TLBs iff MMU-writable flag is cleared from RO SPTE Don't force a remote TLB flush if KVM happens to effectively "refresh" a read-only SPTE that is still MMU-Writable, as KVM allows MMU-Writable SPTEs to have Writable TLB entries, even if the SPTE is !Writable. Remote TLBs need to be flushed only when creating a read-only SPTE for write-tracking, i.e. when installing a !MMU-Writable SPTE. In practice, especially now that KVM doesn't overwrite existing SPTEs when prefetching, KVM will rarely "refresh" a read-only, MMU-Writable SPTE, i.e. this is unlikely to eliminate many, if any, TLB flushes. But, more precisely flushing makes it easier to understand exactly when KVM does and doesn't need to flush. Note, x86 architecturally requires relevant TLB entries to be invalidated on a page fault, i.e. there is no risk of putting a vCPU into an infinite loop of read-only page faults. Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20241011021051.1557902-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-10-30 14:46:45 -07:00
Sean Christopherson	66bc627e7f	KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs Don't mark pages/folios as accessed in the primary MMU when zapping SPTEs, as doing so relies on kvm_pfn_to_refcounted_page(), and generally speaking is unnecessary and wasteful. KVM participates in page aging via mmu_notifiers, so there's no need to push "accessed" updates to the primary MMU. And if KVM zaps a SPTe in response to an mmu_notifier, marking it accessed _after_ the primary MMU has decided to zap the page is likely to go unnoticed, i.e. odds are good that, if the page is being zapped for reclaim, the page will be swapped out regardless of whether or not KVM marks the page accessed. Dropping x86's use of kvm_set_pfn_accessed() also paves the way for removing kvm_pfn_to_refcounted_page() and all its users. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-83-seanjc@google.com>	2024-10-25 13:01:35 -04:00
Sean Christopherson	dc06193532	KVM: Move x86's API to release a faultin page to common KVM Move KVM x86's helper that "finishes" the faultin process to common KVM so that the logic can be shared across all architectures. Note, not all architectures implement a fast page fault path, but the gist of the comment applies to all architectures. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-50-seanjc@google.com>	2024-10-25 13:00:47 -04:00
Sean Christopherson	8eaa98004b	KVM: x86/mmu: Don't mark unused faultin pages as accessed When finishing guest page faults, don't mark pages as accessed if KVM is resuming the guest _without_ installing a mapping, i.e. if the page isn't being used. While it's possible that marking the page accessed could avoid minor thrashing due to reclaiming a page that the guest is about to access, it's far more likely that the gfn=>pfn mapping was was invalidated, e.g. due a memslot change, or because the corresponding VMA is being modified. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-49-seanjc@google.com>	2024-10-25 13:00:47 -04:00
Sean Christopherson	8dd861cc07	KVM: x86/mmu: Put refcounted pages instead of blindly releasing pfns Now that all x86 page fault paths precisely track refcounted pages, use Use kvm_page_fault.refcounted_page to put references to struct page memory when finishing page faults. This is a baby step towards eliminating kvm_pfn_to_refcounted_page(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-48-seanjc@google.com>	2024-10-25 13:00:47 -04:00
Sean Christopherson	1fbee5b01a	KVM: guest_memfd: Provide "struct page" as output from kvm_gmem_get_pfn() Provide the "struct page" associated with a guest_memfd pfn as an output from __kvm_gmem_get_pfn() so that KVM guest page fault handlers can directly put the page instead of having to rely on kvm_pfn_to_refcounted_page(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-47-seanjc@google.com>	2024-10-25 13:00:47 -04:00
Sean Christopherson	54ba8c98a2	KVM: x86/mmu: Convert page fault paths to kvm_faultin_pfn() Convert KVM x86 to use the recently introduced __kvm_faultin_pfn(). Opportunstically capture the refcounted_page grabbed by KVM for use in future changes. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-45-seanjc@google.com>	2024-10-25 13:00:47 -04:00
Sean Christopherson	0cad68cab1	KVM: x86/mmu: Mark pages/folios dirty at the origin of make_spte() Move the marking of folios dirty from make_spte() out to its callers, which have access to the _struct page_, not just the underlying pfn. Once all architectures follow suit, this will allow removing KVM's ugly hack where KVM elevates the refcount of VM_MIXEDMAP pfns that happen to be struct page memory. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-42-seanjc@google.com>	2024-10-25 12:59:08 -04:00
Sean Christopherson	7103853952	KVM: x86/mmu: Add helper to "finish" handling a guest page fault Add a helper to finish/complete the handling of a guest page, e.g. to mark the pages accessed and put any held references. In the near future, this will allow improving the logic without having to copy+paste changes into all page fault paths. And in the less near future, will allow sharing the "finish" API across all architectures. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-41-seanjc@google.com>	2024-10-25 12:59:08 -04:00
Sean Christopherson	fa8fe58d1e	KVM: x86/mmu: Add common helper to handle prefetching SPTEs Deduplicate the prefetching code for indirect and direct MMUs. The core logic is the same, the only difference is that indirect MMUs need to prefetch SPTEs one-at-a-time, as contiguous guest virtual addresses aren't guaranteed to yield contiguous guest physical addresses. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-40-seanjc@google.com>	2024-10-25 12:59:08 -04:00

1 2 3 4 5 ...

866 Commits