linux

mirror of https://github.com/raspberrypi/linux.git synced 2025-12-25 03:22:39 +00:00

Author	SHA1	Message	Date
Juergen Gross	c64ebc7793	x86/mm: Fix __swp_entry_to_pte() for Xen PV guests [ Upstream commit `0f88130e8a` ] Normally __swp_entry_to_pte() is never called with a value translating to a valid PTE. The only known exception is pte_swap_tests(), resulting in a WARN splat in Xen PV guests, as __pte_to_swp_entry() did translate the PFN of the valid PTE to a guest local PFN, while __swp_entry_to_pte() doesn't do the opposite translation. Fix that by using __pte() in __swp_entry_to_pte() instead of open coding the native variant of it. For correctness do the similar conversion for __swp_entry_to_pmd(). Fixes: `05289402d7` ("mm/debug_vm_pgtable: add tests validating arch helpers for core MM features") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230306123259.12461-1-jgross@suse.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:26 +02:00
Ravi Bangoria	1bd5218fe5	perf/ibs: Fix interface via core pmu events [ Upstream commit `2fad201fe3` ] Although, IBS pmus can be invoked via their own interface, indirect IBS invocation via core pmu events is also supported with fixed set of events: cpu-cycles:p, r076:p (same as cpu-cycles:p) and r0C1:p (micro-ops) for user convenience. This indirect IBS invocation is broken since commit `66d258c5b0` ("perf/core: Optimize perf_init_event()"), which added RAW pmu under 'pmu_idr' list and thus if event_init() fails with RAW pmu, it started returning error instead of trying other pmus. Forward precise events from core pmu to IBS by overwriting 'type' and 'config' in the kernel copy of perf_event_attr. Overwriting will cause perf_init_event() to retry with updated 'type' and 'config', which will automatically forward event to IBS pmu. Without patch: $ sudo ./perf record -C 0 -e r076:p -- sleep 1 Error: The r076:p event is not supported. With patch: $ sudo ./perf record -C 0 -e r076:p -- sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.341 MB perf.data (37 samples) ] Fixes: `66d258c5b0` ("perf/core: Optimize perf_init_event()") Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230504110003.2548-3-ravi.bangoria@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:25 +02:00
Juergen Gross	028c29e786	x86/xen: Set MTRR state when running as Xen PV initial domain [ Upstream commit `a153f254e5` ] When running as Xen PV initial domain (aka dom0), MTRRs are disabled by the hypervisor, but the system should nevertheless use correct cache memory types. This has always kind of worked, as disabled MTRRs resulted in disabled PAT, too, so that the kernel avoided code paths resulting in inconsistencies. This bypassed all of the sanity checks the kernel is doing with enabled MTRRs in order to avoid memory mappings with conflicting memory types. This has been changed recently, leading to PAT being accepted to be enabled, while MTRRs stayed disabled. The result is that mtrr_type_lookup() no longer is accepting all memory type requests, but started to return WB even if UC- was requested. This led to driver failures during initialization of some devices. In reality MTRRs are still in effect, but they are under complete control of the Xen hypervisor. It is possible, however, to retrieve the MTRR settings from the hypervisor. In order to fix those problems, overwrite the MTRR state via mtrr_overwrite_state() with the MTRR data from the hypervisor, if the system is running as a Xen dom0. Fixes: `72cbc8f04f` ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-6-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:25 +02:00
Juergen Gross	237be8079a	x86/mtrr: Support setting MTRR state for software defined MTRRs [ Upstream commit `29055dc742` ] When running virtualized, MTRR access can be reduced (e.g. in Xen PV guests or when running as a SEV-SNP guest under Hyper-V). Typically, the hypervisor will not advertize the MTRR feature in CPUID data, resulting in no MTRR memory type information being available for the kernel. This has turned out to result in problems (Link tags below): - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't - Xen PV dom0 mapping memory as WB which should be UC- instead Solve those problems by allowing an MTRR static state override, overwriting the empty state used today. In case such a state has been set, don't call get_mtrr_state() in mtrr_bp_init(). The set state will only be used by mtrr_type_lookup(), as in all other cases mtrr_enabled() is being checked, which will return false. Accept the overwrite call only for selected cases when running as a guest. Disable X86_FEATURE_MTRR in order to avoid any MTRR modifications by just refusing them. [ bp: Massage. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/all/4fe9541e-4d4c-2b2a-f8c8-2d34a7284930@nerdbynature.de/ Link: https://lore.kernel.org/lkml/BYAPR21MB16883ABC186566BD4D2A1451D7FE9@BYAPR21MB1688.namprd21.prod.outlook.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Stable-dep-of: `a153f254e5` ("x86/xen: Set MTRR state when running as Xen PV initial domain") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:25 +02:00
Juergen Gross	18ca757898	x86/mtrr: Replace size_or_mask and size_and_mask with a much easier concept [ Upstream commit `d053b481a5` ] Replace size_or_mask and size_and_mask with the much easier concept of high reserved bits. While at it, instead of using constants in the MTRR code, use some new [ bp: - Drop mtrr_set_mask() - Unbreak long lines - Move struct mtrr_state_type out of the uapi header as it doesn't belong there. It also fixes a HDRTEST breakage "unknown type name ‘bool’" as Reported-by: kernel test robot <lkp@intel.com> - Massage. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230502120931.20719-3-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Stable-dep-of: `a153f254e5` ("x86/xen: Set MTRR state when running as Xen PV initial domain") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:25 +02:00
Juergen Gross	5cd8bdb652	x86/mtrr: Remove physical address size calculation [ Upstream commit `f6b980646b` ] The physical address width calculation in mtrr_bp_init() can easily be replaced with using the already available value x86_phys_bits from struct cpuinfo_x86. The same information source can be used in mtrr/cleanup.c, removing the need to pass that value on to mtrr_cleanup(). In print_mtrr_state() use x86_phys_bits instead of recalculating it from size_or_mask. Move setting of size_or_mask and size_and_mask into a dedicated new function in mtrr/generic.c, enabling to make those 2 variables static, as they are used in generic.c only now. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-2-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Stable-dep-of: `a153f254e5` ("x86/xen: Set MTRR state when running as Xen PV initial domain") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:25 +02:00
Kirill A. Shutemov	9e39c07978	x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad() [ Upstream commit `195edce08b` ] tl;dr: There is a race in the TDX private<=>shared conversion code which could kill the TDX guest. Fix it by changing conversion ordering to eliminate the window. TDX hardware maintains metadata to track which pages are private and shared. Additionally, TDX guests use the guest x86 page tables to specify whether a given mapping is intended to be private or shared. Bad things happen when the intent and metadata do not match. So there are two thing in play: 1. "the page" -- the physical TDX page metadata 2. "the mapping" -- the guest-controlled x86 page table intent For instance, an unrecoverable exit to VMM occurs if a guest touches a private mapping that points to a shared physical page. In summary: * Private mapping => Private Page == OK (obviously) * Shared mapping => Shared Page == OK (obviously) * Private mapping => Shared Page == BIG BOOM! * Shared mapping => Private Page == OK-ish (It will read generate a recoverable #VE via handle_mmio()) Enter load_unaligned_zeropad(). It can touch memory that is adjacent but otherwise unrelated to the memory it needs to touch. It will cause one of those unrecoverable exits (aka. BIG BOOM) if it blunders into a shared mapping pointing to a private page. This is a problem when __set_memory_enc_pgtable() converts pages from shared to private. It first changes the mapping and second modifies the TDX page metadata. It's moving from: * Shared mapping => Shared Page == OK to: * Private mapping => Shared Page == BIG BOOM! This means that there is a window with a shared mapping pointing to a private page where load_unaligned_zeropad() can strike. Add a TDX handler for guest.enc_status_change_prepare(). This converts the page from shared to private before the page becomes private. This ensures that there is never a private mapping to a shared page. Leave a guest.enc_status_change_finish() in place but only use it for private=>shared conversions. This will delay updating the TDX metadata marking the page private until after the mapping matches the metadata. This also ensures that there is never a private mapping to a shared page. [ dhansen: rewrite changelog ] Fixes: `7dbde76316` ("x86/mm/cpa: Add support for TDX shared memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-3-kirill.shutemov%40linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:24 +02:00
Kirill A. Shutemov	dc849d3e8f	x86/mm: Allow guest.enc_status_change_prepare() to fail [ Upstream commit `3f6819dd19` ] TDX code is going to provide guest.enc_status_change_prepare() that is able to fail. TDX will use the call to convert the GPA range from shared to private. This operation can fail. Add a way to return an error from the callback. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-2-kirill.shutemov%40linux.intel.com Stable-dep-of: `195edce08b` ("x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()") Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:23 +02:00
Tom Lendacky	4b6579a08f	x86/sev: Fix calculation of end address based on number of pages [ Upstream commit `5dee19b6b2` ] When calculating an end address based on an unsigned int number of pages, any value greater than or equal to 0x100000 that is shift PAGE_SHIFT bits results in a 0 value, resulting in an invalid end address. Change the number of pages variable in various routines from an unsigned int to an unsigned long to calculate the end address correctly. Fixes: `5e5ccff60a` ("x86/sev: Add helper for validating pages in early enc attribute changes") Fixes: `dc3f3d2474` ("x86/mm: Validate memory when changing the C-bit") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/6a6e4eea0e1414402bac747744984fa4e9c01bb6.1686063086.git.thomas.lendacky@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:21 +02:00
Shawn Wang	dceb668daa	x86/resctrl: Only show tasks' pid in current pid namespace [ Upstream commit `2997d94b5d` ] When writing a task id to the "tasks" file in an rdtgroup, rdtgroup_tasks_write() treats the pid as a number in the current pid namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows the list of global pids from the init namespace, which is confusing and incorrect. To be more robust, let the "tasks" file only show pids in the current pid namespace. Fixes: `e02737d5b8` ("x86/intel_rdt: Add tasks files") Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Acked-by: Fenghua Yu <fenghua.yu@intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/all/20230116071246.97717-1-shawnwang@linux.alibaba.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-07-11 19:39:21 +02:00
Linus Torvalds	bce721f87e	mm: introduce new 'lock_mm_and_find_vma()' page fault helper commit `c2508ec5a5` upstream. .. and make x86 use it. This basically extracts the existing x86 "find and expand faulting vma" code, but extends it to also take the mmap lock for writing in case we actually do need to expand the vma. We've historically short-circuited that case, and have some rather ugly special logic to serialize the stack segment expansion (since we only hold the mmap lock for reading) that doesn't match the normal VM locking. That slight violation of locking worked well, right up until it didn't: the maple tree code really does want proper locking even for simple extension of an existing vma. So extract the code for "look up the vma of the fault" from x86, fix it up to do the necessary write locking, and make it available as a helper function for other architectures that can use the common helper. Note: I say "common helper", but it really only handles the normal stack-grows-down case. Which is all architectures except for PA-RISC and IA64. So some rare architectures can't use the helper, but if they care they'll just need to open-code this logic. It's also worth pointing out that this code really would like to have an optimistic "mmap_upgrade_trylock()" to make it quicker to go from a read-lock (for the common case) to taking the write lock (for having to extend the vma) in the normal single-threaded situation where there is no other locking activity. But that _is_ all the very uncommon special case, so while it would be nice to have such an operation, it probably doesn't matter in reality. I did put in the skeleton code for such a possible future expansion, even if it only acts as pseudo-documentation for what we're doing. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-01 13:14:44 +02:00
Thomas Gleixner	1ccf737c1a	x86/smp: Cure kexec() vs. mwait_play_dead() breakage commit `d7893093a7` upstream. TLDR: It's a mess. When kexec() is executed on a system with offline CPUs, which are parked in mwait_play_dead() it can end up in a triple fault during the bootup of the kexec kernel or cause hard to diagnose data corruption. The reason is that kexec() eventually overwrites the previous kernel's text, page tables, data and stack. If it writes to the cache line which is monitored by a previously offlined CPU, MWAIT resumes execution and ends up executing the wrong text, dereferencing overwritten page tables or corrupting the kexec kernels data. Cure this by bringing the offlined CPUs out of MWAIT into HLT. Write to the monitored cache line of each offline CPU, which makes MWAIT resume execution. The written control word tells the offlined CPUs to issue HLT, which does not have the MWAIT problem. That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as those make it come out of HLT. A follow up change will put them into INIT, which protects at least against NMI and SMI. Fixes: `ea53069231` ("x86, hotplug: Use mwait to offline a processor, fix the legacy case") Reported-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Ashok Raj <ashok.raj@intel.com> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.492257119@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-01 13:14:44 +02:00
Thomas Gleixner	3bd586d0e7	x86/smp: Use dedicated cache-line for mwait_play_dead() commit `f9c9987bf5` upstream. Monitoring idletask::thread_info::flags in mwait_play_dead() has been an obvious choice as all what is needed is a cache line which is not written by other CPUs. But there is a use case where a "dead" CPU needs to be brought out of MWAIT: kexec(). This is required as kexec() can overwrite text, pagetables, stacks and the monitored cacheline of the original kernel. The latter causes MWAIT to resume execution which obviously causes havoc on the kexec kernel which results usually in triple faults. Use a dedicated per CPU storage to prepare for that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.434553750@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-01 13:14:44 +02:00
Thomas Gleixner	903af95113	x86/smp: Remove pointless wmb()s from native_stop_other_cpus() commit `2affa6d6db` upstream. The wmb()s before sending the IPIs are not synchronizing anything. If at all then the apic IPI functions have to provide or act as appropriate barriers. Remove these cargo cult barriers which have no explanation of what they are synchronizing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615193330.378358382@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-01 13:14:44 +02:00
Tony Battersby	f7df17d128	x86/smp: Dont access non-existing CPUID leaf commit `9b040453d4` upstream. stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. Intel CPUs return the content of the highest supported leaf when a non-existing leaf is read, while AMD CPUs return all zeros for unsupported leafs. So the result of the test on Intel CPUs is lottery. While harmless it's incorrect and causes the conditional wbinvd() to be issued where not required. Check whether the leaf is supported before reading it. [ tglx: Adjusted changelog ] Fixes: `08f253ec37` ("x86/cpu: Clear SME feature flag when not in use") Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com Link: https://lore.kernel.org/r/20230615193330.322186388@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-01 13:14:43 +02:00
Thomas Gleixner	5b6e363116	x86/smp: Make stop_other_cpus() more robust commit `1f5e7eb786` upstream. Tony reported intermittent lockups on poweroff. His analysis identified the wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that on SME enabled machines a kexec() does not leave any stale data in the caches when switching from encrypted to non-encrypted mode or vice versa. That wbinvd() is conditional on the SME feature bit which is read directly from CPUID. But that readout does not check whether the CPUID leaf is available or not. If it's not available the CPU will return the value of the highest supported leaf instead. Depending on the content the "SME" bit might be set or not. That's incorrect but harmless. Making the CPUID readout conditional makes the observed hangs go away, but it does not fix the underlying problem: CPU0 CPU1 stop_other_cpus() send_IPIs(REBOOT); stop_this_cpu() while (num_online_cpus() > 1); set_online(false); proceed... -> hang wbinvd() WBINVD is an expensive operation and if multiple CPUs issue it at the same time the resulting delays are even larger. But CPU0 already observed num_online_cpus() going down to 1 and proceeds which causes the system to hang. This issue exists independent of WBINVD, but the delays caused by WBINVD make it more prominent. Make this more robust by adding a cpumask which is initialized to the online CPU mask before sending the IPIs and CPUs clear their bit in stop_this_cpu() after the WBINVD completed. Check for that cpumask to become empty in stop_other_cpus() instead of watching num_online_cpus(). The cpumask cannot plug all holes either, but it's better than a raw counter and allows to restrict the NMI fallback IPI to be sent only the CPUs which have not reported within the timeout window. Fixes: `08f253ec37` ("x86/cpu: Clear SME feature flag when not in use") Reported-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ashok Raj <ashok.raj@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com Link: https://lore.kernel.org/r/87h6r770bv.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-01 13:14:43 +02:00
Borislav Petkov (AMD)	7c6808c3be	x86/microcode/AMD: Load late on both threads too commit `a32b0f0db3` upstream. Do the same as early loading - load on both threads. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230605141332.25948-1-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-07-01 13:14:43 +02:00
Dheeraj Kumar Srivastava	1fadaf23fa	x86/apic: Fix kernel panic when booting with intremap=off and x2apic_phys [ Upstream commit `85d38d5810` ] When booting with "intremap=off" and "x2apic_phys" on the kernel command line, the physical x2APIC driver ends up being used even when x2APIC mode is disabled ("intremap=off" disables x2APIC mode). This happens because the first compound condition check in x2apic_phys_probe() is false due to x2apic_mode == 0 and so the following one returns true after default_acpi_madt_oem_check() having already selected the physical x2APIC driver. This results in the following panic: kernel BUG at arch/x86/kernel/apic/io_apic.c:2409! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0-rc2-ver4.1rc2 #2 Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.3.6 07/06/2021 RIP: 0010:setup_IO_APIC+0x9c/0xaf0 Call Trace: <TASK> ? native_read_msr apic_intr_mode_init x86_late_time_init start_kernel x86_64_start_reservations x86_64_start_kernel secondary_startup_64_no_verify </TASK> which is: setup_IO_APIC: apic_printk(APIC_VERBOSE, "ENABLING IO-APIC IRQs\n"); for_each_ioapic(ioapic) BUG_ON(mp_irqdomain_create(ioapic)); Return 0 to denote that x2APIC has not been enabled when probing the physical x2APIC driver. [ bp: Massage commit message heavily. ] Fixes: `9ebd680bd0` ("x86, apic: Use probe routines to simplify apic selection") Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230616212236.1389-1-dheerajkumar.srivastava@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-28 11:14:24 +02:00
Omar Sandoval	28b3a5f2ad	x86/unwind/orc: Add ELF section with ORC version identifier [ Upstream commit `b9f174c811` ] Commits `ffb1b4a410` ("x86/unwind/orc: Add 'signal' field to ORC metadata") and `fb799447ae` ("x86,objtool: Split UNWIND_HINT_EMPTY in two") changed the ORC format. Although ORC is internal to the kernel, it's the only way for external tools to get reliable kernel stack traces on x86-64. In particular, the drgn debugger [1] uses ORC for stack unwinding, and these format changes broke it [2]. As the drgn maintainer, I don't care how often or how much the kernel changes the ORC format as long as I have a way to detect the change. It suffices to store a version identifier in the vmlinux and kernel module ELF files (to use when parsing ORC sections from ELF), and in kernel memory (to use when parsing ORC from a core dump+symbol table). Rather than hard-coding a version number that needs to be manually bumped, Peterz suggested hashing the definitions from orc_types.h. If there is a format change that isn't caught by this, the hashing script can be updated. This patch adds an .orc_header allocated ELF section containing the 20-byte hash to vmlinux and kernel modules, along with the corresponding __start_orc_header and __stop_orc_header symbols in vmlinux. 1: https://github.com/osandov/drgn 2: https://github.com/osandov/drgn/issues/303 Fixes: `ffb1b4a410` ("x86/unwind/orc: Add 'signal' field to ORC metadata") Fixes: `fb799447ae` ("x86,objtool: Split UNWIND_HINT_EMPTY in two") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lkml.kernel.org/r/aef9c8dc43915b886a8c48509a12ec1b006ca1ca.1686690801.git.osandov@osandov.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-28 11:14:24 +02:00
Yonghong Song	d8cb7c824a	bpf: Fix a bpf_jit_dump issue for x86_64 with sysctl bpf_jit_enable. [ Upstream commit `ad96f1c913` ] The sysctl net/core/bpf_jit_enable does not work now due to commit `1022a5498f` ("bpf, x86_64: Use bpf_jit_binary_pack_alloc"). The commit saved the jitted insns into 'rw_image' instead of 'image' which caused bpf_jit_dump not dumping proper content. With 'echo 2 > /proc/sys/net/core/bpf_jit_enable', run './test_progs -t fentry_test'. Without this patch, one of jitted image for one particular prog is: flen=17 proglen=92 pass=4 image=0000000014c64883 from=test_progs pid=1807 00000000: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000010: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000020: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000030: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000040: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000050: cc cc cc cc cc cc cc cc cc cc cc cc With this patch, the jitte image for the same prog is: flen=17 proglen=92 pass=4 image=00000000b90254b7 from=test_progs pid=1809 00000000: f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3 00000010: 0f 1e fa 31 f6 48 8b 57 00 48 83 fa 07 75 2b 48 00000020: 8b 57 10 83 fa 09 75 22 48 8b 57 08 48 81 e2 ff 00000030: 00 00 00 48 83 fa 08 75 11 48 8b 7f 18 be 01 00 00000040: 00 00 48 83 ff 0a 74 02 31 f6 48 bf 18 d0 14 00 00000050: 00 c9 ff ff 48 89 77 00 31 c0 c9 c3 Fixes: `1022a5498f` ("bpf, x86_64: Use bpf_jit_binary_pack_alloc") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20230609005439.3173569-1-yhs@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-28 11:14:11 +02:00
Lee Jones	75a25a8444	x86/mm: Avoid using set_pgd() outside of real PGD pages commit `d082d48737` upstream. KPTI keeps around two PGDs: one for userspace and another for the kernel. Among other things, set_pgd() contains infrastructure to ensure that updates to the kernel PGD are reflected in the user PGD as well. One side-effect of this is that set_pgd() expects to be passed whole pages. Unfortunately, init_trampoline_kaslr() passes in a single entry: 'trampoline_pgd_entry'. When KPTI is on, set_pgd() will update 'trampoline_pgd_entry' (an 8-Byte globally stored [.bss] variable) and will then proceed to replicate that value into the non-existent neighboring user page (located +4k away), leading to the corruption of other global [.bss] stored variables. Fix it by directly assigning 'trampoline_pgd_entry' and avoiding set_pgd(). [ dhansen: tweak subject and changelog ] Fixes: `0925dda596` ("x86/mm/KASLR: Use only one PUD entry for real mode trampoline") Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lee Jones <lee@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/all/20230614163859.924309-1-lee@kernel.org/g Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-28 11:14:10 +02:00
Ricardo Ribalda	ee8b1d9cc9	x86/purgatory: remove PGO flags commit `97b6b9cbba` upstream. If profile-guided optimization is enabled, the purgatory ends up with multiple .text sections. This is not supported by kexec and crashes the system. Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-2-b05c520b7296@chromium.org Fixes: `930457057a` ("kernel/kexec_file.c: split up __kexec_load_puragory") Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Cc: <stable@vger.kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ross Zwisler <zwisler@google.com> Cc: Simon Horman <horms@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-21 16:02:09 +02:00
Tom Lendacky	de5a115680	x86/head/64: Switch to KERNEL_CS as soon as new GDT is installed [ Upstream commit `a37f2699c3` ] The call to startup_64_setup_env() will install a new GDT but does not actually switch to using the KERNEL_CS entry until returning from the function call. Commit `bcce829083` ("x86/sev: Detect/setup SEV/SME features earlier in boot") moved the call to sme_enable() earlier in the boot process and in between the call to startup_64_setup_env() and the switch to KERNEL_CS. An SEV-ES or an SEV-SNP guest will trigger #VC exceptions during the call to sme_enable() and if the CS pushed on the stack as part of the exception and used by IRETQ is not mapped by the new GDT, then problems occur. Today, the current CS when entering startup_64 is the kernel CS value because it was set up by the decompressor code, so no issue is seen. However, a recent patchset that looked to avoid using the legacy decompressor during an EFI boot exposed this bug. At entry to startup_64, the CS value is that of EFI and is not mapped in the new kernel GDT. So when a #VC exception occurs, the CS value used by IRETQ is not valid and the guest boot crashes. Fix this issue by moving the block that switches to the KERNEL_CS value to be done immediately after returning from startup_64_setup_env(). Fixes: `bcce829083` ("x86/sev: Detect/setup SEV/SME features earlier in boot") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Link: https://lore.kernel.org/all/6ff1f28af2829cc9aea357ebee285825f90a431f.1684340801.git.thomas.lendacky%40amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-21 16:02:04 +02:00
Sean Christopherson	36461b145e	KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-bounds commit `4364b28798` upstream. Bail from kvm_recalculate_phys_map() and disable the optimized map if the target vCPU's x2APIC ID is out-of-bounds, i.e. if the vCPU was added and/or enabled its local APIC after the map was allocated. This fixes an out-of-bounds access bug in the !x2apic_format path where KVM would write beyond the end of phys_map. Check the x2APIC ID regardless of whether or not x2APIC is enabled, as KVM's hardcodes x2APIC ID to be the vCPU ID, i.e. it can't change, and the map allocation in kvm_recalculate_apic_map() doesn't check for x2APIC being enabled, i.e. the check won't get false postivies. Note, this also affects the x2apic_format path, which previously just ignored the "x2apic_id > new->max_apic_id" case. That too is arguably a bug fix, as ignoring the vCPU meant that KVM would not send interrupts to the vCPU until the next map recalculation. In practice, that "bug" is likely benign as a newly present vCPU/APIC would immediately trigger a recalc. But, there's no functional downside to disabling the map, and a future patch will gracefully handle the -E2BIG case by retrying instead of simply disabling the optimized map. Opportunistically add a sanity check on the xAPIC ID size, along with a comment explaining why the xAPIC ID is guaranteed to be "good". Reported-by: Michal Luczaj <mhal@rbox.co> Fixes: `5b84b02917` ("KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602233250.1014316-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-09 10:48:24 +02:00
Sean Christopherson	7ed93eed4a	KVM: x86: Account fastpath-only VM-Exits in vCPU stats commit `8b703a49c9` upstream. Increment vcpu->stat.exits when handling a fastpath VM-Exit without going through any part of the "slow" path. Not bumping the exits stat can result in wildly misleading exit counts, e.g. if the primary reason the guest is exiting is to program the TSC deadline timer. Fixes: `404d5d7bff` ("KVM: X86: Introduce more exit_fastpath_completion enum values") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602011920.787844-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-09 10:48:24 +02:00
Sean Christopherson	2b46a2c480	KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker commit `817fa99836` upstream. Factor in the address space (non-SMM vs. SMM) of the target shadow page when recovering potential NX huge pages, otherwise KVM will retrieve the wrong memslot when zapping shadow pages that were created for SMM. The bug most visibly manifests as a WARN on the memslot being non-NULL, but the worst case scenario is that KVM could unaccount the shadow page without ensuring KVM won't install a huge page, i.e. if the non-SMM slot is being dirty logged, but the SMM slot is not. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015 kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450 R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98 R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0 Call Trace: <TASK> __pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm] kvm_vm_worker_thread+0x106/0x1c0 [kvm] kthread+0xd9/0x100 ret_from_fork+0x2c/0x50 </TASK> ---[ end trace 0000000000000000 ]--- This bug was exposed by commit `edbdb43fc9` ("KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated"), which allowed KVM to retain SMM TDP MMU roots effectively indefinitely. Before commit `edbdb43fc9`, KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages once all vCPUs exited SMM, which made the window where this bug (recovering an SMM NX huge page) could be encountered quite tiny. To hit the bug, the NX recovery thread would have to run while at least one vCPU was in SMM. Most VMs typically only use SMM during boot, and so the problematic shadow pages were gone by the time the NX recovery thread ran. Now that KVM preserves TDP MMU roots until they are explicitly invalidated (e.g. by a memslot deletion), the window to trigger the bug is effectively never closed because most VMMs don't delete memslots after boot (except for a handful of special scenarios). Fixes: `eb29860570` ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages") Reported-by: Fabio Coatti <fabio.coatti@gmail.com> Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-06-09 10:48:23 +02:00
Like Xu	0cf760b5c2	perf/x86/intel: Save/restore cpuc->active_pebs_data_cfg when using guest PEBS [ Upstream commit `3c845304d2` ] After commit `b752ea0c28` ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG"), the cpuc->pebs_data_cfg may save some bits that are not supported by real hardware, such as PEBS_UPDATE_DS_SW. This would cause the VMX hardware MSR switching mechanism to save/restore invalid values for PEBS_DATA_CFG MSR, thus crashing the host when PEBS is used for guest. Fix it by using the active host value from cpuc->active_pebs_data_cfg. Fixes: `b752ea0c28` ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG") Signed-off-by: Like Xu <likexu@tencent.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20230517133808.67885-1-likexu@tencent.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-09 10:48:11 +02:00
Kan Liang	a9165207b2	perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG [ Upstream commit `b752ea0c28` ] Several similar kernel warnings can be triggered, [56605.607840] CPU0 PEBS record size 0, expected 32, config 0 cpuc->record_size=208 when the below commands are running in parallel for a while on SPR. while true; do perf record --no-buildid -a --intr-regs=AX \ -e cpu/event=0xd0,umask=0x81/pp \ -c 10003 -o /dev/null ./triad; done & while true; do perf record -o /tmp/out -W -d \ -e '{ld_blocks.store_forward:period=1000000, \ MEM_TRANS_RETIRED.LOAD_LATENCY:u:precise=2:ldlat=4}' \ -c 1037 ./triad; done The triad program is just the generation of loads/stores. The warnings are triggered when an unexpected PEBS record (with a different config and size) is found. A system-wide PEBS event with the large PEBS config may be enabled during a context switch. Some PEBS records for the system-wide PEBS may be generated while the old task is sched out but the new one hasn't been sched in yet. When the new task is sched in, the cpuc->pebs_record_size may be updated for the per-task PEBS events. So the existing system-wide PEBS records have a different size from the later PEBS records. The PEBS buffer should be flushed right before the hardware is reprogrammed. The new size and threshold should be updated after the old buffer has been flushed. Reported-by: Stephane Eranian <eranian@google.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230421184529.3320912-1-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-09 10:47:59 +02:00
Ard Biesheuvel	7dca248f28	crypto: x86/aria - Use 16 byte alignment for GFNI constant vectors [ Upstream commit `6ab39f9992` ] The GFNI routines in the AVX version of the ARIA implementation now use explicit VMOVDQA instructions to load the constant input vectors, which means they must be 16 byte aligned. So ensure that this is the case, by dropping the section split and the incorrect .align 8 directive, and emitting the constants into the 16-byte aligned section instead. Note that the AVX2 version of this code deviates from this pattern, and does not require a similar fix, given that it loads these contants as 8-byte memory operands, for which AVX2 permits any alignment. Cc: Taehee Yoo <ap420073@gmail.com> Fixes: `8b84475318` ("crypto: x86/aria-avx - Do not use avx2 instructions") Reported-by: syzbot+a6abcf08bad8b18fd198@syzkaller.appspotmail.com Tested-by: syzbot+a6abcf08bad8b18fd198@syzkaller.appspotmail.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-06-09 10:47:48 +02:00
Vernon Lovejoy	19e9b57e68	x86/show_trace_log_lvl: Ensure stack pointer is aligned, again commit `2e4be0d011` upstream. The commit `e335bb51cc` ("x86/unwind: Ensure stack pointer is aligned") tried to align the stack pointer in show_trace_log_lvl(), otherwise the "stack < stack_info.end" check can't guarantee that the last read does not go past the end of the stack. However, we have the same problem with the initial value of the stack pointer, it can also be unaligned. So without this patch this trivial kernel module #include <linux/module.h> static int init(void) { asm volatile("sub $0x4,%rsp"); dump_stack(); asm volatile("add $0x4,%rsp"); return -EAGAIN; } module_init(init); MODULE_LICENSE("GPL"); crashes the kernel. Fixes: `e335bb51cc` ("x86/unwind: Ensure stack pointer is aligned") Signed-off-by: Vernon Lovejoy <vlovejoy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230512104232.GA10227@redhat.com Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-30 14:17:26 +01:00
Maximilian Heyne	6f46281385	x86/pci/xen: populate MSI sysfs entries commit `335b422346` upstream. Commit `bf5e758f02` ("genirq/msi: Simplify sysfs handling") reworked the creation of sysfs entries for MSI IRQs. The creation used to be in msi_domain_alloc_irqs_descs_locked after calling ops->domain_alloc_irqs. Then it moved into __msi_domain_alloc_irqs which is an implementation of domain_alloc_irqs. However, Xen comes with the only other implementation of domain_alloc_irqs and hence doesn't run the sysfs population code anymore. Commit `6c796996ee` ("x86/pci/xen: Fixup fallout from the PCI/MSI overhaul") set the flag MSI_FLAG_DEV_SYSFS for the xen msi_domain_info but that doesn't actually have an effect because Xen uses it's own domain_alloc_irqs implementation. Fix this by making use of the fallback functions for sysfs population. Fixes: `bf5e758f02` ("genirq/msi: Simplify sysfs handling") Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230503131656.15928-1-mheyne@amazon.de Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-30 14:17:26 +01:00
Zhang Rui	9662feaa2b	x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms commit `edc0a2b595` upstream. Traditionally, all CPUs in a system have identical numbers of SMT siblings. That changes with hybrid processors where some logical CPUs have a sibling and others have none. Today, the CPU boot code sets the global variable smp_num_siblings when every CPU thread is brought up. The last thread to boot will overwrite it with the number of siblings of that thread. That last thread to boot will "win". If the thread is a Pcore, smp_num_siblings == 2. If it is an Ecore, smp_num_siblings == 1. smp_num_siblings describes if the system supports SMT. It should specify the maximum number of SMT threads among all cores. Ensure that smp_num_siblings represents the system-wide maximum number of siblings by always increasing its value. Never allow it to decrease. On MeteorLake-P platform, this fixes a problem that the Ecore CPUs are not updated in any cpu sibling map because the system is treated as an UP system when probing Ecore CPUs. Below shows part of the CPU topology information before and after the fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore). ... -/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff -/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11 +/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff +/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21 ... -/sys/devices/system/cpu/cpu12/topology/package_cpus:001000 -/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12 +/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff +/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21 Notice that the "before" 'package_cpus_list' has only one CPU. This means that userspace tools like lscpu will see a little laptop like an 11-socket system: -Core(s) per socket: 1 -Socket(s): 11 +Core(s) per socket: 16 +Socket(s): 1 This is also expected to make the scheduler do rather wonky things too. [ dhansen: remove CPUID detail from changelog, add end user effects ] CC: stable@kernel.org Fixes: `bbb65d2d36` ("x86: use cpuid vector 0xb when available for detecting cpu topology") Fixes: `95f3d39ccf` ("x86/cpu/topology: Provide detect_extended_topology_early()") Suggested-by: Len Brown <len.brown@intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230323015640.27906-1-rui.zhang%40intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-30 14:17:23 +01:00
Kan Liang	1bc26381e2	perf/x86/uncore: Correct the number of CHAs on SPR commit `38776cc45e` upstream. The number of CHAs from the discovery table on some SPR variants is incorrect, because of a firmware issue. An accurate number can be read from the MSR UNC_CBO_CONFIG. Fixes: `949b11381f` ("perf/x86/intel/uncore: Add Sapphire Rapids server CHA support") Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Stephane Eranian <eranian@google.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230508140206.283708-1-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-30 14:17:23 +01:00
Dave Hansen	87e8ff8efc	x86/mm: Avoid incomplete Global INVLPG flushes commit `ce0b15d11a` upstream. The INVLPG instruction is used to invalidate TLB entries for a specified virtual address. When PCIDs are enabled, INVLPG is supposed to invalidate TLB entries for the specified address for both the current PCID and Global entries. (Note: Only kernel mappings set Global=1.) Unfortunately, some INVLPG implementations can leave Global translations unflushed when PCIDs are enabled. As a workaround, never enable PCIDs on affected processors. I expect there to eventually be microcode mitigations to replace this software workaround. However, the exact version numbers where that will happen are not known today. Once the version numbers are set in stone, the processor list can be tweaked to only disable PCIDs on affected processors with affected microcode. Note: if anyone wants a quick fix that doesn't require patching, just stick 'nopcid' on your kernel command-line. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-30 14:17:20 +01:00
Ze Gao	b26d6872d2	rethook, fprobe: do not trace rethook related functions commit `571a2a50a8` upstream. These functions are already marked as NOKPROBE to prevent recursion and we have the same reason to blacklist them if rethook is used with fprobe, since they are beyond the recursion-free region ftrace can guard. Link: https://lore.kernel.org/all/20230517034510.15639-5-zegao@tencent.com/ Fixes: `f3a112c0c4` ("x86,rethook,kprobes: Replace kretprobe with rethook on x86") Signed-off-by: Ze Gao <zegao@tencent.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-24 17:30:24 +01:00
Linus Torvalds	b805d212c3	x86: fix clear_user_rep_good() exception handling annotation This code no longer exists in mainline, because it was removed in commit `d2c95f9d68` ("x86: don't use REP_GOOD or ERMS for user memory clearing") upstream. However, rather than backport the full range of x86 memory clearing and copying cleanups, fix the exception table annotation placement for the final 'rep movsb' in clear_user_rep_good(): rather than pointing at the actual instruction that did the user space access, it pointed to the register move just before it. That made sense from a code flow standpoint, but not from an actual usage standpoint: it means that if user access takes an exception, the exception handler won't actually find the instruction in the exception tables. As a result, rather than fixing it up and returning -EFAULT, it would then turn it into a kernel oops report instead, something like: BUG: unable to handle page fault for address: 0000000020081000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page ... RIP: 0010:clear_user_rep_good+0x1c/0x30 arch/x86/lib/clear_page_64.S:147 ... Call Trace: __clear_user arch/x86/include/asm/uaccess_64.h:103 [inline] clear_user arch/x86/include/asm/uaccess_64.h:124 [inline] iov_iter_zero+0x709/0x1290 lib/iov_iter.c:800 iomap_dio_hole_iter fs/iomap/direct-io.c:389 [inline] iomap_dio_iter fs/iomap/direct-io.c:440 [inline] __iomap_dio_rw+0xe3d/0x1cd0 fs/iomap/direct-io.c:601 iomap_dio_rw+0x40/0xa0 fs/iomap/direct-io.c:689 ext4_dio_read_iter fs/ext4/file.c:94 [inline] ext4_file_read_iter+0x4be/0x690 fs/ext4/file.c:145 call_read_iter include/linux/fs.h:2183 [inline] do_iter_readv_writev+0x2e0/0x3b0 fs/read_write.c:733 do_iter_read+0x2f2/0x750 fs/read_write.c:796 vfs_readv+0xe5/0x150 fs/read_write.c:916 do_preadv+0x1b6/0x270 fs/read_write.c:1008 __do_sys_preadv2 fs/read_write.c:1070 [inline] __se_sys_preadv2 fs/read_write.c:1061 [inline] __x64_sys_preadv2+0xef/0x150 fs/read_write.c:1061 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd which then looks like a filesystem bug rather than the incorrect exception annotation that it is. [ The alternative to this one-liner fix is to take the upstream series that cleans this all up: `68674f94ff` ("x86: don't use REP_GOOD or ERMS for small memory copies") `20f3337d35` ("x86: don't use REP_GOOD or ERMS for small memory clearing") `adfcf4231b` ("x86: don't use REP_GOOD or ERMS for user memory copies") * `d2c95f9d68` ("x86: don't use REP_GOOD or ERMS for user memory clearing") `3639a53558` ("x86: move stac/clac from user copy routines into callers") `577e6a7fd5` ("x86: inline the 'rep movs' in user copies for the FSRM case") `8c9b6a88b7` ("x86: improve on the non-rep 'clear_user' function") `427fda2c8a` ("x86: improve on the non-rep 'copy_user' function") * `e046fe5a36` ("x86: set FSRS automatically on AMD CPUs that have FSRM") `e1f2750edc` ("x86: remove 'zerorest' argument from __copy_user_nocache()") `034ff37d34` ("x86: rewrite '__copy_user_nocache' function") with either the whole series or at a minimum the two marked commits being needed to fix this issue ] Reported-by: syzbot <syzbot+401145a9a237779feb26@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?extid=401145a9a237779feb26 Fixes: `0db7058e8e` ("x86/clear_user: Make it faster") Cc: Borislav Petkov <bp@alien8.de> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-17 14:02:08 +02:00
Mario Limonciello	7e29e780b1	x86/amd_nb: Add PCI ID for family 19h model 78h commit `23a5b8bb02` upstream. Commit `310e782a99` ("platform/x86/amd: pmc: Utilize SMN index 0 for driver probe") switched to using amd_smn_read() which relies upon the misc PCI ID used by DF function 3 being included in a table. The ID for model 78h is missing in that table, so amd_smn_read() doesn't work. Add the missing ID into amd_nb, restoring s2idle on this system. [ bp: Simplify commit message. ] Fixes: `310e782a99` ("platform/x86/amd: pmc: Utilize SMN index 0 for driver probe") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h Acked-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20230427053338.16653-2-mario.limonciello@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-17 14:02:07 +02:00
Namhyung Kim	34b7f8a423	perf/x86: Fix missing sample size update on AMD BRS commit `90befef5a9` upstream. It missed to convert a PERF_SAMPLE_BRANCH_STACK user to call the new perf_sample_save_brstack() helper in order to update the dyn_size. This affects AMD Zen3 machines with the branch-brs event. Fixes: `eb55b455ef` ("perf/core: Add perf_sample_save_brstack() helper") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20230427030527.580841-1-namhyung@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-17 14:02:05 +02:00
Borislav Petkov (AMD)	e16be8fb72	x86/retbleed: Fix return thunk alignment commit `9a48d60467` upstream. SYM_FUNC_START_LOCAL_NOALIGN() adds an endbr leading to this layout (leaving only the last 2 bytes of the address): 3bff <zen_untrain_ret>: 3bff: f3 0f 1e fa endbr64 3c03: f6 test $0xcc,%bl 3c04 <__x86_return_thunk>: 3c04: c3 ret 3c05: cc int3 3c06: 0f ae e8 lfence However, "the RET at __x86_return_thunk must be on a 64 byte boundary, for alignment within the BTB." Use SYM_START instead. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-17 14:01:51 +02:00
Sean Christopherson	2a4aa97aad	KVM: x86/mmu: Refresh CR0.WP prior to checking for emulated permission faults [ Upstream commit `cf9f4c0eb1` ] Refresh the MMU's snapshot of the vCPU's CR0.WP prior to checking for permission faults when emulating a guest memory access and CR0.WP may be guest owned. If the guest toggles only CR0.WP and triggers emulation of a supervisor write, e.g. when KVM is emulating UMIP, KVM may consume a stale CR0.WP, i.e. use stale protection bits metadata. Note, KVM passes through CR0.WP if and only if EPT is enabled as CR0.WP is part of the MMU role for legacy shadow paging, and SVM (NPT) doesn't support per-bit interception controls for CR0. Don't bother checking for EPT vs. NPT as the "old == new" check will always be true under NPT, i.e. the only cost is the read of vcpu->arch.cr4 (SVM unconditionally grabs CR0 from the VMCB on VM-Exit). Reported-by: Mathias Krause <minipli@grsecurity.net> Link: https://lkml.kernel.org/r/677169b4-051f-fcae-756b-9a3e1bb9f8fe%40grsecurity.net Fixes: `fb509f76ac` ("KVM: VMX: Make CR0.WP a guest owned bit") Tested-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230405002608.418442-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> # backport to v6.3.x Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-17 14:01:51 +02:00
Mathias Krause	f13be15a3b	KVM: VMX: Make CR0.WP a guest owned bit [ Upstream commit `fb509f76ac` ] Guests like grsecurity that make heavy use of CR0.WP to implement kernel level W^X will suffer from the implied VMEXITs. With EPT there is no need to intercept a guest change of CR0.WP, so simply make it a guest owned bit if we can do so. This implies that a read of a guest's CR0.WP bit might need a VMREAD. However, the only potentially affected user seems to be kvm_init_mmu() which is a heavy operation to begin with. But also most callers already cache the full value of CR0 anyway, so no additional VMREAD is needed. The only exception is nested_vmx_load_cr3(). This change is VMX-specific, as SVM has no such fine grained control register intercept control. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-7-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-17 14:01:51 +02:00
Mathias Krause	cf5e267bbf	KVM: x86: Make use of kvm_read_cr*_bits() when testing bits [ Upstream commit `74cdc83691` ] Make use of the kvm_read_cr{0,4}_bits() helper functions when we only want to know the state of certain bits instead of the whole register. This not only makes the intent cleaner, it also avoids a potential VMREAD in case the tested bits aren't guest owned. Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-5-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-17 14:01:51 +02:00
Mathias Krause	65c29d3941	KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled [ Upstream commit `01b31714bd` ] There is no need to unload the MMU roots with TDP enabled when only CR0.WP has changed -- the paging structures are still valid, only the permission bitmap needs to be updated. One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to implement kernel W^X. The optimization brings a huge performance gain for this case as the following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a grsecurity L1 VM shows (runtime in seconds, lower is better): legacy TDP shadow kvm-x86/next@d8708b 8.43s 9.45s 70.3s +patch 5.39s 5.63s 70.2s For legacy MMU this is ~36% faster, for TDP MMU even ~40% faster. Also TDP and legacy MMU now both have a similar runtime which vanishes the need to disable TDP MMU for grsecurity. Shadow MMU sees no measurable difference and is still slow, as expected. [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-3-minipli@grsecurity.net Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-17 14:01:51 +02:00
Paolo Bonzini	b1b91f9704	KVM: x86/mmu: Avoid indirect call for get_cr3 [ Upstream commit `2fdcc1b324` ] Most of the time, calls to get_guest_pgd result in calling kvm_read_cr3 (the exception is only nested TDP). Hardcode the default instead of using the get_cr3 function, avoiding a retpoline if they are enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20230322013731.102955-2-minipli@grsecurity.net Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> # backport to v6.3.x Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-17 14:01:51 +02:00
Saurabh Sengar	8788ef0106	x86/ioapic: Don't return 0 from arch_dynirq_lower_bound() [ Upstream commit `5af507bef9` ] arch_dynirq_lower_bound() is invoked by the core interrupt code to retrieve the lowest possible Linux interrupt number for dynamically allocated interrupts like MSI. The x86 implementation uses this to exclude the IO/APIC GSI space. This works correctly as long as there is an IO/APIC registered, but returns 0 if not. This has been observed in VMs where the BIOS does not advertise an IO/APIC. 0 is an invalid interrupt number except for the legacy timer interrupt on x86. The return value is unchecked in the core code, so it ends up to allocate interrupt number 0 which is subsequently considered to be invalid by the caller, e.g. the MSI allocation code. The function has already a check for 0 in the case that an IO/APIC is registered, as ioapic_dynirq_base is 0 in case of device tree setups. Consolidate this and zero check for both ioapic_dynirq_base and gsi_top, which is used in the case that no IO/APIC is registered. Fixes: `3e5bedc2c2` ("x86/apic: Fix arch_dynirq_lower_bound() bug for DT enabled machines") Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/1679988604-20308-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:17:09 +09:00
Uros Bizjak	878acafd25	x86/apic: Fix atomic update of offset in reserve_eilvt_offset() [ Upstream commit `f96fb2df3e` ] The detection of atomic update failure in reserve_eilvt_offset() is not correct. The value returned by atomic_cmpxchg() should be compared to the old value from the location to be updated. If these two are the same, then atomic update succeeded and "eilvt_offsets[offset]" location is updated to "new" in an atomic way. Otherwise, the atomic update failed and it should be retried with the value from "eilvt_offsets[offset]" - exactly what atomic_try_cmpxchg() does in a correct and more optimal way. Fixes: `a68c439b19` ("apic, x86: Check if EILVT APIC registers are available (AMD only)") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230227160917.107820-1-ubizjak@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:17:06 +09:00
Muralidhara M K	67bb7521b6	x86/MCE/AMD: Use an u64 for bank_map [ Upstream commit `4c1cdec319` ] Thee maximum number of MCA banks is 64 (MAX_NR_BANKS), see `a0bc32b3ca` ("x86/mce: Increase maximum number of banks to 64"). However, the bank_map which contains a bitfield of which banks to initialize is of type unsigned int and that overflows when those bit numbers are >= 32, leading to UBSAN complaining correctly: UBSAN: shift-out-of-bounds in arch/x86/kernel/cpu/mce/amd.c:1365:38 shift exponent 32 is too large for 32-bit type 'int' Change the bank_map to a u64 and use the proper BIT_ULL() macro when modifying bits in there. [ bp: Rewrite commit message. ] Fixes: `a0bc32b3ca` ("x86/mce: Increase maximum number of banks to 64") Signed-off-by: Muralidhara M K <muralimk@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230127151601.1068324-1-muralimk@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-11 23:17:02 +09:00
Sean Christopherson	2ec1fe292d	KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated commit `edbdb43fc9` upstream. Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-11 23:16:51 +09:00
Sean Christopherson	fed618e7cf	KVM: nVMX: Emulate NOPs in L2, and PAUSE if it's not intercepted commit `4984563823` upstream. Extend VMX's nested intercept logic for emulated instructions to handle "pause" interception, in quotes because KVM's emulator doesn't filter out NOPs when checking for nested intercepts. Failure to allow emulation of NOPs results in KVM injecting a #UD into L2 on any NOP that collides with the emulator's definition of PAUSE, i.e. on all single-byte NOPs. For PAUSE itself, honor L1's PAUSE-exiting control, but ignore PLE to avoid unnecessarily injecting a #UD into L2. Per the SDM, the first execution of PAUSE after VM-Entry is treated as the beginning of a new loop, i.e. will never trigger a PLE VM-Exit, and so L1 can't expect any given execution of PAUSE to deterministically exit. ... the processor considers this execution to be the first execution of PAUSE in a loop. (It also does so for the first execution of PAUSE at CPL 0 after VM entry.) All that said, the PLE side of things is currently a moot point, as KVM doesn't expose PLE to L1. Note, vmx_check_intercept() is still wildly broken when L1 wants to intercept an instruction, as KVM injects a #UD instead of synthesizing a nested VM-Exit. That issue extends far beyond NOP/PAUSE and needs far more effort to fix, i.e. is a problem for the future. Fixes: `07721feee4` ("KVM: nVMX: Don't emulate instructions in guest mode") Cc: Mathias Krause <minipli@grsecurity.net> Cc: stable@vger.kernel.org Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230405002359.418138-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-11 23:16:51 +09:00
Sean Christopherson	97fe96c75e	KVM: x86/pmu: Disallow legacy LBRs if architectural LBRs are available commit `098f4c061e` upstream. Disallow enabling LBR support if the CPU supports architectural LBRs. Traditional LBR support is absent on CPU models that have architectural LBRs, and KVM doesn't yet support arch LBRs, i.e. KVM will pass through non-existent MSRs if userspace enables LBRs for the guest. Cc: stable@vger.kernel.org Cc: Yang Weijiang <weijiang.yang@intel.com> Cc: Like Xu <like.xu.linux@gmail.com> Reported-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: `be635e34c2` ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES") Tested-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230128001427.2548858-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-11 23:16:50 +09:00

1 2 3 4 5 ...

43674 Commits