Pull hyperv fixes from Wei Liu:
- Add a documentation overview of Confidential Computing VM support
(Michael Kelley)
- Use lapic timer in a TDX VM without paravisor (Dexuan Cui)
- Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
(Michael Kelley)
- Fix a kexec crash due to VP assist page corruption (Anirudh
Rayabharam)
- Python3 compatibility fix for lsvmbus (Anthony Nandaa)
- Misc fixes (Rachel Menge, Roman Kisel, zhang jiao, Hongbo Li)
* tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hv: vmbus: Constify struct kobj_type and struct attribute_group
tools: hv: rm .*.cmd when make clean
x86/hyperv: fix kexec crash due to VP assist page corruption
Drivers: hv: vmbus: Fix the misplaced function description
tools: hv: lsvmbus: change shebang to use python3
x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
Documentation: hyperv: Add overview of Confidential Computing VM support
clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor
Drivers: hv: Remove deprecated hv_fcopy declarations
Merge "hwmon fixes for v6.11-rc7" into review-hans to bring in
commit a54da9df75 ("hwmon: (hp-wmi-sensors) Check if WMI event
data exists").
This is a dependency for a set of WMI event data refactoring changes.
commit 9636be85cc ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when
CPUs go online/offline") introduces a new cpuhp state for hyperv
initialization.
cpuhp_setup_state() returns the state number if state is
CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN and 0 for all other states.
For the hyperv case, since a new cpuhp state was introduced it would
return 0. However, in hv_machine_shutdown(), the cpuhp_remove_state() call
is conditioned upon "hyperv_init_cpuhp > 0". This will never be true and
so hv_cpu_die() won't be called on all CPUs. This means the VP assist page
won't be reset. When the kexec kernel tries to setup the VP assist page
again, the hypervisor corrupts the memory region of the old VP assist page
causing a panic in case the kexec kernel is using that memory elsewhere.
This was originally fixed in commit dfe94d4086 ("x86/hyperv: Fix kexec
panic/hang issues").
Get rid of hyperv_init_cpuhp entirely since we are no longer using a
dynamic cpuhp state and use CPUHP_AP_HYPERV_ONLINE directly with
cpuhp_remove_state().
Cc: stable@vger.kernel.org
Fixes: 9636be85cc ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline")
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20240828112158.3538342-1-anirudh@anirudhrb.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240828112158.3538342-1-anirudh@anirudhrb.com>
Register the "disable virtualization in an emergency" callback just
before KVM enables virtualization in hardware, as there is no functional
need to keep the callbacks registered while KVM happens to be loaded, but
is inactive, i.e. if KVM hasn't enabled virtualization.
Note, unregistering the callback every time the last VM is destroyed could
have measurable latency due to the synchronize_rcu() needed to ensure all
references to the callback are dropped before KVM is unloaded. But the
latency should be a small fraction of the total latency of disabling
virtualization across all CPUs, and userspace can set enable_virt_at_load
to completely eliminate the runtime overhead.
Add a pointer in kvm_x86_ops to allow vendor code to provide its callback.
There is no reason to force vendor code to do the registration, and either
way KVM would need a new kvm_x86_ops hook.
Suggested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order be able to compute the sizes of tasks consistently across all
CPUs in a hybrid system, it is necessary to provide CPU capacity scaling
information to the scheduler via arch_scale_cpu_capacity(). Moreover,
the value returned by arch_scale_freq_capacity() for the given CPU must
correspond to the arch_scale_cpu_capacity() return value for it, or
utilization computations will be inaccurate.
Add support for it through per-CPU variables holding the capacity and
maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will
be returned by arch_scale_cpu_capacity() and used by scale_freq_tick()
to compute arch_freq_scale for the current CPU, respectively.
In order to avoid adding measurable overhead for non-hybrid x86 systems,
which are the vast majority in the field, whether or not the new hybrid
CPU capacity scaling will be in effect is controlled by a static key.
This static key is set by calling arch_enable_hybrid_capacity_scale()
which also allocates memory for the per-CPU data and initializes it.
Next, arch_set_cpu_capacity() is used to set the per-CPU variables
mentioned above for each CPU and arch_rebuild_sched_domains() needs
to be called for the scheduler to realize that capacity-aware
scheduling can be used going forward.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance
Link: https://patch.msgid.link/10523497.nUPlyArG6x@rjwysocki.net
[ rjw: Added parens to function kerneldoc comments ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There's an erratum that prevents the PAT from working correctly:
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf
# Document 316515 Version 010
The kernel currently disables PAT support on those CPUs, but it
does it with some magic numbers.
Replace the magic numbers with the new "IFM" macros.
Make the check refer to the last affected CPU (INTEL_CORE_YONAH)
rather than the first fixed one. This makes it easier to find the
documentation of the erratum since Intel documents where it is
broken and not where it is fixed.
I don't think the Pentium Pro (or Pentium II) is actually affected.
But the old check included them, so it can't hurt to keep doing the
same. I'm also not completely sure about the "Pentium M" CPUs
(models 0x9 and 0xd). But, again, they were included in in the
old checks and were close Pentium III derivatives, so are likely
affected.
While we're at it, revise the comment referring to the erratum name
and making sure it is a quote of the language from the actual errata
doc. That should make it easier to find in the future when the URL
inevitably changes.
Why bother with this in the first place? It actually gets rid of one
of the very few remaining direct references to c->x86{,_model}.
No change in functionality intended.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Link: https://lore.kernel.org/r/20240829220042.1007820-1-dave.hansen@linux.intel.com
Pull x86 fixes from Thomas Gleixner:
- x2apic_disable() clears x2apic_state and x2apic_mode unconditionally,
even when the state is X2APIC_ON_LOCKED, which prevents the kernel to
disable it thereby creating inconsistent state.
Reorder the logic so it actually works correctly
- The XSTATE logic for handling LBR is incorrect as it assumes that
XSAVES supports LBR when the CPU supports LBR. In fact both
conditions need to be true. Otherwise the enablement of LBR in the
IA32_XSS MSR fails and subsequently the machine crashes on the next
XRSTORS operation because IA32_XSS is not initialized.
Cache the XSTATE support bit during init and make the related
functions use this cached information and the LBR CPU feature bit to
cure this.
- Cure a long standing bug in KASLR
KASLR uses the full address space between PAGE_OFFSET and vaddr_end
to randomize the starting points of the direct map, vmalloc and
vmemmap regions. It thereby limits the size of the direct map by
using the installed memory size plus an extra configurable margin for
hot-plug memory. This limitation is done to gain more randomization
space because otherwise only the holes between the direct map,
vmalloc, vmemmap and vaddr_end would be usable for randomizing.
The limited direct map size is not exposed to the rest of the kernel,
so the memory hot-plug and resource management related code paths
still operate under the assumption that the available address space
can be determined with MAX_PHYSMEM_BITS.
request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards. That means the first allocation happens past the end of
the direct map and if unlucky this address is in the vmalloc space,
which causes high_memory to become greater than VMALLOC_START and
consequently causes iounmap() to fail for valid ioremap addresses.
Cure this by exposing the end of the direct map via PHYSMEM_END and
use that for the memory hot-plug and resource management related
places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case
PHYSMEM_END maps to a variable which is initialized by the KASLR
initialization and otherwise it is based on MAX_PHYSMEM_BITS as
before.
- Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of
an initialized variabled on the stack to the VMM. The variable is
only required as output value, so it does not have to exposed to the
VMM in the first place.
- Prevent an array overrun in the resource control code on systems with
Sub-NUMA Clustering enabled because the code failed to adjust the
index by the number of SNC nodes per L3 cache.
* tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Fix arch_mbm_* array overrun on SNC
x86/tdx: Fix data leak in mmio_read()
x86/kaslr: Expose and use the end of the physical memory address space
x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported
x86/apic: Make x2apic_disable() work correctly
Exit to userspace if a fastpath handler triggers such an exit, which can
happen when skipping the instruction, e.g. due to userspace
single-stepping the guest via KVM_GUESTDBG_SINGLESTEP or because of an
emulation failure.
Fixes: 404d5d7bff ("KVM: X86: Introduce more exit_fastpath_completion enum values")
Link: https://lore.kernel.org/r/20240802195120.325560-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Incorporate the _host_ SEV-ES save area into the VMCB as a union with the
legacy save area. The SEV-ES variant used to save/load host state is
larger than the legacy save area, but resides at the same offset. Prefix
the field with "host" to make it as obvious as possible that the SEV-ES
variant in the VMCB is only ever used for host state. Guest state for
SEV-ES VMs is stored in a completely separate page (VMSA), albeit with
the same layout as the host state.
Add a compile-time assert to ensure the VMCB layout is correct, i.e. that
KVM's layout matches the architectural definitions.
No functional change intended.
Link: https://lore.kernel.org/r/20240802204511.352017-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's
IPI virtualization support, but only for AMD. While not stated anywhere
in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs
store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When
IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled,
KVM needs to match CPU behavior as some ICR ICR writes will be handled by
the CPU, not by KVM.
Add a kvm_x86_ops knob to control the underlying format used by the CPU to
store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether
or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage
format for x2APIC mode doesn't matter, and having the behavior follow AMD
versus Intel will provide better test coverage and ease debugging.
Fixes: 4d1d7942e3 ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When using resctrl on systems with Sub-NUMA Clustering enabled, monitoring
groups may be allocated RMID values which would overrun the
arch_mbm_{local,total} arrays.
This is due to inconsistencies in whether the SNC-adjusted num_rmid value or
the unadjusted value in resctrl_arch_system_num_rmid_idx() is used. The
num_rmid value for the L3 resource is currently:
resctrl_arch_system_num_rmid_idx() / snc_nodes_per_l3_cache
As a simple fix, make resctrl_arch_system_num_rmid_idx() return the
SNC-adjusted, L3 num_rmid value on x86.
Fixes: e13db55b5a ("x86/resctrl: Introduce snc_nodes_per_l3_cache")
Signed-off-by: Peter Newman <peternewman@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20240822190212.1848788-1-peternewman@google.com
Currently, struct snp_guest_msg includes a message header (96 bytes) and
a payload (4000 bytes). There is an implicit assumption here that the
SNP message header will always be 96 bytes, and with that assumption the
payload array size has been set to 4000 bytes - a magic number. If any
new member is added to the SNP message header, the SNP guest message
will span more than a page.
Instead of using a magic number for the payload, declare struct
snp_guest_msg in a way that payload plus the message header do not
exceed a page.
[ bp: Massage. ]
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240731150811.156771-5-nikunj@amd.com
The FRED RSP0 MSR points to the top of the kernel stack for user level
event delivery. As this is the task stack it needs to be updated when a
task is scheduled in.
The update is done at context switch. That means it's also done when
switching to kernel threads, which is pointless as those never go out to
user space. For KVM threads this means there are two writes to FRED_RSP0 as
KVM has to switch to the guest value before VMENTER.
Defer the update to the exit to user space path and cache the per CPU
FRED_RSP0 value, so redundant writes can be avoided.
Provide fred_sync_rsp0() for KVM to keep the cache in sync with the actual
MSR value after returning from guest to host mode.
[ tglx: Massage change log ]
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240822073906.2176342-4-xin@zytor.com
In most cases, ti_work values passed to arch_exit_to_user_mode_prepare()
are zeros, e.g., 99% in kernel build tests. So an obvious optimization is
to test ti_work for zero before processing individual bits in it.
Omit the optimization when FPU debugging is enabled, otherwise the
FPU consistency check is never executed.
Intel 0day tests did not find a perfermance regression with this change.
Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240822073906.2176342-2-xin@zytor.com
Modern (fortified) memcpy() prefers to avoid writing (or reading) beyond
the end of the addressed destination (or source) struct member:
In function ‘fortify_memcpy_chk’,
inlined from ‘syscall_get_arguments’ at ./arch/x86/include/asm/syscall.h:85:2,
inlined from ‘populate_seccomp_data’ at kernel/seccomp.c:258:2,
inlined from ‘__seccomp_filter’ at kernel/seccomp.c:1231:3:
./include/linux/fortify-string.h:580:25: error: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror=attribute-warning]
580 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As already done for x86_64 and compat mode, do not use memcpy() to
extract syscall arguments from struct pt_regs but rather just perform
direct assignments. Binary output differences are negligible, and actually
ends up using less stack space:
- sub $0x84,%esp
+ sub $0x6c,%esp
and less text size:
text data bss dec hex filename
10794 252 0 11046 2b26 gcc-32b/kernel/seccomp.o.stock
10714 252 0 10966 2ad6 gcc-32b/kernel/seccomp.o.after
Closes: https://lore.kernel.org/lkml/9b69fb14-df89-4677-9c82-056ea9e706f5@gmail.com/
Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Link: https://lore.kernel.org/all/20240708202202.work.477-kees%40kernel.org
Rename all APIs related to feature MSRs from get_msr_feature() to
get_feature_msr(). The APIs get "feature MSRs", not "MSR features".
And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't
describe the helper itself.
No functional change intended.
Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Refactor get_msr_feature() to take the index and data pointer as distinct
parameters in anticipation of eliminating "struct kvm_msr_entry" usage
further up the primary callchain.
No functional change intended.
Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if
L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has
disallowed access to via an MSR filter. Intel already disallows including
a handful of "special" MSRs in the VMCS lists, so denying access isn't
completely without precedent.
More importantly, the behavior is well-defined _and_ can be communicated
the end user, e.g. to the customer that owns a VM running as L1 on top of
KVM. On the other hand, ignoring userspace MSR filters is all but
guaranteed to result in unexpected behavior as the access will hit KVM's
internal state, which is likely not up-to-date.
Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS
fields, the MSRs in the VMCS load/store lists are 100% guest controlled,
thus making it all but impossible to reason about the correctness of
ignoring the MSR filter. And if userspace *really* wants to deny access
to MSRs via the aforementioned scenarios, userspace can hide the
associated feature from the guest, e.g. by disabling the PMU to prevent
accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM
is blindly processing MSRs; the MSR filters are the _only_ way for
userspace to deny access.
This partially reverts commit ac8d6cad3c ("KVM: x86: Only do MSR
filtering when access MSR by rdmsr/wrmsr").
Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the handful of MSR_IA32_VMX_MISC bit defines that are currently in
msr-indx.h to vmx.h so that all of the VMX_MISC defines and wrappers can
be found in a single location.
Opportunistically use BIT_ULL() instead of open coding hex values, add
defines for feature bits that are architecturally defined, and move the
defines down in the file so that they are colocated with the helpers for
getting fields from VMX_MISC.
No functional change intended.
Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a helper to encode the VMCS revision, size, and supported memory types
in MSR_IA32_VMX_BASIC, i.e. when synthesizing KVM's supported BASIC MSR
value, and delete the now unused VMCS size and memtype shift macros.
For a variety of reasons, KVM has shifted (pun intended) to using helpers
to *get* information from the VMX MSRs, as opposed to defined MASK and
SHIFT macros for direct use. Provide a similar helper for the nested VMX
code, which needs to *set* information, so that KVM isn't left with a mix
of SHIFT macros and dedicated helpers.
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Track the "basic" capabilities VMX MSR as a single u64 in vmcs_config
instead of splitting it across three fields, that obviously don't combine
into a single 64-bit value, so that KVM can use the macros that define MSR
bits using their absolute position. Replace all open coded shifts and
masks, many of which are relative to the "high" half, with the appropriate
macro.
Opportunistically use VMX_BASIC_32BIT_PHYS_ADDR_ONLY instead of an open
coded equivalent, and clean up the related comment to not reference a
specific SDM section (to the surprise of no one, the comment is stale).
No functional change intended (though obviously the code generation will
be quite different).
Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the bit defines for MSR_IA32_VMX_BASIC from msr-index.h to vmx.h so
that they are colocated with other VMX MSR bit defines, and with the
helpers that extract specific information from an MSR_IA32_VMX_BASIC value.
Opportunistically use BIT_ULL() instead of open coding hex values.
Opportunistically rename VMX_BASIC_64 to VMX_BASIC_32BIT_PHYS_ADDR_ONLY,
as "VMX_BASIC_64" is widly misleading. The flag enumerates that addresses
are limited to 32 bits, not that 64-bit addresses are allowed.
Last but not least, opportunistically #define DUAL_MONITOR_TREATMENT so
that all known single-bit feature flags are defined (this will allow
replacing open-coded literals in the future).
Cc: Shan Kang <shan.kang@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add defines for the architectural memory types that can be shoved into
various MSRs and registers, e.g. MTRRs, PAT, VMX capabilities MSRs, EPTPs,
etc. While most MSRs/registers support only a subset of all memory types,
the values themselves are architectural and identical across all users.
Leave the goofy MTRR_TYPE_* definitions as-is since they are in a uapi
header, but add compile-time assertions to connect the dots (and sanity
check that the msr-index.h values didn't get fat-fingered).
Keep the VMX_EPTP_MT_* defines so that it's slightly more obvious that the
EPTP holds a single memory type in 3 of its 64 bits; those bits just
happen to be 2:0, i.e. don't need to be shifted.
Opportunistically use X86_MEMTYPE_WB instead of an open coded '6' in
setup_vmcs_config().
No functional change intended.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240605231918.2915961-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Advertise AVX10.1 related CPUIDs, i.e. report AVX10 support bit via
CPUID.(EAX=07H, ECX=01H):EDX[bit 19] and new CPUID leaf 0x24H so that
guest OS and applications can query the AVX10.1 CPUIDs directly. Intel
AVX10 represents the first major new vector ISA since the introduction of
Intel AVX512, which will establish a common, converged vector instruction
set across all Intel architectures[1].
AVX10.1 is an early version of AVX10, that enumerates the Intel AVX512
instruction set at 128, 256, and 512 bits which is enabled on
Granite Rapids. I.e., AVX10.1 is only a new CPUID enumeration with no
new functionality. New features, e.g. Embedded Rounding and Suppress
All Exceptions (SAE) will be introduced in AVX10.2.
Advertising AVX10.1 is safe because there is nothing to enable for AVX10.1,
i.e. it's purely a new way to enumerate support, thus there will never be
anything for the kernel to enable. Note just the CPUID checking is changed
when using AVX512 related instructions, e.g. if using one AVX512
instruction needs to check (AVX512 AND AVX512DQ), it can check
((AVX512 AND AVX512DQ) OR AVX10.1) after checking XCR0[7:5].
The versions of AVX10 are expected to be inclusive, e.g. version N+1 is
a superset of version N. Per the spec, the version can never be 0, just
advertise AVX10.1 if it's supported in hardware. Moreover, advertising
AVX10_{128,256,512} needs to land in the same commit as advertising basic
AVX10.1 support, otherwise KVM would advertise an impossible CPU model.
E.g. a CPU with AVX512 but not AVX10.1/512 is impossible per the SDM.
As more and more AVX related CPUIDs are added (it would have resulted in
around 40-50 CPUID flags when developing AVX10), the versioning approach
is introduced. But incrementing version numbers are bad for virtualization.
E.g. if AVX10.2 has a feature that shouldn't be enumerated to guests for
whatever reason, then KVM can't enumerate any "later" features either,
because the only way to hide the problematic AVX10.2 feature is to set the
version to AVX10.1 or lower[2]. But most AVX features are just passed
through and don't have virtualization controls, so AVX10 should not be
problematic in practice, so long as Intel honors their promise that future
versions will be supersets of past versions.
[1] https://cdrdv2.intel.com/v1/dl/getContent/784267
[2] https://lore.kernel.org/all/Zkz5Ak0PQlAN8DxK@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Link: https://lore.kernel.org/r/20240819062327.3269720-1-tao1.su@linux.intel.com
[sean: minor changelog tweaks]
Signed-off-by: Sean Christopherson <seanjc@google.com>
iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.
KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions. It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory. This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.
The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.
request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards. That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.
MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits. All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.
Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.
To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.
Fixes: 0483e1fa6e ("x86/mm: Implement ASLR for kernel memory regions")
Reported-by: Max Ramanouski <max8rr8@gmail.com>
Reported-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-By: Max Ramanouski <max8rr8@gmail.com>
Tested-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
In any normal situation this really shouldn't matter, but in case the
address passed in to masked_user_access_begin() were to be some complex
expression, we should evaluate it fully before doing the 'stac'
instruction.
And even without that issue (which objdump would pick up on for any
really bad case), just in general we should strive to minimize the
amount of code we run with user accesses enabled.
For example, even for the trivial pselect6() case, the code generation
(obviously with a non-debug build) just diff with this ends up being
- stac
mov %rax,%rcx
sar $0x3f,%rcx
or %rax,%rcx
+ stac
mov (%rcx),%r13
mov 0x8(%rcx),%r14
clac
so the area delimeted by the 'stac / clac' pair is now literally just
the two user access instructions, and the address generation has been
moved out to before that code.
This will be much more noticeable if we end up deciding that we can go
back to just inlining "get_user()" using the new masked user access
model. The get_user() pointers can often be more complex expressions
involving kernel memory accesses or even function calls.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The Spectre-v1 mitigations made "access_ok()" much more expensive, since
it has to serialize execution with the test for a valid user address.
All the normal user copy routines avoid this by just masking the user
address with a data-dependent mask instead, but the fast
"unsafe_user_read()" kind of patterms that were supposed to be a fast
case got slowed down.
This introduces a notion of using
src = masked_user_access_begin(src);
to do the user address sanity using a data-dependent mask instead of the
more traditional conditional
if (user_read_access_begin(src, len)) {
model.
This model only works for dense accesses that start at 'src' and on
architectures that have a guard region that is guaranteed to fault in
between the user space and the kernel space area.
With this, the user access doesn't need to be manually checked, because
a bad address is guaranteed to fault (by some architecture masking
trick: on x86-64 this involves just turning an invalid user address into
all ones, since we don't map the top of address space).
This only converts a couple of examples for now. Example x86-64 code
generation for loading two words from user space:
stac
mov %rax,%rcx
sar $0x3f,%rcx
or %rax,%rcx
mov (%rcx),%r13
mov 0x8(%rcx),%r14
clac
where all the error handling and -EFAULT is now purely handled out of
line by the exception path.
Of course, if the micro-architecture does badly at 'clac' and 'stac',
the above is still pitifully slow. But at least we did as well as we
could.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two distinct CPU features related to the use of XSAVES and LBR:
whether LBR is itself supported and whether XSAVES supports LBR. The LBR
subsystem correctly checks both in intel_pmu_arch_lbr_init(), but the
XSTATE subsystem does not.
The LBR bit is only removed from xfeatures_mask_independent when LBR is not
supported by the CPU, but there is no validation of XSTATE support.
If XSAVES does not support LBR the write to IA32_XSS causes a #GP fault,
leaving the state of IA32_XSS unchanged, i.e. zero. The fault is handled
with a warning and the boot continues.
Consequently the next XRSTORS which tries to restore supervisor state fails
with #GP because the RFBM has zero for all supervisor features, which does
not match the XCOMP_BV field.
As XFEATURE_MASK_FPSTATE includes supervisor features setting up the FPU
causes a #GP, which ends up in fpu_reset_from_exception_fixup(). That fails
due to the same problem resulting in recursive #GPs until the kernel runs
out of stack space and double faults.
Prevent this by storing the supported independent features in
fpu_kernel_cfg during XSTATE initialization and use that cached value for
retrieving the independent feature bits to be written into IA32_XSS.
[ tglx: Massaged change log ]
Fixes: f0dccc9da4 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mitchell Levy <levymitchell0@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240812-xsave-lbr-fix-v3-1-95bac1bf62f4@gmail.com