Both INT (INT n, INT1, INT3, INTO) and UD (UD0, UD1, UD2) serve special
purposes in the kernel, e.g., INT3 is used by KGDB and UD2 is involved
in LLVM-KCFI instrumentation. At the same time, attaching kprobes on
these instructions (particularly UD) will pollute the stack trace dumped
in the kernel ring buffer, since the exception is triggered in the copy
buffer rather than the original location.
Check for INT and UD in can_probe and reject any kprobes trying to
attach to these instructions.
Link: https://lore.kernel.org/all/20240204031300.830475-3-jinghao7@illinois.edu/
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Jinghao Jia <jinghao7@illinois.edu>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Borislav reported that one of his systems has a broken MADT table which
advertises eight present APICs and 24 non-present APICs in the same
package.
The non-present ones are considered hot-pluggable by the topology
evaluation code, which is obviously bogus as there is no way to hot-plug
within the same package.
As the topology evaluation code accounts for hot-pluggable CPUs in a
package, the maximum number of cores per package is computed wrong, which
in turn causes the uncore performance counter driver to access non-existing
MSRs. It will probably confuse other entities which rely on the maximum
number of cores and threads per package too.
Cure this by ignoring hot-pluggable APIC IDs within a present package.
In theory it would be reasonable to just do this unconditionally, but then
there is this thing called reality^Wvirtualization which ruins
everything. Virtualization is the only existing user of "physical" hotplug
and the virtualization tools allow the above scenario. Whether that is
actually in use or not is unknown.
As it can be argued that the virtualization case is not affected by the
issues which exposed the reported problem, allow the bogosity if the kernel
determined that it is running in a VM for now.
Fixes: 89b0f15f40 ("x86/cpu/topology: Get rid of cpuinfo::x86_max_cores")
Reported-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/87a5nbvccx.ffs@tglx
According to x86 spec ([1] and [2]), MWAIT hint_address[7:4] plus 1 is
the corresponding C-state, and 0xF means C0.
ACPI C-state table usually only contains C1+, but nothing prevents ACPI
firmware from presenting a C-state (maybe C1+) but using MWAIT address C0
(i.e., 0xF in ACPI FFH MWAIT hint address). And if this is the case, Linux
erroneously treat this cstate as C16, while actually this should be valid
C0 instead of C16, as per the specifications.
Since ACPI firmware is out of Linux kernel scope, fix the kernel handling
of 0xF ->(to) C0 in this situation. This is found when a tweaked ACPI
C-state table is presented by Qemu to VM.
Also modify the intel_idle case for code consistency.
[1]. Intel SDM Vol 2, Table 4-11. MWAIT Hints
Register (EAX): "Value of 0 means C1; 1 means C2 and so on
Value of 01111B means C0".
[2]. AMD manual Vol 3, MWAIT: "The processor C-state is EAX[7:4]+1, so to
request C0 is to place the value F in EAX[7:4] and to request C1 is to
place the value 0 in EAX[7:4].".
Signed-off-by: He Rongguang <herongguang@linux.alibaba.com>
[ rjw: Subject and changelog edits, whitespace fixups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As there are some AMD processors which only support CPPC V2 firmware and
BIOS implementation, the amd_pstate driver will be failed to load when
system booting with below kernel warning message:
[ 0.477523] amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
To make the amd_pstate driver can be loaded on those TR40 processors, it
needs to match x86_model from 0x30 to 0x7F for family 17H.
With the change, the system can load amd_pstate driver as expected.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reported-by: Gino Badouri <badouri.g@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218171
Fixes: fbd74d1689 ("ACPI: CPPC: Fix enabling CPPC on AMD systems with shared memory")
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The secondary startup code is used on the primary boot path as well, but
in this case, the initial part runs from a 1:1 mapping, until an
explicit cross-jump is made to the kernel virtual mapping of the same
code.
On the secondary boot path, this jump is pointless as the code already
executes from the mapping targeted by the jump. So combine this
cross-jump with the jump from startup_64() into the common boot path.
This simplifies the execution flow, and clearly separates code that runs
from a 1:1 mapping from code that runs from the kernel virtual mapping.
Note that this requires a page table switch, so hoist the CR3 assignment
into startup_64() as well. And since absolute symbol references will no
longer be permitted in .head.text once we enable the associated build
time checks, a RIP-relative memory operand is used in the JMP
instruction, referring to an absolute constant in the .init.rodata
section.
Given that the secondary startup code does not require a special
placement inside the executable, move it to the .text section.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-15-ardb+git@google.com
Determining the address of the initial page table to program into CR3
involves:
- taking the physical address
- adding the SME encryption mask
On the primary entry path, the code is mapped using a 1:1 virtual to
physical translation, so the physical address can be taken directly
using a RIP-relative LEA instruction.
On the secondary entry path, the address can be obtained by taking the
offset from the virtual kernel base (__START_kernel_map) and adding the
physical kernel base.
This is implemented in a slightly confusing way, so clean this up.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-14-ardb+git@google.com
Assigning the 5-level paging related global variables from the earliest
C code using explicit references that use the 1:1 translation of memory
is unnecessary, as the startup code itself does not rely on them to
create the initial page tables, and this is all it should be doing. So
defer these assignments to the primary C entry code that executes via
the ordinary kernel virtual mapping.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-13-ardb+git@google.com
When paging is enabled, the CR4.PAE and CR4.LA57 control bits cannot be
changed, and so they can simply be preserved rather than reason about
whether or not they need to be set. CR4.MCE should be preserved unless
the kernel was built without CONFIG_X86_MCE, in which case it must be
cleared.
CR4.PSE should be set explicitly, regardless of whether or not it was
set before.
CR4.PGE is set explicitly, and then cleared and set again after
programming CR3 in order to flush TLB entries based on global
translations. This makes the first assignment redundant, and can
therefore be omitted. So clear PGE by omitting it from the preserve
mask, and set it again explicitly after switching to the new page
tables.
[ bp: Document the exact operation of CR4.PGE ]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-12-ardb+git@google.com
The idle routine selection is done on every CPU bringup operation and
has a guard in place which is effective after the first invocation,
which is a pointless exercise.
Invoke it once on the boot CPU and mark the related functions __init.
The guard check has to stay as xen_set_default_idle() runs early.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/87edcu6vaq.ffs@tglx
Updating the static call for x86_idle() from idle_setup() is
counter-intuitive.
Let select_idle_routine() handle it like the other idle choices, which
allows to simplify the idle selection later on.
While at it rewrite comments and return a proper error code and not -1.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240229142248.455616019@linutronix.de
amd_e400_idle(), the idle routine for AMD CPUs which are affected by
erratum 400 violates the RCU constraints by invoking tick_broadcast_enter()
and tick_broadcast_exit() after the core code has marked RCU non-idle. The
functions can end up in lockdep or tracing, which rightfully triggers a
RCU warning.
The core code provides now a static branch conditional invocation of the
broadcast functions.
Remove amd_e400_idle(), enforce default_idle() and enable the static branch
on affected CPUs to cure this.
[ bp: Fold in a fix for a IS_ENABLED() check fail missing a "CONFIG_"
prefix which tglx spotted. ]
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/877cim6sis.ffs@tglx
Sparse complains rightfully about the usage of EXPORT_SYMBOL_GPL() for per
CPU variables:
callthunks.c:346:20: sparse: warning: incorrect type in initializer (different address spaces)
callthunks.c:346:20: sparse: expected void const [noderef] __percpu *__vpp_verify
callthunks.c:346:20: sparse: got unsigned long long *
Use EXPORT_PER_CPU_SYMBOL_GPL() instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.841915535@linutronix.de
Sparse rightfully complains:
bugs.c:71:9: sparse: warning: incorrect type in initializer (different address spaces)
bugs.c:71:9: sparse: expected void const [noderef] __percpu *__vpp_verify
bugs.c:71:9: sparse: got unsigned long long *
The reason is that x86_spec_ctrl_current which is a per CPU variable is
exported with EXPORT_SYMBOL_GPL().
Use EXPORT_PER_CPU_SYMBOL_GPL() instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.732288812@linutronix.de
On UP builds Sparse complains rightfully about accesses to cpu_info with
per CPU accessors:
cacheinfo.c:282:30: sparse: warning: incorrect type in initializer (different address spaces)
cacheinfo.c:282:30: sparse: expected void const [noderef] __percpu *__vpp_verify
cacheinfo.c:282:30: sparse: got unsigned int *
The reason is that on UP builds cpu_info which is a per CPU variable on SMP
is mapped to boot_cpu_info which is a regular variable. There is a hideous
accessor cpu_data() which tries to hide this, but it's not sufficient as
some places require raw accessors and generates worse code than the regular
per CPU accessors.
Waste sizeof(struct x86_cpuinfo) memory on UP and provide the per CPU
cpu_info unconditionally. This requires to update the CPU info on the boot
CPU as SMP does. (Ab)use the weakly defined smp_prepare_boot_cpu() function
and implement exactly that.
This allows to use regular per CPU accessors uncoditionally and paves the
way to remove the cpu_data() hackery.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.622511517@linutronix.de
There is no point in having seven architectures implementing the same empty
stub.
Provide a weak function in the init code and remove the stubs.
This also allows to utilize the function on UP which is required to
sanitize the per CPU handling on X86 UP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.567671691@linutronix.de
To clean up the per CPU insanity of UP which causes sparse to be rightfully
unhappy and prevents the usage of the generic per CPU accessors on cpu_info
it is necessary to include <linux/percpu.h> into <asm/msr.h>.
Including <linux/percpu.h> into <asm/msr.h> is impossible because it ends
up in header dependency hell. The problem is that <asm/processor.h>
includes <asm/msr.h>. The inclusion of <linux/percpu.h> results in a
compile fail where the compiler cannot longer handle an include in
<asm/cpufeature.h> which references boot_cpu_data which is
defined in <asm/processor.h>.
The only reason why <asm/msr.h> is included in <asm/processor.h> are the
set/get_debugctlmsr() inlines. They are defined there because <asm/processor.h>
is such a nice dump ground for everything. In fact they belong obviously
into <asm/debugreg.h>.
Move them to <asm/debugreg.h> and fix up the resulting damage which is just
exposing the reliance on random include chains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240304005104.454678686@linutronix.de
The HV_REGISTER_ are used as arguments to hv_set/get_register(), which
delegate to arch-specific mechanisms for getting/setting synthetic
Hyper-V MSRs.
On arm64, HV_REGISTER_ defines are synthetic VP registers accessed via
the get/set vp registers hypercalls. The naming matches the TLFS
document, although these register names are not specific to arm64.
However, on x86 the prefix HV_REGISTER_ indicates Hyper-V MSRs accessed
via rdmsrl()/wrmsrl(). This is not consistent with the TLFS doc, where
HV_REGISTER_ is *only* used for used for VP register names used by
the get/set register hypercalls.
To fix this inconsistency and prevent future confusion, change the
arch-generic aliases used by callers of hv_set/get_register() to have
the prefix HV_MSR_ instead of HV_REGISTER_.
Use the prefix HV_X64_MSR_ for the x86-only Hyper-V MSRs. On x86, the
generic HV_MSR_'s point to the corresponding HV_X64_MSR_.
Move the arm64 HV_REGISTER_* defines to the asm-generic hyperv-tlfs.h,
since these are not specific to arm64. On arm64, the generic HV_MSR_'s
point to the corresponding HV_REGISTER_.
While at it, rename hv_get/set_registers() and related functions to
hv_get/set_msr(), hv_get/set_nested_msr(), etc. These are only used for
Hyper-V MSRs and this naming makes that clear.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com>
SETUP_RNG_SEED in setup_data is supplied by kexec and should
not be reserved in the e820 map.
Doing so reserves 16 bytes of RAM when booting with kexec.
(16 bytes because data->len is zeroed by parse_setup_data so only
sizeof(setup_data) is reserved.)
When kexec is used repeatedly, each boot adds two entries in the
kexec-provided e820 map as the 16-byte range splits a larger
range of usable memory. Eventually all of the 128 available entries
get used up. The next split will result in losing usable memory
as the new entries cannot be added to the e820 map.
Fixes: 68b8e9713c ("x86/setup: Use rng seeds from setup_data")
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/ZbmOjKnARGiaYBd5@dwarf.suse.cz
It is, and will be even more useful in the future, to dump the SEV
features enabled according to SEV_STATUS. Do so:
[ 0.542753] Memory Encryption Features active: AMD SEV SEV-ES SEV-SNP
[ 0.544425] SEV: Status: SEV SEV-ES SEV-SNP DebugSwap
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20240219094216.GAZdMieDHKiI8aaP3n@fat_crate.local
These functions are mostly pointless on UP, but nevertheless the
64-bit UP APIC build already depends on the existence of
topology_apply_cmdline_limits_early(), which caused a build bug,
resolve it by making them available under CONFIG_X86_LOCAL_APIC,
as their prototypes already are.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The vDSO (and its initial randomization) was introduced in commit 2aae950b21
("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu"), but
had very low entropy. The entropy was improved in commit 394f56fe48
("x86_64, vdso: Fix the vdso address randomization algorithm"), but there
is still improvement to be made.
In principle there should not be executable code at a low entropy offset
from the stack, since the stack and executable code having separate
randomization is part of what makes ASLR stronger.
Remove the only executable code near the stack region and give the vDSO
the same randomized base as other mmap mappings including the linker
and other shared objects. This results in higher entropy being provided
and there's little to no advantage in separating this from the existing
executable code there. This is already how other architectures like
arm64 handle the vDSO.
As an side, while it's sensible for userspace to reserve the initial mmap
base as a region for executable code with a random gap for other mmap
allocations, along with providing randomization within that region, there
isn't much the kernel can do to help due to how dynamic linkers load the
shared objects.
This was extracted from the PaX RANDMMAP feature.
[kees: updated commit log with historical details and other tweaks]
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://github.com/KSPP/linux/issues/280
Link: https://lore.kernel.org/r/20240210091827.work.233-kees@kernel.org
Commit 344da544f1 ("x86/nmi: Print reasons why backtrace NMIs are
ignored") creates a super nice framework to diagnose NMIs.
Every time nmi_exc() is called, it increments a per_cpu counter
(nsp->idt_nmi_seq). At its exit, it also increments the same counter. By
reading this counter it can be seen how many times that function was called
(dividing by 2), and, if the function is still being executed, by checking
the idt_nmi_seq's least significant bit.
On the check side (nmi_backtrace_stall_check()), that variable is queried
to check if the NMI is still being executed, but, there is a mistake in the
bitwise operation. That code wants to check if the least significant bit of
the idt_nmi_seq is set or not, but does the opposite, and checks for all
the other bits, which will always be true after the first exc_nmi()
executed successfully.
This appends the misleading string to the dump "(CPU currently in NMI
handler function)"
Fix it by checking the least significant bit, and if it is set, append the
string.
Fixes: 344da544f1 ("x86/nmi: Print reasons why backtrace NMIs are ignored")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240207165237.1048837-1-leitao@debian.org
MKTME repurposes the high bit of physical address to key id for encryption
key and, even though MAXPHYADDR in CPUID[0x80000008] remains the same,
the valid bits in the MTRR mask register are based on the reduced number
of physical address bits.
detect_tme() in arch/x86/kernel/cpu/intel.c detects TME and subtracts
it from the total usable physical bits, but it is called too late.
Move the call to early_init_intel() so that it is called in setup_arch(),
before MTRRs are setup.
This fixes boot on TDX-enabled systems, which until now only worked with
"disable_mtrr_cleanup". Without the patch, the values written to the
MTRRs mask registers were 52-bit wide (e.g. 0x000fffff_80000800) and
the writes failed; with the patch, the values are 46-bit wide, which
matches the reduced MAXPHYADDR that is shown in /proc/cpuinfo.
Reported-by: Zixi Chen <zixchen@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240131230902.1867092-3-pbonzini%40redhat.com
In commit fbf6449f84 ("x86/sev-es: Set x86_virt_bits to the correct
value straight away, instead of a two-phase approach"), the initialization
of c->x86_phys_bits was moved after this_cpu->c_early_init(c). This is
incorrect because early_init_amd() expected to be able to reduce the
value according to the contents of CPUID leaf 0x8000001f.
Fortunately, the bug was negated by init_amd()'s call to early_init_amd(),
which does reduce x86_phys_bits in the end. However, this is very
late in the boot process and, most notably, the wrong value is used for
x86_phys_bits when setting up MTRRs.
To fix this, call get_cpu_address_sizes() as soon as X86_FEATURE_CPUID is
set/cleared, and c->extended_cpuid_level is retrieved.
Fixes: fbf6449f84 ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240131230902.1867092-2-pbonzini%40redhat.com
early_top_pgt[] is assigned from code that executes from a 1:1 mapping
so it cannot use a plain access from C. Replace the use of
fixup_pointer() with RIP_REL_REF(), which is better and simpler.
For legibility and to align with the code that populates the lower page
table levels, statically initialize the root level page table with an
entry pointing to level3_kernel_pgt[], and overwrite it when needed to
enable 5-level paging.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-24-ardb+git@google.com
'phys_base' is assigned from code that executes from a 1:1 mapping so it
cannot use a plain access from C. Replace the use of fixup_pointer()
with RIP_REL_REF(), which is better and simpler.
While at it, move the assignment to before the addition of the SME mask
so there is no need to subtract it again, and drop the unnecessary
addition ('phys_base' is statically initialized to 0x0)
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-20-ardb+git@google.com
There are two code paths in the startup code to program an IDT: one that
runs from the 1:1 mapping and one that runs from the virtual kernel
mapping. Currently, these are strictly separate because fixup_pointer()
is used on the 1:1 path, which will produce the wrong value when used
while executing from the virtual kernel mapping.
Switch to RIP_REL_REF() so that the two code paths can be merged. Also,
move the GDT and IDT descriptors to the stack so that they can be
referenced directly, rather than via RIP_REL_REF().
Rename startup_64_setup_env() to startup_64_setup_gdt_idt() while at it,
to make the call from assembler self-documenting.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240221113506.2565718-19-ardb+git@google.com
We are going to queue up a number of patches that depend
on fresh changes in x86/sev - merge in that branch to
reduce the number of conflicts going forward.
Also resolve a current conflict with x86/sev.
Conflicts:
arch/x86/include/asm/coco.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The recent restriction to invoke irqdomain_ops::select() only when the
domain bus token is not DOMAIN_BUS_ANY breaks the search for the parent MSI
domain of HPET and IO-APIC. The latter causes a full boot fail.
The restriction itself makes sense to avoid adding DOMAIN_BUS_ANY matches
into the various ARM specific select() callbacks. Reverting this change
would obviously break ARM platforms again and require DOMAIN_BUS_ANY
matches added to various places.
A simpler solution is to use the DOMAIN_BUS_GENERIC_MSI token for the HPET
and IO-APIC parent domain search. This works out of the box because the
affected parent domains check only for the firmware specification content
and not for the bus token.
Fixes: 5aa3c0cf5b ("genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens")
Reported-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/878r38cy8n.ffs@tglx
Add a VMX flag in /proc/cpuinfo, ept_5level, so that userspace can query
whether or not the CPU supports 5-level EPT paging. EPT capabilities are
enumerated via MSR, i.e. aren't accessible to userspace without help from
the kernel, and knowing whether or not 5-level EPT is supported is useful
for debug, triage, testing, etc.
For example, when EPT is enabled, bits 51:48 of guest physical addresses
are consumed by the CPU if and only if 5-level EPT is enabled. For CPUs
with MAXPHYADDR > 48, KVM *can't* map all legal guest memory without
5-level EPT, making 5-level EPT support valuable information for userspace.
Reported-by: Yi Lai <yi1.lai@intel.com>
Cc: Tao Su <tao1.su@linux.intel.com>
Cc: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20240110002340.485595-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
A relocatable kernel will relocate itself to pref_address if it is
loaded below pref_address. This means a booted kernel may be relocating
itself to an area with reserved memory on modern systems, potentially
clobbering arbitrary data that may be important to the system.
This is often the case, as the default value of PHYSICAL_START is
0x1000000 and kernels are typically loaded at 0x100000 or above by
bootloaders like iPXE or kexec. GRUB behaves like the approach
implemented here.
Also fixes the documentation around pref_address and PHYSICAL_START to
be accurate.
[ dhansen: changelog tweak ]
Co-developed-by: Cloud Hsu <cloudhsu@google.com>
Signed-off-by: Cloud Hsu <cloudhsu@google.com>
Signed-off-by: Chris Koch <chrisko@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lore.kernel.org/all/20231215190521.3796022-1-chrisko%40google.com
get_domain_from_cpu() walks a list of domains to find the one that
contains the specified CPU. This needs to be protected against races
with CPU hotplug when the list is modified. It has recently gained
a lockdep annotation to check this.
The lockdep annotation causes false positives when called via IPI as the
lock is held, but by another process. Remove it.
[ bp: Refresh it ontop of x86/cache. ]
Fixes: fb700810d3 ("x86/resctrl: Separate arch and fs resctrl locks")
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/ZdUSwOM9UUNpw84Y@agluck-desk3