Commit Graph

19757 Commits

Author SHA1 Message Date
Saurabh Sengar
85900d0618 x86/of: Map NUMA node to CPUs as per DeviceTree
Currently for DeviceTree bootup, x86 code does the default mapping of
CPUs to NUMA, which is wrong. This can cause incorrect mapping and WARNs
on SMT enabled systems:

  CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
  WARNING: CPU: 1 PID: 0 at topology_sane.isra.0+0x5c/0x6d
  match_smt+0xf6/0xfc
  set_cpu_sibling_map.cold+0x24f/0x512
  start_secondary+0x5c/0x110

Call the set_apicid_to_node() function in dtb_cpu_setup() for setting
the NUMA to CPU mapping for DeviceTree platforms.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/1712068830-4513-4-git-send-email-ssengar@linux.microsoft.com
2024-04-03 08:49:15 +02:00
Saurabh Sengar
222408cde4 x86/of: Set the parse_smp_cfg for all the DeviceTree platforms by default
x86_dtb_parse_smp_config() must be set by DeviceTree platform for
parsing SMP configuration. Set the parse_smp_cfg pointer to
x86_dtb_parse_smp_config() by default so that all the dtb platforms
need not to assign it explicitly. Today there are only two platforms
using DeviceTree in x86, ce4100 and hv_vtl. Remove the explicit
assignment of x86_dtb_parse_smp_config() function from these.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/1712068830-4513-3-git-send-email-ssengar@linux.microsoft.com
2024-04-03 08:46:08 +02:00
Paolo Bonzini
52b761b48f Merge tag 'kvmarm-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.9, part #1

 - Ensure perf events programmed to count during guest execution
   are actually enabled before entering the guest in the nVHE
   configuration.

 - Restore out-of-range handler for stage-2 translation faults.

 - Several fixes to stage-2 TLB invalidations to avoid stale
   translations, possibly including partial walk caches.

 - Fix early handling of architectural VHE-only systems to ensure E2H is
   appropriately set.

 - Correct a format specifier warning in the arch_timer selftest.

 - Make the KVM banner message correctly handle all of the possible
   configurations.
2024-04-02 12:26:15 -04:00
Joan Bruguera Micó
6a53745300 x86/bpf: Fix IP for relocating call depth accounting
The commit:

  59bec00ace ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")

made PER_CPU_VAR() to use rip-relative addressing, hence
INCREMENT_CALL_DEPTH macro and skl_call_thunk_template got rip-relative
asm code inside of it. A follow up commit:

  17bce3b2ae ("x86/callthunks: Handle %rip-relative relocations in call thunk template")

changed x86_call_depth_emit_accounting() to use apply_relocation(),
but mistakenly assumed that the code is being patched in-place (where
the destination of the relocation matches the address of the code),
using *pprog as the destination ip. This is not true for the call depth
accounting, emitted by the BPF JIT, so the calculated address was wrong,
JIT-ed BPF progs on kernels with call depth tracking got broken and
usually caused a page fault.

Pass the destination IP when the BPF JIT emits call depth accounting.

Fixes: 17bce3b2ae ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240401185821.224068-3-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-01 20:37:56 -07:00
Linus Torvalds
448f828feb Merge tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Borislav Petkov:

 - Define the correct set of default hw events on AMD Zen4

 - Use the correct stalled cycles PMCs on AMD Zen2 and newer

 - Fix detection of the LBR freeze feature on AMD

* tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and later
  perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later
  perf/x86/amd/lbr: Use freeze based on availability
  x86/cpufeatures: Add new word for scattered features
2024-03-31 10:43:11 -07:00
Julian Stecklina
4faa0e5d6d x86/boot: Move kernel cmdline setup earlier in the boot process (again)
When split_lock_detect=off (or similar) is specified in
CONFIG_CMDLINE, its effect is lost. The flow is currently this:

	setup_arch():
	  -> early_cpu_init()
	    -> early_identify_cpu()
	      -> sld_setup()
		-> sld_state_setup()
		  -> Looks for split_lock_detect in boot_command_line

	  -> e820__memory_setup()

	  -> Assemble final command line:
	     boot_command_line = builtin_cmdline + boot_cmdline

	  -> parse_early_param()

There were earlier attempts at fixing this in:

  8d48bf8206 ("x86/boot: Pull up cmdline preparation and early param parsing")

later reverted in:

  fbe6183998 ("Revert "x86/boot: Pull up cmdline preparation and early param parsing"")

... because parse_early_param() can't be called before
e820__memory_setup().

In this patch, we just move the command line concatenation to the
beginning of early_cpu_init(). This should fix sld_state_setup(), while
not running in the same issues as the earlier attempt.

The order is now:

	setup_arch():
	  -> Assemble final command line:
	     boot_command_line = builtin_cmdline + boot_cmdline

	  -> early_cpu_init()
	    -> early_identify_cpu()
	      -> sld_setup()
		-> sld_state_setup()
		  -> Looks for split_lock_detect in boot_command_line

	  -> e820__memory_setup()

	  -> parse_early_param()

Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-kernel@vger.kernel.org
2024-03-29 08:19:12 +01:00
Alex Shi
f9f62a877d x86/dumpstack: Use uniform "Oops: " prefix for die() messages
panic() prints a uniform prompt: "Kernel panic - not syncing:",
but die() messages don't have any of that, the message is the
raw user-defined message with no prefix.

There's companies that collect thousands of die() messages per week,
but w/o a prompt in dmesg, it's hard to write scripts to collect and
analize the reasons.

Add a uniform "Oops:" prefix like other architectures.

[ mingo: Rewrote changelog. ]

Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240327024419.471433-1-alexs@kernel.org
2024-03-27 08:45:19 +01:00
Kevin Loughlin
0f4a1e8098 x86/sev: Skip ROM range scans and validation for SEV-SNP guests
SEV-SNP requires encrypted memory to be validated before access.
Because the ROM memory range is not part of the e820 table, it is not
pre-validated by the BIOS. Therefore, if a SEV-SNP guest kernel wishes
to access this range, the guest must first validate the range.

The current SEV-SNP code does indeed scan the ROM range during early
boot and thus attempts to validate the ROM range in probe_roms().
However, this behavior is neither sufficient nor necessary for the
following reasons:

* With regards to sufficiency, if EFI_CONFIG_TABLES are not enabled and
  CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK is set, the kernel will
  attempt to access the memory at SMBIOS_ENTRY_POINT_SCAN_START (which
  falls in the ROM range) prior to validation.

  For example, Project Oak Stage 0 provides a minimal guest firmware
  that currently meets these configuration conditions, meaning guests
  booting atop Oak Stage 0 firmware encounter a problematic call chain
  during dmi_setup() -> dmi_scan_machine() that results in a crash
  during boot if SEV-SNP is enabled.

* With regards to necessity, SEV-SNP guests generally read garbage
  (which changes across boots) from the ROM range, meaning these scans
  are unnecessary. The guest reads garbage because the legacy ROM range
  is unencrypted data but is accessed via an encrypted PMD during early
  boot (where the PMD is marked as encrypted due to potentially mapping
  actually-encrypted data in other PMD-contained ranges).

In one exceptional case, EISA probing treats the ROM range as
unencrypted data, which is inconsistent with other probing.

Continuing to allow SEV-SNP guests to use garbage and to inconsistently
classify ROM range encryption status can trigger undesirable behavior.
For instance, if garbage bytes appear to be a valid signature, memory
may be unnecessarily reserved for the ROM range. Future code or other
use cases may result in more problematic (arbitrary) behavior that
should be avoided.

While one solution would be to overhaul the early PMD mapping to always
treat the ROM region of the PMD as unencrypted, SEV-SNP guests do not
currently rely on data from the ROM region during early boot (and even
if they did, they would be mostly relying on garbage data anyways).

As a simpler solution, skip the ROM range scans (and the otherwise-
necessary range validation) during SEV-SNP guest early boot. The
potential SEV-SNP guest crash due to lack of ROM range validation is
thus avoided by simply not accessing the ROM range.

In most cases, skip the scans by overriding problematic x86_init
functions during sme_early_init() to SNP-safe variants, which can be
likened to x86_init overrides done for other platforms (ex: Xen); such
overrides also avoid the spread of cc_platform_has() checks throughout
the tree.

In the exceptional EISA case, still use cc_platform_has() for the
simplest change, given (1) checks for guest type (ex: Xen domain status)
are already performed here, and (2) these checks occur in a subsys
initcall instead of an x86_init function.

  [ bp: Massage commit message, remove "we"s. ]

Fixes: 9704c07bf9 ("x86/kernel: Validate ROM memory before accessing when SEV-SNP is active")
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20240313121546.2964854-1-kevinloughlin@google.com
2024-03-26 15:22:35 +01:00
Tony Luck
108c6494bd x86/mce: Dynamically size space for machine check records
Systems with a large number of CPUs may generate a large number of
machine check records when things go seriously wrong. But Linux has
a fixed-size buffer that can only capture a few dozen errors.

Allocate space based on the number of CPUs (with a minimum value based
on the historical fixed buffer that could store 80 records).

  [ bp: Rename local var from tmpp to something more telling: gpool. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>
Link: https://lore.kernel.org/r/20240307192704.37213-1-tony.luck@intel.com
2024-03-26 12:40:42 +01:00
Paul E. McKenney
3186b61812 x86/nmi: Upgrade NMI backtrace stall checks & messages
The commit to improve NMI stall debuggability:

  344da544f1 ("x86/nmi: Print reasons why backtrace NMIs are ignored")

... has shown value, but widespread use has also identified a few
opportunities for improvement.

The systems have (as usual) shown far more creativity than that commit's
author, demonstrating yet again that failing CPUs can do whatever they want.

In addition, the current message format is less friendly than one might
like to those attempting to use these messages to identify failing CPUs.

Therefore, separately flag CPUs that, during the full time that the
stack-backtrace request was waiting, were always in an NMI handler,
were never in an NMI handler, or exited one NMI handler.

Also, split the message identifying the CPU and the time since that CPU's
last NMI-related activity so that a single line identifies the CPU without
any other variable information, greatly reducing the processing overhead
required to identify repeat-offender CPUs.

Co-developed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/ab4d70c8-c874-42dc-b206-643018922393@paulmck-laptop
2024-03-26 10:07:59 +01:00
Bingsong Si
cd2236c2f4 x86/cpu: Clear TME feature flag if TME is not enabled by BIOS
When TME is disabled by BIOS, the dmesg output is:

  x86/tme: not enabled by BIOS

... and TME functionality is not enabled by the kernel, but the TME feature
is still shown in /proc/cpuinfo.

Clear it.

[ mingo: Clarified changelog ]

Signed-off-by: Bingsong Si <sibs@chinatelecom.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "Huang, Kai" <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240311071938.13247-1-sibs@chinatelecom.cn
2024-03-26 09:49:32 +01:00
Yuntao Wang
c3262d3d19 x86/head: Simplify relative include path to xen-head.S
Fix the relative path specification in the include directives adding
xen-head.S to the kernel's head_*.S files since they both have
"arch/x86/" as prefix.

  [ bp: Rewrite commit message. ]

Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231231121904.24622-1-ytcoode@gmail.com
2024-03-25 15:24:10 +01:00
Borislav Petkov (AMD)
29ba89f189 x86/CPU/AMD: Improve the erratum 1386 workaround
Disable XSAVES only on machines which haven't loaded the microcode
revision containing the erratum fix.

This will come in handy when running archaic OSes as guests. OSes whose
brilliant programmers thought that CPUID is overrated and one should not
query it but use features directly, ala shoot first, ask questions
later... but only if you're alive after the shooting.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20240324200525.GBZgCHhYFsBj12PrKv@fat_crate.local
2024-03-25 14:05:24 +01:00
Sandipan Das
598c2fafc0 perf/x86/amd/lbr: Use freeze based on availability
Currently, the LBR code assumes that LBR Freeze is supported on all processors
when X86_FEATURE_AMD_LBR_V2 is available i.e. CPUID leaf 0x80000022[EAX]
bit 1 is set. This is incorrect as the availability of the feature is
additionally dependent on CPUID leaf 0x80000022[EAX] bit 2 being set,
which may not be set for all Zen 4 processors.

Define a new feature bit for LBR and PMC freeze and set the freeze enable bit
(FLBRI) in DebugCtl (MSR 0x1d9) conditionally.

It should still be possible to use LBR without freeze for profile-guided
optimization of user programs by using an user-only branch filter during
profiling. When the user-only filter is enabled, branches are no longer
recorded after the transition to CPL 0 upon PMI arrival. When branch
entries are read in the PMI handler, the branch stack does not change.

E.g.

  $ perf record -j any,u -e ex_ret_brn_tkn ./workload

Since the feature bit is visible under flags in /proc/cpuinfo, it can be
used to determine the feasibility of use-cases which require LBR Freeze
to be supported by the hardware such as profile-guided optimization of
kernels.

Fixes: ca5b7c0d96 ("perf/x86/amd/lbr: Add LbrExtV2 branch record support")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/69a453c97cfd11c6f2584b19f937fe6df741510f.1711091584.git.sandipan.das@amd.com
2024-03-25 11:16:55 +01:00
Linus Torvalds
5e74df2f8f Merge tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:

 - Ensure that the encryption mask at boot is properly propagated on
   5-level page tables, otherwise the PGD entry is incorrectly set to
   non-encrypted, which causes system crashes during boot.

 - Undo the deferred 5-level page table setup as it cannot work with
   memory encryption enabled.

 - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
   to the default value but the cached variable is not, so subsequent
   comparisons might yield the wrong result and as a consequence the
   result prevents updating the MSR.

 - Register the local APIC address only once in the MPPARSE enumeration
   to prevent triggering the related WARN_ONs() in the APIC and topology
   code.

 - Handle the case where no APIC is found gracefully by registering a
   fake APIC in the topology code. That makes all related topology
   functions work correctly and does not affect the actual APIC driver
   code at all.

 - Don't evaluate logical IDs during early boot as the local APIC IDs
   are not yet enumerated and the invoked function returns an error
   code. Nothing requires the logical IDs before the final CPUID
   enumeration takes place, which happens after the enumeration.

 - Cure the fallout of the per CPU rework on UP which misplaced the
   copying of boot_cpu_data to per CPU data so that the final update to
   boot_cpu_data got lost which caused inconsistent state and boot
   crashes.

 - Use copy_from_kernel_nofault() in the kprobes setup as there is no
   guarantee that the address can be safely accessed.

 - Reorder struct members in struct saved_context to work around another
   kmemleak false positive

 - Remove the buggy code which tries to update the E820 kexec table for
   setup_data as that is never passed to the kexec kernel.

 - Update the resource control documentation to use the proper units.

 - Fix a Kconfig warning observed with tinyconfig

* tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/64: Move 5-level paging global variable assignments back
  x86/boot/64: Apply encryption mask to 5-level pagetable update
  x86/cpu: Add model number for another Intel Arrow Lake mobile processor
  x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
  Documentation/x86: Document that resctrl bandwidth control units are MiB
  x86/mpparse: Register APIC address only once
  x86/topology: Handle the !APIC case gracefully
  x86/topology: Don't evaluate logical IDs during early boot
  x86/cpu: Ensure that CPU info updates are propagated on UP
  kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
  x86/pm: Work around false positive kmemleak report in msr_build_context()
  x86/kexec: Do not update E820 kexec table for setup_data
  x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
2024-03-24 11:13:56 -07:00
Tom Lendacky
9843231c97 x86/boot/64: Move 5-level paging global variable assignments back
Commit 63bed96604 ("x86/startup_64: Defer assignment of 5-level paging
global variables") moved assignment of 5-level global variables to later
in the boot in order to avoid having to use RIP relative addressing in
order to set them. However, when running with 5-level paging and SME
active (mem_encrypt=on), the variables are needed as part of the page
table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(),
etc.). Since the variables haven't been set, the page table manipulation
is done as if 4-level paging is active, causing the system to crash on
boot.

While only a subset of the assignments that were moved need to be set
early, move all of the assignments back into check_la57_support() so that
these assignments aren't spread between two locations. Instead of just
reverting the fix, this uses the new RIP_REL_REF() macro when assigning
the variables.

Fixes: 63bed96604 ("x86/startup_64: Defer assignment of 5-level paging global variables")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com
2024-03-24 05:00:36 +01:00
Tom Lendacky
4d0d7e7852 x86/boot/64: Apply encryption mask to 5-level pagetable update
When running with 5-level page tables, the kernel mapping PGD entry is
updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC,
which, when SME is active (mem_encrypt=on), results in a page table
entry without the encryption mask set, causing the system to crash on
boot.

Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so
that the encryption mask is set for the PGD entry.

Fixes: 533568e06b ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com
2024-03-24 05:00:35 +01:00
Adamos Ttofari
10e4b5166d x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
Commit 672365477a ("x86/fpu: Update XFD state where required") and
commit 8bf26758ca ("x86/fpu: Add XFD state to fpstate") introduced a
per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in
order to avoid unnecessary writes to the MSR.

On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which
wipes out any stale state. But the per CPU cached xfd value is not
reset, which brings them out of sync.

As a consequence a subsequent xfd_update_state() might fail to update
the MSR which in turn can result in XRSTOR raising a #NM in kernel
space, which crashes the kernel.

To fix this, introduce xfd_set_state() to write xfd_state together
with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD.

Fixes: 672365477a ("x86/fpu: Update XFD state where required")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com

Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de
2024-03-24 04:03:54 +01:00
Thomas Gleixner
f2208aa12c x86/mpparse: Register APIC address only once
The APIC address is registered twice. First during the early detection and
afterwards when actually scanning the table for APIC IDs. The APIC and
topology core warn about the second attempt.

Restrict it to the early detection call.

Fixes: 81287ad65d ("x86/apic: Sanitize APIC address setup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.297774848@linutronix.de
2024-03-23 12:41:48 +01:00
Thomas Gleixner
5e25eb25da x86/topology: Handle the !APIC case gracefully
If there is no local APIC enumerated and registered then the topology
bitmaps are empty. Therefore, topology_init_possible_cpus() will die with
a division by zero exception.

Prevent this by registering a fake APIC id to populate the topology
bitmap. This also allows to use all topology query interfaces
unconditionally. It does not affect the actual APIC code because either
the local APIC address was not registered or no local APIC could be
detected.

Fixes: f1f758a805 ("x86/topology: Add a mechanism to track topology via APIC IDs")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.242709302@linutronix.de
2024-03-23 12:35:56 +01:00
Thomas Gleixner
7af541cee1 x86/topology: Don't evaluate logical IDs during early boot
The local APICs have not yet been enumerated so the logical ID evaluation
from the topology bitmaps does not work and would return an error code.

Skip the evaluation during the early boot CPUID evaluation and only apply
it on the final run.

Fixes: 380414be78 ("x86/cpu/topology: Use topology logical mapping mechanism")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.186943142@linutronix.de
2024-03-23 12:28:06 +01:00
Thomas Gleixner
c90399fbd7 x86/cpu: Ensure that CPU info updates are propagated on UP
The boot sequence evaluates CPUID information twice:

  1) During early boot

  2) When finalizing the early setup right before
     mitigations are selected and alternatives are patched.

In both cases the evaluation is stored in boot_cpu_data, but on UP the
copying of boot_cpu_data to the per CPU info of the boot CPU happens
between #1 and #2. So any update which happens in #2 is never propagated to
the per CPU info instance.

Consolidate the whole logic and copy boot_cpu_data right before applying
alternatives as that's the point where boot_cpu_data is in it's final
state and not supposed to change anymore.

This also removes the voodoo mb() from smp_prepare_cpus_common() which
had absolutely no purpose.

Fixes: 71eb4893cf ("x86/percpu: Cure per CPU madness on UP")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.127642785@linutronix.de
2024-03-23 12:22:04 +01:00
Masami Hiramatsu (Google)
4e51653d5d kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
Read from an unsafe address with copy_from_kernel_nofault() in
arch_adjust_kprobe_addr() because this function is used before checking
the address is in text or not. Syzcaller bot found a bug and reported
the case if user specifies inaccessible data area,
arch_adjust_kprobe_addr() will cause a kernel panic.

[ mingo: Clarified the comment. ]

Fixes: cc66bb9145 ("x86/ibt,kprobes: Cure sym+0 equals fentry woes")
Reported-by: Qiang Zhang <zzqq0103.hey@gmail.com>
Tested-by: Jinghao Jia <jinghao7@illinois.edu>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/171042945004.154897.2221804961882915806.stgit@devnote2
2024-03-22 11:40:56 +01:00
Thomas Gleixner
63edbaa48a x86/cpu/topology: Add support for the AMD 0x80000026 leaf
On AMD processors that support extended CPUID leaf 0x80000026, use the
extended leaf to parse the topology information. In case of a failure,
fall back to parsing the information from CPUID leaf 0xb.

CPUID leaf 0x80000026 exposes the "CCX" and "CCD (Die)" information on
AMD processors which have been mapped to TOPO_TILE_DOMAIN and
TOPO_DIE_DOMAIN respectively.

Since this information was previously not available via CPUID leaf 0xb
or 0x8000001e, the "die_id", "logical_die_id", "max_die_per_pkg",
"die_cpus", and "die_cpus_list" will differ with this addition on
AMD processors that support extended CPUID leaf 0x80000026 and contain
more than one "CCD (Die)" on the package.

For example, following are the changes in the values reported by
"/sys/kernel/debug/x86/topo/cpus/16" after applying this patch on a 4th
Generation AMD EPYC System (1 x 128C/256T):

  (CPU16 is the first CPU of the second CCD on the package)

		   tip:x86/apic      tip:x86/apic
				     + this patch

  online:              1                  1
  initial_apicid:      80                 80
  apicid:              80                 80
  pkg_id:              0                  0
  die_id:              0                  4       *
  cu_id:               255                255
  core_id:             64                 64
  logical_pkg_id:      0                  0
  logical_die_id:      0                  4       *
  llc_id:              8                  8
  l2c_id:              65535              65535
  amd_node_id:         0                  0
  amd_nodes_per_pkg:   1                  1
  num_threads:         256                256
  num_cores:           128                128
  max_dies_per_pkg:    1                  8       *
  max_threads_per_core:2                  2

[ prateek: commit log, updated comment in topoext_amd.c, changed has_0xb
  to has_topoext, rebased the changes on tip:x86/apic, tested the
  changes on 4th Gen AMD EPYC system ]

[ mingo: tidy up the changelog a bit more ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240314050432.1710-1-kprateek.nayak@amd.com
2024-03-22 11:22:14 +01:00
Valentin Schneider
79a4567b2e x86/tsc: Make __use_tsc __ro_after_init
__use_tsc is only ever enabled in __init tsc_enable_sched_clock(), so mark
it as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240313180106.2917308-5-vschneid@redhat.com
2024-03-22 11:18:20 +01:00
Valentin Schneider
ddd8afacc4 x86/kvm: Make kvm_async_pf_enabled __ro_after_init
kvm_async_pf_enabled is only ever enabled in __init kvm_guest_init(), so
mark it as __ro_after_init.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240313180106.2917308-4-vschneid@redhat.com
2024-03-22 11:18:19 +01:00
H.J. Lu
2883f01ec3 x86/shstk: Enable shadow stacks for x32
1. Add shadow stack support to x32 signal.
2. Use the 64-bit map_shadow_stack syscall for x32.
3. Set up shadow stack for x32.

Tested with shadow stack enabled x32 glibc on Intel Tiger Lake:

I configured x32 glibc with --enable-cet, build glibc and
run all glibc tests with shadow stack enabled.  There are
no regressions.  I verified that shadow stack is enabled
via /proc/pid/status.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: H.J. Lu <hjl.tools@gmail.com>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20240315140433.1966543-1-hjl.tools@gmail.com
2024-03-22 10:17:11 +01:00
Dave Young
fc7f27cda8 x86/kexec: Do not update E820 kexec table for setup_data
crashkernel reservation failed on a Thinkpad t440s laptop recently.
Actually the memblock reservation succeeded, but later insert_resource()
failed.

Test steps:
  kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
    kexec reboot ->
        dmesg|grep "crashkernel reserved";
            crashkernel memory range like below reserved successfully:
              0x00000000d0000000 - 0x00000000da000000
        But no such "Crash kernel" region in /proc/iomem

The background story:

Currently the E820 code reserves setup_data regions for both the current
kernel and the kexec kernel, and it inserts them into the resources list.

Before the kexec kernel reboots nobody passes the old setup_data, and
kexec only passes fresh SETUP_EFI/SETUP_IMA/SETUP_RNG_SEED if needed.
Thus the old setup data memory is not used at all.

Due to old kernel updates the kexec e820 table as well so kexec kernel
sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
regions are inserted into resources list in the kexec kernel by
e820__reserve_resources().

Note, due to no setup_data is passed in for those old regions they are not
early reserved (by function early_reserve_memory), and the crashkernel
memblock reservation will just treat them as usable memory and it could
reserve the crashkernel region which overlaps with the old setup_data
regions. And just like the bug I noticed here, kdump insert_resource
failed because e820__reserve_resources has added the overlapped chunks
in /proc/iomem already.

Finally, looking at the code, the old setup_data regions are not used
at all as no setup_data is passed in by the kexec boot loader. Although
something like SETUP_PCI etc could be needed, kexec should pass
the info as new setup_data so that kexec kernel can take care of them.
This should be taken care of in other separate patches if needed.

Thus drop the useless buggy code here.

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Eric DeVolder <eric.devolder@oracle.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/Zf0T3HCG-790K-pZ@darkstar.users.ipa.redhat.com
2024-03-22 10:07:45 +01:00
Brian Gerst
e2d168328e x86/syscall/compat: Remove ia32_unistd.h
This header is now just a wrapper for unistd_32_ia32.h.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240321211847.132473-3-brgerst@gmail.com
2024-03-22 09:37:09 +01:00
Xin Li (Intel)
8f69cba096 x86: Rename __{start,end}_init_task to __{start,end}_init_stack
The stack of a task has been separated from the memory of a task_struct
struture for a long time on x86, as a result __{start,end}_init_task no
longer mark the start and end of the init_task structure, but its stack
only.

Rename __{start,end}_init_task to __{start,end}_init_stack.

Note other architectures are not affected because __{start,end}_init_task
are used on x86 only.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20240322081616.3346181-1-xin@zytor.com
2024-03-22 09:32:41 +01:00
Borislav Petkov (AMD)
95bfb35269 x86/cpu: Get rid of an unnecessary local variable in get_cpu_address_sizes()
Drop 'vp_bits_from_cpuid' as it is not really needed.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240316120706.4352-1-bp@alien8.de
2024-03-21 21:13:56 +01:00
Rafael J. Wysocki
edf66a3c76 x86/cpu: Move leftover contents of topology.c to setup.c
The only useful piece of arch/x86/kernel/topology.c is the definition
of arch_cpu_is_hotpluggable() that can be moved elsewhere (other
architectures tend to put it into setup.c), so do that and delete
the rest of the file.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/12422874.O9o76ZdvQC@kreacher
2024-03-21 20:47:40 +01:00
Brian Gerst
2cb16181a1 x86/boot: Simplify boot stack setup
Define the symbol __top_init_kernel_stack instead of duplicating
the offset from __end_init_task in multiple places.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20240321180506.89030-1-brgerst@gmail.com
2024-03-21 20:17:54 +01:00
Linus Torvalds
cfce216e14 Merge tag 'hyperv-next-signed-20240320' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:

 - Use Hyper-V entropy to seed guest random number generator (Michael
   Kelley)

 - Convert to platform remove callback returning void for vmbus (Uwe
   Kleine-König)

 - Introduce hv_get_hypervisor_version function (Nuno Das Neves)

 - Rename some HV_REGISTER_* defines for consistency (Nuno Das Neves)

 - Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_* (Nuno Das
   Neves)

 - Cosmetic changes for hv_spinlock.c (Purna Pavan Chandra Aekkaladevi)

 - Use per cpu initial stack for vtl context (Saurabh Sengar)

* tag 'hyperv-next-signed-20240320' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Use Hyper-V entropy to seed guest random number generator
  x86/hyperv: Cosmetic changes for hv_spinlock.c
  hyperv-tlfs: Rename some HV_REGISTER_* defines for consistency
  hv: vmbus: Convert to platform remove callback returning void
  mshyperv: Introduce hv_get_hypervisor_version function
  x86/hyperv: Use per cpu initial stack for vtl context
  hyperv-tlfs: Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_*
2024-03-21 10:01:02 -07:00
Uros Bizjak
5d31174f3c x86/fpu: Fix AMD X86_BUG_FXSAVE_LEAK fixup
The assembly snippet in restore_fpregs_from_fpstate() that implements
X86_BUG_FXSAVE_LEAK fixup loads the value from a random variable,
preferably the one that is already in the L1 cache.

However, the access to fpinit_state via *fpstate pointer is not
implemented correctly. The "m" asm constraint requires dereferenced
pointer variable, otherwise the compiler just reloads the value
via temporary stack slot. The current asm code reflects this:

     mov    %rdi,(%rsp)
     ...
     fildl  (%rsp)

With dereferenced pointer variable, the code does what the
comment above the asm snippet says:

     fildl  (%rdi)

Also, remove the pointless %P operand modifier. The modifier is
ineffective on non-symbolic references - it was used to prevent
%rip-relative addresses in .altinstr sections, but FILDL in the
.text section can use %rip-relative addresses without problems.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240315081849.5187-1-ubizjak@gmail.com
2024-03-19 14:02:29 +01:00
Paolo Bonzini
c822a075ab Merge tag 'kvm-x86-asyncpf_abi-6.9' of https://github.com/kvm-x86/linux into HEAD
Guest-side KVM async #PF ABI cleanup for 6.9

Delete kvm_vcpu_pv_apf_data.enabled to fix a goof in KVM's async #PF ABI where
the enabled field pushes the size of "struct kvm_vcpu_pv_apf_data" from 64 to
68 bytes, i.e. beyond a single cache line.

The enabled field is purely a guest-side flag that Linux-as-a-guest uses to
track whether or not the guest has enabled async #PF support.  The actual flag
that is passed to the host, i.e. to KVM proper, is a single bit in a synthetic
MSR, MSR_KVM_ASYNC_PF_EN, i.e. is in a location completely unrelated to the
shared kvm_vcpu_pv_apf_data structure.

Simply drop the the field and use a dedicated guest-side per-CPU variable to
fix the ABI, as opposed to fixing the documentation to match reality.  KVM has
never consumed kvm_vcpu_pv_apf_data.enabled, so the odds of the ABI change
breaking anything are extremely low.
2024-03-18 19:03:42 -04:00
Michael Kelley
f2580a907e x86/hyperv: Use Hyper-V entropy to seed guest random number generator
A Hyper-V host provides its guest VMs with entropy in a custom ACPI
table named "OEM0".  The entropy bits are updated each time Hyper-V
boots the VM, and are suitable for seeding the Linux guest random
number generator (rng). See a brief description of OEM0 in [1].

Generation 2 VMs on Hyper-V use UEFI to boot. Existing EFI code in
Linux seeds the rng with entropy bits from the EFI_RNG_PROTOCOL.
Via this path, the rng is seeded very early during boot with good
entropy. The ACPI OEM0 table provided in such VMs is an additional
source of entropy.

Generation 1 VMs on Hyper-V boot from BIOS. For these VMs, Linux
doesn't currently get any entropy from the Hyper-V host. While this
is not fundamentally broken because Linux can generate its own entropy,
using the Hyper-V host provided entropy would get the rng off to a
better start and would do so earlier in the boot process.

Improve the rng seeding for Generation 1 VMs by having Hyper-V specific
code in Linux take advantage of the OEM0 table to seed the rng. For
Generation 2 VMs, use the OEM0 table to provide additional entropy
beyond the EFI_RNG_PROTOCOL. Because the OEM0 table is custom to
Hyper-V, parse it directly in the Hyper-V code in the Linux kernel
and use add_bootloader_randomness() to add it to the rng. Once the
entropy bits are read from OEM0, zero them out in the table so
they don't appear in /sys/firmware/acpi/tables/OEM0 in the running
VM. The zero'ing is done out of an abundance of caution to avoid
potential security risks to the rng. Also set the OEM0 data length
to zero so a kexec or other subsequent use of the table won't try
to use the zero'ed bits.

[1] https://download.microsoft.com/download/1/c/9/1c9813b8-089c-4fef-b2ad-ad80e79403ba/Whitepaper%20-%20The%20Windows%2010%20random%20number%20generation%20infrastructure.pdf

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20240318155408.216851-1-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240318155408.216851-1-mhklinux@outlook.com>
2024-03-18 22:01:52 +00:00
Nuno Das Neves
b967df6293 hyperv-tlfs: Rename some HV_REGISTER_* defines for consistency
Rename HV_REGISTER_GUEST_OSID to HV_REGISTER_GUEST_OS_ID. This matches
the existing HV_X64_MSR_GUEST_OS_ID.

Rename HV_REGISTER_CRASH_* to HV_REGISTER_GUEST_CRASH_*. Including
GUEST_ is consistent with other #defines such as
HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE. The new names also match the TLFS
document more accurately, i.e. HvRegisterGuestCrash*.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Link: https://lore.kernel.org/r/1710285687-9160-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1710285687-9160-1-git-send-email-nunodasneves@linux.microsoft.com>
2024-03-18 04:58:49 +00:00
Borislav Petkov (AMD)
5c84b051bd x86/CPU/AMD: Update the Zenbleed microcode revisions
Update them to the correct revision numbers.

Fixes: 522b1d6921 ("x86/cpu/amd: Add a Zenbleed fix")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-16 09:04:09 -07:00
Linus Torvalds
4f712ee0cb Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "S390:

   - Changes to FPU handling came in via the main s390 pull request

   - Only deliver to the guest the SCLP events that userspace has
     requested

   - More virtual vs physical address fixes (only a cleanup since
     virtual and physical address spaces are currently the same)

   - Fix selftests undefined behavior

  x86:

   - Fix a restriction that the guest can't program a PMU event whose
     encoding matches an architectural event that isn't included in the
     guest CPUID. The enumeration of an architectural event only says
     that if a CPU supports an architectural event, then the event can
     be programmed *using the architectural encoding*. The enumeration
     does NOT say anything about the encoding when the CPU doesn't
     report support the event *in general*. It might support it, and it
     might support it using the same encoding that made it into the
     architectural PMU spec

   - Fix a variety of bugs in KVM's emulation of RDPMC (more details on
     individual commits) and add a selftest to verify KVM correctly
     emulates RDMPC, counter availability, and a variety of other
     PMC-related behaviors that depend on guest CPUID and therefore are
     easier to validate with selftests than with custom guests (aka
     kvm-unit-tests)

   - Zero out PMU state on AMD if the virtual PMU is disabled, it does
     not cause any bug but it wastes time in various cases where KVM
     would check if a PMC event needs to be synthesized

   - Optimize triggering of emulated events, with a nice ~10%
     performance improvement in VM-Exit microbenchmarks when a vPMU is
     exposed to the guest

   - Tighten the check for "PMI in guest" to reduce false positives if
     an NMI arrives in the host while KVM is handling an IRQ VM-Exit

   - Fix a bug where KVM would report stale/bogus exit qualification
     information when exiting to userspace with an internal error exit
     code

   - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support

   - Rework TDP MMU root unload, free, and alloc to run with mmu_lock
     held for read, e.g. to avoid serializing vCPUs when userspace
     deletes a memslot

   - Tear down TDP MMU page tables at 4KiB granularity (used to be
     1GiB). KVM doesn't support yielding in the middle of processing a
     zap, and 1GiB granularity resulted in multi-millisecond lags that
     are quite impolite for CONFIG_PREEMPT kernels

   - Allocate write-tracking metadata on-demand to avoid the memory
     overhead when a kernel is built with i915 virtualization support
     but the workloads use neither shadow paging nor i915 virtualization

   - Explicitly initialize a variety of on-stack variables in the
     emulator that triggered KMSAN false positives

   - Fix the debugregs ABI for 32-bit KVM

   - Rework the "force immediate exit" code so that vendor code
     ultimately decides how and when to force the exit, which allowed
     some optimization for both Intel and AMD

   - Fix a long-standing bug where kvm_has_noapic_vcpu could be left
     elevated if vCPU creation ultimately failed, causing extra
     unnecessary work

   - Cleanup the logic for checking if the currently loaded vCPU is
     in-kernel

   - Harden against underflowing the active mmu_notifier invalidation
     count, so that "bad" invalidations (usually due to bugs elsehwere
     in the kernel) are detected earlier and are less likely to hang the
     kernel

  x86 Xen emulation:

   - Overlay pages can now be cached based on host virtual address,
     instead of guest physical addresses. This removes the need to
     reconfigure and invalidate the cache if the guest changes the gpa
     but the underlying host virtual address remains the same

   - When possible, use a single host TSC value when computing the
     deadline for Xen timers in order to improve the accuracy of the
     timer emulation

   - Inject pending upcall events when the vCPU software-enables its
     APIC to fix a bug where an upcall can be lost (and to follow Xen's
     behavior)

   - Fall back to the slow path instead of warning if "fast" IRQ
     delivery of Xen events fails, e.g. if the guest has aliased xAPIC
     IDs

  RISC-V:

   - Support exception and interrupt handling in selftests

   - New self test for RISC-V architectural timer (Sstc extension)

   - New extension support (Ztso, Zacas)

   - Support userspace emulation of random number seed CSRs

  ARM:

   - Infrastructure for building KVM's trap configuration based on the
     architectural features (or lack thereof) advertised in the VM's ID
     registers

   - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
     x86's WC) at stage-2, improving the performance of interacting with
     assigned devices that can tolerate it

   - Conversion of KVM's representation of LPIs to an xarray, utilized
     to address serialization some of the serialization on the LPI
     injection path

   - Support for _architectural_ VHE-only systems, advertised through
     the absence of FEAT_E2H0 in the CPU's ID register

   - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
     selftests

  LoongArch:

   - Set reserved bits as zero in CPUCFG

   - Start SW timer only when vcpu is blocking

   - Do not restart SW timer when it is expired

   - Remove unnecessary CSR register saving during enter guest

   - Misc cleanups and fixes as usual

  Generic:

   - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
     always true on all architectures except MIPS (where Kconfig
     determines the available depending on CPU capabilities). It is
     replaced either by an architecture-dependent symbol for MIPS, and
     IS_ENABLED(CONFIG_KVM) everywhere else

   - Factor common "select" statements in common code instead of
     requiring each architecture to specify it

   - Remove thoroughly obsolete APIs from the uapi headers

   - Move architecture-dependent stuff to uapi/asm/kvm.h

   - Always flush the async page fault workqueue when a work item is
     being removed, especially during vCPU destruction, to ensure that
     there are no workers running in KVM code when all references to
     KVM-the-module are gone, i.e. to prevent a very unlikely
     use-after-free if kvm.ko is unloaded

   - Grab a reference to the VM's mm_struct in the async #PF worker
     itself instead of gifting the worker a reference, so that there's
     no need to remember to *conditionally* clean up after the worker

  Selftests:

   - Reduce boilerplate especially when utilize selftest TAP
     infrastructure

   - Add basic smoke tests for SEV and SEV-ES, along with a pile of
     library support for handling private/encrypted/protected memory

   - Fix benign bugs where tests neglect to close() guest_memfd files"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
  selftests: kvm: remove meaningless assignments in Makefiles
  KVM: riscv: selftests: Add Zacas extension to get-reg-list test
  RISC-V: KVM: Allow Zacas extension for Guest/VM
  KVM: riscv: selftests: Add Ztso extension to get-reg-list test
  RISC-V: KVM: Allow Ztso extension for Guest/VM
  RISC-V: KVM: Forward SEED CSR access to user space
  KVM: riscv: selftests: Add sstc timer test
  KVM: riscv: selftests: Change vcpu_has_ext to a common function
  KVM: riscv: selftests: Add guest helper to get vcpu id
  KVM: riscv: selftests: Add exception handling support
  LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
  LoongArch: KVM: Do not restart SW timer when it is expired
  LoongArch: KVM: Start SW timer only when vcpu is blocking
  LoongArch: KVM: Set reserved bits as zero in CPUCFG
  KVM: selftests: Explicitly close guest_memfd files in some gmem tests
  KVM: x86/xen: fix recursive deadlock in timer injection
  KVM: pfncache: simplify locking and make more self-contained
  KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
  KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
  KVM: x86/xen: improve accuracy of Xen timers
  ...
2024-03-15 13:03:13 -07:00
Linus Torvalds
ab522e1478 Merge tag 'devicetree-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree updates from Rob Herring:
 "DT core:

   - Add cleanup.h based auto release of struct device_node pointers via
     __free marking and new for_each_child_of_node_scoped() iterator to
     use it.

   - Always create a base skeleton DT when CONFIG_OF is enabled. This
     supports several usecases of adding DT data on non-DT booted
     systems.

   - Move around some /reserved-memory code in preparation for further
     improvements

   - Add a stub for_each_property_of_node() for !OF

   - Adjust the printk levels on some messages

   - Fix __be32 sparse warning

   - Drop RESERVEDMEM_OF_DECLARE usage from Freescale qbman driver
     (currently orphaned)

   - Add Saravana Kannan and drop Frank Rowand as DT maintainers

  DT bindings:

   - Convert Mediatek timer, Mediatek sysirq, fsl,imx6ul-tsc,
     fsl,imx6ul-pinctrl, Atmel AIC, Atmel HLCDC, FPGA region, and
     xlnx,sd-fec to DT schemas

   - Add existing, but undocumented fsl,imx-anatop binding

   - Add bunch of undocumented vendor prefixes used in compatible
     strings

   - Drop obsolete brcm,bcm2835-pm-wdt binding

   - Drop obsolete i2c.txt which as been replaced with schema in
     dtschema

   - Add DPS310 device and sort trivial-devices.yaml

   - Enable undocumented compatible checks on DT binding examples

   - More QCom maintainer fixes/updates

   - Updates to writing-schema.rst and DT submitting-patches.rst to
     cover some frequent review comments

   - Clean-up SPDX tags to use 'OR' rather than 'or'"

* tag 'devicetree-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (56 commits)
  dt-bindings: soc: imx: fsl,imx-anatop: add imx6q regulators
  of: unittest: Use for_each_child_of_node_scoped()
  of: Introduce for_each_*_child_of_node_scoped() to automate of_node_put() handling
  of: Add cleanup.h based auto release via __free(device_node) markings
  of: Move all FDT reserved-memory handling into of_reserved_mem.c
  of: Add KUnit test to confirm DTB is loaded
  of: unittest: treat missing of_root as error instead of fixing up
  x86/of: Unconditionally call unflatten_and_copy_device_tree()
  um: Unconditionally call unflatten_device_tree()
  of: Create of_root if no dtb provided by firmware
  of: Always unflatten in unflatten_and_copy_device_tree()
  dt-bindings: timer: mediatek: Convert to json-schema
  dt-bindings: interrupt-controller: fsl,intmux: Include power-domains support
  soc: fsl: qbman: Remove RESERVEDMEM_OF_DECLARE usage
  dt-bindings: fsl-imx-sdma: fix HDMI audio index
  dt-bindings: soc: imx: fsl,imx-iomuxc-gpr: add imx6
  dt-bindings: soc: imx: fsl,imx-anatop: add binding
  dt-bindings: input: touchscreen: fsl,imx6ul-tsc convert to YAML
  dt-bindings: pinctrl: fsl,imx6ul-pinctrl: convert to YAML
  of: make for_each_property_of_node() available to to !OF
  ...
2024-03-15 12:37:59 -07:00
Linus Torvalds
902861e34c Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
Linus Torvalds
01732755ee Merge tag 'probes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:
 "x86 kprobes:

   - Use boolean for some function return instead of 0 and 1

   - Prohibit probing on INT/UD. This prevents user to put kprobe on
     INTn/INT1/INT3/INTO and UD0/UD1/UD2 because these are used for a
     special purpose in the kernel

   - Boost Grp instructions. Because a few percent of kernel
     instructions are Grp 2/3/4/5 and those are safe to be executed
     without ip register fixup, allow those to be boosted (direct
     execution on the trampoline buffer with a JMP)

  tracing:

   - Add function argument access from return events (kretprobe and
     fprobe). This allows user to compare how a data structure field is
     changed after executing a function. With BTF, return event also
     accepts function argument access by name.

   - Fix a wrong comment (using "Kretprobe" in fprobe)

   - Cleanup a big probe argument parser function into three parts, type
     parser, post-processing function, and main parser

   - Cleanup to set nr_args field when initializing trace_probe instead
     of counting up it while parsing

   - Cleanup a redundant #else block from tracefs/README source code

   - Update selftests to check entry argument access from return probes

   - Documentation update about entry argument access from return
     probes"

* tag 'probes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  Documentation: tracing: Add entry argument access at function exit
  selftests/ftrace: Add test cases for entry args at function exit
  tracing/probes: Support $argN in return probe (kprobe and fprobe)
  tracing: Remove redundant #else block for BTF args from README
  tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init
  tracing/probes: Cleanup probe argument parser
  tracing/fprobe-event: cleanup: Fix a wrong comment in fprobe event
  x86/kprobes: Boost more instructions from grp2/3/4/5
  x86/kprobes: Prohibit kprobing on INT and UD
  x86/kprobes: Refactor can_{probe,boost} return type to bool
2024-03-14 16:16:33 -07:00
Linus Torvalds
9434467959 Merge tag 'acpi-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
 "These modify the ACPI device events and processor enumeration code to
  take the 'enabled' _STA bit into account as mandated by the ACPI
  specification, convert several platform drivers to using a remove
  callback that returns void, add some new quirks for ACPI IRQ override
  and other things, address assorted issues and clean up code.

  Specifics:

   - Rearrange Device Check and Bus Check notification handling in the
     ACPI device hotplug code to make it get the "enabled" _STA bit into
     account (Rafael Wysocki)

   - Modify acpi_processor_add() to skip processors with the "enabled"
     _STA bit clear, as per the specification (Rafael Wysocki)

   - Stop failing Device Check notification handling without a valid
     reason (Rafael Wysocki)

   - Defer enumeration of devices that depend on a device with an ACPI
     device ID equalt to INTC10CF to address probe ordering issues on
     some platforms (Wentong Wu)

   - Constify acpi_bus_type (Ricardo Marliere)

   - Make the ACPI-specific suspend-to-idle code take the Low-Power S0
     Idle MSFT UUID into account on non-AMD systems (Rafael Wysocki)

   - Add ACPI IRQ override quirks for some new platforms (Sergey
     Kalinichev, Maxim Kudinov, Alexey Froloff, Sviatoslav Harasymchuk,
     Nicolas Haye)

   - Make the NFIT parsing code use acpi_evaluate_dsm_typed() (Andy
     Shevchenko)

   - Fix a memory leak in acpi_processor_power_exit() (Armin Wolf)

   - Make it possible to quirk the CSI-2 and MIPI DisCo for Imaging
     properties parsing and add a quirk for Dell XPS 9315 (Sakari Ailus)

   - Prevent false-positive static checker warnings from triggering by
     intializing some variables in the ACPI thermal code to zero (Colin
     Ian King)

   - Add DELL0501 handling to acpi_quirk_skip_serdev_enumeration() and
     make that function generic (Hans de Goede)

   - Make the ACPI backlight code handle fetching EDID that is longer
     than 256 bytes (Mario Limonciello)

   - Skip initialization of GHES_ASSIST structures for Machine Check
     Architecture in APEI (Avadhut Naik)

   - Convert several plaform drivers in the ACPI subsystem to using a
     remove callback that returns void (Uwe Kleine-König)

   - Drop the long-deprecated custom_method debugfs interface that is
     problematic from the security standpoint (Rafael Wysocki)

   - Use %pe in a couple of places in the ACPI code for easier error
     decoding (Onkarnath)

   - Fix register width information handling during system memory
     accesses in the ACPI CPPC library (Jarred White)

   - Add AMD CPPC V2 support for family 17h processors to the ACPI CPPC
     library (Perry Yuan)"

* tag 'acpi-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
  ACPI: resource: Use IRQ override on Maibenben X565
  ACPI: CPPC: Use access_width over bit_width for system memory accesses
  ACPI: CPPC: enable AMD CPPC V2 support for family 17h processors
  ACPI: APEI: Skip initialization of GHES_ASSIST structures for Machine Check Architecture
  ACPI: scan: Consolidate Device Check and Bus Check notification handling
  ACPI: scan: Rework Device Check and Bus Check notification handling
  ACPI: scan: Make acpi_processor_add() check the device enabled bit
  ACPI: scan: Relocate acpi_bus_trim_one()
  ACPI: scan: Fix device check notification handling
  ACPI: resource: Add MAIBENBEN X577 to irq1_edge_low_force_override
  ACPI: pfr_update: Convert to platform remove callback returning void
  ACPI: pfr_telemetry: Convert to platform remove callback returning void
  ACPI: fan: Convert to platform remove callback returning void
  ACPI: GED: Convert to platform remove callback returning void
  ACPI: DPTF: Convert to platform remove callback returning void
  ACPI: AGDI: Convert to platform remove callback returning void
  ACPI: TAD: Convert to platform remove callback returning void
  ACPI: APEI: GHES: Convert to platform remove callback returning void
  ACPI: property: Polish ignoring bad data nodes
  ACPI: thermal_lib: Initialize temp_decik to zero
  ...
2024-03-13 11:54:05 -07:00
Linus Torvalds
07abb19a9b Merge tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "From the functional perspective, the most significant change here is
  the addition of support for Energy Models that can be updated
  dynamically at run time.

  There is also the addition of LZ4 compression support for hibernation,
  the new preferred core support in amd-pstate, new platforms support in
  the Intel RAPL driver, new model-specific EPP handling in intel_pstate
  and more.

  Apart from that, the cpufreq default transition delay is reduced from
  10 ms to 2 ms (along with some related adjustments), the system
  suspend statistics code undergoes a significant rework and there is a
  usual bunch of fixes and code cleanups all over.

  Specifics:

   - Allow the Energy Model to be updated dynamically (Lukasz Luba)

   - Add support for LZ4 compression algorithm to the hibernation image
     creation and loading code (Nikhil V)

   - Fix and clean up system suspend statistics collection (Rafael
     Wysocki)

   - Simplify device suspend and resume handling in the power management
     core code (Rafael Wysocki)

   - Fix PCI hibernation support description (Yiwei Lin)

   - Make hibernation take set_memory_ro() return values into account as
     appropriate (Christophe Leroy)

   - Set mem_sleep_current during kernel command line setup to avoid an
     ordering issue with handling it (Maulik Shah)

   - Fix wake IRQs handling when pm_runtime_force_suspend() is used as a
     driver's system suspend callback (Qingliang Li)

   - Simplify pm_runtime_get_if_active() usage and add a replacement for
     pm_runtime_put_autosuspend() (Sakari Ailus)

   - Add a tracepoint for runtime_status changes tracking (Vilas Bhat)

   - Fix section title markdown in the runtime PM documentation (Yiwei
     Lin)

   - Enable preferred core support in the amd-pstate cpufreq driver
     (Meng Li)

   - Fix min_perf assignment in amd_pstate_adjust_perf() and make the
     min/max limit perf values in amd-pstate always stay within the
     (highest perf, lowest perf) range (Tor Vic, Meng Li)

   - Allow intel_pstate to assign model-specific values to strings used
     in the EPP sysfs interface and make it do so on Meteor Lake
     (Srinivas Pandruvada)

   - Drop long-unused cpudata::prev_cummulative_iowait from the
     intel_pstate cpufreq driver (Jiri Slaby)

   - Prevent scaling_cur_freq from exceeding scaling_max_freq when the
     latter is an inefficient frequency (Shivnandan Kumar)

   - Change default transition delay in cpufreq to 2ms (Qais Yousef)

   - Remove references to 10ms minimum sampling rate from comments in
     the cpufreq code (Pierre Gondois)

   - Honour transition_latency over transition_delay_us in cpufreq (Qais
     Yousef)

   - Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar)

   - General enhancements / cleanups to ARM cpufreq drivers (tianyu2,
     Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia
     Belova)

   - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan)

   - Make the SCMI cpufreq driver get a transition delay value from
     firmware (Pierre Gondois)

   - Prevent the haltpoll cpuidle governor from shrinking guest
     poll_limit_ns below grow_start (Parshuram Sangle)

   - Avoid potential overflow in integer multiplication when computing
     cpuidle state parameters (C Cheng)

   - Adjust MWAIT hint target C-state computation in the ACPI cpuidle
     driver and in intel_idle to return a correct value for C0 (He
     Rongguang)

   - Address multiple issues in the TPMI RAPL driver and add support for
     new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui)

   - Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel
     Lezcano)

   - Fix kernel-doc for dtpm_create_hierarchy() (Yang Li)

   - Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth
     Norway Ananda)

   - Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil)

   - Fix a couple of warnings in the OPP core code related to W=1 builds
     (Viresh Kumar)

   - Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh
     Kumar)

   - Extend dev_pm_opp_data with turbo support (Sibi Sankar)

   - dt-bindings: drop maxItems from inner items (David Heidelberg)"

* tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (95 commits)
  dt-bindings: opp: drop maxItems from inner items
  OPP: debugfs: Fix warning around icc_get_name()
  OPP: debugfs: Fix warning with W=1 builds
  cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h
  OPP: Extend dev_pm_opp_data with turbo support
  Fix cpupower-frequency-info.1 man page typo
  cpufreq: scmi: Set transition_delay_us
  firmware: arm_scmi: Populate fast channel rate_limit
  firmware: arm_scmi: Populate perf commands rate_limit
  cpuidle: ACPI/intel: fix MWAIT hint target C-state computation
  PM: sleep: wakeirq: fix wake irq warning in system suspend
  powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function
  cpufreq: Don't unregister cpufreq cooling on CPU hotplug
  PM: suspend: Set mem_sleep_current during kernel command line setup
  cpufreq: Honour transition_latency over transition_delay_us
  cpufreq: Limit resolving a frequency to policy min/max
  Documentation: PM: Fix runtime_pm.rst markdown syntax
  cpufreq: amd-pstate: adjust min/max limit perf
  cpufreq: Remove references to 10ms min sampling rate
  cpufreq: intel_pstate: Update default EPPs for Meteor Lake
  ...
2024-03-13 11:40:06 -07:00
Wei Yang
9b67ce2c12 x86/vmlinux.lds.S: Take __START_KERNEL out conditional definition
If CONFIG_X86_32=y, the section start address is defined to be
"LOAD_OFFSET + LOAD_PHYSICAL_ADDR", which is the same as
__START_KERNEL_map.

Unify it with the 64-bit definition to simplify the code.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240313075839.8321-5-richard.weiyang@gmail.com
2024-03-13 11:29:11 +01:00
Wei Yang
a5cffd056e x86/vmlinux.lds.S: Remove conditional definition of LOAD_OFFSET
In vmlinux.lds.S, we define LOAD_OFFSET conditionally to __PAGE_OFFSET
or __START_KERNEL_map. While __START_KERNEL_map is already defined to
the same value with the same condition.

So it is fine to define LOAD_OFFSET to __START_KERNEL_map directly.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240313075839.8321-4-richard.weiyang@gmail.com
2024-03-13 11:29:10 +01:00
Linus Torvalds
b29f377119 Merge tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:

 - Continuing work by Ard Biesheuvel to improve the x86 early startup
   code, with the long-term goal to make it position independent:

      - Get rid of early accesses to global objects, either by moving
        them to the stack, deferring the access until later, or dropping
        the globals entirely

      - Move all code that runs early via the 1:1 mapping into
        .head.text, and move code that does not out of it, so that build
        time checks can be added later to ensure that no inadvertent
        absolute references were emitted into code that does not
        tolerate them

      - Remove fixup_pointer() and occurrences of __pa_symbol(), which
        rely on the compiler emitting absolute references, which is not
        guaranteed

 - Improve the early console code

 - Add early console message about ignored NMIs, so that users are at
   least warned about their existence - even if we cannot do anything
   about them

 - Improve the kexec code's kernel load address handling

 - Enable more X86S (simplified x86) bits

 - Simplify early boot GDT handling

 - Micro-optimize the boot code a bit

 - Misc cleanups

* tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86/sev: Move early startup code into .head.text section
  x86/sme: Move early SME kernel encryption handling into .head.text
  x86/boot: Move mem_encrypt= parsing to the decompressor
  efi/libstub: Add generic support for parsing mem_encrypt=
  x86/startup_64: Simplify virtual switch on primary boot
  x86/startup_64: Simplify calculation of initial page table address
  x86/startup_64: Defer assignment of 5-level paging global variables
  x86/startup_64: Simplify CR4 handling in startup code
  x86/boot: Use 32-bit XOR to clear registers
  efi/x86: Set the PE/COFF header's NX compat flag unconditionally
  x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]
  x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]
  x86/boot/64: Use RIP_REL_REF() to access early page tables
  x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'
  x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]
  x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'
  x86/boot/64: Simplify global variable accesses in GDT/IDT programming
  x86/trampoline: Bypass compat mode in trampoline_start64() if not needed
  kexec: Allocate kernel above bzImage's pref_address
  x86/boot: Add a message about ignored early NMIs
  ...
2024-03-12 09:58:57 -07:00
Linus Torvalds
0e33cf955f Merge tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RFDS mitigation from Dave Hansen:
 "RFDS is a CPU vulnerability that may allow a malicious userspace to
  infer stale register values from kernel space. Kernel registers can
  have all kinds of secrets in them so the mitigation is basically to
  wait until the kernel is about to return to userspace and has user
  values in the registers. At that point there is little chance of
  kernel secrets ending up in the registers and the microarchitectural
  state can be cleared.

  This leverages some recent robustness fixes for the existing MDS
  vulnerability. Both MDS and RFDS use the VERW instruction for
  mitigation"

* tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
  x86/rfds: Mitigate Register File Data Sampling (RFDS)
  Documentation/hw-vuln: Add documentation for RFDS
  x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
2024-03-12 09:31:39 -07:00
Ingo Molnar
2e2bc42c83 Merge branch 'linus' into x86/boot, to resolve conflict
There's a new conflict with Linus's upstream tree, because
in the following merge conflict resolution in <asm/coco.h>:

  38b334fc76 Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Linus has resolved the conflicting placement of 'cc_mask' better
than the original commit:

  1c811d403a x86/sev: Fix position dependent variable references in startup code

... which was also done by an internal merge resolution:

  2e5fc4786b Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree

But Linus is right in 38b334fc76, the 'cc_mask' declaration is sufficient
within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block.

So instead of forcing Linus to do the same resolution again, merge in Linus's
tree and follow his conflict resolution.

 Conflicts:
	arch/x86/include/asm/coco.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-03-12 09:55:57 +01:00