Commit Graph

13012 Commits

Author SHA1 Message Date
Linus Torvalds
c88b9b4cde Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, bluetooth and bpf.

  Fairly usual collection of driver and core fixes. The large selftest
  accompanying one of the fixes is also becoming a common occurrence.

  Current release - regressions:

   - ipv6: fix infinite recursion in fib6_dump_done()

   - net/rds: fix possible null-deref in newly added error path

  Current release - new code bugs:

   - net: do not consume a full cacheline for system_page_pool

   - bpf: fix bpf_arena-related file descriptor leaks in the verifier

   - drv: ice: fix freeing uninitialized pointers, fixing misuse of the
     newfangled __free() auto-cleanup

  Previous releases - regressions:

   - x86/bpf: fixes the BPF JIT with retbleed=stuff

   - xen-netfront: add missing skb_mark_for_recycle, fix page pool
     accounting leaks, revealed by recently added explicit warning

   - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
     non-wildcard addresses

   - Bluetooth:
      - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
        better workarounds to un-break some buggy Qualcomm devices
      - set conn encrypted before conn establishes, fix re-connecting to
        some headsets which use slightly unusual sequence of msgs

   - mptcp:
      - prevent BPF accessing lowat from a subflow socket
      - don't account accept() of non-MPC client as fallback to TCP

   - drv: mana: fix Rx DMA datasize and skb_over_panic

   - drv: i40e: fix VF MAC filter removal

  Previous releases - always broken:

   - gro: various fixes related to UDP tunnels - netns crossing
     problems, incorrect checksum conversions, and incorrect packet
     transformations which may lead to panics

   - bpf: support deferring bpf_link dealloc to after RCU grace period

   - nf_tables:
      - release batch on table validation from abort path
      - release mutex after nft_gc_seq_end from abort path
      - flush pending destroy work before exit_net release

   - drv: r8169: skip DASH fw status checks when DASH is disabled"

* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
  netfilter: validate user input for expected length
  net/sched: act_skbmod: prevent kernel-infoleak
  net: usb: ax88179_178a: avoid the interface always configured as random address
  net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
  net: ravb: Always update error counters
  net: ravb: Always process TX descriptor ring
  netfilter: nf_tables: discard table flag update with pending basechain deletion
  netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
  netfilter: nf_tables: reject new basechain after table flag update
  netfilter: nf_tables: flush pending destroy work before exit_net release
  netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
  netfilter: nf_tables: release batch on table validation from abort path
  Revert "tg3: Remove residual error handling in tg3_suspend"
  tg3: Remove residual error handling in tg3_suspend
  net: mana: Fix Rx DMA datasize and skb_over_panic
  net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
  net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
  net: stmmac: fix rx queue priority assignment
  net: txgbe: fix i2c dev name cannot match clkdev
  net: fec: Set mac_managed_pm during probe
  ...
2024-04-04 14:49:10 -07:00
Linus Torvalds
0f099dc9d1 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
 "ARM:

   - Ensure perf events programmed to count during guest execution are
     actually enabled before entering the guest in the nVHE
     configuration

   - Restore out-of-range handler for stage-2 translation faults

   - Several fixes to stage-2 TLB invalidations to avoid stale
     translations, possibly including partial walk caches

   - Fix early handling of architectural VHE-only systems to ensure E2H
     is appropriately set

   - Correct a format specifier warning in the arch_timer selftest

   - Make the KVM banner message correctly handle all of the possible
     configurations

  RISC-V:

   - Remove redundant semicolon in num_isa_ext_regs()

   - Fix APLIC setipnum_le/be write emulation

   - Fix APLIC in_clrip[x] read emulation

  x86:

   - Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID
     entries (old vs. new) and ultimately neglects to clear PV_UNHALT
     from vCPUs with HLT-exiting disabled

   - Documentation fixes for SEV

   - Fix compat ABI for KVM_MEMORY_ENCRYPT_OP

   - Fix a 14-year-old goof in a declaration shared by host and guest;
     the enabled field used by Linux when running as a guest pushes the
     size of "struct kvm_vcpu_pv_apf_data" from 64 to 68 bytes. This is
     really unconsequential because KVM never consumes anything beyond
     the first 64 bytes, but the resulting struct does not match the
     documentation

  Selftests:

   - Fix spelling mistake in arch_timer selftest"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
  KVM: arm64: Rationalise KVM banner output
  arm64: Fix early handling of FEAT_E2H0 not being implemented
  KVM: arm64: Ensure target address is granule-aligned for range TLBI
  KVM: arm64: Use TLBI_TTL_UNKNOWN in __kvm_tlb_flush_vmid_range()
  KVM: arm64: Don't pass a TLBI level hint when zapping table entries
  KVM: arm64: Don't defer TLB invalidation when zapping table entries
  KVM: selftests: Fix __GUEST_ASSERT() format warnings in ARM's arch timer test
  KVM: arm64: Fix out-of-IPA space translation fault handling
  KVM: arm64: Fix host-programmed guest events in nVHE
  RISC-V: KVM: Fix APLIC in_clrip[x] read emulation
  RISC-V: KVM: Fix APLIC setipnum_le/be write emulation
  RISC-V: KVM: Remove second semicolon
  KVM: selftests: Fix spelling mistake "trigged" -> "triggered"
  Documentation: kvm/sev: clarify usage of KVM_MEMORY_ENCRYPT_OP
  Documentation: kvm/sev: separate description of firmware
  KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OP
  KVM: selftests: Check that PV_UNHALT is cleared when HLT exiting is disabled
  KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT
  KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper
  KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES
  ...
2024-04-03 10:26:37 -07:00
Paolo Bonzini
52b761b48f Merge tag 'kvmarm-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.9, part #1

 - Ensure perf events programmed to count during guest execution
   are actually enabled before entering the guest in the nVHE
   configuration.

 - Restore out-of-range handler for stage-2 translation faults.

 - Several fixes to stage-2 TLB invalidations to avoid stale
   translations, possibly including partial walk caches.

 - Fix early handling of architectural VHE-only systems to ensure E2H is
   appropriately set.

 - Correct a format specifier warning in the arch_timer selftest.

 - Make the KVM banner message correctly handle all of the possible
   configurations.
2024-04-02 12:26:15 -04:00
Joan Bruguera Micó
6a53745300 x86/bpf: Fix IP for relocating call depth accounting
The commit:

  59bec00ace ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")

made PER_CPU_VAR() to use rip-relative addressing, hence
INCREMENT_CALL_DEPTH macro and skl_call_thunk_template got rip-relative
asm code inside of it. A follow up commit:

  17bce3b2ae ("x86/callthunks: Handle %rip-relative relocations in call thunk template")

changed x86_call_depth_emit_accounting() to use apply_relocation(),
but mistakenly assumed that the code is being patched in-place (where
the destination of the relocation matches the address of the code),
using *pprog as the destination ip. This is not true for the call depth
accounting, emitted by the BPF JIT, so the calculated address was wrong,
JIT-ed BPF progs on kernels with call depth tracking got broken and
usually caused a page fault.

Pass the destination IP when the BPF JIT emits call depth accounting.

Fixes: 17bce3b2ae ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240401185821.224068-3-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-01 20:37:56 -07:00
Linus Torvalds
448f828feb Merge tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Borislav Petkov:

 - Define the correct set of default hw events on AMD Zen4

 - Use the correct stalled cycles PMCs on AMD Zen2 and newer

 - Fix detection of the LBR freeze feature on AMD

* tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and later
  perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later
  perf/x86/amd/lbr: Use freeze based on availability
  x86/cpufeatures: Add new word for scattered features
2024-03-31 10:43:11 -07:00
Linus Torvalds
1aac9cb7e6 Merge tag 'x86_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:

 - Make sure single object builds in arch/x86/virt/ ala
      make ... arch/x86/virt/vmx/tdx/seamcall.o
   work again

 - Do not do ROM range scans and memory validation when the kernel is
   running as a SEV-SNP guest as those can get problematic and, before
   that, are not really needed in such a guest

 - Exclude the build-time generated vdso-image-x32.o object from objtool
   validation and in particular the return sites in there due to a
   warning which fires when an unpatched return thunk is being used

 - Improve the NMI CPUs stall message to show additional information
   about the state of each CPU wrt the NMI handler

 - Enable gcc named address spaces support only on !KCSAN configs due to
   compiler options incompatibility

 - Revert a change which was trying to use GB pages for mapping regions
   only when the regions would be large enough but that change lead to
   kexec failing

 - A documentation fixlet

* tag 'x86_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/build: Use obj-y to descend into arch/x86/virt/
  x86/sev: Skip ROM range scans and validation for SEV-SNP guests
  x86/vdso: Fix rethunk patching for vdso-image-x32.o too
  x86/nmi: Upgrade NMI backtrace stall checks & messages
  x86/percpu: Disable named address spaces for KCSAN
  Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
  Documentation/x86: Fix title underline length
2024-03-31 10:16:34 -07:00
Borislav Petkov (AMD)
4535e1a417 x86/bugs: Fix the SRSO mitigation on Zen3/4
The original version of the mitigation would patch in the calls to the
untraining routines directly.  That is, the alternative() in UNTRAIN_RET
will patch in the CALL to srso_alias_untrain_ret() directly.

However, even if commit e7c25c441e ("x86/cpu: Cleanup the untrain
mess") meant well in trying to clean up the situation, due to micro-
architectural reasons, the untraining routine srso_alias_untrain_ret()
must be the target of a CALL instruction and not of a JMP instruction as
it is done now.

Reshuffle the alternative macros to accomplish that.

Fixes: e7c25c441e ("x86/cpu: Cleanup the untrain mess")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-29 12:13:12 -07:00
Baoquan He
32fbe52465 crash: use macro to add crashk_res into iomem early for specific arch
There are regression reports[1][2] that crashkernel region on x86_64 can't
be added into iomem tree sometime.  This causes the later failure of kdump
loading.

This happened after commit 4a693ce65b ("kdump: defer the insertion of
crashkernel resources") was merged.

Even though, these reported issues are proved to be related to other
component, they are just exposed after above commmit applied, I still
would like to keep crashk_res and crashk_low_res being added into iomem
early as before because the early adding has been always there on x86_64
and working very well.  For safety of kdump, Let's change it back.

Here, add a macro HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY to limit that
only ARCH defining the macro can have the early adding
crashk_res/_low_res into iomem. Then define
HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY on x86 to enable it.

Note: In reserve_crashkernel_low(), there's a remnant of crashk_low_res
handling which was mistakenly added back in commit 85fcde402d ("kexec:
split crashkernel reservation code out from crash_core.c").

[1]
[PATCH V2] x86/kexec: do not update E820 kexec table for setup_data
https://lore.kernel.org/all/Zfv8iCL6CT2JqLIC@darkstar.users.ipa.redhat.com/T/#u

[2]
Question about Address Range Validation in Crash Kernel Allocation
https://lore.kernel.org/all/4eeac1f733584855965a2ea62fa4da58@huawei.com/T/#u

Link: https://lkml.kernel.org/r/ZgDYemRQ2jxjLkq+@MiWiFi-R3L-srv
Fixes: 4a693ce65b ("kdump: defer the insertion of crashkernel resources")
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Huacai Chen <chenhuacai@loongson.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:14:12 -07:00
Kevin Loughlin
0f4a1e8098 x86/sev: Skip ROM range scans and validation for SEV-SNP guests
SEV-SNP requires encrypted memory to be validated before access.
Because the ROM memory range is not part of the e820 table, it is not
pre-validated by the BIOS. Therefore, if a SEV-SNP guest kernel wishes
to access this range, the guest must first validate the range.

The current SEV-SNP code does indeed scan the ROM range during early
boot and thus attempts to validate the ROM range in probe_roms().
However, this behavior is neither sufficient nor necessary for the
following reasons:

* With regards to sufficiency, if EFI_CONFIG_TABLES are not enabled and
  CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK is set, the kernel will
  attempt to access the memory at SMBIOS_ENTRY_POINT_SCAN_START (which
  falls in the ROM range) prior to validation.

  For example, Project Oak Stage 0 provides a minimal guest firmware
  that currently meets these configuration conditions, meaning guests
  booting atop Oak Stage 0 firmware encounter a problematic call chain
  during dmi_setup() -> dmi_scan_machine() that results in a crash
  during boot if SEV-SNP is enabled.

* With regards to necessity, SEV-SNP guests generally read garbage
  (which changes across boots) from the ROM range, meaning these scans
  are unnecessary. The guest reads garbage because the legacy ROM range
  is unencrypted data but is accessed via an encrypted PMD during early
  boot (where the PMD is marked as encrypted due to potentially mapping
  actually-encrypted data in other PMD-contained ranges).

In one exceptional case, EISA probing treats the ROM range as
unencrypted data, which is inconsistent with other probing.

Continuing to allow SEV-SNP guests to use garbage and to inconsistently
classify ROM range encryption status can trigger undesirable behavior.
For instance, if garbage bytes appear to be a valid signature, memory
may be unnecessarily reserved for the ROM range. Future code or other
use cases may result in more problematic (arbitrary) behavior that
should be avoided.

While one solution would be to overhaul the early PMD mapping to always
treat the ROM region of the PMD as unencrypted, SEV-SNP guests do not
currently rely on data from the ROM region during early boot (and even
if they did, they would be mostly relying on garbage data anyways).

As a simpler solution, skip the ROM range scans (and the otherwise-
necessary range validation) during SEV-SNP guest early boot. The
potential SEV-SNP guest crash due to lack of ROM range validation is
thus avoided by simply not accessing the ROM range.

In most cases, skip the scans by overriding problematic x86_init
functions during sme_early_init() to SNP-safe variants, which can be
likened to x86_init overrides done for other platforms (ex: Xen); such
overrides also avoid the spread of cc_platform_has() checks throughout
the tree.

In the exceptional EISA case, still use cc_platform_has() for the
simplest change, given (1) checks for guest type (ex: Xen domain status)
are already performed here, and (2) these checks occur in a subsys
initcall instead of an x86_init function.

  [ bp: Massage commit message, remove "we"s. ]

Fixes: 9704c07bf9 ("x86/kernel: Validate ROM memory before accessing when SEV-SNP is active")
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20240313121546.2964854-1-kevinloughlin@google.com
2024-03-26 15:22:35 +01:00
Sandipan Das
598c2fafc0 perf/x86/amd/lbr: Use freeze based on availability
Currently, the LBR code assumes that LBR Freeze is supported on all processors
when X86_FEATURE_AMD_LBR_V2 is available i.e. CPUID leaf 0x80000022[EAX]
bit 1 is set. This is incorrect as the availability of the feature is
additionally dependent on CPUID leaf 0x80000022[EAX] bit 2 being set,
which may not be set for all Zen 4 processors.

Define a new feature bit for LBR and PMC freeze and set the freeze enable bit
(FLBRI) in DebugCtl (MSR 0x1d9) conditionally.

It should still be possible to use LBR without freeze for profile-guided
optimization of user programs by using an user-only branch filter during
profiling. When the user-only filter is enabled, branches are no longer
recorded after the transition to CPL 0 upon PMI arrival. When branch
entries are read in the PMI handler, the branch stack does not change.

E.g.

  $ perf record -j any,u -e ex_ret_brn_tkn ./workload

Since the feature bit is visible under flags in /proc/cpuinfo, it can be
used to determine the feasibility of use-cases which require LBR Freeze
to be supported by the hardware such as profile-guided optimization of
kernels.

Fixes: ca5b7c0d96 ("perf/x86/amd/lbr: Add LbrExtV2 branch record support")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/69a453c97cfd11c6f2584b19f937fe6df741510f.1711091584.git.sandipan.das@amd.com
2024-03-25 11:16:55 +01:00
Sandipan Das
7f274e609f x86/cpufeatures: Add new word for scattered features
Add a new word for scattered features because all free bits among the
existing Linux-defined auxiliary flags have been exhausted.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/8380d2a0da469a1f0ad75b8954a79fb689599ff6.1711091584.git.sandipan.das@amd.com
2024-03-25 11:16:54 +01:00
Linus Torvalds
5e74df2f8f Merge tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:

 - Ensure that the encryption mask at boot is properly propagated on
   5-level page tables, otherwise the PGD entry is incorrectly set to
   non-encrypted, which causes system crashes during boot.

 - Undo the deferred 5-level page table setup as it cannot work with
   memory encryption enabled.

 - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
   to the default value but the cached variable is not, so subsequent
   comparisons might yield the wrong result and as a consequence the
   result prevents updating the MSR.

 - Register the local APIC address only once in the MPPARSE enumeration
   to prevent triggering the related WARN_ONs() in the APIC and topology
   code.

 - Handle the case where no APIC is found gracefully by registering a
   fake APIC in the topology code. That makes all related topology
   functions work correctly and does not affect the actual APIC driver
   code at all.

 - Don't evaluate logical IDs during early boot as the local APIC IDs
   are not yet enumerated and the invoked function returns an error
   code. Nothing requires the logical IDs before the final CPUID
   enumeration takes place, which happens after the enumeration.

 - Cure the fallout of the per CPU rework on UP which misplaced the
   copying of boot_cpu_data to per CPU data so that the final update to
   boot_cpu_data got lost which caused inconsistent state and boot
   crashes.

 - Use copy_from_kernel_nofault() in the kprobes setup as there is no
   guarantee that the address can be safely accessed.

 - Reorder struct members in struct saved_context to work around another
   kmemleak false positive

 - Remove the buggy code which tries to update the E820 kexec table for
   setup_data as that is never passed to the kexec kernel.

 - Update the resource control documentation to use the proper units.

 - Fix a Kconfig warning observed with tinyconfig

* tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/64: Move 5-level paging global variable assignments back
  x86/boot/64: Apply encryption mask to 5-level pagetable update
  x86/cpu: Add model number for another Intel Arrow Lake mobile processor
  x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
  Documentation/x86: Document that resctrl bandwidth control units are MiB
  x86/mpparse: Register APIC address only once
  x86/topology: Handle the !APIC case gracefully
  x86/topology: Don't evaluate logical IDs during early boot
  x86/cpu: Ensure that CPU info updates are propagated on UP
  kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
  x86/pm: Work around false positive kmemleak report in msr_build_context()
  x86/kexec: Do not update E820 kexec table for setup_data
  x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
2024-03-24 11:13:56 -07:00
Tony Luck
8a8a9c9047 x86/cpu: Add model number for another Intel Arrow Lake mobile processor
This one is the regular laptop CPU.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240322161725.195614-1-tony.luck@intel.com
2024-03-24 04:08:10 +01:00
Anton Altaparmakov
e3f269ed0a x86/pm: Work around false positive kmemleak report in msr_build_context()
Since:

  7ee18d6779 ("x86/power: Make restore_processor_context() sane")

kmemleak reports this issue:

  unreferenced object 0xf68241e0 (size 32):
    comm "swapper/0", pid 1, jiffies 4294668610 (age 68.432s)
    hex dump (first 32 bytes):
      00 cc cc cc 29 10 01 c0 00 00 00 00 00 00 00 00  ....)...........
      00 42 82 f6 cc cc cc cc cc cc cc cc cc cc cc cc  .B..............
    backtrace:
      [<461c1d50>] __kmem_cache_alloc_node+0x106/0x260
      [<ea65e13b>] __kmalloc+0x54/0x160
      [<c3858cd2>] msr_build_context.constprop.0+0x35/0x100
      [<46635aff>] pm_check_save_msr+0x63/0x80
      [<6b6bb938>] do_one_initcall+0x41/0x1f0
      [<3f3add60>] kernel_init_freeable+0x199/0x1e8
      [<3b538fde>] kernel_init+0x1a/0x110
      [<938ae2b2>] ret_from_fork+0x1c/0x28

Which is a false positive.

Reproducer:

  - Run rsync of whole kernel tree (multiple times if needed).
  - start a kmemleak scan
  - Note this is just an example: a lot of our internal tests hit these.

The root cause is similar to the fix in:

  b0b592cf08 x86/pm: Fix false positive kmemleak report in msr_build_context()

ie. the alignment within the packed struct saved_context
which has everything unaligned as there is only "u16 gs;" at start of
struct where in the past there were four u16 there thus aligning
everything afterwards.  The issue is with the fact that Kmemleak only
searches for pointers that are aligned (see how pointers are scanned in
kmemleak.c) so when the struct members are not aligned it doesn't see
them.

Testing:

We run a lot of tests with our CI, and after applying this fix we do not
see any kmemleak issues any more whilst without it we see hundreds of
the above report. From a single, simple test run consisting of 416 individual test
cases on kernel 5.10 x86 with kmemleak enabled we got 20 failures due to this,
which is quite a lot. With this fix applied we get zero kmemleak related failures.

Fixes: 7ee18d6779 ("x86/power: Make restore_processor_context() sane")
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240314142656.17699-1-anton@tuxera.com
2024-03-22 11:01:31 +01:00
Linus Torvalds
cfce216e14 Merge tag 'hyperv-next-signed-20240320' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:

 - Use Hyper-V entropy to seed guest random number generator (Michael
   Kelley)

 - Convert to platform remove callback returning void for vmbus (Uwe
   Kleine-König)

 - Introduce hv_get_hypervisor_version function (Nuno Das Neves)

 - Rename some HV_REGISTER_* defines for consistency (Nuno Das Neves)

 - Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_* (Nuno Das
   Neves)

 - Cosmetic changes for hv_spinlock.c (Purna Pavan Chandra Aekkaladevi)

 - Use per cpu initial stack for vtl context (Saurabh Sengar)

* tag 'hyperv-next-signed-20240320' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/hyperv: Use Hyper-V entropy to seed guest random number generator
  x86/hyperv: Cosmetic changes for hv_spinlock.c
  hyperv-tlfs: Rename some HV_REGISTER_* defines for consistency
  hv: vmbus: Convert to platform remove callback returning void
  mshyperv: Introduce hv_get_hypervisor_version function
  x86/hyperv: Use per cpu initial stack for vtl context
  hyperv-tlfs: Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_*
2024-03-21 10:01:02 -07:00
Linus Torvalds
0815d5cc7d Merge tag 'for-linus-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:

 - Xen event channel handling fix for a regression with a rare kernel
   config and some added hardening

 - better support of running Xen dom0 in PVH mode

 - a cleanup for the xen grant-dma-iommu driver

* tag 'for-linus-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/events: increment refcnt only if event channel is refcounted
  xen/evtchn: avoid WARN() when unbinding an event channel
  x86/xen: attempt to inflate the memory balloon on PVH
  xen/grant-dma-iommu: Convert to platform remove callback returning void
2024-03-19 08:48:09 -07:00
Paolo Bonzini
f3c80061c0 KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OP
The data structs for KVM_MEMORY_ENCRYPT_OP have different sizes for 32- and 64-bit
userspace, but they do not make any attempt to convert from one ABI to the other
when 32-bit userspace is running on 64-bit kernels.  This configuration never
worked, and SEV is only for 64-bit kernels so we're not breaking ABI on 32-bit
kernels.

Fix this by adding the appropriate padding; no functional change intended
for 64-bit userspace.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-03-18 19:03:52 -04:00
Paolo Bonzini
c822a075ab Merge tag 'kvm-x86-asyncpf_abi-6.9' of https://github.com/kvm-x86/linux into HEAD
Guest-side KVM async #PF ABI cleanup for 6.9

Delete kvm_vcpu_pv_apf_data.enabled to fix a goof in KVM's async #PF ABI where
the enabled field pushes the size of "struct kvm_vcpu_pv_apf_data" from 64 to
68 bytes, i.e. beyond a single cache line.

The enabled field is purely a guest-side flag that Linux-as-a-guest uses to
track whether or not the guest has enabled async #PF support.  The actual flag
that is passed to the host, i.e. to KVM proper, is a single bit in a synthetic
MSR, MSR_KVM_ASYNC_PF_EN, i.e. is in a location completely unrelated to the
shared kvm_vcpu_pv_apf_data structure.

Simply drop the the field and use a dedicated guest-side per-CPU variable to
fix the ABI, as opposed to fixing the documentation to match reality.  KVM has
never consumed kvm_vcpu_pv_apf_data.enabled, so the odds of the ABI change
breaking anything are extremely low.
2024-03-18 19:03:42 -04:00
Linus Torvalds
4f712ee0cb Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "S390:

   - Changes to FPU handling came in via the main s390 pull request

   - Only deliver to the guest the SCLP events that userspace has
     requested

   - More virtual vs physical address fixes (only a cleanup since
     virtual and physical address spaces are currently the same)

   - Fix selftests undefined behavior

  x86:

   - Fix a restriction that the guest can't program a PMU event whose
     encoding matches an architectural event that isn't included in the
     guest CPUID. The enumeration of an architectural event only says
     that if a CPU supports an architectural event, then the event can
     be programmed *using the architectural encoding*. The enumeration
     does NOT say anything about the encoding when the CPU doesn't
     report support the event *in general*. It might support it, and it
     might support it using the same encoding that made it into the
     architectural PMU spec

   - Fix a variety of bugs in KVM's emulation of RDPMC (more details on
     individual commits) and add a selftest to verify KVM correctly
     emulates RDMPC, counter availability, and a variety of other
     PMC-related behaviors that depend on guest CPUID and therefore are
     easier to validate with selftests than with custom guests (aka
     kvm-unit-tests)

   - Zero out PMU state on AMD if the virtual PMU is disabled, it does
     not cause any bug but it wastes time in various cases where KVM
     would check if a PMC event needs to be synthesized

   - Optimize triggering of emulated events, with a nice ~10%
     performance improvement in VM-Exit microbenchmarks when a vPMU is
     exposed to the guest

   - Tighten the check for "PMI in guest" to reduce false positives if
     an NMI arrives in the host while KVM is handling an IRQ VM-Exit

   - Fix a bug where KVM would report stale/bogus exit qualification
     information when exiting to userspace with an internal error exit
     code

   - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support

   - Rework TDP MMU root unload, free, and alloc to run with mmu_lock
     held for read, e.g. to avoid serializing vCPUs when userspace
     deletes a memslot

   - Tear down TDP MMU page tables at 4KiB granularity (used to be
     1GiB). KVM doesn't support yielding in the middle of processing a
     zap, and 1GiB granularity resulted in multi-millisecond lags that
     are quite impolite for CONFIG_PREEMPT kernels

   - Allocate write-tracking metadata on-demand to avoid the memory
     overhead when a kernel is built with i915 virtualization support
     but the workloads use neither shadow paging nor i915 virtualization

   - Explicitly initialize a variety of on-stack variables in the
     emulator that triggered KMSAN false positives

   - Fix the debugregs ABI for 32-bit KVM

   - Rework the "force immediate exit" code so that vendor code
     ultimately decides how and when to force the exit, which allowed
     some optimization for both Intel and AMD

   - Fix a long-standing bug where kvm_has_noapic_vcpu could be left
     elevated if vCPU creation ultimately failed, causing extra
     unnecessary work

   - Cleanup the logic for checking if the currently loaded vCPU is
     in-kernel

   - Harden against underflowing the active mmu_notifier invalidation
     count, so that "bad" invalidations (usually due to bugs elsehwere
     in the kernel) are detected earlier and are less likely to hang the
     kernel

  x86 Xen emulation:

   - Overlay pages can now be cached based on host virtual address,
     instead of guest physical addresses. This removes the need to
     reconfigure and invalidate the cache if the guest changes the gpa
     but the underlying host virtual address remains the same

   - When possible, use a single host TSC value when computing the
     deadline for Xen timers in order to improve the accuracy of the
     timer emulation

   - Inject pending upcall events when the vCPU software-enables its
     APIC to fix a bug where an upcall can be lost (and to follow Xen's
     behavior)

   - Fall back to the slow path instead of warning if "fast" IRQ
     delivery of Xen events fails, e.g. if the guest has aliased xAPIC
     IDs

  RISC-V:

   - Support exception and interrupt handling in selftests

   - New self test for RISC-V architectural timer (Sstc extension)

   - New extension support (Ztso, Zacas)

   - Support userspace emulation of random number seed CSRs

  ARM:

   - Infrastructure for building KVM's trap configuration based on the
     architectural features (or lack thereof) advertised in the VM's ID
     registers

   - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
     x86's WC) at stage-2, improving the performance of interacting with
     assigned devices that can tolerate it

   - Conversion of KVM's representation of LPIs to an xarray, utilized
     to address serialization some of the serialization on the LPI
     injection path

   - Support for _architectural_ VHE-only systems, advertised through
     the absence of FEAT_E2H0 in the CPU's ID register

   - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
     selftests

  LoongArch:

   - Set reserved bits as zero in CPUCFG

   - Start SW timer only when vcpu is blocking

   - Do not restart SW timer when it is expired

   - Remove unnecessary CSR register saving during enter guest

   - Misc cleanups and fixes as usual

  Generic:

   - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
     always true on all architectures except MIPS (where Kconfig
     determines the available depending on CPU capabilities). It is
     replaced either by an architecture-dependent symbol for MIPS, and
     IS_ENABLED(CONFIG_KVM) everywhere else

   - Factor common "select" statements in common code instead of
     requiring each architecture to specify it

   - Remove thoroughly obsolete APIs from the uapi headers

   - Move architecture-dependent stuff to uapi/asm/kvm.h

   - Always flush the async page fault workqueue when a work item is
     being removed, especially during vCPU destruction, to ensure that
     there are no workers running in KVM code when all references to
     KVM-the-module are gone, i.e. to prevent a very unlikely
     use-after-free if kvm.ko is unloaded

   - Grab a reference to the VM's mm_struct in the async #PF worker
     itself instead of gifting the worker a reference, so that there's
     no need to remember to *conditionally* clean up after the worker

  Selftests:

   - Reduce boilerplate especially when utilize selftest TAP
     infrastructure

   - Add basic smoke tests for SEV and SEV-ES, along with a pile of
     library support for handling private/encrypted/protected memory

   - Fix benign bugs where tests neglect to close() guest_memfd files"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
  selftests: kvm: remove meaningless assignments in Makefiles
  KVM: riscv: selftests: Add Zacas extension to get-reg-list test
  RISC-V: KVM: Allow Zacas extension for Guest/VM
  KVM: riscv: selftests: Add Ztso extension to get-reg-list test
  RISC-V: KVM: Allow Ztso extension for Guest/VM
  RISC-V: KVM: Forward SEED CSR access to user space
  KVM: riscv: selftests: Add sstc timer test
  KVM: riscv: selftests: Change vcpu_has_ext to a common function
  KVM: riscv: selftests: Add guest helper to get vcpu id
  KVM: riscv: selftests: Add exception handling support
  LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
  LoongArch: KVM: Do not restart SW timer when it is expired
  LoongArch: KVM: Start SW timer only when vcpu is blocking
  LoongArch: KVM: Set reserved bits as zero in CPUCFG
  KVM: selftests: Explicitly close guest_memfd files in some gmem tests
  KVM: x86/xen: fix recursive deadlock in timer injection
  KVM: pfncache: simplify locking and make more self-contained
  KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
  KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
  KVM: x86/xen: improve accuracy of Xen timers
  ...
2024-03-15 13:03:13 -07:00
Linus Torvalds
902861e34c Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
   from hotplugged memory rather than only from main memory. Series
   "implement "memmap on memory" feature on s390".

 - More folio conversions from Matthew Wilcox in the series

	"Convert memcontrol charge moving to use folios"
	"mm: convert mm counter to take a folio"

 - Chengming Zhou has optimized zswap's rbtree locking, providing
   significant reductions in system time and modest but measurable
   reductions in overall runtimes. The series is "mm/zswap: optimize the
   scalability of zswap rb-tree".

 - Chengming Zhou has also provided the series "mm/zswap: optimize zswap
   lru list" which provides measurable runtime benefits in some
   swap-intensive situations.

 - And Chengming Zhou further optimizes zswap in the series "mm/zswap:
   optimize for dynamic zswap_pools". Measured improvements are modest.

 - zswap cleanups and simplifications from Yosry Ahmed in the series
   "mm: zswap: simplify zswap_swapoff()".

 - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
   contributed several DAX cleanups as well as adding a sysfs tunable to
   control the memmap_on_memory setting when the dax device is
   hotplugged as system memory.

 - Johannes Weiner has added the large series "mm: zswap: cleanups",
   which does that.

 - More DAMON work from SeongJae Park in the series

	"mm/damon: make DAMON debugfs interface deprecation unignorable"
	"selftests/damon: add more tests for core functionalities and corner cases"
	"Docs/mm/damon: misc readability improvements"
	"mm/damon: let DAMOS feeds and tame/auto-tune itself"

 - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
   extension" Rakie Kim has developed a new mempolicy interleaving
   policy wherein we allocate memory across nodes in a weighted fashion
   rather than uniformly. This is beneficial in heterogeneous memory
   environments appearing with CXL.

 - Christophe Leroy has contributed some cleanup and consolidation work
   against the ARM pagetable dumping code in the series "mm: ptdump:
   Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".

 - Luis Chamberlain has added some additional xarray selftesting in the
   series "test_xarray: advanced API multi-index tests".

 - Muhammad Usama Anjum has reworked the selftest code to make its
   human-readable output conform to the TAP ("Test Anything Protocol")
   format. Amongst other things, this opens up the use of third-party
   tools to parse and process out selftesting results.

 - Ryan Roberts has added fork()-time PTE batching of THP ptes in the
   series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
   targeted at arm64, this significantly speeds up fork() when the
   process has a large number of pte-mapped folios.

 - David Hildenbrand also gets in on the THP pte batching game in his
   series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
   implements batching during munmap() and other pte teardown
   situations. The microbenchmark improvements are nice.

 - And in the series "Transparent Contiguous PTEs for User Mappings"
   Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
   mappings"). Kernel build times on arm64 improved nicely. Ryan's
   series "Address some contpte nits" provides some followup work.

 - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
   fixed an obscure hugetlb race which was causing unnecessary page
   faults. He has also added a reproducer under the selftest code.

 - In the series "selftests/mm: Output cleanups for the compaction
   test", Mark Brown did what the title claims.

 - Kinsey Ho has added the series "mm/mglru: code cleanup and
   refactoring".

 - Even more zswap material from Nhat Pham. The series "fix and extend
   zswap kselftests" does as claimed.

 - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
   regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
   in our handling of DAX on archiecctures which have virtually aliasing
   data caches. The arm architecture is the main beneficiary.

 - Lokesh Gidra's series "per-vma locks in userfaultfd" provides
   dramatic improvements in worst-case mmap_lock hold times during
   certain userfaultfd operations.

 - Some page_owner enhancements and maintenance work from Oscar Salvador
   in his series

	"page_owner: print stacks and their outstanding allocations"
	"page_owner: Fixup and cleanup"

 - Uladzislau Rezki has contributed some vmalloc scalability
   improvements in his series "Mitigate a vmap lock contention". It
   realizes a 12x improvement for a certain microbenchmark.

 - Some kexec/crash cleanup work from Baoquan He in the series "Split
   crash out from kexec and clean up related config items".

 - Some zsmalloc maintenance work from Chengming Zhou in the series

	"mm/zsmalloc: fix and optimize objects/page migration"
	"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"

 - Zi Yan has taught the MM to perform compaction on folios larger than
   order=0. This a step along the path to implementaton of the merging
   of large anonymous folios. The series is named "Enable >0 order folio
   memory compaction".

 - Christoph Hellwig has done quite a lot of cleanup work in the
   pagecache writeback code in his series "convert write_cache_pages()
   to an iterator".

 - Some modest hugetlb cleanups and speedups in Vishal Moola's series
   "Handle hugetlb faults under the VMA lock".

 - Zi Yan has changed the page splitting code so we can split huge pages
   into sizes other than order-0 to better utilize large folios. The
   series is named "Split a folio to any lower order folios".

 - David Hildenbrand has contributed the series "mm: remove
   total_mapcount()", a cleanup.

 - Matthew Wilcox has sought to improve the performance of bulk memory
   freeing in his series "Rearrange batched folio freeing".

 - Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
   provides large improvements in bootup times on large machines which
   are configured to use large numbers of hugetlb pages.

 - Matthew Wilcox's series "PageFlags cleanups" does that.

 - Qi Zheng's series "minor fixes and supplement for ptdesc" does that
   also. S390 is affected.

 - Cleanups to our pagemap utility functions from Peter Xu in his series
   "mm/treewide: Replace pXd_large() with pXd_leaf()".

 - Nico Pache has fixed a few things with our hugepage selftests in his
   series "selftests/mm: Improve Hugepage Test Handling in MM
   Selftests".

 - Also, of course, many singleton patches to many things. Please see
   the individual changelogs for details.

* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
  mm/zswap: remove the memcpy if acomp is not sleepable
  crypto: introduce: acomp_is_async to expose if comp drivers might sleep
  memtest: use {READ,WRITE}_ONCE in memory scanning
  mm: prohibit the last subpage from reusing the entire large folio
  mm: recover pud_leaf() definitions in nopmd case
  selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
  selftests/mm: skip uffd hugetlb tests with insufficient hugepages
  selftests/mm: dont fail testsuite due to a lack of hugepages
  mm/huge_memory: skip invalid debugfs new_order input for folio split
  mm/huge_memory: check new folio order when split a folio
  mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
  mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
  mm: fix list corruption in put_pages_list
  mm: remove folio from deferred split list before uncharging it
  filemap: avoid unnecessary major faults in filemap_fault()
  mm,page_owner: drop unnecessary check
  mm,page_owner: check for null stack_record before bumping its refcount
  mm: swap: fix race between free_swap_and_cache() and swapoff()
  mm/treewide: align up pXd_leaf() retval across archs
  mm/treewide: drop pXd_large()
  ...
2024-03-14 17:43:30 -07:00
Linus Torvalds
70ef654469 Merge tag 'efi-next-for-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:

 - Measure initrd and command line using the CC protocol if the ordinary
   TCG2 protocol is not implemented, typically on TDX confidential VMs

 - Avoid creating mappings that are both writable and executable while
   running in the EFI boot services. This is a prerequisite for getting
   the x86 shim loader signed by MicroSoft again, which allows the
   distros to install on x86 PCs that ship with EFI secure boot enabled.

 - API update for struct platform_driver::remove()

* tag 'efi-next-for-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  virt: efi_secret: Convert to platform remove callback returning void
  x86/efistub: Remap kernel text read-only before dropping NX attribute
  efi/libstub: Add get_event_log() support for CC platforms
  efi/libstub: Measure into CC protocol if TCG2 protocol is absent
  efi/libstub: Add Confidential Computing (CC) measurement typedefs
  efi/tpm: Use symbolic GUID name from spec for final events table
  efi/libstub: Use TPM event typedefs from the TCG PC Client spec
2024-03-13 12:37:41 -07:00
Roger Pau Monne
38620fc4e8 x86/xen: attempt to inflate the memory balloon on PVH
When running as PVH or HVM Linux will use holes in the memory map as scratch
space to map grants, foreign domain pages and possibly miscellaneous other
stuff.  However the usage of such memory map holes for Xen purposes can be
problematic.  The request of holesby Xen happen quite early in the kernel boot
process (grant table setup already uses scratch map space), and it's possible
that by then not all devices have reclaimed their MMIO space.  It's not
unlikely for chunks of Xen scratch map space to end up using PCI bridge MMIO
window memory, which (as expected) causes quite a lot of issues in the system.

At least for PVH dom0 we have the possibility of using regions marked as
UNUSABLE in the e820 memory map.  Either if the region is UNUSABLE in the
native memory map, or it has been converted into UNUSABLE in order to hide RAM
regions from dom0, the second stage translation page-tables can populate those
areas without issues.

PV already has this kind of logic, where the balloon driver is inflated at
boot.  Re-use the current logic in order to also inflate it when running as
PVH.  onvert UNUSABLE regions up to the ratio specified in EXTRA_MEM_RATIO to
RAM, while reserving them using xen_add_extra_mem() (which is also moved so
it's no longer tied to CONFIG_PV).

[jgross: fixed build for CONFIG_PVH without CONFIG_XEN_PVH]

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20240220174341.56131-1-roger.pau@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2024-03-13 17:48:26 +01:00
Linus Torvalds
216532e147 Merge tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
 "As is pretty normal for this tree, there are changes all over the
  place, especially for small fixes, selftest improvements, and improved
  macro usability.

  Some header changes ended up landing via this tree as they depended on
  the string header cleanups. Also, a notable set of changes is the work
  for the reintroduction of the UBSAN signed integer overflow sanitizer
  so that we can continue to make improvements on the compiler side to
  make this sanitizer a more viable future security hardening option.

  Summary:

   - string.h and related header cleanups (Tanzir Hasan, Andy
     Shevchenko)

   - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev,
     Harshit Mogalapalli)

   - selftests/powerpc: Fix load_unaligned_zeropad build failure
     (Michael Ellerman)

   - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn)

   - Handle tail call optimization better in LKDTM (Douglas Anderson)

   - Use long form types in overflow.h (Andy Shevchenko)

   - Add flags param to string_get_size() (Andy Shevchenko)

   - Add Coccinelle script for potential struct_size() use (Jacob
     Keller)

   - Fix objtool corner case under KCFI (Josh Poimboeuf)

   - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng)

   - Add str_plural() helper (Michal Wajdeczko, Kees Cook)

   - Ignore relocations in .notes section

   - Add comments to explain how __is_constexpr() works

   - Fix m68k stack alignment expectations in stackinit Kunit test

   - Convert string selftests to KUnit

   - Add KUnit tests for fortified string functions

   - Improve reporting during fortified string warnings

   - Allow non-type arg to type_max() and type_min()

   - Allow strscpy() to be called with only 2 arguments

   - Add binary mode to leaking_addresses scanner

   - Various small cleanups to leaking_addresses scanner

   - Adding wrapping_*() arithmetic helper

   - Annotate initial signed integer wrap-around in refcount_t

   - Add explicit UBSAN section to MAINTAINERS

   - Fix UBSAN self-test warnings

   - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL

   - Reintroduce UBSAN's signed overflow sanitizer"

* tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits)
  selftests/powerpc: Fix load_unaligned_zeropad build failure
  string: Convert helpers selftest to KUnit
  string: Convert selftest to KUnit
  sh: Fix build with CONFIG_UBSAN=y
  compiler.h: Explain how __is_constexpr() works
  overflow: Allow non-type arg to type_max() and type_min()
  VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler()
  lib/string_helpers: Add flags param to string_get_size()
  x86, relocs: Ignore relocations in .notes section
  objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks
  overflow: Use POD in check_shl_overflow()
  lib: stackinit: Adjust target string to 8 bytes for m68k
  sparc: vdso: Disable UBSAN instrumentation
  kernel.h: Move lib/cmdline.c prototypes to string.h
  leaking_addresses: Provide mechanism to scan binary files
  leaking_addresses: Ignore input device status lines
  leaking_addresses: Use File::Temp for /tmp files
  MAINTAINERS: Update LEAKING_ADDRESSES details
  fortify: Improve buffer overflow reporting
  fortify: Add KUnit tests for runtime overflows
  ...
2024-03-12 14:49:30 -07:00
Linus Torvalds
65d287c7eb Merge tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
 "Just two small updates this time:

   - A series I did to unify the definition of PAGE_SIZE through
     Kconfig, intended to help with a vdso rework that needs the
     constant but cannot include the normal kernel headers when building
     the compat VDSO on arm64 and potentially others

   - a patch from Yan Zhao to remove the pfn_to_virt() definitions from
     a couple of architectures after finding they were both incorrect
     and entirely unused"

* tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  arch: define CONFIG_PAGE_SIZE_*KB on all architectures
  arch: simplify architecture specific page size configuration
  arch: consolidate existing CONFIG_PAGE_SIZE_*KB definitions
  mm: Remove broken pfn_to_virt() on arch csky/hexagon/openrisc
2024-03-12 10:56:28 -07:00
Linus Torvalds
b29f377119 Merge tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:

 - Continuing work by Ard Biesheuvel to improve the x86 early startup
   code, with the long-term goal to make it position independent:

      - Get rid of early accesses to global objects, either by moving
        them to the stack, deferring the access until later, or dropping
        the globals entirely

      - Move all code that runs early via the 1:1 mapping into
        .head.text, and move code that does not out of it, so that build
        time checks can be added later to ensure that no inadvertent
        absolute references were emitted into code that does not
        tolerate them

      - Remove fixup_pointer() and occurrences of __pa_symbol(), which
        rely on the compiler emitting absolute references, which is not
        guaranteed

 - Improve the early console code

 - Add early console message about ignored NMIs, so that users are at
   least warned about their existence - even if we cannot do anything
   about them

 - Improve the kexec code's kernel load address handling

 - Enable more X86S (simplified x86) bits

 - Simplify early boot GDT handling

 - Micro-optimize the boot code a bit

 - Misc cleanups

* tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86/sev: Move early startup code into .head.text section
  x86/sme: Move early SME kernel encryption handling into .head.text
  x86/boot: Move mem_encrypt= parsing to the decompressor
  efi/libstub: Add generic support for parsing mem_encrypt=
  x86/startup_64: Simplify virtual switch on primary boot
  x86/startup_64: Simplify calculation of initial page table address
  x86/startup_64: Defer assignment of 5-level paging global variables
  x86/startup_64: Simplify CR4 handling in startup code
  x86/boot: Use 32-bit XOR to clear registers
  efi/x86: Set the PE/COFF header's NX compat flag unconditionally
  x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]
  x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]
  x86/boot/64: Use RIP_REL_REF() to access early page tables
  x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'
  x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]
  x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'
  x86/boot/64: Simplify global variable accesses in GDT/IDT programming
  x86/trampoline: Bypass compat mode in trampoline_start64() if not needed
  kexec: Allocate kernel above bzImage's pref_address
  x86/boot: Add a message about ignored early NMIs
  ...
2024-03-12 09:58:57 -07:00
Linus Torvalds
e66c58f743 Merge tag 'x86-apic-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC fixup from Dave Hansen:
 "Revert VERW fixed addressing patch.

  The reverted commit is not x86/apic material and was cruft left over
  from a merge.

  I believe the sequence of events went something like this:

   - The commit in question was added to x86/urgent

   - x86/urgent was merged into x86/apic to resolve a conflict

   - The commit was zapped from x86/urgent, but *not* from x86/apic

   - x86/apic got pullled (yesterday)

  I think we need to be a bit more vigilant when zapping things to make
  sure none of the other branches are depending on the zapped material"

* tag 'x86-apic-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "x86/bugs: Use fixed addressing for VERW operand"
2024-03-12 09:45:34 -07:00
Linus Torvalds
0e33cf955f Merge tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RFDS mitigation from Dave Hansen:
 "RFDS is a CPU vulnerability that may allow a malicious userspace to
  infer stale register values from kernel space. Kernel registers can
  have all kinds of secrets in them so the mitigation is basically to
  wait until the kernel is about to return to userspace and has user
  values in the registers. At that point there is little chance of
  kernel secrets ending up in the registers and the microarchitectural
  state can be cleared.

  This leverages some recent robustness fixes for the existing MDS
  vulnerability. Both MDS and RFDS use the VERW instruction for
  mitigation"

* tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
  x86/rfds: Mitigate Register File Data Sampling (RFDS)
  Documentation/hw-vuln: Add documentation for RFDS
  x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
2024-03-12 09:31:39 -07:00
Dave Hansen
532a0c57d7 Revert "x86/bugs: Use fixed addressing for VERW operand"
This was reverts commit 8009479ee9.

It was originally in x86/urgent, but was deemed wrong so got zapped.
But in the meantime, x86/urgent had been merged into x86/apic to
resolve a conflict.  I didn't notice the merge so didn't zap it
from x86/apic and it managed to make it up with the x86/apic
material.

The reverted commit is known to cause some KASAN problems.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
2024-03-12 08:33:51 -07:00
Ingo Molnar
2e2bc42c83 Merge branch 'linus' into x86/boot, to resolve conflict
There's a new conflict with Linus's upstream tree, because
in the following merge conflict resolution in <asm/coco.h>:

  38b334fc76 Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Linus has resolved the conflicting placement of 'cc_mask' better
than the original commit:

  1c811d403a x86/sev: Fix position dependent variable references in startup code

... which was also done by an internal merge resolution:

  2e5fc4786b Merge branch 'x86/sev' into x86/boot, to resolve conflicts and to pick up dependent tree

But Linus is right in 38b334fc76, the 'cc_mask' declaration is sufficient
within the #ifdef CONFIG_ARCH_HAS_CC_PLATFORM block.

So instead of forcing Linus to do the same resolution again, merge in Linus's
tree and follow his conflict resolution.

 Conflicts:
	arch/x86/include/asm/coco.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-03-12 09:55:57 +01:00
Linus Torvalds
855684c7d9 Merge tag 'x86_tdx_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 tdx update from Dave Hansen:

 - Fix sparse warning from TDX use of movdir64b()

* tag 'x86_tdx_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/asm: Remove the __iomem annotation of movdir64b()'s dst argument
2024-03-11 20:20:36 -07:00
Linus Torvalds
555b684190 Merge tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Dave Hansen:

 - Add a warning when memory encryption conversions fail. These
   operations require VMM cooperation, even in CoCo environments where
   the VMM is untrusted. While it's _possible_ that memory pressure
   could trigger the new warning, the odds are that a guest would only
   see this from an attacking VMM.

 - Simplify page fault code by re-enabling interrupts unconditionally

 - Avoid truncation issues when pfns are passed in to pfn_to_kaddr()
   with small (<64-bit) types.

* tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/cpa: Warn for set_memory_XXcrypted() VMM fails
  x86/mm: Get rid of conditional IF flag handling in page fault path
  x86/mm: Ensure input to pfn_to_kaddr() is treated as a 64-bit type
2024-03-11 20:07:52 -07:00
Linus Torvalds
685d982112 Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:

 - The biggest change is the rework of the percpu code, to support the
   'Named Address Spaces' GCC feature, by Uros Bizjak:

      - This allows C code to access GS and FS segment relative memory
        via variables declared with such attributes, which allows the
        compiler to better optimize those accesses than the previous
        inline assembly code.

      - The series also includes a number of micro-optimizations for
        various percpu access methods, plus a number of cleanups of %gs
        accesses in assembly code.

      - These changes have been exposed to linux-next testing for the
        last ~5 months, with no known regressions in this area.

 - Fix/clean up __switch_to()'s broken but accidentally working handling
   of FPU switching - which also generates better code

 - Propagate more RIP-relative addressing in assembly code, to generate
   slightly better code

 - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
   make it easier for distros to follow & maintain these options

 - Rework the x86 idle code to cure RCU violations and to clean up the
   logic

 - Clean up the vDSO Makefile logic

 - Misc cleanups and fixes

* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  x86/idle: Select idle routine only once
  x86/idle: Let prefer_mwait_c1_over_halt() return bool
  x86/idle: Cleanup idle_setup()
  x86/idle: Clean up idle selection
  x86/idle: Sanitize X86_BUG_AMD_E400 handling
  sched/idle: Conditionally handle tick broadcast in default_idle_call()
  x86: Increase brk randomness entropy for 64-bit systems
  x86/vdso: Move vDSO to mmap region
  x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
  x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
  x86/retpoline: Ensure default return thunk isn't used at runtime
  x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
  x86/vdso: Use $(addprefix ) instead of $(foreach )
  x86/vdso: Simplify obj-y addition
  x86/vdso: Consolidate targets and clean-files
  x86/bugs: Rename CONFIG_RETHUNK              => CONFIG_MITIGATION_RETHUNK
  x86/bugs: Rename CONFIG_CPU_SRSO             => CONFIG_MITIGATION_SRSO
  x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY       => CONFIG_MITIGATION_IBRS_ENTRY
  x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY      => CONFIG_MITIGATION_UNRET_ENTRY
  x86/bugs: Rename CONFIG_SLS                  => CONFIG_MITIGATION_SLS
  ...
2024-03-11 19:53:15 -07:00
Linus Torvalds
fcc196579a Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups, including a large series from Thomas Gleixner to cure
  sparse warnings"

* tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi: Drop unused declaration of proc_nmi_enabled()
  x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables
  x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
  x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current
  x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address()
  x86/percpu: Cure per CPU madness on UP
  smp: Consolidate smp_prepare_boot_cpu()
  x86/msr: Add missing __percpu annotations
  x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
  perf/x86/amd/uncore: Fix __percpu annotation
  x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP)
  x86/apm_32: Remove dead function apm_get_battery_status()
  x86/insn-eval: Fix function param name in get_eff_addr_sib()
2024-03-11 19:37:56 -07:00
Linus Torvalds
d69ad12c78 Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Ingo Molnar:

 - Reduce <asm/bootparam.h> dependencies

 - Simplify <asm/efi.h>

 - Unify *_setup_data definitions into <asm/setup_data.h>

 - Reduce the size of <asm/bootparam.h>

* tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Do not include <asm/bootparam.h> in several files
  x86/efi: Implement arch_ima_efi_boot_mode() in source file
  x86/setup: Move internal setup_data structures into setup_data.h
  x86/setup: Move UAPI setup structures into setup_data.h
2024-03-11 19:23:16 -07:00
Linus Torvalds
a5b1a017cb Merge tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:

 - Micro-optimize local_xchg() and the rtmutex code on x86

 - Fix percpu-rwsem contention tracepoints

 - Simplify debugging Kconfig dependencies

 - Update/clarify the documentation of atomic primitives

 - Misc cleanups

* tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters()
  locking/x86: Implement local_xchg() using CMPXCHG without the LOCK prefix
  locking/percpu-rwsem: Trigger contention tracepoints only if contended
  locking/rwsem: Make DEBUG_RWSEMS and PREEMPT_RT mutually exclusive
  locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hint
  locking/mutex: Simplify <linux/mutex.h>
  locking/qspinlock: Fix 'wait_early' set but not used warning
  locking/atomic: scripts: Clarify ordering of conditional atomics
2024-03-11 18:33:03 -07:00
Linus Torvalds
b0402403e5 Merge tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:

 - Add a FRU (Field Replaceable Unit) memory poison manager which
   collects and manages previously encountered hw errors in order to
   save them to persistent storage across reboots. Previously recorded
   errors are "replayed" upon reboot in order to poison memory which has
   caused said errors in the past.

   The main use case is stacked, on-chip memory which cannot simply be
   replaced so poisoning faulty areas of it and thus making them
   inaccessible is the only strategy to prolong its lifetime.

 - Add an AMD address translation library glue which converts the
   reported addresses of hw errors into system physical addresses in
   order to be used by other subsystems like memory failure, for
   example. Add support for MI300 accelerators to that library.

 - igen6: Add support for Alder Lake-N SoC

 - i10nm: Add Grand Ridge support

 - The usual fixlets and cleanups

* tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/versal: Convert to platform remove callback returning void
  RAS/AMD/FMPM: Fix off by one when unwinding on error
  RAS/AMD/FMPM: Add debugfs interface to print record entries
  RAS/AMD/FMPM: Save SPA values
  RAS: Export helper to get ras_debugfs_dir
  RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
  RAS: Introduce a FRU memory poison manager
  RAS/AMD/ATL: Add MI300 row retirement support
  Documentation: Move RAS section to admin-guide
  EDAC/versal: Make the bit position of injected errors configurable
  EDAC/i10nm: Add Intel Grand Ridge micro-server support
  EDAC/igen6: Add one more Intel Alder Lake-N SoC support
  RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
  RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
  RAS/AMD/ATL: Add MI300 support
  Documentation: RAS: Add index and address translation section
  EDAC/amd64: Use new AMD Address Translation Library
  RAS: Introduce AMD Address Translation Library
  EDAC/synopsys: Convert to devm_platform_ioremap_resource()
2024-03-11 18:14:06 -07:00
Linus Torvalds
38b334fc76 Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:

 - Add the x86 part of the SEV-SNP host support.

   This will allow the kernel to be used as a KVM hypervisor capable of
   running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP
   is the ultimate goal of the AMD confidential computing side,
   providing the most comprehensive confidential computing environment
   up to date.

   This is the x86 part and there is a KVM part which did not get ready
   in time for the merge window so latter will be forthcoming in the
   next cycle.

 - Rework the early code's position-dependent SEV variable references in
   order to allow building the kernel with clang and -fPIE/-fPIC and
   -mcmodel=kernel

 - The usual set of fixes, cleanups and improvements all over the place

* tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  x86/sev: Disable KMSAN for memory encryption TUs
  x86/sev: Dump SEV_STATUS
  crypto: ccp - Have it depend on AMD_IOMMU
  iommu/amd: Fix failure return from snp_lookup_rmpentry()
  x86/sev: Fix position dependent variable references in startup code
  crypto: ccp: Make snp_range_list static
  x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT
  Documentation: virt: Fix up pre-formatted text block for SEV ioctls
  crypto: ccp: Add the SNP_SET_CONFIG command
  crypto: ccp: Add the SNP_COMMIT command
  crypto: ccp: Add the SNP_PLATFORM_STATUS command
  x86/cpufeatures: Enable/unmask SEV-SNP CPU feature
  KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
  crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump
  iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown
  crypto: ccp: Handle legacy SEV commands when SNP is enabled
  crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled
  crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  x86/sev: Introduce an SNP leaked pages list
  crypto: ccp: Provide an API to issue SEV and SNP commands
  ...
2024-03-11 17:44:11 -07:00
Linus Torvalds
2edfd1046f Merge tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull resource control updates from Borislav Petkov:

 - Rework different aspects of the resctrl code like adding
   arch-specific accessors and splitting the locking, in order to
   accomodate ARM's MPAM implementation of hw resource control and be
   able to use the same filesystem control interface like on x86. Work
   by James Morse

 - Improve the memory bandwidth throttling heuristic to handle workloads
   with not too regular load levels which end up penalized unnecessarily

 - Use CPUID to detect the memory bandwidth enforcement limit on AMD

 - The usual set of fixes

* tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  x86/resctrl: Remove lockdep annotation that triggers false positive
  x86/resctrl: Separate arch and fs resctrl locks
  x86/resctrl: Move domain helper migration into resctrl_offline_cpu()
  x86/resctrl: Add CPU offline callback for resctrl work
  x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU
  x86/resctrl: Add CPU online callback for resctrl work
  x86/resctrl: Add helpers for system wide mon/alloc capable
  x86/resctrl: Make rdt_enable_key the arch's decision to switch
  x86/resctrl: Move alloc/mon static keys into helpers
  x86/resctrl: Make resctrl_mounted checks explicit
  x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()
  x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  x86/resctrl: Queue mon_event_read() instead of sending an IPI
  x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
  x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid
  x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding
  x86/resctrl: Track the number of dirty RMID a CLOSID has
  x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  x86/resctrl: Access per-rmid structures by index
  ...
2024-03-11 17:29:55 -07:00
Linus Torvalds
720c857907 Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FRED support from Thomas Gleixner:
 "Support for x86 Fast Return and Event Delivery (FRED).

  FRED is a replacement for IDT event delivery on x86 and addresses most
  of the technical nightmares which IDT exposes:

   1) Exception cause registers like CR2 need to be manually preserved
      in nested exception scenarios.

   2) Hardware interrupt stack switching is suboptimal for nested
      exceptions as the interrupt stack mechanism rewinds the stack on
      each entry which requires a massive effort in the low level entry
      of #NMI code to handle this.

   3) No hardware distinction between entry from kernel or from user
      which makes establishing kernel context more complex than it needs
      to be especially for unconditionally nestable exceptions like NMI.

   4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
      is a problem when the perf NMI takes a fault when collecting a
      stack trace.

   5) Partial restore of ESP when returning to a 16-bit segment

   6) Limitation of the vector space which can cause vector exhaustion
      on large systems.

   7) Inability to differentiate NMI sources

  FRED addresses these shortcomings by:

   1) An extended exception stack frame which the CPU uses to save
      exception cause registers. This ensures that the meta information
      for each exception is preserved on stack and avoids the extra
      complexity of preserving it in software.

   2) Hardware interrupt stack switching is non-rewinding if a nested
      exception uses the currently interrupt stack.

   3) The entry points for kernel and user context are separate and GS
      BASE handling which is required to establish kernel context for
      per CPU variable access is done in hardware.

   4) NMIs are now nesting protected. They are only reenabled on the
      return from NMI.

   5) FRED guarantees full restore of ESP

   6) FRED does not put a limitation on the vector space by design
      because it uses a central entry points for kernel and user space
      and the CPUstores the entry type (exception, trap, interrupt,
      syscall) on the entry stack along with the vector number. The
      entry code has to demultiplex this information, but this removes
      the vector space restriction.

      The first hardware implementations will still have the current
      restricted vector space because lifting this limitation requires
      further changes to the local APIC.

   7) FRED stores the vector number and meta information on stack which
      allows having more than one NMI vector in future hardware when the
      required local APIC changes are in place.

  The series implements the initial FRED support by:

   - Reworking the existing entry and IDT handling infrastructure to
     accomodate for the alternative entry mechanism.

   - Expanding the stack frame to accomodate for the extra 16 bytes FRED
     requires to store context and meta information

   - Providing FRED specific C entry points for events which have
     information pushed to the extended stack frame, e.g. #PF and #DB.

   - Providing FRED specific C entry points for #NMI and #MCE

   - Implementing the FRED specific ASM entry points and the C code to
     demultiplex the events

   - Providing detection and initialization mechanisms and the necessary
     tweaks in context switching, GS BASE handling etc.

  The FRED integration aims for maximum code reuse vs the existing IDT
  implementation to the extent possible and the deviation in hot paths
  like context switching are handled with alternatives to minimalize the
  impact. The low level entry and exit paths are seperate due to the
  extended stack frame and the hardware based GS BASE swichting and
  therefore have no impact on IDT based systems.

  It has been extensively tested on existing systems and on the FRED
  simulation and as of now there are no outstanding problems"

* tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/fred: Fix init_task thread stack pointer initialization
  MAINTAINERS: Add a maintainer entry for FRED
  x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
  x86/fred: Invoke FRED initialization code to enable FRED
  x86/fred: Add FRED initialization functions
  x86/syscall: Split IDT syscall setup code into idt_syscall_init()
  KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
  x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
  x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
  x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
  x86/traps: Add sysvec_install() to install a system interrupt handler
  x86/fred: FRED entry/exit and dispatch code
  x86/fred: Add a machine check entry stub for FRED
  x86/fred: Add a NMI entry stub for FRED
  x86/fred: Add a debug fault entry stub for FRED
  x86/idtentry: Incorporate definitions/declarations of the FRED entries
  x86/fred: Make exc_page_fault() work for FRED
  x86/fred: Allow single-step trap and NMI when starting a new task
  x86/fred: No ESPFIX needed when FRED is enabled
  ...
2024-03-11 16:00:17 -07:00
Linus Torvalds
ca7e917769 Merge tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
 "Rework of APIC enumeration and topology evaluation.

  The current implementation has a couple of shortcomings:

   - It fails to handle hybrid systems correctly.

   - The APIC registration code which handles CPU number assignents is
     in the middle of the APIC code and detached from the topology
     evaluation.

   - The various mechanisms which enumerate APICs, ACPI, MPPARSE and
     guest specific ones, tweak global variables as they see fit or in
     case of XENPV just hack around the generic mechanisms completely.

   - The CPUID topology evaluation code is sprinkled all over the vendor
     code and reevaluates global variables on every hotplug operation.

   - There is no way to analyze topology on the boot CPU before bringing
     up the APs. This causes problems for infrastructure like PERF which
     needs to size certain aspects upfront or could be simplified if
     that would be possible.

   - The APIC admission and CPU number association logic is
     incomprehensible and overly complex and needs to be kept around
     after boot instead of completing this right after the APIC
     enumeration.

  This update addresses these shortcomings with the following changes:

   - Rework the CPUID evaluation code so it is common for all vendors
     and provides information about the APIC ID segments in a uniform
     way independent of the number of segments (Thread, Core, Module,
     ..., Die, Package) so that this information can be computed instead
     of rewriting global variables of dubious value over and over.

   - A few cleanups and simplifcations of the APIC, IO/APIC and related
     interfaces to prepare for the topology evaluation changes.

   - Seperation of the parser stages so the early evaluation which tries
     to find the APIC address can be seperately overridden from the late
     evaluation which enumerates and registers the local APIC as further
     preparation for sanitizing the topology evaluation.

   - A new registration and admission logic which

       - encapsulates the inner workings so that parsers and guest logic
         cannot longer fiddle in it

       - uses the APIC ID segments to build topology bitmaps at
         registration time

       - provides a sane admission logic

       - allows to detect the crash kernel case, where CPU0 does not run
         on the real BSP, automatically. This is required to prevent
         sending INIT/SIPI sequences to the real BSP which would reset
         the whole machine. This was so far handled by a tedious command
         line parameter, which does not even work in nested crash
         scenarios.

       - Associates CPU number after the enumeration completed and
         prevents the late registration of APICs, which was somehow
         tolerated before.

   - Converting all parsers and guest enumeration mechanisms over to the
     new interfaces.

     This allows to get rid of all global variable tweaking from the
     parsers and enumeration mechanisms and sanitizes the XEN[PV]
     handling so it can use CPUID evaluation for the first time.

   - Mopping up existing sins by taking the information from the APIC ID
     segment bitmaps.

     This evaluates hybrid systems correctly on the boot CPU and allows
     for cleanups and fixes in the related drivers, e.g. PERF.

  The series has been extensively tested and the minimal late fallout
  due to a broken ACPI/MADT table has been addressed by tightening the
  admission logic further"

* tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
  x86/topology: Ignore non-present APIC IDs in a present package
  x86/apic: Build the x86 topology enumeration functions on UP APIC builds too
  smp: Provide 'setup_max_cpus' definition on UP too
  smp: Avoid 'setup_max_cpus' namespace collision/shadowing
  x86/bugs: Use fixed addressing for VERW operand
  x86/cpu/topology: Get rid of cpuinfo::x86_max_cores
  x86/cpu/topology: Provide __num_[cores|threads]_per_package
  x86/cpu/topology: Rename topology_max_die_per_package()
  x86/cpu/topology: Rename smp_num_siblings
  x86/cpu/topology: Retrieve cores per package from topology bitmaps
  x86/cpu/topology: Use topology logical mapping mechanism
  x86/cpu/topology: Provide logical pkg/die mapping
  x86/cpu/topology: Simplify cpu_mark_primary_thread()
  x86/cpu/topology: Mop up primary thread mask handling
  x86/cpu/topology: Use topology bitmaps for sizing
  x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT
  x86/xen/smp_pv: Count number of vCPUs early
  x86/cpu/topology: Assign hotpluggable CPUIDs during init
  x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug
  x86/topology: Add a mechanism to track topology via APIC IDs
  ...
2024-03-11 15:45:55 -07:00
Linus Torvalds
80a76c60e5 Merge tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource updates from Thomas Gleixner:
 "Updates for timekeeping and PTP core.

  The cross-timestamp mechanism which allows to correlate hardware
  clocks uses clocksource pointers for describing the correlation.

  That's suboptimal as drivers need to obtain the pointer, which
  requires needless exports and exposing internals. This can all be
  completely avoided by assigning clocksource IDs and using them for
  describing the correlated clock source.

  So this adds clocksource IDs to all clocksources in the tree which can
  be exposed to this mechanism and removes the pointer and now needless
  exports.

  A related improvement for the core and the correlation handling has
  not made it this time, but is expected to get ready for the next
  round"

* tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kvmclock: Unexport kvmclock clocksource
  treewide: Remove system_counterval_t.cs, which is never read
  timekeeping: Evaluate system_counterval_t.cs_id instead of .cs
  ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant
  x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id
  x86/tsc: Add clocksource ID, set system_counterval_t.cs_id
  timekeeping: Add clocksource ID to struct system_counterval_t
  x86/tsc: Correct kernel-doc notation
2024-03-11 14:25:18 -07:00
Linus Torvalds
4527e83780 Merge tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
 "Updates for the MSI interrupt subsystem and initial RISC-V MSI
  support.

  The core changes have been adopted from previous work which converted
  ARM[64] to the new per device MSI domain model, which was merged to
  support multiple MSI domain per device. The ARM[64] changes are being
  worked on too, but have not been ready yet. The core and platform-MSI
  changes have been split out to not hold up RISC-V and to avoid that
  RISC-V builds on the scheduled for removal interfaces.

  The core support provides new interfaces to handle wire to MSI bridges
  in a straight forward way and introduces new platform-MSI interfaces
  which are built on top of the per device MSI domain model.

  Once ARM[64] is converted over the old platform-MSI interfaces and the
  related ugliness in the MSI core code will be removed.

  The actual MSI parts for RISC-V were finalized late and have been
  post-poned for the next merge window.

  Drivers:

   - Add a new driver for the Andes hart-level interrupt controller

   - Rework the SiFive PLIC driver to prepare for MSI suport

   - Expand the RISC-V INTC driver to support the new RISC-V AIA
     controller which provides the basis for MSI on RISC-V

   - A few fixup for the fallout of the core changes"

* tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  irqchip/riscv-intc: Fix low-level interrupt handler setup for AIA
  x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search
  genirq/matrix: Dynamic bitmap allocation
  irqchip/riscv-intc: Add support for RISC-V AIA
  irqchip/sifive-plic: Improve locking safety by using irqsave/irqrestore
  irqchip/sifive-plic: Parse number of interrupts and contexts early in plic_probe()
  irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain creation failure
  irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent fwnode
  irqchip/sifive-plic: Use devm_xyz() for managed allocation
  irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz()
  irqchip/sifive-plic: Convert PLIC driver into a platform driver
  irqchip/riscv-intc: Introduce Andes hart-level interrupt controller
  irqchip/riscv-intc: Allow large non-standard interrupt number
  genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens
  irqchip/imx-intmux: Handle pure domain searches correctly
  genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV
  genirq/irqdomain: Reroute device MSI create_mapping
  genirq/msi: Provide allocation/free functions for "wired" MSI interrupts
  genirq/msi: Optionally use dev->fwnode for device domain
  genirq/msi: Provide DOMAIN_BUS_WIRED_TO_MSI
  ...
2024-03-11 14:03:03 -07:00
Pawan Gupta
8076fcde01 x86/rfds: Mitigate Register File Data Sampling (RFDS)
RFDS is a CPU vulnerability that may allow userspace to infer kernel
stale data previously used in floating point registers, vector registers
and integer registers. RFDS only affects certain Intel Atom processors.

Intel released a microcode update that uses VERW instruction to clear
the affected CPU buffers. Unlike MDS, none of the affected cores support
SMT.

Add RFDS bug infrastructure and enable the VERW based mitigation by
default, that clears the affected buffers just before exiting to
userspace. Also add sysfs reporting and cmdline parameter
"reg_file_data_sampling" to control the mitigation.

For details see:
Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11 13:13:48 -07:00
Paolo Bonzini
e9a2bba476 Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM Xen and pfncache changes for 6.9:

 - Rip out the half-baked support for using gfn_to_pfn caches to manage pages
   that are "mapped" into guests via physical addresses.

 - Add support for using gfn_to_pfn caches with only a host virtual address,
   i.e. to bypass the "gfn" stage of the cache.  The primary use case is
   overlay pages, where the guest may change the gfn used to reference the
   overlay page, but the backing hva+pfn remains the same.

 - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
   of a gpa, so that userspace doesn't need to reconfigure and invalidate the
   cache/mapping if the guest changes the gpa (but userspace keeps the resolved
   hva the same).

 - When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.

 - Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).

 - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.

 - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
   refresh), and drop a now-redundant acquisition of xen_lock (that was
   protecting the shared_info cache) to fix a deadlock due to recursively
   acquiring xen_lock.
2024-03-11 10:42:55 -04:00
Paolo Bonzini
e9025cdd8c Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.9:

 - Fix several bugs where KVM speciously prevents the guest from utilizing
   fixed counters and architectural event encodings based on whether or not
   guest CPUID reports support for the _architectural_ encoding.

 - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads,
   priority of VMX interception vs #GP, PMC types in architectural PMUs, etc.

 - Add a selftest to verify KVM correctly emulates RDMPC, counter availability,
   and a variety of other PMC-related behaviors that depend on guest CPUID,
   i.e. are difficult to validate via KVM-Unit-Tests.

 - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting
   cycles, e.g. when checking if a PMC event needs to be synthesized when
   skipping an instruction.

 - Optimize triggering of emulated events, e.g. for "count instructions" events
   when skipping an instruction, which yields a ~10% performance improvement in
   VM-Exit microbenchmarks when a vPMU is exposed to the guest.

 - Tighten the check for "PMI in guest" to reduce false positives if an NMI
   arrives in the host while KVM is handling an IRQ VM-Exit.
2024-03-11 10:41:09 -04:00
Paolo Bonzini
b00471a552 Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.9:

 - Fix a bug where KVM would report stale/bogus exit qualification information
   when exiting to userspace due to an unexpected VM-Exit while the CPU was
   vectoring an exception.

 - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.

 - Clean up the logic for massaging the passthrough MSR bitmaps when userspace
   changes its MSR filter.
2024-03-11 10:31:29 -04:00
Paolo Bonzini
41ebae2ecd Merge tag 'kvm-x86-mmu-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.9:

 - Clean up code related to unprotecting shadow pages when retrying a guest
   instruction after failed #PF-induced emulation.

 - Zap TDP MMU roots at 4KiB granularity to minimize the delay in yielding if
   a reschedule is needed, e.g. if a high priority task needs to run.  Because
   KVM doesn't support yielding in the middle of processing a zapped non-leaf
   SPTE, zapping at 1GiB granularity can result in multi-millisecond lag when
   attempting to schedule in a high priority.

 - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
   read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.

 - Allocate write-tracking metadata on-demand to avoid the memory overhead when
   running kernels built with KVMGT support (external write-tracking enabled),
   but for workloads that don't use nested virtualization (shadow paging) or
   KVMGT.
2024-03-11 10:29:22 -04:00
Paolo Bonzini
c9cd0beae9 Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.9:

 - Explicitly initialize a variety of on-stack variables in the emulator that
   triggered KMSAN false positives (though in fairness in KMSAN, it's comically
   difficult to see that the uninitialized memory is never truly consumed).

 - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
   DR6 and DR7.

 - Rework the "force immediate exit" code so that vendor code ultimately
   decides how and when to force the exit.  This allows VMX to further optimize
   handling preemption timer exits, and allows SVM to avoid sending a duplicate
   IPI (SVM also has a need to force an exit).

 - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
   vCPU creation ultimately failed, and add WARN to guard against similar bugs.

 - Provide a dedicated arch hook for checking if a different vCPU was in-kernel
   (for directed yield), and simplify the logic for checking if the currently
   loaded vCPU is in-kernel.

 - Misc cleanups and fixes.
2024-03-11 10:24:56 -04:00
Paolo Bonzini
233d0bc4d8 Merge tag 'loongarch-kvm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.9

* Set reserved bits as zero in CPUCFG.
* Start SW timer only when vcpu is blocking.
* Do not restart SW timer when it is expired.
* Remove unnecessary CSR register saving during enter guest.
2024-03-11 09:56:54 -04:00
Paolo Bonzini
7d8942d8e7 Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of https://github.com/kvm-x86/linux into HEAD
KVM GUEST_MEMFD fixes for 6.8:

 - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
   avoid creating ABI that KVM can't sanely support.

 - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
   clear that such VMs are purely a development and testing vehicle, and
   come with zero guarantees.

 - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
   is to support confidential VMs with deterministic private memory (SNP
   and TDX) only in the TDP MMU.

 - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
   when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-09 11:48:35 -05:00