Commit Graph

20098 Commits

Author SHA1 Message Date
Patrick Bellasi
318e8c339c x86/cpu/kvm: SRSO: Fix possible missing IBPB on VM-Exit
In [1] the meaning of the synthetic IBPB flags has been redefined for a
better separation of concerns:
 - ENTRY_IBPB     -- issue IBPB on entry only
 - IBPB_ON_VMEXIT -- issue IBPB on VM-Exit only
and the Retbleed mitigations have been updated to match this new
semantics.

Commit [2] was merged shortly before [1], and their interaction was not
handled properly. This resulted in IBPB not being triggered on VM-Exit
in all SRSO mitigation configs requesting an IBPB there.

Specifically, an IBPB on VM-Exit is triggered only when
X86_FEATURE_IBPB_ON_VMEXIT is set. However:

 - X86_FEATURE_IBPB_ON_VMEXIT is not set for "spec_rstack_overflow=ibpb",
   because before [1] having X86_FEATURE_ENTRY_IBPB was enough. Hence,
   an IBPB is triggered on entry but the expected IBPB on VM-exit is
   not.

 - X86_FEATURE_IBPB_ON_VMEXIT is not set also when
   "spec_rstack_overflow=ibpb-vmexit" if X86_FEATURE_ENTRY_IBPB is
   already set.

   That's because before [1] this was effectively redundant. Hence, e.g.
   a "retbleed=ibpb spec_rstack_overflow=bpb-vmexit" config mistakenly
   reports the machine still vulnerable to SRSO, despite an IBPB being
   triggered both on entry and VM-Exit, because of the Retbleed selected
   mitigation config.

 - UNTRAIN_RET_VM won't still actually do anything unless
   CONFIG_MITIGATION_IBPB_ENTRY is set.

For "spec_rstack_overflow=ibpb", enable IBPB on both entry and VM-Exit
and clear X86_FEATURE_RSB_VMEXIT which is made superfluous by
X86_FEATURE_IBPB_ON_VMEXIT. This effectively makes this mitigation
option similar to the one for 'retbleed=ibpb', thus re-order the code
for the RETBLEED_MITIGATION_IBPB option to be less confusing by having
all features enabling before the disabling of the not needed ones.

For "spec_rstack_overflow=ibpb-vmexit", guard this mitigation setting
with CONFIG_MITIGATION_IBPB_ENTRY to ensure UNTRAIN_RET_VM sequence is
effectively compiled in. Drop instead the CONFIG_MITIGATION_SRSO guard,
since none of the SRSO compile cruft is required in this configuration.
Also, check only that the required microcode is present to effectively
enabled the IBPB on VM-Exit.

Finally, update the KConfig description for CONFIG_MITIGATION_IBPB_ENTRY
to list also all SRSO config settings enabled by this guard.

Fixes: 864bcaa38e ("x86/cpu/kvm: Provide UNTRAIN_RET_VM") [1]
Fixes: d893832d0e ("x86/srso: Add IBPB on VMEXIT") [2]
Reported-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Patrick Bellasi <derkling@google.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-02-11 10:07:52 -08:00
Linus Torvalds
c545cd3276 Merge tag 'x86-mm-2025-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:

 - The biggest changes are the TLB flushing scalability optimizations,
   to update the mm_cpumask lazily and related changes.

   This feature has both a track record and a continued risk of
   performance regressions, so it was already delayed by a cycle - but
   it's all 100% perfect now™ (Rik van Riel)

 - Also miscellaneous fixes and cleanups. (Gautam Somani, Kirill
   Shutemov, Sebastian Andrzej Siewior)

* tag 'x86-mm-2025-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Remove unnecessary include of <linux/extable.h>
  x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
  x86/mm/selftests: Fix typo in lam.c
  x86/mm/tlb: Only trim the mm_cpumask once a second
  x86/mm/tlb: Also remove local CPU from mm_cpumask if stale
  x86/mm/tlb: Add tracepoint for TLB flush IPI to stale CPU
  x86/mm/tlb: Update mm_cpumask lazily
2025-01-31 10:39:07 -08:00
Linus Torvalds
2a9f04bde0 Merge tag 'rtc-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
 "Not much this cycle, there are multiple small fixes.

  Core:
   - use boolean values with device_init_wakeup()

  Drivers:
   - pcf2127: add BSM support
   - pcf85063: fix possible out of bounds write"

* tag 'rtc-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
  rtc: pcf2127: add BSM support
  rtc: Remove hpet_rtc_dropped_irq()
  dt-bindings: rtc: mxc: Document fsl,imx31-rtc
  rtc: stm32: Use syscon_regmap_lookup_by_phandle_args
  rtc: zynqmp: Fix optional clock name property
  rtc: loongson: clear TOY_MATCH0_REG in loongson_rtc_isr()
  rtc: pcf85063: fix potential OOB write in PCF85063 NVMEM read
  rtc: tps6594: Fix integer overflow on 32bit systems
  rtc: use boolean values with device_init_wakeup()
  rtc: RTC_DRV_SPEAR should not default to y when compile-testing
2025-01-30 17:50:02 -08:00
Linus Torvalds
47cbb41a2e Merge tag 'acpi-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
 "Add a new ACPI-related quirk for Vexia EDU ATLA 10 tablet 5V (Hans de
  Goede) and fix the MADT parsing code so that CPUs with different entry
  types (LAPIC and x2APIC) are initialized in the order in which they
  appear in the MADT as required by the ACPI specification (Zhang Rui)"

* tag 'acpi-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  x86/acpi: Fix LAPIC/x2APIC parsing order
  ACPI: x86: Add skip i2c clients quirk for Vexia EDU ATLA 10 tablet 5V
2025-01-30 15:22:18 -08:00
Joel Granados
1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Linus Torvalds
9c5968db9e Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
 "The various patchsets are summarized below. Plus of course many
  indivudual patches which are described in their changelogs.

   - "Allocate and free frozen pages" from Matthew Wilcox reorganizes
     the page allocator so we end up with the ability to allocate and
     free zero-refcount pages. So that callers (ie, slab) can avoid a
     refcount inc & dec

   - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
     use large folios other than PMD-sized ones

   - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
     and fixes for this small built-in kernel selftest

   - "mas_anode_descend() related cleanup" from Wei Yang tidies up part
     of the mapletree code

   - "mm: fix format issues and param types" from Keren Sun implements a
     few minor code cleanups

   - "simplify split calculation" from Wei Yang provides a few fixes and
     a test for the mapletree code

   - "mm/vma: make more mmap logic userland testable" from Lorenzo
     Stoakes continues the work of moving vma-related code into the
     (relatively) new mm/vma.c

   - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
     Hildenbrand cleans up and rationalizes handling of gfp flags in the
     page allocator

   - "readahead: Reintroduce fix for improper RA window sizing" from Jan
     Kara is a second attempt at fixing a readahead window sizing issue.
     It should reduce the amount of unnecessary reading

   - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
     addresses an issue where "huge" amounts of pte pagetables are
     accumulated:

       https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

     Qi's series addresses this windup by synchronously freeing PTE
     memory within the context of madvise(MADV_DONTNEED)

   - "selftest/mm: Remove warnings found by adding compiler flags" from
     Muhammad Usama Anjum fixes some build warnings in the selftests
     code when optional compiler warnings are enabled

   - "mm: don't use __GFP_HARDWALL when migrating remote pages" from
     David Hildenbrand tightens the allocator's observance of
     __GFP_HARDWALL

   - "pkeys kselftests improvements" from Kevin Brodsky implements
     various fixes and cleanups in the MM selftests code, mainly
     pertaining to the pkeys tests

   - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
     estimate application working set size

   - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
     provides some cleanups to memcg's hugetlb charging logic

   - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
     removes the global swap cgroup lock. A speedup of 10% for a
     tmpfs-based kernel build was demonstrated

   - "zram: split page type read/write handling" from Sergey Senozhatsky
     has several fixes and cleaups for zram in the area of
     zram_write_page(). A watchdog softlockup warning was eliminated

   - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
     Brodsky cleans up the pagetable destructor implementations. A rare
     use-after-free race is fixed

   - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
     simplifies and cleans up the debugging code in the VMA merging
     logic

   - "Account page tables at all levels" from Kevin Brodsky cleans up
     and regularizes the pagetable ctor/dtor handling. This results in
     improvements in accounting accuracy

   - "mm/damon: replace most damon_callback usages in sysfs with new
     core functions" from SeongJae Park cleans up and generalizes
     DAMON's sysfs file interface logic

   - "mm/damon: enable page level properties based monitoring" from
     SeongJae Park increases the amount of information which is
     presented in response to DAMOS actions

   - "mm/damon: remove DAMON debugfs interface" from SeongJae Park
     removes DAMON's long-deprecated debugfs interfaces. Thus the
     migration to sysfs is completed

   - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
     Peter Xu cleans up and generalizes the hugetlb reservation
     accounting

   - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
     removes a never-used feature of the alloc_pages_bulk() interface

   - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
     extends DAMOS filters to support not only exclusion (rejecting),
     but also inclusion (allowing) behavior

   - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
     introduces a new memory descriptor for zswap.zpool that currently
     overlaps with struct page for now. This is part of the effort to
     reduce the size of struct page and to enable dynamic allocation of
     memory descriptors

   - "mm, swap: rework of swap allocator locks" from Kairui Song redoes
     and simplifies the swap allocator locking. A speedup of 400% was
     demonstrated for one workload. As was a 35% reduction for kernel
     build time with swap-on-zram

   - "mm: update mips to use do_mmap(), make mmap_region() internal"
     from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
     mmap_region() can be made MM-internal

   - "mm/mglru: performance optimizations" from Yu Zhao fixes a few
     MGLRU regressions and otherwise improves MGLRU performance

   - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
     Park updates DAMON documentation

   - "Cleanup for memfd_create()" from Isaac Manjarres does that thing

   - "mm: hugetlb+THP folio and migration cleanups" from David
     Hildenbrand provides various cleanups in the areas of hugetlb
     folios, THP folios and migration

   - "Uncached buffered IO" from Jens Axboe implements the new
     RWF_DONTCACHE flag which provides synchronous dropbehind for
     pagecache reading and writing. To permite userspace to address
     issues with massive buildup of useless pagecache when
     reading/writing fast devices

   - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
     Weißschuh fixes and optimizes some of the MM selftests"

* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm/compaction: fix UBSAN shift-out-of-bounds warning
  s390/mm: add missing ctor/dtor on page table upgrade
  kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
  tools: add VM_WARN_ON_VMG definition
  mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
  seqlock: add missing parameter documentation for raw_seqcount_try_begin()
  mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
  mm/page_alloc: remove the incorrect and misleading comment
  zram: remove zcomp_stream_put() from write_incompressible_page()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/kfence: use str_write_read() helper in get_access_type()
  selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
  kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
  selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
  selftests/mm: vm_util: split up /proc/self/smaps parsing
  selftests/mm: virtual_address_range: unmap chunks after validation
  selftests/mm: virtual_address_range: mmap() without PROT_WRITE
  selftests/memfd/memfd_test: fix possible NULL pointer dereference
  mm: add FGP_DONTCACHE folio creation flag
  mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
  ...
2025-01-26 18:36:23 -08:00
Linus Torvalds
c159dfbdd4 Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
 "Mainly individually changelogged singleton patches. The patch series
  in this pull are:

   - "lib min_heap: Improve min_heap safety, testing, and documentation"
     from Kuan-Wei Chiu provides various tightenings to the min_heap
     library code

   - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
     some cleanup and Rust preparation in the xarray library code

   - "Update reference to include/asm-<arch>" from Geert Uytterhoeven
     fixes pathnames in some code comments

   - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
     the new secs_to_jiffies() in various places where that is
     appropriate

   - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
     switches two filesystems to the new mount API

   - "Convert ocfs2 to use folios" from Matthew Wilcox does that

   - "Remove get_task_comm() and print task comm directly" from Yafang
     Shao removes now-unneeded calls to get_task_comm() in various
     places

   - "squashfs: reduce memory usage and update docs" from Phillip
     Lougher implements some memory savings in squashfs and performs
     some maintainability work

   - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
     tightens the sort code's behaviour and adds some maintenance work

   - "nilfs2: protect busy buffer heads from being force-cleared" from
     Ryusuke Konishi fixes an issues in nlifs when the fs is presented
     with a corrupted image

   - "nilfs2: fix kernel-doc comments for function return values" from
     Ryusuke Konishi fixes some nilfs kerneldoc

   - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
     addresses some nilfs BUG_ONs which syzbot was able to trigger

   - "minmax.h: Cleanups and minor optimisations" from David Laight does
     some maintenance work on the min/max library code

   - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
     work on the xarray library code"

* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
  ocfs2: use str_yes_no() and str_no_yes() helper functions
  include/linux/lz4.h: add some missing macros
  Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
  Xarray: remove repeat check in xas_squash_marks()
  Xarray: distinguish large entries correctly in xas_split_alloc()
  Xarray: move forward index correctly in xas_pause()
  Xarray: do not return sibling entries from xas_find_marked()
  ipc/util.c: complete the kernel-doc function descriptions
  gcov: clang: use correct function param names
  latencytop: use correct kernel-doc format for func params
  minmax.h: remove some #defines that are only expanded once
  minmax.h: simplify the variants of clamp()
  minmax.h: move all the clamp() definitions after the min/max() ones
  minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
  minmax.h: reduce the #define expansion of min(), max() and clamp()
  minmax.h: update some comments
  minmax.h: add whitespace around operators and after commas
  nilfs2: do not update mtime of renamed directory that is not moved
  nilfs2: handle errors that nilfs_prepare_chunk() may return
  CREDITS: fix spelling mistake
  ...
2025-01-26 17:50:53 -08:00
Guo Weikang
c6f239796b mm/memblock: add memblock_alloc_or_panic interface
Before SLUB initialization, various subsystems used memblock_alloc to
allocate memory.  In most cases, when memory allocation fails, an
immediate panic is required.  To simplify this behavior and reduce
repetitive checks, introduce `memblock_alloc_or_panic`.  This function
ensures that memory allocation failures result in a panic automatically,
improving code readability and consistency across subsystems that require
this behavior.

[guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic]
  Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com
  Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/
Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>	[s390]
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:38 -08:00
Qi Zheng
ee0934b035 x86: pgtable: move pagetable_dtor() to __tlb_remove_table()
Move pagetable_dtor() to __tlb_remove_table(), so that ptlock and page
table pages can be freed together (regardless of whether RCU is used). 
This prevents the use-after-free problem where the ptlock is freed
immediately but the page table pages is freed later via RCU.

Link: https://lkml.kernel.org/r/27b3cdc8786bebd4f748380bf82f796482718504.1736317725.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:22 -08:00
Qi Zheng
0b6476f939 x86: pgtable: convert __tlb_remove_table() to use struct ptdesc
Convert __tlb_remove_table() to use struct ptdesc, which will help to move
pagetable_dtor() to __tlb_remove_table().

And page tables shouldn't have swap cache, so use pagetable_free() instead
of free_page_and_swap_cache() to free page table pages.

Link: https://lkml.kernel.org/r/39f60f93143ff77cf5d6b3c3e75af0ffc1480adb.1736317725.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:22 -08:00
Linus Torvalds
382e391365 Merge tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:

 - Introduce a new set of Hyper-V headers in include/hyperv and replace
   the old hyperv-tlfs.h with the new headers (Nuno Das Neves)

 - Fixes for the Hyper-V VTL mode (Roman Kisel)

 - Fixes for cpu mask usage in Hyper-V code (Michael Kelley)

 - Document the guest VM hibernation behaviour (Michael Kelley)

 - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain)

* tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Documentation: hyperv: Add overview of guest VM hibernation
  hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id()
  hyperv: Do not overlap the hvcall IO areas in get_vtl()
  hyperv: Enable the hypercall output page for the VTL mode
  hv_balloon: Fallback to generic_online_page() for non-HV hot added mem
  Drivers: hv: vmbus: Log on missing offers if any
  Drivers: hv: vmbus: Wait for boot-time offers during boot and resume
  uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup
  iommu/hyper-v: Don't assume cpu_possible_mask is dense
  Drivers: hv: Don't assume cpu_possible_mask is dense
  x86/hyperv: Don't assume cpu_possible_mask is dense
  hyperv: Remove the now unused hyperv-tlfs.h files
  hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
  hyperv: Add new Hyper-V headers in include/hyperv
  hyperv: Clean up unnecessary #includes
  hyperv: Move hv_connection_id to hyperv-tlfs.h
2025-01-25 09:22:55 -08:00
Linus Torvalds
5b7f7234ff Merge tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:

 - A large and involved preparatory series to pave the way to add
   exception handling for relocate_kernel - which will be a debugging
   facility that has aided in the field to debug an exceptionally hard
   to debug early boot bug. Plus assorted cleanups and fixes that were
   discovered along the way, by David Woodhouse:

      - Clean up and document register use in relocate_kernel_64.S
      - Use named labels in swap_pages in relocate_kernel_64.S
      - Only swap pages for ::preserve_context mode
      - Allocate PGD for x86_64 transition page tables separately
      - Copy control page into place in machine_kexec_prepare()
      - Invoke copy of relocate_kernel() instead of the original
      - Move relocate_kernel to kernel .data section
      - Add data section to relocate_kernel
      - Drop page_list argument from relocate_kernel()
      - Eliminate writes through kernel mapping of relocate_kernel page
      - Clean up register usage in relocate_kernel()
      - Mark relocate_kernel page as ROX instead of RWX
      - Disable global pages before writing to control page
      - Ensure preserve_context flag is set on return to kernel
      - Use correct swap page in swap_pages function
      - Fix stack and handling of re-entry point for ::preserve_context
      - Mark machine_kexec() with __nocfi
      - Cope with relocate_kernel() not being at the start of the page
      - Use typedef for relocate_kernel_fn function prototype
      - Fix location of relocate_kernel with -ffunction-sections (fix by Nathan Chancellor)

 - A series to remove the last remaining absolute symbol references from
   .head.text, and enforce this at build time, by Ard Biesheuvel:

      - Avoid WARN()s and panic()s in early boot code
      - Don't hang but terminate on failure to remap SVSM CA
      - Determine VA/PA offset before entering C code
      - Avoid intentional absolute symbol references in .head.text
      - Disable UBSAN in early boot code
      - Move ENTRY_TEXT to the start of the image
      - Move .head.text into its own output section
      - Reject absolute references in .head.text

 - The above build-time enforcement uncovered a handful of bugs of
   essentially non-working code, and a wrokaround for a toolchain bug,
   fixed by Ard Biesheuvel as well:

      - Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
      - Disable UBSAN on SEV code that may execute very early
      - Disable ftrace branch profiling in SEV startup code

 - And miscellaneous cleanups:

      - kexec_core: Add and update comments regarding the KEXEC_JUMP flow (Rafael J. Wysocki)
      - x86/sysfs: Constify 'struct bin_attribute' (Thomas Weißschuh)"

* tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  x86/sev: Disable ftrace branch profiling in SEV startup code
  x86/kexec: Use typedef for relocate_kernel_fn function prototype
  x86/kexec: Cope with relocate_kernel() not being at the start of the page
  kexec_core: Add and update comments regarding the KEXEC_JUMP flow
  x86/kexec: Mark machine_kexec() with __nocfi
  x86/kexec: Fix location of relocate_kernel with -ffunction-sections
  x86/kexec: Fix stack and handling of re-entry point for ::preserve_context
  x86/kexec: Use correct swap page in swap_pages function
  x86/kexec: Ensure preserve_context flag is set on return to kernel
  x86/kexec: Disable global pages before writing to control page
  x86/sev: Don't hang but terminate on failure to remap SVSM CA
  x86/sev: Disable UBSAN on SEV code that may execute very early
  x86/boot/64: Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
  x86/sysfs: Constify 'struct bin_attribute'
  x86/kexec: Mark relocate_kernel page as ROX instead of RWX
  x86/kexec: Clean up register usage in relocate_kernel()
  x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page
  x86/kexec: Drop page_list argument from relocate_kernel()
  x86/kexec: Add data section to relocate_kernel
  x86/kexec: Move relocate_kernel to kernel .data section
  ...
2025-01-24 05:54:26 -08:00
Zhang Rui
0141978ae7 x86/acpi: Fix LAPIC/x2APIC parsing order
On some systems, the same CPU (with the same APIC ID) is assigned a
different logical CPU id after commit ec9aedb2aa ("x86/acpi: Ignore
invalid x2APIC entries").

This means that Linux enumerates the CPUs in a different order, which
violates ACPI specification[1] that states:

  "OSPM should initialize processors in the order that they appear in
   the MADT"

The problematic commit parses all LAPIC entries before any x2APIC
entries, aiming to ignore x2APIC entries with APIC ID < 255 when valid
LAPIC entries exist. However, it disrupts the CPU enumeration order on
systems where x2APIC entries precede LAPIC entries in the MADT.

Fix this problem by:

 1) Parsing LAPIC entries first without registering them in the
    topology to evaluate whether valid LAPIC entries exist.

 2) Restoring the MADT in order parser which invokes either the LAPIC
    or the X2APIC parser function depending on the entry type.

The X2APIC parser still ignores entries < 0xff in case that #1 found
valid LAPIC entries independent of their position in the MADT table.

Link: https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#madt-processor-local-apic-sapic-structure-entry-order
Cc: All applicable <stable@vger.kernel.org>
Reported-by: Jim Mattson <jmattson@google.com>
Closes: https://lore.kernel.org/all/20241010213136.668672-1-jmattson@google.com/
Fixes: ec9aedb2aa ("x86/acpi: Ignore invalid x2APIC entries")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Tested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20250117081420.4046737-1-rui.zhang@intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-24 14:27:54 +01:00
Linus Torvalds
2e04247f7c Merge tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ftrace updates from Steven Rostedt:

 - Have fprobes built on top of function graph infrastructure

   The fprobe logic is an optimized kprobe that uses ftrace to attach to
   functions when a probe is needed at the start or end of the function.
   The fprobe and kretprobe logic implements a similar method as the
   function graph tracer to trace the end of the function. That is to
   hijack the return address and jump to a trampoline to do the trace
   when the function exits. To do this, a shadow stack needs to be
   created to store the original return address. Fprobes and function
   graph do this slightly differently. Fprobes (and kretprobes) has
   slots per callsite that are reserved to save the return address. This
   is fine when just a few points are traced. But users of fprobes, such
   as BPF programs, are starting to add many more locations, and this
   method does not scale.

   The function graph tracer was created to trace all functions in the
   kernel. In order to do this, when function graph tracing is started,
   every task gets its own shadow stack to hold the return address that
   is going to be traced. The function graph tracer has been updated to
   allow multiple users to use its infrastructure. Now have fprobes be
   one of those users. This will also allow for the fprobe and kretprobe
   methods to trace the return address to become obsolete. With new
   technologies like CFI that need to know about these methods of
   hijacking the return address, going toward a solution that has only
   one method of doing this will make the kernel less complex.

 - Cleanup with guard() and free() helpers

   There were several places in the code that had a lot of "goto out" in
   the error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.

 - Remove disabling of interrupts in the function graph tracer

   When function graph tracer was first introduced, it could race with
   interrupts and NMIs. To prevent that race, it would disable
   interrupts and not trace NMIs. But the code has changed to allow NMIs
   and also interrupts. This change was done a long time ago, but the
   disabling of interrupts was never removed. Remove the disabling of
   interrupts in the function graph tracer is it is not needed. This
   greatly improves its performance.

 - Allow the :mod: command to enable tracing module functions on the
   kernel command line.

   The function tracer already has a way to enable functions to be
   traced in modules by writing ":mod:<module>" into set_ftrace_filter.
   That will enable either all the functions for the module if it is
   loaded, or if it is not, it will cache that command, and when the
   module is loaded that matches <module>, its functions will be
   enabled. This also allows init functions to be traced. But currently
   events do not have that feature.

   Because enabling function tracing can be done very early at boot up
   (before scheduling is enabled), the commands that can be done when
   function tracing is started is limited. Having the ":mod:" command to
   trace module functions as they are loaded is very useful. Update the
   kernel command line function filtering to allow it.

* tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits)
  ftrace: Implement :mod: cache filtering on kernel command line
  tracing: Adopt __free() and guard() for trace_fprobe.c
  bpf: Use ftrace_get_symaddr() for kprobe_multi probes
  ftrace: Add ftrace_get_symaddr to convert fentry_ip to symaddr
  Documentation: probes: Update fprobe on function-graph tracer
  selftests/ftrace: Add a test case for repeating register/unregister fprobe
  selftests: ftrace: Remove obsolate maxactive syntax check
  tracing/fprobe: Remove nr_maxactive from fprobe
  fprobe: Add fprobe_header encoding feature
  fprobe: Rewrite fprobe on function-graph tracer
  s390/tracing: Enable HAVE_FTRACE_GRAPH_FUNC
  ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC
  bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled
  tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
  tracing: Add ftrace_fill_perf_regs() for perf event
  tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs
  fprobe: Use ftrace_regs in fprobe exit handler
  fprobe: Use ftrace_regs in fprobe entry handler
  fgraph: Pass ftrace_regs to retfunc
  fgraph: Replace fgraph_ret_regs with ftrace_regs
  ...
2025-01-21 15:15:28 -08:00
Linus Torvalds
4c551165e7 Merge tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull interrupt subsystem updates from Thomas Gleixner:

 - Consolidate the machine_kexec_mask_interrupts() by providing a
   generic implementation and replacing the copy & pasta orgy in the
   relevant architectures.

 - Prevent unconditional operations on interrupt chips during kexec
   shutdown, which can trigger warnings in certain cases when the
   underlying interrupt has been shut down before.

 - Make the enforcement of interrupt handling in interrupt context
   unconditionally available, so that it actually works for non x86
   related interrupt chips. The earlier enablement for ARM GIC chips set
   the required chip flag, but did not notice that the check was hidden
   behind a config switch which is not selected by ARM[64].

 - Decrapify the handling of deferred interrupt affinity setting.

   Some interrupt chips require that affinity changes are made from the
   context of handling an interrupt to avoid certain race conditions.
   For x86 this was the default, but with interrupt remapping this
   requirement was lifted and a flag was introduced which tells the core
   code that affinity changes can be done in any context. Unrestricted
   affinity changes are the default for the majority of interrupt chips.

   RISCV has the requirement to add the deferred mode to one of it's
   interrupt controllers, but with the original implementation this
   would require to add the any context flag to all other RISC-V
   interrupt chips. That's backwards, so reverse the logic and require
   that chips, which need the deferred mode have to be marked
   accordingly. That avoids chasing the 'sane' chips and marking them.

 - Add multi-node support to the Loongarch AVEC interrupt controller
   driver.

 - The usual tiny cleanups, fixes and improvements all over the place.

* tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set()
  genirq/timings: Add kernel-doc for a function parameter
  genirq: Remove IRQ_MOVE_PCNTXT and related code
  x86/apic: Convert to IRQCHIP_MOVE_DEFERRED
  genirq: Provide IRQCHIP_MOVE_DEFERRED
  hexagon: Remove GENERIC_PENDING_IRQ leftover
  ARC: Remove GENERIC_PENDING_IRQ
  genirq: Remove handle_enforce_irqctx() wrapper
  genirq: Make handle_enforce_irqctx() unconditionally available
  irqchip/loongarch-avec: Add multi-nodes topology support
  irqchip/ts4800: Replace seq_printf() by seq_puts()
  irqchip/ti-sci-inta : Add module build support
  irqchip/ti-sci-intr: Add module build support
  irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function
  irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args
  genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown
  kexec: Consolidate machine_kexec_mask_interrupts() implementation
  genirq: Reuse irq_thread_fn() for forced thread case
  genirq: Move irq_thread_fn() further up in the code
2025-01-21 13:51:07 -08:00
Linus Torvalds
62de6e1685 Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "Fair scheduler (SCHED_FAIR) enhancements:

   - Behavioral improvements:
      - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)

   - Delayed-dequeue enhancements & fixes: (Vincent Guittot)
      - Rename h_nr_running into h_nr_queued
      - Add new cfs_rq.h_nr_runnable
      - Use the new cfs_rq.h_nr_runnable
      - Removed unsued cfs_rq.h_nr_delayed
      - Rename cfs_rq.idle_h_nr_running into h_nr_idle
      - Remove unused cfs_rq.idle_nr_running
      - Rename cfs_rq.nr_running into nr_queued
      - Do not try to migrate delayed dequeue task
      - Fix variable declaration position
      - Encapsulate set custom slice in a __setparam_fair() function

   - Fixes:
      - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
      - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
        Chourasia)

   - Cleanups:
      - Clean up in migrate_degrades_locality() to improve readability
        (Peter Zijlstra)
      - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
      - Update comments after sched_tick() rename (Sebastian Andrzej
        Siewior)
      - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
        (Valentin Schneider)

  Deadline scheduler (SCHED_DL) enhancements:

   - Restore dl_server bandwidth on non-destructive root domain changes
     (Juri Lelli)

   - Correctly account for allocated bandwidth during hotplug (Juri
     Lelli)

   - Check bandwidth overflow earlier for hotplug (Juri Lelli)

   - Clean up goto label in pick_earliest_pushable_dl_task() (John
     Stultz)

   - Consolidate timer cancellation (Wander Lairson Costa)

  Load-balancer enhancements:

   - Improve performance by prioritizing migrating eligible tasks in
     sched_balance_rq() (Hao Jia)

   - Do not compute NUMA Balancing stats unnecessarily during
     load-balancing (K Prateek Nayak)

   - Do not compute overloaded status unnecessarily during
     load-balancing (K Prateek Nayak)

  Generic scheduling code enhancements:

   - Use READ_ONCE() in task_on_rq_queued(), to consistently use the
     WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)

  Isolated CPUs support enhancements: (Waiman Long)

   - Make "isolcpus=nohz" equivalent to "nohz_full"
   - Consolidate housekeeping cpumasks that are always identical
   - Remove HK_TYPE_SCHED
   - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE

  RSEQ enhancements:

   - Validate read-only fields under DEBUG_RSEQ config (Mathieu
     Desnoyers)

  PSI enhancements:

   - Fix race when task wakes up before psi_sched_switch() adjusts flags
     (Chengming Zhou)

  IRQ time accounting performance enhancements: (Yafang Shao)

   - Define sched_clock_irqtime as static key
   - Don't account irq time if sched_clock_irqtime is disabled

  Virtual machine scheduling enhancements:

   - Don't try to catch up excess steal time (Suleiman Souhlal)

  Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)

   - Convert "sysctl_sched_itmt_enabled" to boolean
   - Use guard() for itmt_update_mutex
   - Move the "sched_itmt_enabled" sysctl to debugfs
   - Remove x86_smt_flags and use cpu_smt_flags directly
   - Use x86_sched_itmt_flags for PKG domain unconditionally

  Debugging code & instrumentation enhancements:

   - Change need_resched warnings to pr_err() (David Rientjes)
   - Print domain name in /proc/schedstat (K Prateek Nayak)
   - Fix value reported by hot tasks pulled in /proc/schedstat (Peter
     Zijlstra)
   - Report the different kinds of imbalances in /proc/schedstat
     (Swapnil Sapkal)
   - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
   - Update Schedstat version to 17 (Swapnil Sapkal)"

* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  rseq: Fix rseq unregistration regression
  psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
  sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  sched: Don't account irq time if sched_clock_irqtime is disabled
  sched: Define sched_clock_irqtime as static key
  sched/fair: Do not compute overloaded status unnecessarily during lb
  sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
  x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
  x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
  x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  x86/itmt: Use guard() for itmt_update_mutex
  x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
  sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
  sched/debug: Change need_resched warnings to pr_err
  sched/fair: Encapsulate set custom slice in a __setparam_fair() function
  sched: Fix race between yield_to() and try_to_wake_up()
  docs: Update Schedstat version to 17
  sched/stats: Print domain name in /proc/schedstat
  sched: Move sched domain name out of CONFIG_SCHED_DEBUG
  sched: Report the different kinds of imbalances in /proc/schedstat
  ...
2025-01-21 11:32:36 -08:00
Linus Torvalds
858df1de21 Merge tag 'x86-cleanups-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "Miscellaneous x86 cleanups and typo fixes, and also the removal of
  the 'disablelapic' boot parameter"

* tag 'x86-cleanups-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ioapic: Remove a stray tab in the IO-APIC type string
  x86/cpufeatures: Remove "AMD" from the comments to the AMD-specific leaf
  Documentation/kernel-parameters: Fix a typo in kvm.enable_virt_at_load text
  x86/cpu: Fix typo in x86_match_cpu()'s doc
  x86/apic: Remove "disablelapic" cmdline option
  Documentation: Merge x86-specific boot options doc into kernel-parameters.txt
  x86/ioremap: Remove unused size parameter in remapping functions
  x86/ioremap: Simplify setup_data mapping variants
  x86/boot/compressed: Remove unused header includes from kaslr.c
2025-01-21 11:15:29 -08:00
Linus Torvalds
6c4aa896eb Merge tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
 "Seqlock optimizations that arose in a perf context and were merged
  into the perf tree:

   - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
   - mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan)
   - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren
     Baghdasaryan)
   - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)

  Core perf enhancements:

   - Reduce 'struct page' footprint of perf by mapping pages in advance
     (Lorenzo Stoakes)
   - Save raw sample data conditionally based on sample type (Yabin Cui)
   - Reduce sampling overhead by checking sample_type in
     perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin
     Cui)
   - Export perf_exclude_event() (Namhyung Kim)

  Uprobes scalability enhancements: (Andrii Nakryiko)

   - Simplify find_active_uprobe_rcu() VMA checks
   - Add speculative lockless VMA-to-inode-to-uprobe resolution
   - Simplify session consumer tracking
   - Decouple return_instance list traversal and freeing
   - Ensure return_instance is detached from the list before freeing
   - Reuse return_instances between multiple uretprobes within task
   - Guard against kmemdup() failing in dup_return_instance()

  AMD core PMU driver enhancements:

   - Relax privilege filter restriction on AMD IBS (Namhyung Kim)

  AMD RAPL energy counters support: (Dhananjay Ugwekar)

   - Introduce topology_logical_core_id() (K Prateek Nayak)
   - Remove the unused get_rapl_pmu_cpumask() function
   - Remove the cpu_to_rapl_pmu() function
   - Rename rapl_pmu variables
   - Make rapl_model struct global
   - Add arguments to the init and cleanup functions
   - Modify the generic variable names to *_pkg*
   - Remove the global variable rapl_msrs
   - Move the cntr_mask to rapl_pmus struct
   - Add core energy counter support for AMD CPUs

  Intel core PMU driver enhancements:

   - Support RDPMC 'metrics clear mode' feature (Kan Liang)
   - Clarify adaptive PEBS processing (Kan Liang)
   - Factor out functions for PEBS records processing (Kan Liang)
   - Simplify the PEBS records processing for adaptive PEBS (Kan Liang)

  Intel uncore driver enhancements: (Kan Liang)

   - Convert buggy pmu->func_id use to pmu->registered
   - Support more units on Granite Rapids"

* tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  perf: map pages in advance
  perf/x86/intel/uncore: Support more units on Granite Rapids
  perf/x86/intel/uncore: Clean up func_id
  perf/x86/intel: Support RDPMC metrics clear mode
  uprobes: Guard against kmemdup() failing in dup_return_instance()
  perf/x86: Relax privilege filter restriction on AMD IBS
  perf/core: Export perf_exclude_event()
  uprobes: Reuse return_instances between multiple uretprobes within task
  uprobes: Ensure return_instance is detached from the list before freeing
  uprobes: Decouple return_instance list traversal and freeing
  uprobes: Simplify session consumer tracking
  uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  uprobes: simplify find_active_uprobe_rcu() VMA checks
  mm: introduce mmap_lock_speculate_{try_begin|retry}
  mm: convert mm_lock_seq to a proper seqcount
  mm/gup: Use raw_seqcount_try_begin()
  seqlock: add raw_seqcount_try_begin
  perf/x86/rapl: Add core energy counter support for AMD CPUs
  perf/x86/rapl: Move the cntr_mask to rapl_pmus struct
  perf/x86/rapl: Remove the global variable rapl_msrs
  ...
2025-01-21 10:52:03 -08:00
Linus Torvalds
a6640c8c2f Merge tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:

 - Introduce the generic section-based annotation infrastructure a.k.a.
   ASM_ANNOTATE/ANNOTATE (Peter Zijlstra)

 - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra)
    - ANNOTATE_NOENDBR
    - ANNOTATE_RETPOLINE_SAFE
    - instrumentation_{begin,end}()
    - VALIDATE_UNRET_BEGIN
    - ANNOTATE_IGNORE_ALTERNATIVE
    - ANNOTATE_INTRA_FUNCTION_CALL
    - {.UN}REACHABLE

 - Optimize the annotation-sections parsing code (Peter Zijlstra)

 - Centralize annotation definitions in <linux/objtool.h>

 - Unify & simplify the barrier_before_unreachable()/unreachable()
   definitions (Peter Zijlstra)

 - Convert unreachable() calls to BUG() in x86 code, as unreachable()
   has unreliable code generation (Peter Zijlstra)

 - Remove annotate_reachable() and annotate_unreachable(), as it's
   unreliable against compiler optimizations (Peter Zijlstra)

 - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra)

 - Robustify the annotation code by warning about unknown annotation
   types (Peter Zijlstra)

 - Allow arch code to discover jump table size, in preparation of
   annotated jump table support (Ard Biesheuvel)

* tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Convert unreachable() to BUG()
  objtool: Allow arch code to discover jump table size
  objtool: Warn about unknown annotation types
  objtool: Fix ANNOTATE_REACHABLE to be a normal annotation
  objtool: Convert {.UN}REACHABLE to ANNOTATE
  objtool: Remove annotate_{,un}reachable()
  loongarch: Use ASM_REACHABLE
  x86: Convert unreachable() to BUG()
  unreachable: Unify
  objtool: Collect more annotations in objtool.h
  objtool: Collapse annotate sequences
  objtool: Convert ANNOTATE_INTRA_FUNCTION_CALL to ANNOTATE
  objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATE
  objtool: Convert VALIDATE_UNRET_BEGIN to ANNOTATE
  objtool: Convert instrumentation_{begin,end}() to ANNOTATE
  objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATE
  objtool: Convert ANNOTATE_NOENDBR to ANNOTATE
  objtool: Generic annotation infrastructure
2025-01-21 10:13:11 -08:00
Linus Torvalds
b9d8a295ed Merge tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:

 - The first part of a restructuring of AMD's representation of a
   northbridge which is legacy now, and the creation of the new AMD node
   concept which represents the Zen architecture of having a collection
   of I/O devices within an SoC. Those nodes comprise the so-called data
   fabric on Zen.

   This has at least one practical advantage of not having to add a PCI
   ID each time a new data fabric PCI device releases. Eventually, the
   lot more uniform provider of data fabric functionality amd_node.c
   will be used by all the drivers which need it

 - Smaller cleanups

* tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/amd_node: Use defines for SMN register offsets
  x86/amd_node: Remove dependency on AMD_NB
  x86/amd_node: Update __amd_smn_rw() error paths
  x86/amd_nb: Move SMN access code to a new amd_node driver
  x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id()
  x86/amd_nb: Simplify function 3 search
  x86/amd_nb: Use topology info to get AMD node count
  x86/amd_nb: Simplify root device search
  x86/amd_nb: Simplify function 4 search
  x86: Start moving AMD node functionality out of AMD_NB
  x86/amd_nb: Clean up early_is_amd_nb()
  x86/amd_nb: Restrict init function to AMD-based systems
  x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
2025-01-21 09:38:52 -08:00
Linus Torvalds
48795f90cb Merge tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:

 - Remove the less generic CPU matching infra around struct x86_cpu_desc
   and use the generic struct x86_cpu_id thing

 - Remove magic naked numbers for CPUID functions and use proper defines
   of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around
   the tree

 - Smaller cleanups and improvements

* tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Make all all CPUID leaf names consistent
  x86/fpu: Remove unnecessary CPUID level check
  x86/fpu: Move CPUID leaf definitions to common code
  x86/tsc: Remove CPUID "frequency" leaf magic numbers.
  x86/tsc: Move away from TSC leaf magic numbers
  x86/cpu: Move TSC CPUID leaf definition
  x86/cpu: Refresh DCA leaf reading code
  x86/cpu: Remove unnecessary MwAIT leaf checks
  x86/cpu: Use MWAIT leaf definition
  x86/cpu: Move MWAIT leaf definition to common header
  x86/cpu: Remove 'x86_cpu_desc' infrastructure
  x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'
  x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id'
  x86/cpu: Expose only stepping min/max interface
  x86/cpu: Introduce new microcode matching helper
  x86/cpufeature: Document cpu_feature_enabled() as the default to use
  x86/paravirt: Remove the WBINVD callback
  x86/cpufeatures: Free up unused feature bits
2025-01-21 09:30:59 -08:00
Linus Torvalds
13b6931c44 Merge tag 'x86_sev_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:

 - A segmented Reverse Map table (RMP) is a across-nodes distributed
   table of sorts which contains per-node descriptors of each node-local
   4K page, denoting its ownership (hypervisor, guest, etc) in the realm
   of confidential computing. Add support for such a table in order to
   improve referential locality when accessing or modifying RMP table
   entries

 - Add support for reading the TSC in SNP guests by removing any
   interference or influence the hypervisor might have, with the goal of
   making a confidential guest even more independent from the hypervisor

* tag 'x86_sev_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev: Add the Secure TSC feature for SNP guests
  x86/tsc: Init the TSC for Secure TSC guests
  x86/sev: Mark the TSC in a secure TSC guest as reliable
  x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled guests
  x86/sev: Prevent GUEST_TSC_FREQ MSR interception for Secure TSC enabled guests
  x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
  x86/sev: Add Secure TSC support for SNP guests
  x86/sev: Relocate SNP guest messaging routines to common code
  x86/sev: Carve out and export SNP guest messaging init routines
  virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL
  virt: sev-guest: Remove is_vmpck_empty() helper
  x86/sev/docs: Document the SNP Reverse Map Table (RMP)
  x86/sev: Add full support for a segmented RMP table
  x86/sev: Treat the contiguous RMP table as a single RMP segment
  x86/sev: Map only the RMP table entries instead of the full RMP range
  x86/sev: Move the SNP probe routine out of the way
  x86/sev: Require the RMPREAD instruction after Zen4
  x86/sev: Add support for the RMPREAD instruction
  x86/sev: Prepare for using the RMPREAD instruction to access the RMP
2025-01-21 09:00:31 -08:00
Linus Torvalds
254d763310 Merge tag 'x86_microcode_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loader updates from Borislav Petkov:

 - A bunch of minor cleanups

* tag 'x86_microcode_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode/AMD: Remove ret local var in early_apply_microcode()
  x86/microcode/AMD: Have __apply_microcode_amd() return bool
  x86/microcode/AMD: Make __verify_patch_size() return bool
  x86/microcode/AMD: Remove bogus comment from parse_container()
  x86/microcode/AMD: Return bool from find_blobs_in_containers()
2025-01-21 08:33:10 -08:00
Linus Torvalds
3357d1d1f9 Merge tag 'x86_cache_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:

 - Extend resctrl with the capability of total memory bandwidth
   monitoring, thus accomodating systems which support only total but
   not local memory bandwidth monitoring. Add the respective new mount
   options

 - The usual cleanups

* tag 'x86_cache_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Document the new "mba_MBps_event" file
  x86/resctrl: Add write option to "mba_MBps_event" file
  x86/resctrl: Add "mba_MBps_event" file to CTRL_MON directories
  x86/resctrl: Make mba_sc use total bandwidth if local is not supported
  x86/resctrl: Compute memory bandwidth for all supported events
  x86/resctrl: Modify update_mba_bw() to use per CTRL_MON group event
  x86/resctrl: Prepare for per-CTRL_MON group mba_MBps control
  x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags
  x86/resctrl: Use kthread_run_on_cpu()
2025-01-21 08:31:04 -08:00
Linus Torvalds
d80825ee4a Merge tag 'x86_bugs_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU speculation update from Borislav Petkov:

 - Add support for AMD hardware which is not affected by SRSO on the
   user/kernel attack vector and advertise it to guest userspace

* tag 'x86_bugs_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM: x86: Advertise SRSO_USER_KERNEL_NO to userspace
  x86/bugs: Add SRSO_USER_KERNEL_NO support
2025-01-21 08:22:40 -08:00
Linus Torvalds
d3504411a4 Merge tag 'ras_core_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:

 - Remove the shared threshold bank hack on AMD and streamline and
   simplify it

 - Cleanup and sanitize MCA code

* tag 'ras_core_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce/amd: Remove shared threshold bank plumbing
  x86/mce: Remove the redundant mce_hygon_feature_init()
  x86/mce: Convert family/model mixed checks to VFM-based checks
  x86/mce: Break up __mcheck_cpu_apply_quirks()
  x86/mce: Make four functions return bool
  x86/mce/threshold: Remove the redundant this_cpu_dec_return()
  x86/mce: Make several functions return bool
2025-01-21 08:16:24 -08:00
Thomas Gleixner
7d04319a05 x86/apic: Convert to IRQCHIP_MOVE_DEFERRED
Instead of marking individual interrupts as safe to be migrated in
arbitrary contexts, mark the interrupt chips, which require the interrupt
to be moved in actual interrupt context, with the new IRQCHIP_MOVE_DEFERRED
flag. This makes more sense because this is a per interrupt chip property
and not restricted to individual interrupts.

That flips the logic from the historical opt-out to a opt-in model. This is
simpler to handle for other architectures, which default to unrestricted
affinity setting. It also allows to cleanup the redundant core logic
significantly.

All interrupt chips, which belong to a top-level domain sitting directly on
top of the x86 vector domain are marked accordingly, unless the related
setup code marks the interrupts with IRQ_MOVE_PCNTXT, i.e. XEN.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/all/20241210103335.563277044@linutronix.de
2025-01-15 21:38:53 +01:00
Xin Li (Intel)
de31b3cd70 x86/fred: Fix the FRED RSP0 MSR out of sync with its per-CPU cache
The FRED RSP0 MSR is only used for delivering events when running
userspace.  Linux leverages this property to reduce expensive MSR
writes and optimize context switches.  The kernel only writes the
MSR when about to run userspace *and* when the MSR has actually
changed since the last time userspace ran.

This optimization is implemented by maintaining a per-CPU cache of
FRED RSP0 and then checking that against the value for the top of
current task stack before running userspace.

However cpu_init_fred_exceptions() writes the MSR without updating
the per-CPU cache.  This means that the kernel might return to
userspace with MSR_IA32_FRED_RSP0==0 when it needed to point to the
top of current task stack.  This would induce a double fault (#DF),
which is bad.

A context switch after cpu_init_fred_exceptions() can paper over
the issue since it updates the cached value.  That evidently
happens most of the time explaining how this bug got through.

Fix the bug through resynchronizing the FRED RSP0 MSR with its
per-CPU cache in cpu_init_fred_exceptions().

Fixes: fe85ee3919 ("x86/entry: Set FRED RSP0 on return to userspace instead of context switch")
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250110174639.1250829-1-xin%40zytor.com
2025-01-14 14:16:36 -08:00
David Woodhouse
7c61a3d8f7 x86/kexec: Use typedef for relocate_kernel_fn function prototype
Both i386 and x86_64 now copy the relocate_kernel() function into the control
page and execute it from there, using an open-coded function pointer.

Use a typedef for it instead.

  [ bp: Put relocate_kernel_ptr ptr arithmetic on a single line. ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-10-dwmw2@infradead.org
2025-01-14 13:09:08 +01:00
David Woodhouse
e536057543 x86/kexec: Cope with relocate_kernel() not being at the start of the page
A few places in the kexec control code page make the assumption that the first
instruction of relocate_kernel is at the very start of the page.

To allow for Clang CFI information to be added to relocate_kernel(), as well
as the general principle of removing unwarranted assumptions, fix them to use
the external __relocate_kernel_start symbol that the linker adds. This means
using a separate addq and subq for calculating offsets, as the assembler can
no longer calculate the delta directly for itself and relocations aren't that
versatile. But those values can at least be used relative to a local label to
avoid absolute relocations.

Turn the jump from relocate_kernel() to identity_mapped() into a real indirect
'jmp *%rsi' too, while touching it. There was no real reason for it to be
a push+ret in the first place, and adding Clang CFI info will also give
objtool enough visibility to start complaining 'return with modified stack
frame' about it.

  [ bp: Massage commit message. ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-9-dwmw2@infradead.org
2025-01-14 13:05:14 +01:00
David Woodhouse
2114796ca0 x86/kexec: Mark machine_kexec() with __nocfi
A recent commit caused the relocate_kernel() function to be invoked through
a function pointer, but it does not have CFI information. The resulting trap
occurs after the IDT and GDT have been invalidated, leading to a triple-fault
if CONFIG_CFI_CLANG is enabled.

Using SYM_TYPED_FUNC_START() to provide the CFI information looks like it will
require a prolonged battle with objtool. And is fairly pointless anyway, as
the actual signature comes from a __kcfi_typeid_… symbol emitted from the
C code based on the function prototype it thinks that relocate_kernel has,
rendering the check somewhat tautological.

The simple fix is just to mark machine_kexec() with __nocfi.

Fixes: eeebbde571 ("x86/kexec: Invoke copy of relocate_kernel() instead of the original")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-7-dwmw2@infradead.org
2025-01-14 13:02:40 +01:00
Nathan Chancellor
eeed915041 x86/kexec: Fix location of relocate_kernel with -ffunction-sections
After commit

  cb33ff9e06 ("x86/kexec: Move relocate_kernel to kernel .data section"),

kernels configured with an option that uses -ffunction-sections, such as
CONFIG_LTO_CLANG, crash when kexecing because the value of relocate_kernel
does not match the value of __relocate_kernel_start so incorrect code gets
copied via machine_kexec_prepare().

  $ llvm-nm good-vmlinux &| rg relocate_kernel
  ffffffff83280d41 T __relocate_kernel_end
  ffffffff83280b00 T __relocate_kernel_start
  ffffffff83280b00 T relocate_kernel

  $ llvm-nm bad-vmlinux &| rg relocate_kernel
  ffffffff83266100 D __relocate_kernel_end
  ffffffff83266100 D __relocate_kernel_start
  ffffffff8120b0d8 T relocate_kernel

When -ffunction-sections is enabled, TEXT_MAIN matches on
'.text.[0-9a-zA-Z_]*' to coalesce the function specific functions back
into .text during link time after they have been optimized. Due to the
placement of TEXT_TEXT before KEXEC_RELOCATE_KERNEL in the x86 linker
script, the .text.relocate_kernel section ends up in .text instead of
.data.

Use a second dot in the relocate_kernel section name to avoid matching
on TEXT_MAIN, which matches a similar situation that happened in
commit

  79cd2a1122 ("x86/retpoline,kprobes: Fix position of thunk sections with CONFIG_LTO_CLANG"),

which allows kexec to function properly.

While .data.relocate_kernel still ends up in the .data section via
DATA_MAIN -> DATA_DATA, ensure it is located with the
.text.relocate_kernel section as intended by performing the same
transformation.

Fixes: cb33ff9e06 ("x86/kexec: Move relocate_kernel to kernel .data section")
Fixes: 8dbec5c77b ("x86/kexec: Add data section to relocate_kernel")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-6-dwmw2@infradead.org
2025-01-14 13:00:18 +01:00
David Woodhouse
2cacf7f23a x86/kexec: Fix stack and handling of re-entry point for ::preserve_context
A ::preserve_context kimage can be invoked more than once, and the entry point
can be different every time. When the callee returns to the kernel, it leaves
the address of its entry point for next time on the stack.

That being the case, one might reasonably assume that the caller would
allocate space for it on the stack frame before actually performing the 'call'
into the callee.

Apparently not, though. Ever since the kjump code was first added in 2009, it
has set up a *new* stack at the top of the swap_page scratch page, then just
performed the 'call' without allocating any space for the re-entry address to
be returned. It then reads the re-entry point for next time from 0(%rsp) which
is actually the first qword of the page *after* the swap page, which might not
exist at all! And if the callee has written to that, then it will have
corrupted memory it doesn't own.

Correct this by pushing the entry point of the callee onto the stack before
calling it. The callee may then adjust it, or not, as it sees fit, and
subsequent invocations should work correctly either way.

Remove a stray push of zero to the *relocate_kernel* stack, which may have
been intended for this purpose, but which was actually just noise.

Also, loading the stack for the callee relied on the address of the swap page
being in %r10 without ever documenting that fact. Recent code changes made
that no longer true, so load it directly from the local kexec_pa_swap_page
variable instead.

Fixes: b3adabae8a ("x86/kexec: Drop page_list argument from relocate_kernel()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-5-dwmw2@infradead.org
2025-01-14 12:57:43 +01:00
David Woodhouse
85d724df8c x86/kexec: Use correct swap page in swap_pages function
The swap_pages function expects the swap page to be in %r10, but there
was no documentation to that effect. Once upon a time the setup code
used to load its value from a kernel virtual address and save it to an
address which is accessible in the identity-mapped page tables, and
*happened* to use %r10 to do so, with no comment that it was left there
on *purpose* instead of just being a scratch register. Once that was no
longer necessary, %r10 just holds whatever the kernel happened to leave
in it.

Now that the original value passed by the kernel is accessible via
%rip-relative addressing, load directly from there instead of using %r10
for it. But document the other parameters that the swap_pages function
*does* expect in registers.

Fixes: b3adabae8a ("x86/kexec: Drop page_list argument from relocate_kernel()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-4-dwmw2@infradead.org
2025-01-14 12:54:36 +01:00
David Woodhouse
4d5f1da98f x86/kexec: Ensure preserve_context flag is set on return to kernel
The swap_pages() function will only actually *swap*, as its name implies, if
the preserve_context flag in the %r11 register is non-zero. On the way back
from a ::preserve_context kexec, ensure that the %r11 register is non-zero so
that the pages get swapped back.

Fixes: 9e5683e2d0 ("x86/kexec: Only swap pages for ::preserve_context mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-3-dwmw2@infradead.org
2025-01-14 12:52:47 +01:00
David Woodhouse
d144d8a652 x86/kexec: Disable global pages before writing to control page
The kernel switches to a new set of page tables during kexec. The global
mappings (_PAGE_GLOBAL==1) can remain in the TLB after this switch. This
is generally not a problem because the new page tables use a different
portion of the virtual address space than the normal kernel mappings.

The critical exception to that generalisation (and the only mapping
which isn't an identity mapping) is the kexec control page itself —
which was ROX in the original kernel mapping, but should be RWX in the
new page tables. If there is a global TLB entry for that in its prior
read-only state, it definitely needs to be flushed before attempting to
write through that virtual mapping.

It would be possible to just avoid writing to the virtual address of the
page and defer all writes until they can be done through the identity
mapping. But there's no good reason to keep the old TLB entries around,
as they can cause nothing but trouble.

Clear the PGE bit in %cr4 early, before storing data in the control page.

Fixes: 5a82223e07 ("x86/kexec: Mark relocate_kernel page as ROX instead of RWX")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219592
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Link: https://lore.kernel.org/r/20250109140757.2841269-2-dwmw2@infradead.org
2025-01-14 12:46:17 +01:00
Guo Weikang
ccd582059a mm/early_ioremap: add null pointer checks to prevent NULL-pointer dereference
The early_ioremap interface can fail and return NULL in certain cases.  To
prevent NULL-pointer dereference crashes, fixed issues in the acpi_extlog
and copy_early_mem interfaces, improving robustness when handling early
memory.

Link: https://lkml.kernel.org/r/20241212101004.1544070-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13 22:40:59 -08:00
Qi Zheng
718b13861d x86: mm: free page table pages by RCU instead of semi RCU
Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
will be freed by semi RCU, that is:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

In this way, the page table can be lockless traversed by disabling IRQ in
paths such as fast GUP.  But this is not enough to free the empty PTE page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().

In preparation for supporting empty PTE page table pages reclaimation, let
single table also be freed by RCU like batch table freeing.  Then we can
also use pte_offset_map() etc to prevent PTE page from being freed.

Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
the page table pages:

 - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.

 - For pt_list, it is used to manage the PGD page in x86. Fortunately
   tlb_remove_table() will not be used for free PGD pages, so it is safe
   to use pt_rcu_head.

 - For pmd_huge_pte, it is used for THPs, so it is safe.

After applying this patch, if CONFIG_PT_RECLAIM is enabled, the function
call of free_pte() is as follows:

free_pte
  pte_free_tlb
    __pte_free_tlb
      ___pte_free_tlb
        paravirt_tlb_remove_table
          tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
            [no-free-memory slowpath:]
              tlb_table_invalidate
              tlb_remove_table_one
                __tlb_remove_table_one [frees via RCU]
            [fastpath:]
              tlb_table_flush
                tlb_remove_table_free [frees via RCU]
          native_tlb_remove_table [CONFIG_PARAVIRT on native]
            tlb_remove_table [see above]

Link: https://lkml.kernel.org/r/0287d442a973150b0e1019cc406e6322d148277a.1733305182.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13 22:40:48 -08:00
Dr. David Alan Gilbert
58589c6a6e rtc: Remove hpet_rtc_dropped_irq()
hpet_rtc_dropped_irq() has been unused since
commit f52ef24be2 ("rtc/alpha: remove legacy rtc driver")

Remove it in rtc, and x86 hpet code.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241215022356.181625-1-linux@treblig.org
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2025-01-13 23:07:18 +01:00
K Prateek Nayak
e1bc026465 x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
x86_sched_itmt_flags() returns SD_ASYM_PACKING if ITMT support is
enabled by the system. Without ITMT support being enabled, it returns 0
similar to current x86_die_flags() on non-Hybrid systems
(!X86_HYBRID_CPU and !X86_FEATURE_AMD_HETEROGENEOUS_CORES)

On Intel systems that enable ITMT support, either the MC domain
coincides with the PKG domain, or in case of multiple MC groups
within a PKG domain, either Sub-NUMA Cluster (SNC) is enabled or the
processor features Hybrid core layout (X86_HYBRID_CPU) which leads to
three distinct possibilities:

o If PKG and MC domains coincide, PKG domain is degenerated by
  sd_parent_degenerate() when building sched domain topology.

o If SNC is enabled, PKG domain is never added since
  "x86_has_numa_in_package" is set and the topology will instead contain
  NODE and NUMA domains.

o On X86_HYBRID_CPU which contains multiple MC groups within the PKG,
  the PKG domain requires x86_sched_itmt_flags().

Thus, on Intel systems that contains multiple MC groups within the PKG
and enables ITMT support, the PKG domain requires
x86_sched_itmt_flags(). In all other cases PKG domain is either never
added or is degenerated. Thus, returning x86_sched_itmt_flags()
unconditionally at PKG domain on Intel systems should not lead to any
functional changes.

On AMD systems with multiple LLCs (MC groups) within a PKG domain,
enabling ITMT support requires setting SD_ASYM_PACKING to the PKG domain
since the core rankings are assigned PKG-wide.

Core rankings on AMD processors is currently set by the amd-pstate
driver when Preferred Core feature is supported. A subset of systems that
support Preferred Core feature can be detected using
X86_FEATURE_AMD_HETEROGENEOUS_CORES however, this does not cover all the
systems that support Preferred Core ranking.

Detecting Preferred Core support on AMD systems requires inspecting CPPC
Highest Perf on all present CPUs and checking if it differs on at least
one CPU. Previous suggestion to use a synthetic feature to detect
Preferred Core support [1] was found to be non-trivial to implement
since BSP alone cannot detect if Preferred Core is supported and by the
time AP comes up, alternatives are patched and setting a X86_FEATURE_*
then is not possible.

Since x86 processors enabling ITMT support that consists multiple
non-NUMA MC groups within a PKG requires SD_ASYM_PACKING flag set at the
PKG domain, return x86_sched_itmt_flags unconditionally for the PKG
domain.

Since x86_die_flags() would have just returned x86_sched_itmt_flags()
after the change, remove the unnecessary wrapper and pass
x86_sched_itmt_flags() directly as the flags function.

Fixes: f3a0523918 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-6-kprateek.nayak@amd.com
2025-01-13 14:10:24 +01:00
K Prateek Nayak
537e247879 x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
x86_*_flags() wrappers were introduced with commit d3d37d850d
("x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU") to add
x86_sched_itmt_flags() in addition to the default domain flags for SMT
and MC domain.

commit 995998ebde ("x86/sched: Remove SD_ASYM_PACKING from the
SMT domain flags") removed the ITMT flags for SMT domain but not the
x86_smt_flags() wrappers which directly returns cpu_smt_flags().

Remove x86_smt_flags() and directly use cpu_smt_flags() to derive the
flags for SMT domain. No functional changes intended.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-5-kprateek.nayak@amd.com
2025-01-13 14:10:24 +01:00
K Prateek Nayak
d04013a4b2 x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
"sched_itmt_enabled" was only introduced as a debug toggle for any funky
ITMT behavior. Move the sysctl controlled from
"/proc/sys/kernel/sched_itmt_enabled" to debugfs at
"/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a
cat on the file will return "Y" or "N" instead of "1" or "0" to
indicate that feature is enabled or disabled respectively. Either "0" or
"N" (or any string that kstrtobool() interprets as false) can be written
to the file will disable the feature, and writing  either "1" or "Y" (or
any string that kstrtobool() interprets as true) will enable it back
when the platform supports ITMT ranking.

Since ITMT is x86 specific (and PowerPC uses SD_ASYM_PACKING too), the
toggle was moved to "/sys/kernel/debug/x86/" as opposed to
"/sys/kernel/debug/sched/"

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-4-kprateek.nayak@amd.com
2025-01-13 14:10:24 +01:00
K Prateek Nayak
fc1055d533 x86/itmt: Use guard() for itmt_update_mutex
Use guard() for itmt_update_mutex which avoids the extra mutex_unlock()
in the bailout and return paths.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-3-kprateek.nayak@amd.com
2025-01-13 14:10:23 +01:00
K Prateek Nayak
2f6f726bdd x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
In preparation to move "sysctl_sched_itmt_enabled" to debugfs, convert
the unsigned int to bool since debugfs readily exposes boolean fops
primitives (debugfs_read_file_bool, debugfs_write_file_bool) which can
streamline the conversion.

Since the current ctl_table initializes extra1 and extra2 to SYSCTL_ZERO
and SYSCTL_ONE respectively, the value of "sysctl_sched_itmt_enabled"
can only be 0 or 1 and this datatype conversion should not cause any
functional changes.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-2-kprateek.nayak@amd.com
2025-01-13 14:10:23 +01:00
Yafang Shao
52cd5c4b59 arch: remove get_task_comm() and print task comm directly
Since task->comm is guaranteed to be NUL-terminated, we can print it
directly without the need to copy it into a separate buffer.  This
simplifies the code and avoids unnecessary operations.

Link: https://lkml.kernel.org/r/20241219023452.69907-3-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "André Almeida" <andrealmeid@igalia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Danilo Krummrich <dakr@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12 20:21:15 -08:00
Nuno Das Neves
ef5a3c92a8 hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
Switch to using hvhdk.h everywhere in the kernel. This header
includes all the new Hyper-V headers in include/hyperv, which form a
superset of the definitions found in hyperv-tlfs.h.

This makes it easier to add new Hyper-V interfaces without being
restricted to those in the TLFS doc (reflected in hyperv-tlfs.h).

To be more consistent with the original Hyper-V code, the names of
some definitions are changed slightly. Update those where needed.

Update comments in mshyperv.h files to point to include/hyperv for
adding new definitions.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com
Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-01-10 00:54:21 +00:00
Nikunj A Dadhania
73bbf3b0fb x86/tsc: Init the TSC for Secure TSC guests
Use the GUEST_TSC_FREQ MSR to discover the TSC frequency instead of
relying on kvm-clock based frequency calibration.  Override both CPU and
TSC frequency calibration callbacks with securetsc_get_tsc_khz(). Since
the difference between CPU base and TSC frequency does not apply in this
case, the same callback is being used.

  [ bp: Carve out from
    https://lore.kernel.org/r/20250106124633.1418972-11-nikunj@amd.com ]

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250106124633.1418972-11-nikunj@amd.com
2025-01-08 21:26:19 +01:00
Yazen Ghannam
79821b907f x86/amd_node: Use defines for SMN register offsets
There are more than one SMN index/data pair available for software use.
The register offsets are different, but the protocol is the same.

Use defines for the SMN offset values and allow the index/data offsets
to be passed to the read/write helper function.

This eases code reuse with other SMN users in the kernel.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-14-yazen.ghannam@amd.com
2025-01-08 11:02:28 +01:00
Yazen Ghannam
77466b798d x86/amd_node: Remove dependency on AMD_NB
Cache the root devices locally so that there are no more dependencies on
AMD_NB.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-13-yazen.ghannam@amd.com
2025-01-08 11:02:22 +01:00
Yazen Ghannam
35df797665 x86/amd_node: Update __amd_smn_rw() error paths
Use guard(mutex) and convert PCI error codes to common ones.

Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-12-yazen.ghannam@amd.com
2025-01-08 11:01:46 +01:00