Pull powerpc updates from Michael Ellerman:
- Rework kfence support for the HPT MMU to work on systems with >= 16TB
of RAM.
- Remove the powerpc "maple" platform, used by the "Yellow Dog
Powerstation".
- Add support for DYNAMIC_FTRACE_WITH_CALL_OPS,
DYNAMIC_FTRACE_WITH_DIRECT_CALLS & BPF Trampolines.
- Add support for running KVM nested guests on Power11.
- Other small features, cleanups and fixes.
Thanks to Amit Machhiwal, Arnd Bergmann, Christophe Leroy, Costa
Shulyupin, David Hunter, David Wang, Disha Goel, Gautam Menghani, Geert
Uytterhoeven, Hari Bathini, Julia Lawall, Kajol Jain, Keith Packard,
Lukas Bulwahn, Madhavan Srinivasan, Markus Elfring, Michal Suchanek,
Ming Lei, Mukesh Kumar Chaurasiya, Nathan Chancellor, Naveen N Rao,
Nicholas Piggin, Nysal Jan K.A, Paulo Miguel Almeida, Pavithra Prakash,
Ritesh Harjani (IBM), Rob Herring (Arm), Sachin P Bappalige, Shen
Lichuan, Simon Horman, Sourabh Jain, Thomas Weißschuh, Thorsten Blum,
Thorsten Leemhuis, Venkat Rao Bagalkote, Zhang Zekun, and zhang jiao.
* tag 'powerpc-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (89 commits)
EDAC/powerpc: Remove PPC_MAPLE drivers
powerpc/perf: Add per-task/process monitoring to vpa_pmu driver
powerpc/kvm: Add vpa latency counters to kvm_vcpu_arch
docs: ABI: sysfs-bus-event_source-devices-vpa-pmu: Document sysfs event format entries for vpa_pmu
powerpc/perf: Add perf interface to expose vpa counters
MAINTAINERS: powerpc: Mark Maddy as "M"
powerpc/Makefile: Allow overriding CPP
powerpc-km82xx.c: replace of_node_put() with __free
ps3: Correct some typos in comments
powerpc/kexec: Fix return of uninitialized variable
macintosh: Use common error handling code in via_pmu_led_init()
powerpc/powermac: Use of_property_match_string() in pmac_has_backlight_type()
powerpc: remove dead config options for MPC85xx platform support
powerpc/xive: Use cpumask_intersects()
selftests/powerpc: Remove the path after initialization.
powerpc/xmon: symbol lookup length fixed
powerpc/ep8248e: Use %pa to format resource_size_t
powerpc/ps3: Reorganize kerneldoc parameter names
KVM: PPC: Book3S HV: Fix kmv -> kvm typo
powerpc/sstep: make emulate_vsx_load and emulate_vsx_store static
...
The systemcfg data only has minimal overlap with the vdso data.
Splitting the two avoids mapping the implementation-defined vdso data
into /proc/ppc64/systemcfg.
It is also a preparation for the standardization of vdso data storage.
The only field actually used by both systemcfg and vdso is
tb_ticks_per_sec and it is only changed once during time_init().
Initialize it in both structures there.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-26-b64f0842d512@linutronix.de
The page containing VDSO time data is swapped with the one containing
TIME namespace data when a process uses a non-root time namespace.
For other data like powerpc specific data and RNG data, it means
tracking whether time namespace is the root one or not to know which
page to use.
Simplify the logic behind by moving time data out of first data page
so that the first data page which contains everything else always
remains the first page. Time data is in the second or third page
depending on selected time namespace.
While we are playing with get_datapage macro, directly take into
account the data offset inside the macro instead of adding that offset
afterwards.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/0557d3ec898c1d0ea2fc59fa8757618e524c5d94.1727858295.git.christophe.leroy@csgroup.eu
Linus noticed that the error handling in __arch_setup_additional_pages()
fails to clear the mm VDSO pointer if _install_special_mapping() fails.
In practice there should be no actual bug, because if there's an error the
VDSO pointer is cleared later in arch_setup_additional_pages().
However it's no longer necessary to set the pointer before installing the
mapping. Commit c1bab64360 ("powerpc/vdso: Move to
_install_special_mapping() and remove arch_vma_name()") reworked the code
so that the VMA name comes from the vm_special_mapping.name, rather than
relying on arch_vma_name().
So rework the code to only set the VDSO pointer once the mappings have
been installed correctly, and remove the stale comment.
Link: https://lkml.kernel.org/r/20240812082605.743814-4-mpe@ellerman.id.au
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
find_timens_vvar_page() is not architecture-specific, as can be seen from
how all five per-architecture versions of it are the same.
(arm64, powerpc and riscv are exactly the same; x86 and s390 have two
characters difference inside a comment, less blank lines, and mark the
!CONFIG_TIME_NS version as inline.)
Refactor the five copies into a central copy in kernel/time/namespace.c.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130115320.2918447-1-jannh@google.com
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
Implement syscall wrapper as per s390, x86, arm64. When enabled
cause handlers to accept parameters from a stack frame rather than
from user scratch register state. This allows for user registers to be
safely cleared in order to reduce caller influence on speculation
within syscall routine. The wrapper is a macro that emits syscall
handler symbols that call into the target handler, obtaining its
parameters from a struct pt_regs on the stack.
As registers are already saved to the stack prior to calling
system_call_exception, it appears that this function is executed more
efficiently with the new stack-pointer convention than with parameters
passed by registers, avoiding the allocation of a stack frame for this
method. On a 32-bit system, we see >20% performance increases on the
null_syscall microbenchmark, and on a Power 8 the performance gains
amortise the cost of clearing and restoring registers which is
implemented at the end of this series, seeing final result of ~5.6%
performance improvement on null_syscall.
Syscalls are wrapped in this fashion on all platforms except for the
Cell processor as this commit does not provide SPU support. This can be
quickly fixed in a successive patch, but requires spu_sys_callback to
allocate a pt_regs structure to satisfy the wrapped calling convention.
Co-developed-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmai.com>
[mpe: Make incompatible with COMPAT to retain clearing of high bits of args]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220921065605.1051927-22-rmclure@linux.ibm.com
Cause syscall handlers to be typed as follows when called indirectly
throughout the kernel. This is to allow for better type checking.
typedef long (*syscall_fn)(unsigned long, unsigned long, unsigned long,
unsigned long, unsigned long, unsigned long);
Since both 32 and 64-bit abis allow for at least the first six
machine-word length parameters to a function to be passed by registers,
even handlers which admit fewer than six parameters may be viewed as
having the above type.
Coercing syscalls to syscall_fn requires a cast to void* to avoid
-Wcast-function-type.
Fixup comparisons in VDSO to avoid pointer-integer comparison. Introduce
explicit cast on systems with SPUs.
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220921065605.1051927-19-rmclure@linux.ibm.com
Since commit 511157ab64 ("powerpc/vdso: Move vdso datapage up front")
VVAR page is in front of the VDSO area. In result it breaks CRIU
(Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]"
from /proc/../maps points at ELF/vdso image, rather than at VVAR data page.
Laurent made a patch to keep CRIU working (by reading aux vector).
But I think it still makes sence to separate two mappings into different
VMAs. It will also make ppc64 less "special" for userspace and as
a side-bonus will make VVAR page un-writable by debugger (which previously
would COW page and can be unexpected).
I opportunistically Cc stable on it: I understand that usually such
stuff isn't a stable material, but that will allow us in CRIU have
one workaround less that is needed just for one release (v5.11) on
one platform (ppc64), which we otherwise have to maintain.
I wouldn't go as far as to say that the commit 511157ab64 is ABI
regression as no other userspace got broken, but I'd really appreciate
if it gets backported to v5.11 after v5.12 is released, so as not
to complicate already non-simple CRIU-vdso code. Thanks!
[1]: https://github.com/checkpoint-restore/criu/issues/1417
Cc: stable@vger.kernel.org # v5.11
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> # vDSO parts.
Acked-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f401eb1ebc0bfc4d8f0e10dc8e525fd409eb68e2.1617209142.git.christophe.leroy@csgroup.eu
Today vdso_data structure has:
- syscall_map_32[] and syscall_map_64[] on PPC64
- syscall_map_32[] on PPC32
On PPC32, syscall_map_32[] is populated using sys_call_table[].
On PPC64, syscall_map_64[] is populated using sys_call_table[]
and syscal_map_32[] is populated using compat_sys_call_table[].
To simplify vdso_setup_syscall_map(),
- On PPC32 rename syscall_map_32[] into syscall_map[],
- On PPC64 rename syscall_map_64[] into syscall_map[],
- On PPC64 rename syscall_map_32[] into compat_syscall_map[].
That way, syscall_map[] gets populated using sys_call_table[] and
compat_syscall_map[] gets population using compat_sys_call_table[].
Also define an empty compat_syscall_map[] on PPC32 to avoid ifdefs.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/472734be0d9991eee320a06824219a5b2663736b.1601197618.git.christophe.leroy@csgroup.eu
The code in vdso_cpu_init that exposes the cpu and numa node to
userspace via SPRG_VDSO incorrctly masks the cpu to 12 bits. This means
that any kernel running on a box with more than 4096 threads (NR_CPUS
advertises a limit of of 8192 cpus) would expose userspace to two cpu
contexts running at the same time with the same cpu number.
Note: I'm not aware of any distro shipping a kernel with support for more
than 4096 threads today, nor of any system image that currently exceeds
4096 threads. Found via code browsing.
Fixes: 18ad51dd34 ("powerpc: Add VDSO version of getcpu")
Signed-off-by: Milton Miller <miltonm@us.ibm.com>
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200715233704.1352257-1-anton@ozlabs.org
The original 2005 patch that introduced the powerpc vdso, pre-git
("ppc64: Implement a vDSO and use it for signal trampoline") notes that:
... symbols exposed by the vDSO aren't "normal" function symbols, apps
can't be expected to link against them directly, the vDSO's are both
seen as if they were linked at 0 and the symbols just contain offsets
to the various functions. This is done on purpose to avoid a
relocation step (ppc64 functions normally have descriptors with abs
addresses in them). When glibc uses those functions, it's expected to
use it's own trampolines that know how to reach them.
Despite that explanation, there remains dead #ifdef
VDS64_HAS_DESCRIPTORS code-blocks that provide alternate function
definitions that setup function descriptors.
Since VDS64_HAS_DESCRIPTORS has been unused for all these years, we
might as well finally remove it from the codebase.
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200224211848.26087-1-joe.lawrence@redhat.com
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>