commit d799769188 upstream.
When ld detects unaligned relocations, it emits R_PPC64_UADDR64
relocations instead of R_PPC64_RELATIVE. Currently R_PPC64_UADDR64 are
detected by arch/powerpc/tools/relocs_check.sh and expected not to work.
Below is a simple chunk to trigger this behaviour (this disables
optimization for the demonstration purposes only, this also happens with
-O1/-O2 when CONFIG_PRINTK_INDEX=y, for example):
\#pragma GCC push_options
\#pragma GCC optimize ("O0")
struct entry {
const char *file;
int line;
} __attribute__((packed));
static const struct entry e1 = { .file = __FILE__, .line = __LINE__ };
static const struct entry e2 = { .file = __FILE__, .line = __LINE__ };
...
prom_printf("e1=%s %lx %lx\n", e1.file, (unsigned long) e1.file, mfmsr());
prom_printf("e2=%s %lx\n", e2.file, (unsigned long) e2.file);
\#pragma GCC pop_options
This adds support for UADDR64 for 64bit. This reuses __dynamic_symtab
from the 32bit code which supports more relocation types already.
Because RELACOUNT includes only R_PPC64_RELATIVE, this replaces it with
RELASZ which is the size of all relocation records.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20220309061822.168173-1-aik@ozlabs.ru
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bb523b406c upstream
Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in, similar to copy_to_user, instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in. This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.
Rename the functions to fault_in_{readable,writeable} to make sure
this change doesn't silently break things.
Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7c5ed82b80 ]
On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.
The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for the kernel to allocate memory for other system
resources.
Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.
This patch is dependent on changes to paca allocation for boot CPU. It
expect boot CPU to discover 1T segment support which is introduced by
the patch posted here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html
Reported-by: Abdul haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220204085601.107257-1-sourabhjain@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 9bb162fa26 upstream.
Allthough kernel text is always mapped with BATs, we still have
inittext mapped with pages, so TLB miss handling is required
when CONFIG_DEBUG_PAGEALLOC or CONFIG_KFENCE is set.
The final solution should be to set a BAT that also maps inittext
but that BAT then needs to be cleared at end of init, and it will
require more changes to be able to do it properly.
As DEBUG_PAGEALLOC or KFENCE are debugging, performance is not a big
deal so let's fix it simply for now to enable easy stable application.
Fixes: 035b19a15a ("powerpc/32s: Always map kernel text and rodata with BATs")
Cc: stable@vger.kernel.org # v5.11+
Reported-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/aea33b4813a26bdb9378b5f273f00bd5d4abe240.1638857364.git.christophe.leroy@csgroup.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit aee101d7b9 ]
Commit 314f6c23dd ("powerpc/64s: Mask NIP before checking against
SRR0") masked off the low 2 bits of the NIP value in the interrupt
stack frame in case they are non-zero and mis-compare against a SRR0
register value of a CPU which always reads back 0 from the 2 low bits
which are reserved.
This now causes the opposite problem that an implementation which does
implement those bits in SRR0 will mis-compare against the masked NIP
value in which they have been cleared. QEMU is one such implementation,
and this is allowed by the architecture.
This can be triggered by sigfuz by setting low bits of PT_NIP in the
signal context.
Fix this for now by masking the SRR0 bits as well. Cleaner is probably
to sanitise these values before putting them in registers or stack, but
this is the quick and backportable fix.
Fixes: 314f6c23dd ("powerpc/64s: Mask NIP before checking against SRR0")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220117134403.2995059-1-npiggin@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 06e629c25d ]
In panic path, fadump is triggered via a panic notifier function.
Before calling panic notifier functions, smp_send_stop() gets called,
which stops all CPUs except the panic'ing CPU. Commit 8389b37dff
("powerpc: stop_this_cpu: remove the cpu from the online map.") and
again commit bab26238bb ("powerpc: Offline CPU in stop_this_cpu()")
started marking CPUs as offline while stopping them. So, if a kernel
has either of the above commits, vmcore captured with fadump via panic
path would not process register data for all CPUs except the panic'ing
CPU. Sample output of crash-utility with such vmcore:
# crash vmlinux vmcore
...
KERNEL: vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 1
DATE: Wed Nov 10 09:56:34 EST 2021
UPTIME: 00:00:42
LOAD AVERAGE: 2.27, 0.69, 0.24
TASKS: 183
NODENAME: XXXXXXXXX
RELEASE: 5.15.0+
VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
MACHINE: ppc64le (2500 Mhz)
MEMORY: 8 GB
PANIC: "Kernel panic - not syncing: sysrq triggered crash"
PID: 3394
COMMAND: "bash"
TASK: c0000000150a5f80 [THREAD_INFO: c0000000150a5f80]
CPU: 1
STATE: TASK_RUNNING (PANIC)
crash> p -x __cpu_online_mask
__cpu_online_mask = $1 = {
bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
crash>
crash>
crash> p -x __cpu_active_mask
__cpu_active_mask = $2 = {
bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
crash>
While this has been the case since fadump was introduced, the issue
was not identified for two probable reasons:
- In general, the bulk of the vmcores analyzed were from crash
due to exception.
- The above did change since commit 8341f2f222 ("sysrq: Use
panic() to force a crash") started using panic() instead of
deferencing NULL pointer to force a kernel crash. But then
commit de6e5d3841 ("powerpc: smp_send_stop do not offline
stopped CPUs") stopped marking CPUs as offline till kernel
commit bab26238bb ("powerpc: Offline CPU in stop_this_cpu()")
reverted that change.
To ensure post processing register data of all other CPUs happens
as intended, let panic() function take the crash friendly path (read
crash_smp_send_stop()) with the help of crash_kexec_post_notifiers
option. Also, as register data for all CPUs is captured by f/w, skip
IPI callbacks here for fadump, to avoid any complications in finding
the right backtraces.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211207103719.91117-2-hbathini@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 219572d2fc ]
Kdump can be triggered after panic_notifers since commit f06e5153f4
("kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump
after panic_notifers") introduced crash_kexec_post_notifiers option.
But using this option would mean smp_send_stop(), that marks all other
CPUs as offline, gets called before kdump is triggered. As a result,
kdump routines fail to save other CPUs' registers. To fix this, kdump
friendly crash_smp_send_stop() function was introduced with kernel
commit 0ee59413c9 ("x86/panic: replace smp_send_stop() with kdump
friendly version in panic path"). Override this kdump friendly weak
function to handle crash_kexec_post_notifiers option appropriately
on powerpc.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
[Fixed signature of crash_stop_this_cpu() - reported by lkp@intel.com]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211207103719.91117-1-hbathini@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5dad4ba68a ]
It is possible for all CPUs to miss the pending cpumask becoming clear,
and then nobody resetting it, which will cause the lockup detector to
stop working. It will eventually expire, but watchdog_smp_panic will
avoid doing anything if the pending mask is clear and it will never be
reset.
Order the cpumask clear vs the subsequent test to close this race.
Add an extra check for an empty pending mask when the watchdog fires and
finds its bit still clear, to try to catch any other possible races or
bugs here and keep the watchdog working. The extra test in
arch_touch_nmi_watchdog is required to prevent the new warning from
firing off.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211110025056.2084347-2-npiggin@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fd1eaaaaa6 ]
When CONFIG_PPC_RFI_SRR_DEBUG=y we check the SRR values before returning
from interrupts. This is done in asm using EMIT_BUG_ENTRY, and passing
BUGFLAG_WARNING.
However that fails to create an exception table entry for the warning,
and so do_program_check() fails the exception table search and proceeds
to call _exception(), resulting in an oops like:
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 2 PID: 1204 Comm: sigreturn_unali Tainted: P 5.16.0-rc2-00194-g91ca3d4f77c5 #12
NIP: c00000000000c5b0 LR: 0000000000000000 CTR: 0000000000000000
...
NIP [c00000000000c5b0] system_call_common+0x150/0x268
LR [0000000000000000] 0x0
Call Trace:
[c00000000db73e10] [c00000000000c558] system_call_common+0xf8/0x268 (unreliable)
...
Instruction dump:
7cc803a6 888d0931 2c240000 4082001c 38800000 988d0931 e8810170 e8a10178
7c9a03a6 7cbb03a6 7d7a02a6 e9810170 <7f0b6088> 7d7b02a6 e9810178 7f0b6088
We should instead use EMIT_WARN_ENTRY, which creates an exception table
entry for the warning, allowing the warning to be correctly recognised,
and the code to resume after printing the warning.
Note however that because this warning is buried deep in the interrupt
return path, we are not able to recover from it (due to MSR_RI being
clear), so we still end up in die() with an unrecoverable exception.
Fixes: 59dc5bfca0 ("powerpc/64s: avoid reloading (H)SRR registers if they are still valid")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211221135101.2085547-2-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 314f6c23dd ]
When CONFIG_PPC_RFI_SRR_DEBUG=y we check that NIP and SRR0 match when
returning from interrupts. This can trigger falsely if NIP has either of
its two low bits set via sigreturn or ptrace, while SRR0 has its low two
bits masked in hardware.
As a quick fix make sure to mask the low bits before doing the check.
Fixes: 59dc5bfca0 ("powerpc/64s: avoid reloading (H)SRR registers if they are still valid")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20211221135101.2085547-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f1797e4de1 ]
module_alloc() first tries to allocate module text within 24 bits direct
jump from kernel text, and tries a wider allocation if first one fails.
When first allocation fails the following is observed in kernel logs:
vmap allocation for size 2400256 failed: use vmalloc=<size> to increase size
systemd-udevd: vmalloc error: size 2395133, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null)
CPU: 0 PID: 127 Comm: systemd-udevd Tainted: G W 5.15.5-gentoo-PowerMacG4 #9
Call Trace:
[e2a53a50] [c0ba0048] dump_stack_lvl+0x80/0xb0 (unreliable)
[e2a53a70] [c0540128] warn_alloc+0x11c/0x2b4
[e2a53b50] [c0531be8] __vmalloc_node_range+0xd8/0x64c
[e2a53c10] [c00338c0] module_alloc+0xa0/0xac
[e2a53c40] [c027a368] load_module+0x2ae0/0x8148
[e2a53e30] [c027fc78] sys_finit_module+0xfc/0x130
[e2a53f30] [c0035098] ret_from_syscall+0x0/0x28
...
Add __GFP_NOWARN flag to first allocation so that no warning appears
when it fails.
Reported-by: Erhard Furtner <erhard_f@mailbox.org>
Fixes: 2ec13df167 ("powerpc/modules: Load modules closer to kernel text")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/93c9b84d6ec76aaf7b4f03468e22433a6d308674.1638267035.git.christophe.leroy@csgroup.eu
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 08b0af5b2a ]
Some thread flags can be set remotely, and so even when IRQs are disabled,
the flags can change under our feet. Thus, when setting flags we must use
an atomic operation rather than a plain read-modify-write sequence, as a
plain read-modify-write may discard flags which are concurrently set by a
remote thread, e.g.
// task A // task B
tmp = A->thread_info.flags;
set_tsk_thread_flag(A, NEWFLAG_B);
tmp |= NEWFLAG_A;
A->thread_info.flags = tmp;
arch/powerpc/kernel/interrupt.c's system_call_exception() sets
_TIF_RESTOREALL in the thread info flags with a read-modify-write, which
may result in other flags being discarded.
Elsewhere in the file it uses clear_bits() to atomically remove flag bits,
so use set_bits() here for consistency with those.
There may be reasons (e.g. instrumentation) that prevent the use of
set_thread_flag() and clear_thread_flag() here, which would otherwise be
preferable.
Fixes: ae7aaecc3f ("powerpc/64s: system call rfscv workaround for TM bugs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Eirik Fuller <efuller@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20211129130653.2037928-10-mark.rutland@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 8734b41b3e upstream.
Livepatching a loaded module involves applying relocations through
apply_relocate_add(), which attempts to write to read-only memory when
CONFIG_STRICT_MODULE_RWX=y. Work around this by performing these
writes through the text poke area by using patch_instruction().
R_PPC_REL24 is the only relocation type generated by the kpatch-build
userspace tool or klp-convert kernel tree that I observed applying a
relocation to a post-init module.
A more comprehensive solution is planned, but using patch_instruction()
for R_PPC_REL24 on should serve as a sufficient fix.
This does have a performance impact, I observed ~15% overhead in
module_load() on POWER8 bare metal with checksum verification off.
Fixes: c35717c71e ("powerpc: Set ARCH_HAS_STRICT_MODULE_RWX")
Cc: stable@vger.kernel.org # v5.14+
Reported-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
[mpe: Check return codes from patch_instruction()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211214121248.777249-1-mpe@ellerman.id.au
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fcb116bc43 upstream.
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.
Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit. This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.
This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.
In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig. That can be done where it matters on
a case-by-case basis with careful analysis.
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Fixes: a3616a3c02 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
Fixes: 83a1f27ad7 ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf07 ("signal/s390: Use force_sigsegv in default_trap_handler")
Fixes: 086ec444f8 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
Fixes: c317d306d5 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
Fixes: 695dd0d634 ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
Fixes: 1fbd60df8a ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
Fixes: 941edc5bf1 ("exit/syscall_user_dispatch: Send ordinary signals on failure")
Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Backlund <tmb@iki.fi>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 83a1f27ad7 upstream.
If the register state may be partial and corrupted instead of calling
do_exit, call force_sigsegv(SIGSEGV). Which properly kills the
process with SIGSEGV and does not let any more userspace code execute,
instead of just killing one thread of the process and potentially
confusing everything.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 756f1ae8a44e ("PPC32: Rework signal code and add a swapcontext system call.")
Fixes: 04879b04bf50 ("[PATCH] ppc64: VMX (Altivec) support & signal32 rework, from Ben Herrenschmidt")
Link: https://lkml.kernel.org/r/20211020174406.17889-7-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Thomas Backlund <tmb@iki.fi>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5d5e4522a7 upstream.
printk from NMI context relies on irq work being raised on the local CPU
to print to console. This can be a problem if the NMI was raised by a
lockup detector to print lockup stack and regs, because the CPU may not
enable irqs (because it is locked up).
Introduce printk_trigger_flush() that can be called another CPU to try
to get those messages to the console, call that where printk_safe_flush
was previously called.
Fixes: 93d102f094 ("printk: remove safe buffers")
Cc: stable@vger.kernel.org # 5.15
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20211107045116.1754411-1-npiggin@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1e35eba405 upstream.
As spotted and explained in commit c12ab8dbc4 ("powerpc/8xx: Fix
Oops with STRICT_KERNEL_RWX without DEBUG_RODATA_TEST"), the selection
of STRICT_KERNEL_RWX without selecting DEBUG_RODATA_TEST has spotted
the lack of the DIRTY bit in the pinned kernel data TLBs.
This problem should have been detected a lot earlier if things had
been working as expected. But due to an incredible level of chance or
mishap, this went undetected because of a set of bugs: In fact the
DTLBs were not pinned, because instead of setting the reserve bit
in MD_CTR, it was set in MI_CTR that is the register for ITLBs.
But then, another huge bug was there: the physical address was
reset to 0 at the boundary between RO and RW areas, leading to the
same physical space being mapped at both 0xc0000000 and 0xc8000000.
This had by miracle no consequence until now because the entry was
not really pinned so it was overwritten soon enough to go undetected.
Of course, now that we really pin the DTLBs, it must be fixed as well.
Fixes: f76c8f6d25 ("powerpc/8xx: Add function to set pinned TLBs")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Depends-on: c12ab8dbc4 ("powerpc/8xx: Fix Oops with STRICT_KERNEL_RWX without DEBUG_RODATA_TEST")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a21e9a057fe2d247a535aff0d157a54eefee017a.1636963688.git.christophe.leroy@csgroup.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5499802b22 upstream.
The conversion from __copy_from_user() to __get_user() by
commit d3ccc97815 ("powerpc/signal: Use __get_user() to copy
sigset_t") introduced a regression in __get_user_sigset() for
powerpc/32. The bug was subsequently moved into
unsafe_get_user_sigset().
The bug is due to the copied 64 bit value being truncated to
32 bits while being assigned to dst->sig[0]
The regression was reported by users of the Xorg packages distributed in
Debian/powerpc --
"The symptoms are that the fb screen goes blank, with the backlight
remaining on and no errors logged in /var/log; wdm (or startx) run
with no effect (I tried logging in in the blind, with no effect).
And they are hard to kill, requiring 'kill -KILL ...'"
Fix the regression by copying each word of the sigset, not only the
first one.
__get_user_sigset() was tentatively optimised to copy 64 bits at once
in order to minimise KUAP unlock/lock impact, but the unsafe variant
doesn't suffer that, so it can just copy words.
Fixes: 887f3ceb51 ("powerpc/signal32: Convert do_setcontext[_tm]() to user access block")
Cc: stable@vger.kernel.org # v5.13+
Reported-by: Finn Thain <fthain@linux-m68k.org>
Reported-and-tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/99ef38d61c0eb3f79c68942deb0c35995a93a777.1636966353.git.christophe.leroy@csgroup.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4a5cb51f3d upstream.
The check_return_regs_valid() can cause a false positive if the return
regs are marked as norestart and they are an HSRR type interrupt,
because the low bit in the bottom of regs->trap causes interrupt type
matching to fail.
This can occcur for example on bare metal with a HV privileged doorbell
interrupt that causes a signal, but do_signal returns early because
get_signal() fails, and takes the "No signal to deliver" path. In this
case no signal was delivered so the return location is not changed so
return SRRs are not invalidated, yet set_trap_norestart is called, which
messes up the match. Building go-1.16.6 is known to reproduce this.
Fix it by using the TRAP() accessor which masks out the low bit.
Fixes: 6eaaf9de35 ("powerpc/64s/interrupt: Check and fix srr_valid without crashing")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211026122531.3599918-1-npiggin@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 81291383ff upstream.
A e5500 machine running a 32-bit kernel sometimes hangs at boot,
seemingly going into an infinite loop of instruction storage interrupts.
The ESR (Exception Syndrome Register) has a value of 0x800000 (store)
when this happens, which is likely set by a previous store. An
instruction TLB miss interrupt would then leave ESR unchanged, and if no
PTE exists it calls directly to the instruction storage interrupt
handler without changing ESR.
access_error() does not cause a segfault due to a store to a read-only
vma because is_exec is true. Most subsequent fault handling does not
check for a write fault on a read-only vma, and might do strange things
like create a writeable PTE or call page_mkwrite on a read only vma or
file. It's not clear what happens here to cause the infinite faulting in
this case, a fault handler failure or low level PTE or TLB handling.
In any case this can be fixed by having the instruction storage
interrupt zero regs->dsisr rather than storing the ESR value to it.
Fixes: a01a3f2ddb ("powerpc: remove arguments from fault handler functions")
Cc: stable@vger.kernel.org # v5.12+
Reported-by: Jacques de Laval <jacques.delaval@protonmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Jacques de Laval <jacques.delaval@protonmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211028133043.4159501-1-npiggin@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull powerpc fixes from Michael Ellerman:
- Fix a bug exposed by a previous fix, where running guests with
certain SMT topologies could crash the host on Power8.
- Fix atomic sleep warnings when re-onlining CPUs, when PREEMPT is
enabled.
Thanks to Nathan Lynch, Srikar Dronamraju, and Valentin Schneider.
* tag 'powerpc-5.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/smp: do not decrement idle task preempt count in CPU offline
powerpc/idle: Don't corrupt back chain when going idle
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:
BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100
Call Trace:
dump_stack_lvl+0xac/0x108
__schedule_bug+0xac/0xe0
__schedule+0xcf8/0x10d0
schedule_idle+0x3c/0x70
do_idle+0x2d8/0x4a0
cpu_startup_entry+0x38/0x40
start_secondary+0x2ec/0x3a0
start_secondary_prolog+0x10/0x14
This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8279 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."
However, since commit 2c669ef697 ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a376ca ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.
The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.
Tested with pseries and powernv in qemu, and pseries on PowerVM.
Fixes: 2c669ef697 ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015173902.2278118-1-nathanl@linux.ibm.com
In isa206_idle_insn_mayloss() we store various registers into the stack
red zone, which is allowed.
However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again,
to 0(r1), which corrupts the stack back chain.
We used to do the same in isa206_idle_insn_mayloss() itself, but we
fixed that in 73287caa92 ("powerpc64/idle: Fix SP offsets when saving
GPRs"), however we missed that the macro also corrupts the back chain.
Corrupting the back chain is bad for debuggability but doesn't
necessarily cause a bug.
However we recently changed the stack handling in some KVM code, and it
now relies on the stack back chain being valid when it returns. The
corruption causes that code to return with r1 pointing somewhere in
kernel data, at some point LR is restored from the stack and we branch
to NULL or somewhere else invalid.
Only affects Power8 hosts running KVM guests, with dynamic_mt_modes
enabled (which it is by default).
The fixes tag below points to the commit that changed the KVM stack
handling, exposing this bug. The actual corruption of the back chain has
always existed since 948cf67c47 ("powerpc: Add NAP mode support on
Power7 in HV mode").
Fixes: 9b4416c509 ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020094826.3222052-1-mpe@ellerman.id.au
Pull powerpc fixes from Michael Ellerman:
"A bit of a big batch, partly because I didn't send any last week, and
also just because the BPF fixes happened to land this week.
Summary:
- Fix a regression hit by the IPR SCSI driver, introduced by the
recent addition of MSI domains on pseries.
- A big series including 8 BPF fixes, some with potential security
impact and the rest various code generation issues.
- Fix our program check assembler entry path, which was accidentally
jumping into a gas macro and generating strange stack frames, which
could confuse find_bug().
- A couple of fixes, and related changes, to fix corner cases in our
machine check handling.
- Fix our DMA IOMMU ops, which were not always returning the optimal
DMA mask, leading to at least one device falling back to 32-bit DMA
when it shouldn't.
- A fix for KUAP handling on 32-bit Book3S.
- Fix crashes seen when kdumping on some pseries systems.
Thanks to Naveen N. Rao, Nicholas Piggin, Alexey Kardashevskiy, Cédric
Le Goater, Christophe Leroy, Mahesh Salgaonkar, Abdul Haleem,
Christoph Hellwig, Johan Almbladh, Stan Johnson"
* tag 'powerpc-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
pseries/eeh: Fix the kdump kernel crash during eeh_pseries_init
powerpc/32s: Fix kuap_kernel_restore()
powerpc/pseries/msi: Add an empty irq_write_msi_msg() handler
powerpc/64s: Fix unrecoverable MCE calling async handler from NMI
powerpc/64/interrupt: Reconcile soft-mask state in NMI and fix false BUG
powerpc/64: warn if local irqs are enabled in NMI or hardirq context
powerpc/traps: do not enable irqs in _exception
powerpc/64s: fix program check interrupt emergency stack path
powerpc/bpf ppc32: Fix BPF_SUB when imm == 0x80000000
powerpc/bpf ppc32: Do not emit zero extend instruction for 64-bit BPF_END
powerpc/bpf ppc32: Fix JMP32_JSET_K
powerpc/bpf ppc32: Fix ALU32 BPF_ARSH operation
powerpc/bpf: Emit stf barrier instruction sequences for BPF_NOSPEC
powerpc/security: Add a helper to query stf_barrier type
powerpc/bpf: Fix BPF_SUB when imm == 0x80000000
powerpc/bpf: Fix BPF_MOD when imm == 1
powerpc/bpf: Validate branch ranges
powerpc/lib: Add helper to check if offset is within conditional branch range
powerpc/iommu: Report the correct most efficient DMA mask for PCI devices
The machine check handler is not considered NMI on 64s. The early
handler is the true NMI handler, and then it schedules the
machine_check_exception handler to run when interrupts are enabled.
This works fine except the case of an unrecoverable MCE, where the true
NMI is taken when MSR[RI] is clear, it can not recover, so it calls
machine_check_exception directly so something might be done about it.
Calling an async handler from NMI context can result in irq state and
other things getting corrupted. This can also trigger the BUG at
arch/powerpc/include/asm/interrupt.h:168
BUG_ON(!arch_irq_disabled_regs(regs) && !(regs->msr & MSR_EE));
Fix this by making an _async version of the handler which is called
in the normal case, and a NMI version that is called for unrecoverable
interrupts.
Fixes: 2b43dd7653 ("powerpc/64: enable MSR[EE] in irq replay pt_regs")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211004145642.1331214-6-npiggin@gmail.com
This can help catch bugs such as the one fixed by the previous change
to prevent _exception() from enabling irqs.
ppc32 could have a similar warning but it has no good config option to
debug this stuff (the test may be overkill to add for production
kernels).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211004145642.1331214-4-npiggin@gmail.com
_exception can be called by machine check handlers when the MCE hits
user code (e.g., pseries and powernv). This will enable local irqs
because, which is a dicey thing to do in NMI or hard irq context.
This seemed to worked out okay because a userspace MCE can basically be
treated like a synchronous interrupt (after async / imprecise MCEs are
filtered out). Since NMI and hard irq handlers have started growing
nmi_enter / irq_enter, and more irq state sanity checks, this has
started to cause problems (or at least trigger warnings).
The Fixes tag to the commit which introduced this rather than try to
work out exactly which commit was the first that could possibly cause a
problem because that may be difficult to prove.
Fixes: 9f2f79e3a3 ("powerpc: Disable interrupts in 64-bit kernel FP and vector faults")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211004145642.1331214-3-npiggin@gmail.com
According to dma-api.rst, the dma_get_required_mask() helper should return
"the mask that the platform requires to operate efficiently". Which in
the case of PPC64 means the bypass mask and not a mask from an IOMMU table
which is shorter and slower to use due to map/unmap operations (especially
expensive on "pseries").
However the existing implementation ignores the possibility of bypassing
and returns the IOMMU table mask on the pseries platform which makes some
drivers (mpt3sas is one example) choose 32bit DMA even though bypass is
supported. The powernv platform sort of handles it by having a bigger
default window with a mask >=40 but it only works as drivers choose
63/64bit if the required mask is >32 which is rather pointless.
This reintroduces the bypass capability check to let drivers make
a better choice of the DMA mask.
Fixes: f1565c24b5 ("powerpc: use the generic dma_ops_bypass mode")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210930034454.95794-1-aik@ozlabs.ru
Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now
that the two function are always called back-to-back by architectures
that have rseq. The rseq helper is stubbed out for architectures that
don't support rseq, i.e. this is a nop across the board.
Note, tracehook_notify_resume() is horribly named and arguably does not
belong in tracehook.h as literally every line of code in it has nothing
to do with tracing. But, that's been true since commit a42c6ded82
("move key_repace_session_keyring() into tracehook_notify_resume()")
first usurped tracehook_notify_resume() back in 2012. Punt cleaning that
mess up to future patches.
No functional change intended.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210901203030.1292304-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We queue an irq work for deferred processing of mce event in realmode
mce handler, where translation is disabled. Queuing of the work may
result in accessing memory outside RMO region, such access needs the
translation to be enabled for an LPAR running with hash mmu else the
kernel crashes.
After enabling translation in mce_handle_error() we used to leave it
enabled to avoid crashing here, but now with the commit
74c3354bc1 ("powerpc/pseries/mce: restore msr before returning from
handler") we are restoring the MSR to disable translation.
Hence to fix this enable the translation before queuing the work.
Without this change following trace is seen on injecting SLB multihit in
an LPAR running with hash mmu.
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 5 PID: 1883 Comm: insmod Tainted: G OE 5.14.0-mce+ #137
NIP: c000000000735d60 LR: c000000000318640 CTR: 0000000000000000
REGS: c00000001ebff9a0 TRAP: 0300 Tainted: G OE (5.14.0-mce+)
MSR: 8000000000001003 <SF,ME,RI,LE> CR: 28008228 XER: 00000001
CFAR: c00000000031863c DAR: c00000027fa8fe08 DSISR: 40000000 IRQMASK: 0
...
NIP llist_add_batch+0x0/0x40
LR __irq_work_queue_local+0x70/0xc0
Call Trace:
0xc00000001ebffc0c (unreliable)
irq_work_queue+0x40/0x70
machine_check_queue_event+0xbc/0xd0
machine_check_early_common+0x16c/0x1f4
Fixes: 74c3354bc1 ("powerpc/pseries/mce: restore msr before returning from handler")
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
[mpe: Fix comment formatting, trim oops in change log for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210909064330.312432-1-ganeshgr@linux.ibm.com
The rfscv instruction does not work correctly with the fake-suspend mode
in POWER9, which can end up with the hypervisor restoring an incorrect
checkpoint.
Work around this by setting the _TIF_RESTOREALL flag if a system call
returns to a transaction active state, causing rfid to be used instead
of rfscv to return, which will do the right thing. The contents of the
registers are irrelevant because they will be overwritten in this case
anyway.
Fixes: 7fa95f9ada ("powerpc/64s: system call support for scv/rfscv instructions")
Reported-by: Eirik Fuller <efuller@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210908101718.118522-1-npiggin@gmail.com
If a system call is made with a transaction active, the kernel
immediately aborts it and returns. scv system calls disable irqs even
earlier in their interrupt handler, and tabort_syscall does not fix this
up.
This can result in irq soft-mask state being messed up on the next
kernel entry, and crashing at BUG_ON(arch_irq_disabled_regs(regs)) in
the kernel exit handlers, or possibly worse.
This can't easily be fixed in asm because at this point an async irq may
have hit, which is soft-masked and marked pending. The pending interrupt
has to be replayed before returning to userspace. The fix is to move the
tabort_syscall code to C in the main syscall handler, and just skip the
system call but otherwise return as usual, which will take care of the
pending irqs. This also does a bunch of other things including possible
signal delivery to the process, but the doomed transaction should still
be aborted when it is eventually returned to.
The sc system call path is changed to use the new C function as well to
reduce code and path differences. This slows down how quickly system
calls are aborted when called while a transaction is active, which could
potentially impact TM performance. But making any system call is already
bad for performance, and TM is on the way out, so go with simpler over
faster.
Fixes: 7fa95f9ada ("powerpc/64s: system call support for scv/rfscv instructions")
Reported-by: Eirik Fuller <efuller@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Use #ifdef rather than IS_ENABLED() to fix build error on 32-bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210903125707.1601269-1-npiggin@gmail.com
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769.
Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...
Pull Kbuild updates from Masahiro Yamada:
- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.
- Show a warning if a different compiler is used for building external
modules.
- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.
- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.
- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.
- Add Nick Desaulniers as a Kbuild reviewer.
- Drop stale cc-option tests.
- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.
- Show a warning if 'FORCE' is missing for if_changed rules.
- Various cleanups
* tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
kbuild: redo fake deps at include/ksym/*.h
kbuild: clean up objtool_args slightly
modpost: get the *.mod file path more simply
checkkconfigsymbols.py: Fix the '--ignore' option
kbuild: merge vmlinux_link() between ARCH=um and other architectures
kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh
kbuild: merge vmlinux_link() between the ordinary link and Clang LTO
kbuild: remove stale *.symversions
kbuild: remove unused quiet_cmd_update_lto_symversions
gen_compile_commands: extract compiler command from a series of commands
x86: remove cc-option-yn test for -mtune=
arc: replace cc-option-yn uses with cc-option
s390: replace cc-option-yn uses with cc-option
ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild
sparc: move the install rule to arch/sparc/Makefile
security: remove unneeded subdir-$(CONFIG_...)
kbuild: sh: remove unused install script
kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y
kbuild: Switch to 'f' variants of integrated assembler flag
kbuild: Shuffle blank line to improve comment meaning
...