Commit Graph

133280 Commits

Author SHA1 Message Date
Takashi Iwai
3895d353f4 ALSA: control: Use deferred fasync helper
[ Upstream commit 4a971e84a7 ]

For avoiding the potential deadlock via kill_fasync() call, use the
new fasync helpers to defer the invocation from the control API.  Note
that it's merely a workaround.

Another note: although we haven't received reports about the deadlock
with the control API, the deadlock is still potentially possible, and
it's better to align the behavior with other core APIs (PCM and
timer); so let's move altogether.

Link: https://lore.kernel.org/r/20220728125945.29533-5-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-25 11:40:44 +02:00
Takashi Iwai
60110fd266 ALSA: core: Add async signal helpers
[ Upstream commit ef34a0ae7a ]

Currently the call of kill_fasync() from an interrupt handler might
lead to potential spin deadlocks, as spotted by syzkaller.
Unfortunately, it's not so trivial to fix this lock chain as it's
involved with the tasklist_lock that is touched in allover places.

As a temporary workaround, this patch provides the way to defer the
async signal notification in a work.  The new helper functions,
snd_fasync_helper() and snd_kill_faync() are replacements for
fasync_helper() and kill_fasync(), respectively.  In addition,
snd_fasync_free() needs to be called at the destructor of the relevant
file object.

Link: https://lore.kernel.org/r/20220728125945.29533-2-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-25 11:40:44 +02:00
Laurent Dufour
6568e52b28 watchdog: export lockup_detector_reconfigure
[ Upstream commit 7c56a8733d ]

In some circumstances it may be interesting to reconfigure the watchdog
from inside the kernel.

On PowerPC, this may helpful before and after a LPAR migration (LPM) is
initiated, because it implies some latencies, watchdog, and especially NMI
watchdog is expected to be triggered during this operation. Reconfiguring
the watchdog with a factor, would prevent it to happen too frequently
during LPM.

Rename lockup_detector_reconfigure() as __lockup_detector_reconfigure() and
create a new function lockup_detector_reconfigure() calling
__lockup_detector_reconfigure() under the protection of watchdog_mutex.

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
[mpe: Squash in build fix from Laurent, reported by Sachin]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220713154729.80789-3-ldufour@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-25 11:40:43 +02:00
Yunfei Wang
2097c78351 iommu/io-pgtable-arm-v7s: Add a quirk to allow pgtable PA up to 35bit
[ Upstream commit bfdd231374 ]

Single memory zone feature will remove ZONE_DMA32 and ZONE_DMA and
cause pgtable PA size larger than 32bit.

Since Mediatek IOMMU hardware support at most 35bit PA in pgtable,
so add a quirk to allow the PA of pgtables support up to bit35.

Signed-off-by: Ning Li <ning.li@mediatek.com>
Signed-off-by: Yunfei Wang <yf.wang@mediatek.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20220630092927.24925-2-yf.wang@mediatek.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-25 11:40:41 +02:00
Jean-Philippe Brucker
50de504581 uacce: Handle parent device removal or parent driver module rmmod
[ Upstream commit 80fc671bcc ]

The uacce driver must deal with a possible removal of the parent device
or parent driver module rmmod at any time.

Although uacce_remove(), called on device removal and on driver unbind,
prevents future use of the uacce fops by removing the cdev, fops that
were called before that point may still be running.

Serialize uacce_fops_open() and uacce_remove() with uacce->mutex.
Serialize other fops against uacce_remove() with q->mutex.
Since we need to protect uacce_fops_poll() which gets called on the fast
path, replace uacce->queues_lock with q->mutex to improve scalability.
The other fops are only used during setup.

uacce_queue_is_valid(), checked under q->mutex or uacce->mutex, denotes
whether uacce_remove() has disabled all queues. If that is the case,
don't go any further since the parent device is being removed and
uacce->ops should not be called anymore.

Reported-by: Yang Shen <shenyang39@huawei.com>
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Link: https://lore.kernel.org/r/20220701034843.7502-1-zhangfei.gao@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-25 11:40:34 +02:00
Chuck Lever
d3c262f584 SUNRPC: Fix xdr_encode_bool()
commit c770f31d8f upstream.

I discovered that xdr_encode_bool() was returning the same address
that was passed in the @p parameter. The documenting comment states
that the intent is to return the address of the next buffer
location, just like the other "xdr_encode_*" helpers.

The result was the encoded results of NFSv3 PATHCONF operations were
not formed correctly.

Fixes: ded04a587f ("NFSD: Update the NFSv3 PATHCONF3res encoder to use struct xdr_stream")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-25 11:40:05 +02:00
Jinghao Jia
41fd6cc88a BPF: Fix potential bad pointer dereference in bpf_sys_bpf()
commit e2dcac2f58 upstream.

The bpf_sys_bpf() helper function allows an eBPF program to load another
eBPF program from within the kernel. In this case the argument union
bpf_attr pointer (as well as the insns and license pointers inside) is a
kernel address instead of a userspace address (which is the case of a
usual bpf() syscall). To make the memory copying process in the syscall
work in both cases, bpfptr_t was introduced to wrap around the pointer
and distinguish its origin. Specifically, when copying memory contents
from a bpfptr_t, a copy_from_user() is performed in case of a userspace
address and a memcpy() is performed for a kernel address.

This can lead to problems because the in-kernel pointer is never checked
for validity. The problem happens when an eBPF syscall program tries to
call bpf_sys_bpf() to load a program but provides a bad insns pointer --
say 0xdeadbeef -- in the bpf_attr union. The helper calls __sys_bpf()
which would then call bpf_prog_load() to load the program.
bpf_prog_load() is responsible for copying the eBPF instructions to the
newly allocated memory for the program; it creates a kernel bpfptr_t for
insns and invokes copy_from_bpfptr(). Internally, all bpfptr_t
operations are backed by the corresponding sockptr_t operations, which
performs direct memcpy() on kernel pointers for copy_from/strncpy_from
operations. Therefore, the code is always happy to dereference the bad
pointer to trigger a un-handle-able page fault and in turn an oops.
However, this is not supposed to happen because at that point the eBPF
program is already verified and should not cause a memory error.

Sample KASAN trace:

[   25.685056][  T228] ==================================================================
[   25.685680][  T228] BUG: KASAN: user-memory-access in copy_from_bpfptr+0x21/0x30
[   25.686210][  T228] Read of size 80 at addr 00000000deadbeef by task poc/228
[   25.686732][  T228]
[   25.686893][  T228] CPU: 3 PID: 228 Comm: poc Not tainted 5.19.0-rc7 #7
[   25.687375][  T228] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014
[   25.687991][  T228] Call Trace:
[   25.688223][  T228]  <TASK>
[   25.688429][  T228]  dump_stack_lvl+0x73/0x9e
[   25.688747][  T228]  print_report+0xea/0x200
[   25.689061][  T228]  ? copy_from_bpfptr+0x21/0x30
[   25.689401][  T228]  ? _printk+0x54/0x6e
[   25.689693][  T228]  ? _raw_spin_lock_irqsave+0x70/0xd0
[   25.690071][  T228]  ? copy_from_bpfptr+0x21/0x30
[   25.690412][  T228]  kasan_report+0xb5/0xe0
[   25.690716][  T228]  ? copy_from_bpfptr+0x21/0x30
[   25.691059][  T228]  kasan_check_range+0x2bd/0x2e0
[   25.691405][  T228]  ? copy_from_bpfptr+0x21/0x30
[   25.691734][  T228]  memcpy+0x25/0x60
[   25.692000][  T228]  copy_from_bpfptr+0x21/0x30
[   25.692328][  T228]  bpf_prog_load+0x604/0x9e0
[   25.692653][  T228]  ? cap_capable+0xb4/0xe0
[   25.692956][  T228]  ? security_capable+0x4f/0x70
[   25.693324][  T228]  __sys_bpf+0x3af/0x580
[   25.693635][  T228]  bpf_sys_bpf+0x45/0x240
[   25.693937][  T228]  bpf_prog_f0ec79a5a3caca46_bpf_func1+0xa2/0xbd
[   25.694394][  T228]  bpf_prog_run_pin_on_cpu+0x2f/0xb0
[   25.694756][  T228]  bpf_prog_test_run_syscall+0x146/0x1c0
[   25.695144][  T228]  bpf_prog_test_run+0x172/0x190
[   25.695487][  T228]  __sys_bpf+0x2c5/0x580
[   25.695776][  T228]  __x64_sys_bpf+0x3a/0x50
[   25.696084][  T228]  do_syscall_64+0x60/0x90
[   25.696393][  T228]  ? fpregs_assert_state_consistent+0x50/0x60
[   25.696815][  T228]  ? exit_to_user_mode_prepare+0x36/0xa0
[   25.697202][  T228]  ? syscall_exit_to_user_mode+0x20/0x40
[   25.697586][  T228]  ? do_syscall_64+0x6e/0x90
[   25.697899][  T228]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   25.698312][  T228] RIP: 0033:0x7f6d543fb759
[   25.698624][  T228] Code: 08 5b 89 e8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 a6 0e 00 f7 d8 64 89 01 48
[   25.699946][  T228] RSP: 002b:00007ffc3df78468 EFLAGS: 00000287 ORIG_RAX: 0000000000000141
[   25.700526][  T228] RAX: ffffffffffffffda RBX: 00007ffc3df78628 RCX: 00007f6d543fb759
[   25.701071][  T228] RDX: 0000000000000090 RSI: 00007ffc3df78478 RDI: 000000000000000a
[   25.701636][  T228] RBP: 00007ffc3df78510 R08: 0000000000000000 R09: 0000000000300000
[   25.702191][  T228] R10: 0000000000000005 R11: 0000000000000287 R12: 0000000000000000
[   25.702736][  T228] R13: 00007ffc3df78638 R14: 000055a1584aca68 R15: 00007f6d5456a000
[   25.703282][  T228]  </TASK>
[   25.703490][  T228] ==================================================================
[   25.704050][  T228] Disabling lock debugging due to kernel taint

Update copy_from_bpfptr() and strncpy_from_bpfptr() so that:
 - for a kernel pointer, it uses the safe copy_from_kernel_nofault() and
   strncpy_from_kernel_nofault() functions.
 - for a userspace pointer, it performs copy_from_user() and
   strncpy_from_user().

Fixes: af2ac3e13e ("bpf: Prepare bpf syscall to be used from kernel and user space.")
Link: https://lore.kernel.org/bpf/20220727132905.45166-1-jinghao@linux.ibm.com/
Signed-off-by: Jinghao Jia <jinghao@linux.ibm.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20220729201713.88688-1-jinghao@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-25 11:40:02 +02:00
Hector Martin
1b7e0482ab locking/atomic: Make test_and_*_bit() ordered on failure
commit 415d832497 upstream.

These operations are documented as always ordered in
include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
type use cases where one side needs to ensure a flag is left pending
after some shared data was updated rely on this ordering, even in the
failure case.

This is the case with the workqueue code, which currently suffers from a
reproducible ordering violation on Apple M1 platforms (which are
notoriously out-of-order) that ends up causing the TTY layer to fail to
deliver data to userspace properly under the right conditions.  This
change fixes that bug.

Change the documentation to restrict the "no order on failure" story to
the _lock() variant (for which it makes sense), and remove the
early-exit from the generic implementation, which is what causes the
missing barrier semantics in that case.  Without this, the remaining
atomic op is fully ordered (including on ARM64 LSE, as of recent
versions of the architecture spec).

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: e986a0d6cb ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
Fixes: 61e02392d3 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
Signed-off-by: Hector Martin <marcan@marcan.st>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-25 11:39:54 +02:00
Kees Cook
3cfd07084b tracing/perf: Avoid -Warray-bounds warning for __rel_loc macro
commit c6d777acdf upstream.

As done for trace_events.h, also fix the __rel_loc macro in perf.h,
which silences the -Warray-bounds warning:

In file included from ./include/linux/string.h:253,
                 from ./include/linux/bitmap.h:11,
                 from ./include/linux/cpumask.h:12,
                 from ./include/linux/mm_types_task.h:14,
                 from ./include/linux/mm_types.h:5,
                 from ./include/linux/buildid.h:5,
                 from ./include/linux/module.h:14,
                 from samples/trace_events/trace-events-sample.c:2:
In function '__fortify_strcpy',
    inlined from 'perf_trace_foo_rel_loc' at samples/trace_events/./trace-events-sample.h:519:1:
./include/linux/fortify-string.h:47:33: warning: '__builtin_strcpy' offset 12 is out of the bounds [
0, 4] [-Warray-bounds]
   47 | #define __underlying_strcpy     __builtin_strcpy
      |                                 ^
./include/linux/fortify-string.h:445:24: note: in expansion of macro '__underlying_strcpy'
  445 |                 return __underlying_strcpy(p, q);
      |                        ^~~~~~~~~~~~~~~~~~~

Also make __data struct member a proper flexible array to avoid future
problems.

Link: https://lkml.kernel.org/r/20220125220037.2738923-1-keescook@chromium.org

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 55de2c0b56 ("tracing: Add '__rel_loc' using trace event macros")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17 14:24:30 +02:00
Steven Rostedt (Google)
f467478d15 tracing: Use a struct alignof to determine trace event field alignment
commit 4c3d2f9388 upstream.

alignof() gives an alignment of types as they would be as standalone
variables. But alignment in structures might be different, and when
building the fields of events, the alignment must be the actual
alignment otherwise the field offsets may not match what they actually
are.

This caused trace-cmd to crash, as libtraceevent did not check if the
field offset was bigger than the event. The write_msr and read_msr
events on 32 bit had their fields incorrect, because it had a u64 field
between two ints. alignof(u64) would give 8, but the u64 field was at a
4 byte alignment.

Define a macro as:

   ALIGN_STRUCTFIELD(type) ((int)(offsetof(struct {char a; type b;}, b)))

which gives the actual alignment of types in a structure.

Link: https://lkml.kernel.org/r/20220731015928.7ab3a154@rorschach.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 04ae87a520 ("ftrace: Rework event_create_dir()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17 14:24:29 +02:00
Huacai Chen
9e7dab7eda tpm: eventlog: Fix section mismatch for DEBUG_SECTION_MISMATCH
[ Upstream commit bed4593645 ]

If DEBUG_SECTION_MISMATCH enabled, __calc_tpm2_event_size() will not be
inlined, this cause section mismatch like this:

WARNING: modpost: vmlinux.o(.text.unlikely+0xe30c): Section mismatch in reference from the variable L0 to the function .init.text:early_ioremap()
The function L0() references
the function __init early_memremap().
This is often because L0 lacks a __init
annotation or the annotation of early_ioremap is wrong.

Fix it by using __always_inline instead of inline for the called-once
function __calc_tpm2_event_size().

Fixes: 44038bc514 ("tpm: Abstract crypto agile event size calculations")
Cc: stable@vger.kernel.org # v5.3
Reported-by: WANG Xuerui <git@xen0n.name>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:28 +02:00
Masami Hiramatsu
3c0a5a0e1c tracing: Avoid -Warray-bounds warning for __rel_loc macro
[ Upstream commit 58c5724ec2 ]

Since -Warray-bounds checks the destination size from the type of given
pointer, __assign_rel_str() macro gets warned because it passes the
pointer to the 'u32' field instead of 'trace_event_raw_*' data structure.
Pass the data address calculated from the 'trace_event_raw_*' instead of
'u32' __rel_loc field.

Link: https://lkml.kernel.org/r/20220125233154.dac280ed36944c0c2fe6f3ac@kernel.org

Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
[ This did not fix the warning, but is still a nice clean up ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:26 +02:00
Masami Hiramatsu
994dea8549 tracing: Add '__rel_loc' using trace event macros
[ Upstream commit 55de2c0b56 ]

Add '__rel_loc' using trace event macros. These macros are usually
not used in the kernel, except for testing purpose.
This also add "rel_" variant of macros for dynamic_array string,
and bitmask.

Link: https://lkml.kernel.org/r/163757342119.510314.816029622439099016.stgit@devnote2

Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:26 +02:00
Rafael J. Wysocki
0eeb7037a4 ACPI: CPPC: Do not prevent CPPC from working in the future
[ Upstream commit 4f4179fcf4 ]

There is a problem with the current revision checks in
is_cppc_supported() that they essentially prevent the CPPC support
from working if a new _CPC package format revision being a proper
superset of the v3 and only causing _CPC to return a package with more
entries (while retaining the types and meaning of the entries defined by
the v3) is introduced in the future and used by the platform firmware.

In that case, as long as the number of entries in the _CPC return
package is at least CPPC_V3_NUM_ENT, it should be perfectly fine to
use the v3 support code and disregard the additional package entries
added by the new package format revision.

For this reason, drop is_cppc_supported() altogether, put the revision
checks directly into acpi_cppc_processor_probe() so they are easier to
follow and rework them to take the case mentioned above into account.

Fixes: 4773e77cdc ("ACPI / CPPC: Add support for CPPC v3")
Cc: 4.18+ <stable@vger.kernel.org> # 4.18+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:25 +02:00
Christoph Hellwig
b055781dd9 block: remove the struct blk_queue_ctx forward declaration
[ Upstream commit 9778ac77c2 ]

This type doesn't exist at all, so no need to forward declare it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:24 +02:00
Jason A. Donenfeld
3dd33a09f5 crypto: blake2s - remove shash module
[ Upstream commit 2d16803c56 ]

BLAKE2s has no currently known use as an shash. Just remove all of this
unnecessary plumbing. Removing this shash was something we talked about
back when we were making BLAKE2s a built-in, but I simply never got
around to doing it. So this completes that project.

Importantly, this fixs a bug in which the lib code depends on
crypto_simd_disabled_for_test, causing linker errors.

Also add more alignment tests to the selftests and compare SIMD and
non-SIMD compression functions, to make up for what we lose from
testmgr.c.

Reported-by: gaochao <gaochao49@huawei.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 6048fdcc5f ("lib/crypto: blake2s: include as built-in")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:19 +02:00
David Collins
1e0ca3d809 spmi: trace: fix stack-out-of-bound access in SPMI tracing functions
commit 2af28b241e upstream.

trace_spmi_write_begin() and trace_spmi_read_end() both call
memcpy() with a length of "len + 1".  This leads to one extra
byte being read beyond the end of the specified buffer.  Fix
this out-of-bound memory access by using a length of "len"
instead.

Here is a KASAN log showing the issue:

BUG: KASAN: stack-out-of-bounds in trace_event_raw_event_spmi_read_end+0x1d0/0x234
Read of size 2 at addr ffffffc0265b7540 by task thermal@2.0-ser/1314
...
Call trace:
 dump_backtrace+0x0/0x3e8
 show_stack+0x2c/0x3c
 dump_stack_lvl+0xdc/0x11c
 print_address_description+0x74/0x384
 kasan_report+0x188/0x268
 kasan_check_range+0x270/0x2b0
 memcpy+0x90/0xe8
 trace_event_raw_event_spmi_read_end+0x1d0/0x234
 spmi_read_cmd+0x294/0x3ac
 spmi_ext_register_readl+0x84/0x9c
 regmap_spmi_ext_read+0x144/0x1b0 [regmap_spmi]
 _regmap_raw_read+0x40c/0x754
 regmap_raw_read+0x3a0/0x514
 regmap_bulk_read+0x418/0x494
 adc5_gen3_poll_wait_hs+0xe8/0x1e0 [qcom_spmi_adc5_gen3]
 ...
 __arm64_sys_read+0x4c/0x60
 invoke_syscall+0x80/0x218
 el0_svc_common+0xec/0x1c8
 ...

addr ffffffc0265b7540 is located in stack of task thermal@2.0-ser/1314 at offset 32 in frame:
 adc5_gen3_poll_wait_hs+0x0/0x1e0 [qcom_spmi_adc5_gen3]

this frame has 1 object:
 [32, 33) 'status'

Memory state around the buggy address:
 ffffffc0265b7400: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
 ffffffc0265b7480: 04 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
>ffffffc0265b7500: 00 00 00 00 f1 f1 f1 f1 01 f3 f3 f3 00 00 00 00
                                           ^
 ffffffc0265b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffffffc0265b7600: f1 f1 f1 f1 01 f2 07 f2 f2 f2 01 f3 00 00 00 00
==================================================================

Fixes: a9fce37481 ("spmi: add command tracepoints for SPMI")
Cc: stable@vger.kernel.org
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: David Collins <quic_collinsd@quicinc.com>
Link: https://lore.kernel.org/r/20220627235512.2272783-1-quic_collinsd@quicinc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17 14:24:19 +02:00
Waiman Long
147f66d22f sched, cpuset: Fix dl_cpu_busy() panic due to empty cs->cpus_allowed
[ Upstream commit b6e8d40d43 ]

With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
that the cpuset will just use the effective CPUs of its parent. So
cpuset_can_attach() can call task_can_attach() with an empty mask.
This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
to dl_bw_of() to crash due to percpu value access of an out of bound
CPU value. For example:

	[80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
	  :
	[80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
	  :
	[80468.207946] Call Trace:
	[80468.208947]  cpuset_can_attach+0xa0/0x140
	[80468.209953]  cgroup_migrate_execute+0x8c/0x490
	[80468.210931]  cgroup_update_dfl_csses+0x254/0x270
	[80468.211898]  cgroup_subtree_control_write+0x322/0x400
	[80468.212854]  kernfs_fop_write_iter+0x11c/0x1b0
	[80468.213777]  new_sync_write+0x11f/0x1b0
	[80468.214689]  vfs_write+0x1eb/0x280
	[80468.215592]  ksys_write+0x5f/0xe0
	[80468.216463]  do_syscall_64+0x5c/0x80
	[80468.224287]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
to be used by tasks within the cpuset anyway.

Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
reflect the change. In addition, a check is added to task_can_attach()
to guard against the possibility that cpumask_any_and() may return a
value >= nr_cpu_ids.

Fixes: 7f51412a41 ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:14 +02:00
Uwe Kleine-König
a8aa2a76ee mfd: t7l66xb: Drop platform disable callback
[ Upstream commit 128ac294e1 ]

None of the in-tree instantiations of struct t7l66xb_platform_data
provides a disable callback. So better don't dereference this function
pointer unconditionally. As there is no user, drop it completely instead
of calling it conditional.

This is a preparation for making platform remove callbacks return void.

Fixes: 1f192015ca ("mfd: driver for the T7L66XB TMIO SoC")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Link: https://lore.kernel.org/r/20220530192430.2108217-3-u.kleine-koenig@pengutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:09 +02:00
Dan Carpenter
74824db8b2 kfifo: fix kfifo_to_user() return type
[ Upstream commit 045ed31e23 ]

The kfifo_to_user() macro is supposed to return zero for success or
negative error codes.  Unfortunately, there is a signedness bug so it
returns unsigned int.  This only affects callers which try to save the
result in ssize_t and as far as I can see the only place which does that
is line6_hwdep_read().

TL;DR: s/_uint/_int/.

Link: https://lkml.kernel.org/r/YrVL3OJVLlNhIMFs@kili
Fixes: 144ecf310e ("kfifo: fix kfifo_alloc() to return a signed int value")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:08 +02:00
Kent Overstreet
273c16d93f 9p: Add client parameter to p9_req_put()
[ Upstream commit 8b11ff098a ]

This is to aid in adding mempools, in the next patch.

Link: https://lkml.kernel.org/r/20220704014243.153050-2-kent.overstreet@gmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:07 +02:00
Kent Overstreet
3ac76cdab9 9p: Drop kref usage
[ Upstream commit 6cda12864c ]

An upcoming patch is going to require passing the client through
p9_req_put() -> p9_req_free(), but that's awkward with the kref
indirection - so this patch switches to using refcount_t directly.

Link: https://lkml.kernel.org/r/20220704014243.153050-1-kent.overstreet@gmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:07 +02:00
Dominique Martinet
eb3eed52a7 9p: fix a bunch of checkpatch warnings
[ Upstream commit 6e195b0f7c ]

Sohaib Mohamed started a serie of tiny and incomplete checkpatch fixes but
seemingly stopped halfway -- take over and do most of it.
This is still missing net/9p/trans* and net/9p/protocol.c for a later
time...

Link: http://lkml.kernel.org/r/20211102134608.1588018-3-dominique.martinet@atmark-techno.com
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:24:07 +02:00
Gwendal Grignou
c41664346a iio: cros: Register FIFO callback after sensor is registered
[ Upstream commit 0b4ae3f6d1 ]

Instead of registering callback to process sensor events right at
initialization time, wait for the sensor to be register in the iio
subsystem.

Events can come at probe time (in case the kernel rebooted abruptly
without switching the sensor off for  instance), and be sent to IIO core
before the sensor is fully registered.

Fixes: aa984f1ba4 ("iio: cros_ec: Register to cros_ec_sensorhub when EC supports FIFO")
Reported-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Gwendal Grignou <gwendal@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20220711144716.642617-1-gwendal@chromium.org
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:55 +02:00
Pierre-Louis Bossart
250b465051 soundwire: revisit driver bind/unbind and callbacks
[ Upstream commit bd29c00edd ]

In the SoundWire probe, we store a pointer from the driver ops into
the 'slave' structure. This can lead to kernel oopses when unbinding
codec drivers, e.g. with the following sequence to remove machine
driver and codec driver.

/sbin/modprobe -r snd_soc_sof_sdw
/sbin/modprobe -r snd_soc_rt711

The full details can be found in the BugLink below, for reference the
two following examples show different cases of driver ops/callbacks
being invoked after the driver .remove().

kernel: BUG: kernel NULL pointer dereference, address: 0000000000000150
kernel: Workqueue: events cdns_update_slave_status_work [soundwire_cadence]
kernel: RIP: 0010:mutex_lock+0x19/0x30
kernel: Call Trace:
kernel:  ? sdw_handle_slave_status+0x426/0xe00 [soundwire_bus 94ff184bf398570c3f8ff7efe9e32529f532e4ae]
kernel:  ? newidle_balance+0x26a/0x400
kernel:  ? cdns_update_slave_status_work+0x1e9/0x200 [soundwire_cadence 1bcf98eebe5ba9833cd433323769ac923c9c6f82]

kernel: BUG: unable to handle page fault for address: ffffffffc07654c8
kernel: Workqueue: pm pm_runtime_work
kernel: RIP: 0010:sdw_bus_prep_clk_stop+0x6f/0x160 [soundwire_bus]
kernel: Call Trace:
kernel:  <TASK>
kernel:  sdw_cdns_clock_stop+0xb5/0x1b0 [soundwire_cadence 1bcf98eebe5ba9833cd433323769ac923c9c6f82]
kernel:  intel_suspend_runtime+0x5f/0x120 [soundwire_intel aca858f7c87048d3152a4a41bb68abb9b663a1dd]
kernel:  ? dpm_sysfs_remove+0x60/0x60

This was not detected earlier in Intel tests since the tests first
remove the parent PCI device and shut down the bus. The sequence
above is a corner case which keeps the bus operational but without a
driver bound.

While trying to solve this kernel oopses, it became clear that the
existing SoundWire bus does not deal well with the unbind case.

Commit 528be501b7 ("soundwire: sdw_slave: add probe_complete structure and new fields")
added a 'probed' status variable and a 'probe_complete'
struct completion. This status is however not reset on remove and
likewise the 'probe complete' is not re-initialized, so the
bind/unbind/bind test cases would fail. The timeout used before the
'update_status' callback was also a bad idea in hindsight, there
should really be no timing assumption as to if and when a driver is
bound to a device.

An initial draft was based on device_lock() and device_unlock() was
tested. This proved too complicated, with deadlocks created during the
suspend-resume sequences, which also use the same device_lock/unlock()
as the bind/unbind sequences. On a CometLake device, a bad DSDT/BIOS
caused spurious resumes and the use of device_lock() caused hangs
during suspend. After multiple weeks or testing and painful
reverse-engineering of deadlocks on different devices, we looked for
alternatives that did not interfere with the device core.

A bus notifier was used successfully to keep track of DRIVER_BOUND and
DRIVER_UNBIND events. This solved the bind-unbind-bind case in tests,
but it can still be defeated with a theoretical corner case where the
memory is freed by a .remove while the callback is in use. The
notifier only helps make sure the driver callbacks are valid, but not
that the memory allocated in probe remains valid while the callbacks
are invoked.

This patch suggests the introduction of a new 'sdw_dev_lock' mutex
protecting probe/remove and all driver callbacks. Since this mutex is
'local' to SoundWire only, it does not interfere with existing locks
and does not create deadlocks. In addition, this patch removes the
'probe_complete' completion, instead we directly invoke the
'update_status' from the probe routine. That removes any sort of
timing dependency and a much better support for the device/driver
model, the driver could be bound before the bus started, or eons after
the bus started and the hardware would be properly initialized in all
cases.

BugLink: https://github.com/thesofproject/linux/issues/3531
Fixes: 56d4fe31af ("soundwire: Add MIPI DisCo property helpers")
Fixes: 528be501b7 ("soundwire: sdw_slave: add probe_complete structure and new fields")
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Rander Wang <rander.wang@intel.com>
Reviewed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://lore.kernel.org/r/20220621225641.221170-2-pierre-louis.bossart@linux.intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:48 +02:00
Bryan O'Donoghue
0c787dfc22 clk: qcom: gcc-msm8939: Add missing SYSTEM_MM_NOC_BFDCD_CLK_SRC
[ Upstream commit 07e7fcf171 ]

When adding in the indexes for this clock-controller we missed
SYSTEM_MM_NOC_BFDCD_CLK_SRC.

Add it in now.

Fixes: 4c71d6abc4 ("clk: qcom: Add DT bindings for MSM8939 GCC")
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>
Cc: devicetree@vger.kernel.org
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Link: https://lore.kernel.org/r/20220504163835.40130-2-bryan.odonoghue@linaro.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:45 +02:00
Mike Christie
3e541526aa scsi: iscsi: Fix session removal on shutdown
[ Upstream commit 31500e9027 ]

When the system is shutting down, iscsid is not running so we will not get
a response to the ISCSI_ERR_INVALID_HOST error event. The system shutdown
will then hang waiting on userspace to remove the session.

This has libiscsi force the destruction of the session from the kernel when
iscsi_host_remove() is called from a driver's shutdown callout.

This fixes a regression added in qedi boot with commit d1f2ce7763 ("scsi:
qedi: Fix host removal with running sessions") which made qedi use the
common session removal function that waits on userspace instead of rolling
its own kernel based removal.

Link: https://lore.kernel.org/r/20220616222738.5722-7-michael.christie@oracle.com
Fixes: d1f2ce7763 ("scsi: qedi: Fix host removal with running sessions")
Tested-by: Nilesh Javali <njavali@marvell.com>
Reviewed-by: Lee Duncan <lduncan@suse.com>
Reviewed-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:45 +02:00
Mike Christie
ca89365815 scsi: iscsi: Add helper to remove a session from the kernel
[ Upstream commit bb42856bfd ]

During qedi shutdown we need to stop the iSCSI layer from sending new nops
as pings and from responding to target ones and make sure there is no
running connection cleanups. Commit d1f2ce7763 ("scsi: qedi: Fix host
removal with running sessions") converted the driver to use the libicsi
helper to drive session removal, so the above issues could be handled. The
problem is that during system shutdown iscsid will not be running so when
we try to remove the root session we will hang waiting for userspace to
reply.

Add a helper that will drive the destruction of sessions like these during
system shutdown.

Link: https://lore.kernel.org/r/20220616222738.5722-5-michael.christie@oracle.com
Tested-by: Nilesh Javali <njavali@marvell.com>
Reviewed-by: Nilesh Javali <njavali@marvell.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:44 +02:00
Christoph Hellwig
d18a908565 memremap: remove support for external pgmap refcounts
[ Upstream commit b80892ca02 ]

No driver is left using the external pgmap refcount, so remove the
code to support it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20211028151017.50234-1-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:43 +02:00
Thomas Gleixner
154bf040ba netfilter: xtables: Bring SPDX identifier back
[ Upstream commit 20646f5b1e ]

Commit e2be04c7f9 ("License cleanup: add SPDX license identifier to
uapi header files with a license") added the correct SPDX identifier to
include/uapi/linux/netfilter/xt_IDLETIMER.h.

A subsequent commit removed it for no reason and reintroduced the UAPI
license incorrectness as the file is now missing the UAPI exception
again.

Add it back and remove the GPLv2 boilerplate while at it.

Fixes: 68983a354a ("netfilter: xtables: Add snapshot of hardidletimer target")
Cc: Manoj Basapathi <manojbm@codeaurora.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:42 +02:00
Mike Manning
51dd6d3beb net: allow unbound socket for packets in VRF when tcp_l3mdev_accept set
[ Upstream commit 944fd1aeac ]

The commit 3c82a21f43 ("net: allow binding socket in a VRF when
there's an unbound socket") changed the inet socket lookup to avoid
packets in a VRF from matching an unbound socket. This is to ensure the
necessary isolation between the default and other VRFs for routing and
forwarding. VRF-unaware processes running in the default VRF cannot
access another VRF and have to be run with 'ip vrf exec <vrf>'. This is
to be expected with tcp_l3mdev_accept disabled, but could be reallowed
when this sysctl option is enabled. So instead of directly checking dif
and sdif in inet[6]_match, here call inet_sk_bound_dev_eq(). This
allows a match on unbound socket for non-zero sdif i.e. for packets in
a VRF, if tcp_l3mdev_accept is enabled.

Fixes: 3c82a21f43 ("net: allow binding socket in a VRF when there's an unbound socket")
Signed-off-by: Mike Manning <mvrmanning@gmail.com>
Link: https://lore.kernel.org/netdev/a54c149aed38fded2d3b5fdb1a6c89e36a083b74.camel@lasnet.de/
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:36 +02:00
Eric Dumazet
8fd69d3df9 ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH()
[ Upstream commit 5d368f0328 ]

INET6_MATCH() runs without holding a lock on the socket.

We probably need to annotate most reads.

This patch makes INET6_MATCH() an inline function
to ease our changes.

v2: inline function only defined if IS_ENABLED(CONFIG_IPV6)
    Change the name to inet6_match(), this is no longer a macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:36 +02:00
Eric Dumazet
68bf74ec95 inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()
[ Upstream commit 4915d50e30 ]

INET_MATCH() runs without holding a lock on the socket.

We probably need to annotate most reads.

This patch makes INET_MATCH() an inline function
to ease our changes.

v2:

We remove the 32bit version of it, as modern compilers
should generate the same code really, no need to
try to be smarter.

Also make 'struct net *net' the first argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:35 +02:00
Vincent Mailhol
5793da4db3 can: error: specify the values of data[5..7] of CAN error frames
[ Upstream commit e70a3263a7 ]

Currently, data[5..7] of struct can_frame, when used as a CAN error
frame, are defined as being "controller specific". Device specific
behaviours are problematic because it prevents someone from writing
code which is portable between devices.

As a matter of fact, data[5] is never used, data[6] is always used to
report TX error counter and data[7] is always used to report RX error
counter. can-utils also relies on this.

This patch updates the comment in the uapi header to specify that
data[5] is reserved (and thus should not be used) and that data[6..7]
are used for error counters.

Fixes: 0d66548a10 ("[CAN]: Add PF_CAN core module")
Link: https://lore.kernel.org/all/20220719143550.3681-11-mailhol.vincent@wanadoo.fr
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:33 +02:00
Mordechay Goodstein
0fb9387b85 ieee80211: add EHT 1K aggregation definitions
[ Upstream commit 2a2c86f15e ]

We add the fields for parsing extended ADDBA request/respond,
and new max 1K aggregation for limit ADDBA request/respond.

Adjust drivers to use the proper macro, IEEE80211_MAX_AMPDU_BUF ->
IEEE80211_MAX_AMPDU_BUF_HE.

Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Link: https://lore.kernel.org/r/20220214173004.b8b447ce95b7.I0ee2554c94e89abc7a752b0f7cc7fd79c273efea@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:29 +02:00
Maxime Ripard
ed486d1508 drm/bridge: Add a function to abstract away panels
[ Upstream commit 87ea95808d ]

Display drivers so far need to have a lot of boilerplate to first
retrieve either the panel or bridge that they are connected to using
drm_of_find_panel_or_bridge(), and then either deal with each with ad-hoc
functions or create a drm panel bridge through drm_panel_bridge_add.

In order to reduce the boilerplate and hopefully create a path of least
resistance towards using the DRM panel bridge layer, let's create the
function devm_drm_of_get_bridge() to reduce that boilerplate.

Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20210910130941.1740182-2-maxime@cerno.tech
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:24 +02:00
Paul E. McKenney
5a6ff3e32a rcutorture: Warn on individual rcu_torture_init() error conditions
[ Upstream commit efeff6b39b ]

When running rcutorture as a module, any rcu_torture_init() issues will be
reflected in the error code from modprobe or insmod, as the case may be.
However, these error codes are not available when running rcutorture
built-in, for example, when using the kvm.sh script.  This commit
therefore adds WARN_ON_ONCE() to allow distinguishing rcu_torture_init()
errors when running rcutorture built-in.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:22 +02:00
Eric Dumazet
ab5a4e0cbf net: fix sk_wmem_schedule() and sk_rmem_schedule() errors
[ Upstream commit 7c80b038d2 ]

If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes,
we want to allocate 1 byte more (rounded up to one page),
instead of 150001 :/

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:19 +02:00
Thomas Zimmermann
ff0087df19 drm/shmem-helper: Pass GEM shmem object in public interfaces
[ Upstream commit a193f3b4e0 ]

Change all GEM SHMEM object functions that receive a GEM object
of type struct drm_gem_object to expect an object of type
struct drm_gem_shmem_object instead.

This change reduces the number of upcasts from struct drm_gem_object
by moving them into callers. The C compiler can now verify that the
GEM SHMEM functions are called with the correct type.

For consistency, the patch also renames drm_gem_shmem_free_object to
drm_gem_shmem_free. It further updates documentation for a number of
functions.

v3:
	* fix docs for drm_gem_shmem_object_free()
v2:
	* mention _object_ callbacks in docs (Daniel)

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20211108093149.7226-4-tzimmermann@suse.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:19 +02:00
Thomas Zimmermann
4109ff9e7b drm/shmem-helper: Export dedicated wrappers for GEM object functions
[ Upstream commit c7fbcb7149 ]

Wrap GEM SHMEM functions for struct drm_gem_object_funcs and update
all callers. This will allow for an update of the public interfaces
of the GEM SHMEM helper library.

v2:
	* fix docs for drm_gem_shmem_object_print_info()

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20211108093149.7226-3-tzimmermann@suse.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:18 +02:00
Thomas Zimmermann
a3f82087b7 drm/shmem-helper: Unexport drm_gem_shmem_create_with_handle()
[ Upstream commit 5a363c2067 ]

Turn drm_gem_shmem_create_with_handle() into an internal helper
function. It's not used outside of the compilation unit.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20211108093149.7226-2-tzimmermann@suse.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:18 +02:00
Peter Zijlstra
7bc43ab2b9 x86/extable: Fix ex_handler_msr() print condition
[ Upstream commit a1a5482a2c ]

On Fri, Jun 17, 2022 at 02:08:52PM +0300, Stephane Eranian wrote:
> Some changes to the way invalid MSR accesses are reported by the
> kernel is causing some problems with messages printed on the
> console.
>
> We have seen several cases of ex_handler_msr() printing invalid MSR
> accesses once but the callstack multiple times causing confusion on
> the console.

> The problem here is that another earlier commit (5.13):
>
> a358f40600 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality")
>
> Modifies all the pr_*_once() calls to always return true claiming
> that no caller is ever checking the return value of the functions.
>
> This is why we are seeing the callstack printed without the
> associated printk() msg.

Extract the ONCE_IF(cond) part into __ONCE_LTE_IF() and use that to
implement DO_ONCE_LITE_IF() and fix the extable code.

Fixes: a358f40600 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Stephane Eranian <eranian@google.com>
Link: https://lkml.kernel.org/r/YqyVFsbviKjVGGZ9@worktop.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:14 +02:00
Eric Auger
c24bbe4596 ACPI: VIOT: Fix ACS setup
[ Upstream commit 3dcb861dbc ]

Currently acpi_viot_init() gets called after the pci
device has been scanned and pci_enable_acs() has been called.
So pci_request_acs() fails to be taken into account leading
to wrong single iommu group topologies when dealing with
multi-function root ports for instance.

We cannot simply move the acpi_viot_init() earlier, similarly
as the IORT init because the VIOT parsing relies on the pci
scan. However we can detect VIOT is present earlier and in
such a case, request ACS. Introduce a new acpi_viot_early_init()
routine that allows to call pci_request_acs() before the scan.

While at it, guard the call to pci_request_acs() with #ifdef
CONFIG_PCI.

Fixes: 3cf485540e ("ACPI: Add driver for the VIOT table")
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reported-by: Jin Liu <jinl@redhat.com>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Tested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:11 +02:00
Shuai Xue
335852f7a9 ACPI: APEI: explicit init of HEST and GHES in apci_init()
[ Upstream commit dc4e8c07e9 ]

From commit e147133a42 ("ACPI / APEI: Make hest.c manage the estatus
memory pool") was merged, ghes_init() relies on acpi_hest_init() to manage
the estatus memory pool. On the other hand, ghes_init() relies on
sdei_init() to detect the SDEI version and (un)register events. The
dependencies are as follows:

    ghes_init() => acpi_hest_init() => acpi_bus_init() => acpi_init()
    ghes_init() => sdei_init()

HEST is not PCI-specific and initcall ordering is implicit and not
well-defined within a level.

Based on above, remove acpi_hest_init() from acpi_pci_root_init() and
convert ghes_init() and sdei_init() from initcalls to explicit calls in the
following order:

    acpi_hest_init()
    ghes_init()
        sdei_init()

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:11 +02:00
Peter Zijlstra
e0067e6d54 locking/lockdep: Fix lockdep_init_map_*() confusion
[ Upstream commit eae6d58d67 ]

Commit dfd5e3f5fe ("locking/lockdep: Mark local_lock_t") added yet
another lockdep_init_map_*() variant, but forgot to update all the
existing users of the most complicated version.

This could lead to a loss of lock_type and hence an incorrect report.
Given the relative rarity of both local_lock and these annotations,
this is unlikely to happen in practise, still, best fix things.

Fixes: dfd5e3f5fe ("locking/lockdep: Mark local_lock_t")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YqyEDtoan20K0CVD@worktop.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:08 +02:00
Juri Lelli
8df06a2753 wait: Fix __wait_event_hrtimeout for RT/DL tasks
[ Upstream commit cceeeb6a6d ]

Changes to hrtimer mode (potentially made by __hrtimer_init_sleeper on
PREEMPT_RT) are not visible to hrtimer_start_range_ns, thus not
accounted for by hrtimer_start_expires call paths. In particular,
__wait_event_hrtimeout suffers from this problem as we have, for
example:

fs/aio.c::read_events
  wait_event_interruptible_hrtimeout
    __wait_event_hrtimeout
      hrtimer_init_sleeper_on_stack <- this might "mode |= HRTIMER_MODE_HARD"
                                       on RT if task runs at RT/DL priority
        hrtimer_start_range_ns
          WARN_ON_ONCE(!(mode & HRTIMER_MODE_HARD) ^ !timer->is_hard)
          fires since the latter doesn't see the change of mode done by
          init_sleeper

Fix it by making __wait_event_hrtimeout call hrtimer_sleeper_start_expires,
which is aware of the special RT/DL case, instead of hrtimer_start_range_ns.

Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220627095051.42470-1-juri.lelli@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:01 +02:00
John Keeping
f066e01582 sched/core: Always flush pending blk_plug
[ Upstream commit 401e4963bf ]

With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):

	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
	Workqueue: writeback wb_workfn (flush-7:0)
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
	[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
	+(rt_mutex_slowlock.constprop.0+0xac/0x174)
	[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
	[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
	[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
	[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
	[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
	[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
	[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
	[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
	[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
	[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
	Exception stack(0xc10e3fb0 to 0xc10e3ff8)
	3fa0:                                     00000000 00000000 00000000 00000000
	3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
	3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

	INFO: task tar:2083 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:tar             state:D stack:    0 pid: 2083 ppid:  2082 flags:0x00000000
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
	[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
	[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
	[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
	[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
	[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
	[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
	[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
	[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
	[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
	[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
	Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
	bfa0:                   01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
	bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
	bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.

The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.

It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag.  But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.

Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe
("sched/core: Rework the __schedule() preempt argument").

As described in [1]:

   The schedule process must distinguish between blocking on a regular
   sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
   and rwlock):
   - rwsem and mutex must flush block requests (blk_schedule_flush_plug())
     even if blocked on a lock. This can not deadlock because this also
     happens for non-RT.
     There should be a warning if the scheduling point is within a RCU read
     section.

   - spinlock and rwlock must not flush block requests. This will deadlock
     if the callback attempts to acquire a lock which is already acquired.
     Similarly to being preempted, there should be no warning if the
     scheduling point is within a RCU read section.

and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:01 +02:00
Chen Yu
079651c6cf sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
[ Upstream commit 70fb5ccf2e ]

[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency usecs:
=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)

The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.

[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:

SIS Search Efficiency(se_eff%):
        A ratio expressed as a percentage of runqueues scanned versus
        idle CPUs found. A 100% efficiency indicates that the target,
        prev or recent CPU of a task was idle at wakeup. The lower the
        efficiency, the more runqueues were scanned before an idle CPU
        was found.

SIS Domain Search Efficiency(dom_eff%):
        Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate(fast_rate%):
        Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate(success_rate%):
        Percentage of scans that found an idle CPU.

The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.

Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860

schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549

schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312

schedstat_parse.py -f tbench_vanilla.log
case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
loopback	 224 threads	        0.000	       0.000	       0.000	       0.000

According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.

[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.

In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.

Introduce the quadratic function:

y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE

x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:

nr_scan = llc_weight * y'

Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...

For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...

Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.

When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.

This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca28512 ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.

[Test result]

netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.

The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.

Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)

Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:

    vanilla                    patched
   4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
     38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
  128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
   5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg

There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.

Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)

Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)

Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)

*Consider the Standard Deviation, this -7.36% regression might not be valid.

Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.

There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:

          CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
util_avg  1024    1024    1024    1024    1024    1024    1024    0

Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.

lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.

Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:

	if (llc_weight <= 16)
		nr_scan = nr_scan * 32 / llc_weight;

For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.

There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17 14:23:00 +02:00
Weitao Wang
b76ea430e9 USB: HCD: Fix URB giveback issue in tasklet function
commit 26c6c2f8a9 upstream.

Usb core introduce the mechanism of giveback of URB in tasklet context to
reduce hardware interrupt handling time. On some test situation(such as
FIO with 4KB block size), when tasklet callback function called to
giveback URB, interrupt handler add URB node to the bh->head list also.
If check bh->head list again after finish all URB giveback of local_list,
then it may introduce a "dynamic balance" between giveback URB and add URB
to bh->head list. This tasklet callback function may not exit for a long
time, which will cause other tasklet function calls to be delayed. Some
real-time applications(such as KB and Mouse) will see noticeable lag.

In order to prevent the tasklet function from occupying the cpu for a long
time at a time, new URBS will not be added to the local_list even though
the bh->head list is not empty. But also need to ensure the left URB
giveback to be processed in time, so add a member high_prio for structure
giveback_urb_bh to prioritize tasklet and schelule this tasklet again if
bh->head list is not empty.

At the same time, we are able to prioritize tasklet through structure
member high_prio. So, replace the local high_prio_bh variable with this
structure member in usb_hcd_giveback_urb.

Fixes: 94dfd7edfd ("USB: HCD: support giveback of URB in tasklet context")
Cc: stable <stable@kernel.org>
Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Weitao Wang <WeitaoWang-oc@zhaoxin.com>
Link: https://lore.kernel.org/r/20220726074918.5114-1-WeitaoWang-oc@zhaoxin.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17 14:22:58 +02:00
Pali Rohár
fdf7590591 PCI: Add defines for normal and subtractive PCI bridges
commit 904b10fb18 upstream.

Add these PCI class codes to pci_ids.h:

  PCI_CLASS_BRIDGE_PCI_NORMAL
  PCI_CLASS_BRIDGE_PCI_SUBTRACTIVE

Use these defines in all kernel code for describing PCI class codes for
normal and subtractive PCI bridges.

[bhelgaas: similar change in pci-mvebu.c]
Link: https://lore.kernel.org/r/20220214114109.26809-1-pali@kernel.org
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Guenter Roeck <linux@roeck-us.net>a
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
[ gregkh - take only the pci_ids.h portion for stable backports ]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17 14:22:57 +02:00