Commit Graph

48425 Commits

Author SHA1 Message Date
Linus Torvalds
0f70f5b08a Merge tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull automount updates from Al Viro:
 "Automount wart removal

  A bunch of odd boilerplate gone from instances - the reason for
  those was the need to protect the yet-to-be-attched mount from
  mark_mounts_for_expiry() deciding to take it out.

  But that's easy to detect and take care of in mark_mounts_for_expiry()
  itself; no need to have every instance simulate mount being busy by
  grabbing an extra reference to it, with finish_automount() undoing
  that once it attaches that mount.

  Should've done it that way from the very beginning... This is a
  flagday change, thankfully there are very few instances.

  vfs_submount() is gone - its sole remaining user (trace_automount)
  had been switched to saner primitives"

* tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill vfs_submount()
  saner calling conventions for ->d_automount()
2025-05-30 15:38:29 -07:00
Linus Torvalds
b78f1293f9 Merge tag 'trace-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:

 - Have module addresses get updated in the persistent ring buffer

   The addresses of the modules from the previous boot are saved in the
   persistent ring buffer. If the same modules are loaded and an address
   is in the old buffer points to an address that was both saved in the
   persistent ring buffer and is loaded in memory, shift the address to
   point to the address that is loaded in memory in the trace event.

 - Print function names for irqs off and preempt off callsites

   When ignoring the print fmt of a trace event and just printing the
   fields directly, have the fields for preempt off and irqs off events
   still show the function name (via kallsyms) instead of just showing
   the raw address.

 - Clean ups of the histogram code

   The histogram functions saved over 800 bytes on the stack to process
   events as they come in. Instead, create per-cpu buffers that can hold
   this information and have a separate location for each context level
   (thread, softirq, IRQ and NMI).

   Also add some more comments to the code.

 - Add "common_comm" field for histograms

   Add "common_comm" that uses the current->comm as a field in an event
   histogram and acts like any of the other fields of the event.

 - Show "subops" in the enabled_functions file

   When the function graph infrastructure is used, a subsystem has a
   "subops" that it attaches its callback function to. Instead of the
   enabled_functions just showing a function calling the function that
   calls the subops functions, also show the subops functions that will
   get called for that function too.

 - Add "copy_trace_marker" option to instances

   There are cases where an instance is created for tooling to write
   into, but the old tooling has the top level instance hardcoded into
   the application. New tools want to consume the data from an instance
   and not the top level buffer. By adding a copy_trace_marker option,
   whenever the top instance trace_marker is written into, a copy of it
   is also written into the instance with this option set. This allows
   new tools to read what old tools are writing into the top buffer.

   If this option is cleared by the top instance, then what is written
   into the trace_marker is not written into the top instance. This is a
   way to redirect the trace_marker writes into another instance.

 - Have tracepoints created by DECLARE_TRACE() use trace_<name>_tp()

   If a tracepoint is created by DECLARE_TRACE() instead of
   TRACE_EVENT(), then it will not be exposed via tracefs. Currently
   there's no way to differentiate in the kernel the tracepoint
   functions between those that are exposed via tracefs or not. A
   calling convention has been made manually to append a "_tp" prefix
   for events created by DECLARE_TRACE(). Instead of doing this
   manually, force it so that all DECLARE_TRACE() events have this
   notation.

 - Use __string() for task->comm in some sched events

   Instead of hardcoding the comm to be TASK_COMM_LEN in some of the
   scheduler events use __string() which makes it dynamic. Note, if
   these events are parsed by user space it they may break, and the
   event may have to be converted back to the hardcoded size.

 - Have function graph "depth" be unsigned to the user

   Internally to the kernel, the "depth" field of the function graph
   event is signed due to -1 being used for end of boundary. What
   actually gets recorded in the event itself is zero or positive.
   Reflect this to user space by showing "depth" as unsigned int and be
   consistent across all events.

 - Allow an arbitrary long CPU string to osnoise_cpus_write()

   The filtering of which CPUs to write to can exceed 256 bytes. If a
   machine has 256 CPUs, and the filter is to filter every other CPU,
   the write would take a string larger than 256 bytes. Instead of using
   a fixed size buffer on the stack that is 256 bytes, allocate it to
   handle what is passed in.

 - Stop having ftrace check the per-cpu data "disabled" flag

   The "disabled" flag in the data structure passed to most ftrace
   functions is checked to know if tracing has been disabled or not.
   This flag was added back in 2008 before the ring buffer had its own
   way to disable tracing. The "disable" flag is now not always set when
   needed, and the ring buffer flag should be used in all locations
   where the disabled is needed. Since the "disable" flag is redundant
   and incorrect, stop using it. Fix up some locations that use the
   "disable" flag to use the ring buffer info.

 - Use a new tracer_tracing_disable/enable() instead of data->disable
   flag

   There's a few cases that set the data->disable flag to stop tracing,
   but this flag is not consistently used. It is also an on/off switch
   where if a function set it and calls another function that sets it,
   the called function may incorrectly enable it.

   Use a new trace_tracing_disable() and tracer_tracing_enable() that
   uses a counter and can be nested. These use the ring buffer flags
   which are always checked making the disabling more consistent.

 - Save the trace clock in the persistent ring buffer

   Save what clock was used for tracing in the persistent ring buffer
   and set it back to that clock after a reboot.

 - Remove unused reference to a per CPU data pointer in mmiotrace
   functions

 - Remove unused buffer_page field from trace_array_cpu structure

 - Remove more strncpy() instances

 - Other minor clean ups and fixes

* tag 'trace-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (36 commits)
  tracing: Fix compilation warning on arm32
  tracing: Record trace_clock and recover when reboot
  tracing/sched: Use __string() instead of fixed lengths for task->comm
  tracepoint: Have tracepoints created with DECLARE_TRACE() have _tp suffix
  tracing: Cleanup upper_empty() in pid_list
  tracing: Allow the top level trace_marker to write into another instances
  tracing: Add a helper function to handle the dereference arg in verifier
  tracing: Remove unnecessary "goto out" that simply returns ret is trigger code
  tracing: Fix error handling in event_trigger_parse()
  tracing: Rename event_trigger_alloc() to trigger_data_alloc()
  tracing: Replace deprecated strncpy() with strscpy() for stack_trace_filter_buf
  tracing: Remove unused buffer_page field from trace_array_cpu structure
  tracing: Use atomic_inc_return() for updating "disabled" counter in irqsoff tracer
  tracing: Convert the per CPU "disabled" counter to local from atomic
  tracing: branch: Use trace_tracing_is_on_cpu() instead of "disabled" field
  ring-buffer: Add ring_buffer_record_is_on_cpu()
  tracing: Do not use per CPU array_buffer.data->disabled for cpumask
  ftrace: Do not disabled function graph based on "disabled" field
  tracing: kdb: Use tracer_tracing_on/off() instead of setting per CPU disabled
  tracing: Use tracer_tracing_disable() instead of "disabled" field for ftrace_dump_one()
  ...
2025-05-29 21:04:36 -07:00
Linus Torvalds
43db111107 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "As far as x86 goes this pull request "only" includes TDX host support.

  Quotes are appropriate because (at 6k lines and 100+ commits) it is
  much bigger than the rest, which will come later this week and
  consists mostly of bugfixes and selftests. s390 changes will also come
  in the second batch.

  ARM:

   - Add large stage-2 mapping (THP) support for non-protected guests
     when pKVM is enabled, clawing back some performance.

   - Enable nested virtualisation support on systems that support it,
     though it is disabled by default.

   - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE
     and protected modes.

   - Large rework of the way KVM tracks architecture features and links
     them with the effects of control bits. While this has no functional
     impact, it ensures correctness of emulation (the data is
     automatically extracted from the published JSON files), and helps
     dealing with the evolution of the architecture.

   - Significant changes to the way pKVM tracks ownership of pages,
     avoiding page table walks by storing the state in the hypervisor's
     vmemmap. This in turn enables the THP support described above.

   - New selftest checking the pKVM ownership transition rules

   - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
     even if the host didn't have it.

   - Fixes for the address translation emulation, which happened to be
     rather buggy in some specific contexts.

   - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
     from the number of counters exposed to a guest and addressing a
     number of issues in the process.

   - Add a new selftest for the SVE host state being corrupted by a
     guest.

   - Keep HCR_EL2.xMO set at all times for systems running with the
     kernel at EL2, ensuring that the window for interrupts is slightly
     bigger, and avoiding a pretty bad erratum on the AmpereOne HW.

   - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
     from a pretty bad case of TLB corruption unless accesses to HCR_EL2
     are heavily synchronised.

   - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
     tables in a human-friendly fashion.

   - and the usual random cleanups.

  LoongArch:

   - Don't flush tlb if the host supports hardware page table walks.

   - Add KVM selftests support.

  RISC-V:

   - Add vector registers to get-reg-list selftest

   - VCPU reset related improvements

   - Remove scounteren initialization from VCPU reset

   - Support VCPU reset from userspace using set_mpstate() ioctl

  x86:

   - Initial support for TDX in KVM.

     This finally makes it possible to use the TDX module to run
     confidential guests on Intel processors. This is quite a large
     series, including support for private page tables (managed by the
     TDX module and mirrored in KVM for efficiency), forwarding some
     TDVMCALLs to userspace, and handling several special VM exits from
     the TDX module.

     This has been in the works for literally years and it's not really
     possible to describe everything here, so I'll defer to the various
     merge commits up to and including commit 7bcf7246c4 ('Merge
     branch 'kvm-tdx-finish-initial' into HEAD')"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits)
  x86/tdx: mark tdh_vp_enter() as __flatten
  Documentation: virt/kvm: remove unreferenced footnote
  RISC-V: KVM: lock the correct mp_state during reset
  KVM: arm64: Fix documentation for vgic_its_iter_next()
  KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
  KVM: arm64: Stage-2 huge mappings for np-guests
  KVM: arm64: Add a range to pkvm_mappings
  KVM: arm64: Convert pkvm_mappings to interval tree
  KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
  KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
  KVM: arm64: Add a range to __pkvm_host_unshare_guest()
  KVM: arm64: Add a range to __pkvm_host_share_guest()
  KVM: arm64: Introduce for_each_hyp_page
  KVM: arm64: Handle huge mappings for np-guest CMOs
  KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
  KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
  KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
  RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
  RISC-V: KVM: Remove scounteren initialization
  KVM: RISC-V: remove unnecessary SBI reset state
  ...
2025-05-29 08:10:01 -07:00
Steven Rostedt
99d2328044 ring-buffer: Simplify functions with __free(kfree) to free allocations
The function rb_allocate_pages() allocates cpu_buffer and on error needs
to free it. It has a single return. Use __free(kfree) and return directly
on errors and have the return use return_ptr(cpu_buffer).

The function alloc_buffer() allocates buffer and on error needs to free
it. It has a single return. Use __free(kfree) and return directly on
errors and have the return use return_ptr(buffer).

The function __rb_map_vma() allocates a temporary array "pages". Have it
use __free() and not worry about freeing it when returning.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527143144.6edc4625@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:08 -04:00
Steven Rostedt
60bc720e10 ring-buffer: Make ring_buffer_{un}map() simpler with guard(mutex)
Convert the taking of the buffer->mutex and the cpu_buffer->mapping_lock
over to guard(mutex) and simplify the ring_buffer_map() and
ring_buffer_unmap() functions.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250527122009.267efb72@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:08 -04:00
Steven Rostedt
b2e7c6ed26 ring-buffer: Simplify ring_buffer_read_page() with guard()
The function ring_buffer_read_page() had two gotos. One was simply
returning "ret" and the other was unlocking the reader_lock.

There's no reason to use goto to simply return the "ret" variable. Instead
just return the value.

The jump to the unlocking of the reader_lock can be replaced by
guard(raw_spinlock_irqsave)(&cpu_buffer->reader_lock).

With these two changes the "ret" variable is no longer used and can be
removed. The return value on non-error is what was read and is stored in
the "read" variable.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527145216.0187cf36@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
f0d8cbc8cc ring-buffer: Simplify reset_disabled_cpu_buffer() with use of guard()
Use guard(raw_spinlock_irqsave)() in reset_disabled_cpu_buffer() to
simplify the locking.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527144623.77a9cc47@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
f115d2b70b ring-buffer: Remove jump to out label in ring_buffer_swap_cpu()
The function ring_buffer_swap_cpu() has a bunch of jumps to the label out
that simply returns "ret". There's no reason to jump to a label that
simply returns a value. Just return directly from there.

This goes back to almost the beginning when commit 8aabee573d
("ring-buffer: remove unneeded get_online_cpus") was introduced. That
commit removed a put_online_cpus() from that label, but never updated all
the jumps to it that now no longer needed to do anything but return a
value.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527145753.6b45d840@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
2d22216521 ring-buffer: Removed unnecessary if() goto out where out is the next line
In the function ring_buffer_discard_commit() there's an if statement that
jumps to the next line:

	if (rb_try_to_discard(cpu_buffer, event))
		goto out;
 out:

This was caused by the change that modified the way timestamps were taken
in interrupt context, and removed the code between the if statement and
the goto, but failed to update the conditional logic.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527155116.227f35be@gandalf.local.home
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Masami Hiramatsu (Google)
32dc004252 tracing: Reset last-boot buffers when reading out all cpu buffers
Reset the last-boot ring buffers when read() reads out all cpu
buffers through trace_pipe/trace_pipe_raw. This prevents ftrace to
unwind ring buffer read pointer next boot.

Note that this resets only when all per-cpu buffers are empty, and
read via read(2) syscall. For example, if you read only one of the
per-cpu trace_pipe, it does not reset it. Also, reading buffer by
splice(2) syscall does not reset because some data in the reader
(the last) page.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/174792929202.496143.8184644221859580999.stgit@mhiramat.tok.corp.google.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
c2a0831142 ring-buffer: Allow reserve_mem persistent ring buffers to be mmapped
When the persistent ring buffer is created from the memory returned by
reserve_mem there is nothing prohibiting it to be memory mapped to user
space. The memory is the same as the pages allocated by alloc_page().

The way the memory is managed by the ring buffer code is slightly
different though and needs to be addressed.

The persistent memory uses the page->id for its own purpose where as the
user mmap buffer currently uses that for the subbuf array mapped to user
space. If the buffer is a persistent buffer, use the page index into that
buffer as the identifier instead of the page->id.

That is, the page->id for a persistent buffer, represents the order of the
buffer is in the link list. ->id == 0 means it is the reader page.
When a reader page is swapped, the new reader page's ->id gets zero, and
the old reader page gets the ->id of the page that it swapped with.

The user space mapping has the ->id is the index of where it was mapped in
user space and does not change while it is mapped.

Since the persistent buffer is fixed in its location, the index of where
a page is in the memory range can be used as the "id" to put in the meta
page array, and it can be mapped in the same order to user space as it is
in the persistent memory.

A new rb_page_id() helper function is used to get and set the id depending
on if the page is a normal memory allocated buffer or a physical memory
mapped buffer.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250401203332.246646011@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:24:07 -04:00
Steven Rostedt
4fc78a7c9c ring-buffer: Do not trigger WARN_ON() due to a commit_overrun
When reading a memory mapped buffer the reader page is just swapped out
with the last page written in the write buffer. If the reader page is the
same as the commit buffer (the buffer that is currently being written to)
it was assumed that it should never have missed events. If it does, it
triggers a WARN_ON_ONCE().

But there just happens to be one scenario where this can legitimately
happen. That is on a commit_overrun. A commit overrun is when an interrupt
preempts an event being written to the buffer and then the interrupt adds
so many new events that it fills and wraps the buffer back to the commit.
Any new events would then be dropped and be reported as "missed_events".

In this case, the next page to read is the commit buffer and after the
swap of the reader page, the reader page will be the commit buffer, but
this time there will be missed events and this triggers the following
warning:

 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1127 at kernel/trace/ring_buffer.c:7357 ring_buffer_map_get_reader+0x49a/0x780
 Modules linked in: kvm_intel kvm irqbypass
 CPU: 2 UID: 0 PID: 1127 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00004-g478bc2824b45-dirty #564 PREEMPT
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:ring_buffer_map_get_reader+0x49a/0x780
 Code: 00 00 00 48 89 fe 48 c1 ee 03 80 3c 2e 00 0f 85 ec 01 00 00 4d 3b a6 a8 00 00 00 0f 85 8a fd ff ff 48 85 c0 0f 84 55 fe ff ff <0f> 0b e9 4e fe ff ff be 08 00 00 00 4c 89 54 24 58 48 89 54 24 50
 RSP: 0018:ffff888121787dc0 EFLAGS: 00010002
 RAX: 00000000000006a2 RBX: ffff888100062800 RCX: ffffffff8190cb49
 RDX: ffff888126934c00 RSI: 1ffff11020200a15 RDI: ffff8881010050a8
 RBP: dffffc0000000000 R08: 0000000000000000 R09: ffffed1024d26982
 R10: ffff888126934c17 R11: ffff8881010050a8 R12: ffff888126934c00
 R13: ffff8881010050b8 R14: ffff888101005000 R15: ffff888126930008
 FS:  00007f95c8cd7540(0000) GS:ffff8882b576e000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f95c8de4dc0 CR3: 0000000128452002 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  ? __pfx_ring_buffer_map_get_reader+0x10/0x10
  tracing_buffers_ioctl+0x283/0x370
  __x64_sys_ioctl+0x134/0x190
  do_syscall_64+0x79/0x1c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f95c8de48db
 Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
 RSP: 002b:00007ffe037ba110 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 00007ffe037bb2b0 RCX: 00007f95c8de48db
 RDX: 0000000000000000 RSI: 0000000000005220 RDI: 0000000000000006
 RBP: 00007ffe037ba180 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffe037bb6f8 R14: 00007f95c9065000 R15: 00005575c7492c90
  </TASK>
 irq event stamp: 5080
 hardirqs last  enabled at (5079): [<ffffffff83e0adb0>] _raw_spin_unlock_irqrestore+0x50/0x70
 hardirqs last disabled at (5080): [<ffffffff83e0aa83>] _raw_spin_lock_irqsave+0x63/0x70
 softirqs last  enabled at (4182): [<ffffffff81516122>] handle_softirqs+0x552/0x710
 softirqs last disabled at (4159): [<ffffffff815163f7>] __irq_exit_rcu+0x107/0x210
 ---[ end trace 0000000000000000 ]---

The above was triggered by running on a kernel with both lockdep and KASAN
as well as kmemleak enabled and executing the following command:

 # perf record -o perf-test.dat -a -- trace-cmd record --nosplice  -e all -p function hackbench 50

With perf interjecting a lot of interrupts and trace-cmd enabling all
events as well as function tracing, with lockdep, KASAN and kmemleak
enabled, it could cause an interrupt preempting an event being written to
add enough events to wrap the buffer. trace-cmd was modified to have
--nosplice use mmap instead of reading the buffer.

The way to differentiate this case from the normal case of there only
being one page written to where the swap of the reader page received that
one page (which is the commit page), check if the tail page is on the
reader page. The difference between the commit page and the tail page is
that the tail page is where new writes go to, and the commit page holds
the first write that hasn't been committed yet. In the case of an
interrupt preempting the write of an event and filling the buffer, it
would move the tail page but not the commit page.

Have the warning only trigger if the tail page is also on the reader page,
and also print out the number of events dropped by a commit overrun as
that can not yet be safely added to the page so that the reader can see
there were events dropped.

Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250528121555.2066527e@gandalf.local.home
Fixes: fe832be05a ("ring-buffer: Have mmapped ring buffer keep track of missed events")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29 08:23:48 -04:00
Linus Torvalds
90b83efa67 Merge tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:

 - Fix and improve BTF deduplication of identical BTF types (Alan
   Maguire and Andrii Nakryiko)

 - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
   Alexis Lothoré)

 - Support load-acquire and store-release instructions in BPF JIT on
   riscv64 (Andrea Parri)

 - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
   Protopopov)

 - Streamline allowed helpers across program types (Feng Yang)

 - Support atomic update for hashtab of BPF maps (Hou Tao)

 - Implement json output for BPF helpers (Ihor Solodrai)

 - Several s390 JIT fixes (Ilya Leoshkevich)

 - Various sockmap fixes (Jiayuan Chen)

 - Support mmap of vmlinux BTF data (Lorenz Bauer)

 - Support BPF rbtree traversal and list peeking (Martin KaFai Lau)

 - Tests for sockmap/sockhash redirection (Michal Luczaj)

 - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)

 - Add support for dma-buf iterators in BPF (T.J. Mercier)

 - The verifier support for __bpf_trap() (Yonghong Song)

* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
  bpf, arm64: Remove unused-but-set function and variable.
  selftests/bpf: Add tests with stack ptr register in conditional jmp
  bpf: Do not include stack ptr register in precision backtracking bookkeeping
  selftests/bpf: enable many-args tests for arm64
  bpf, arm64: Support up to 12 function arguments
  bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
  bpf: Avoid __bpf_prog_ret0_warn when jit fails
  bpftool: Add support for custom BTF path in prog load/loadall
  selftests/bpf: Add unit tests with __bpf_trap() kfunc
  bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
  bpf: Remove special_kfunc_set from verifier
  selftests/bpf: Add test for open coded dmabuf_iter
  selftests/bpf: Add test for dmabuf_iter
  bpf: Add open coded dmabuf iterator
  bpf: Add dmabuf iterator
  dma-buf: Rename debugfs symbols
  bpf: Fix error return value in bpf_copy_from_user_dynptr
  libbpf: Use mmap to parse vmlinux BTF from sysfs
  selftests: bpf: Add a test for mmapable vmlinux BTF
  btf: Allow mmap of vmlinux btf
  ...
2025-05-28 15:52:42 -07:00
Linus Torvalds
1b98f357da Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
 "Core:

   - Implement the Device Memory TCP transmit path, allowing zero-copy
     data transmission on top of TCP from e.g. GPU memory to the wire.

   - Move all the IPv6 routing tables management outside the RTNL scope,
     under its own lock and RCU. The route control path is now 3x times
     faster.

   - Convert queue related netlink ops to instance lock, reducing again
     the scope of the RTNL lock. This improves the control plane
     scalability.

   - Refactor the software crc32c implementation, removing unneeded
     abstraction layers and improving significantly the related
     micro-benchmarks.

   - Optimize the GRO engine for UDP-tunneled traffic, for a 10%
     performance improvement in related stream tests.

   - Cover more per-CPU storage with local nested BH locking; this is a
     prep work to remove the current per-CPU lock in local_bh_disable()
     on PREMPT_RT.

   - Introduce and use nlmsg_payload helper, combining buffer bounds
     verification with accessing payload carried by netlink messages.

  Netfilter:

   - Rewrite the procfs conntrack table implementation, improving
     considerably the dump performance. A lot of user-space tools still
     use this interface.

   - Implement support for wildcard netdevice in netdev basechain and
     flowtables.

   - Integrate conntrack information into nft trace infrastructure.

   - Export set count and backend name to userspace, for better
     introspection.

  BPF:

   - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
     programs and can be controlled in similar way to traditional qdiscs
     using the "tc qdisc" command.

   - Refactor the UDP socket iterator, addressing long standing issues
     WRT duplicate hits or missed sockets.

  Protocols:

   - Improve TCP receive buffer auto-tuning and increase the default
     upper bound for the receive buffer; overall this improves the
     single flow maximum thoughput on 200Gbs link by over 60%.

   - Add AFS GSSAPI security class to AF_RXRPC; it provides transport
     security for connections to the AFS fileserver and VL server.

   - Improve TCP multipath routing, so that the sources address always
     matches the nexthop device.

   - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
     and thus preventing DoS caused by passing around problematic FDs.

   - Retire DCCP socket. DCCP only receives updates for bugs, and major
     distros disable it by default. Its removal allows for better
     organisation of TCP fields to reduce the number of cache lines hit
     in the fast path.

   - Extend TCP drop-reason support to cover PAWS checks.

  Driver API:

   - Reorganize PTP ioctl flag support to require an explicit opt-in for
     the drivers, avoiding the problem of drivers not rejecting new
     unsupported flags.

   - Converted several device drivers to timestamping APIs.

   - Introduce per-PHY ethtool dump helpers, improving the support for
     dump operations targeting PHYs.

  Tests and tooling:

   - Add support for classic netlink in user space C codegen, so that
     ynl-c can now read, create and modify links, routes addresses and
     qdisc layer configuration.

   - Add ynl sub-types for binary attributes, allowing ynl-c to output
     known struct instead of raw binary data, clarifying the classic
     netlink output.

   - Extend MPTCP selftests to improve the code-coverage.

   - Add tests for XDP tail adjustment in AF_XDP.

  New hardware / drivers:

   - OpenVPN virtual driver: offload OpenVPN data channels processing to
     the kernel-space, increasing the data transfer throughput WRT the
     user-space implementation.

   - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.

   - Broadcom asp-v3.0 ethernet driver.

   - AMD Renoir ethernet device.

   - ReakTek MT9888 2.5G ethernet PHY driver.

   - Aeonsemi 10G C45 PHYs driver.

  Drivers:

   - Ethernet high-speed NICs:
       - nVidia/Mellanox (mlx5):
           - refactor the steering table handling to significantly
             reduce the amount of memory used
           - add support for complex matches in H/W flow steering
           - improve flow streeing error handling
           - convert to netdev instance locking
       - Intel (100G, ice, igb, ixgbe, idpf):
           - ice: add switchdev support for LLDP traffic over VF
           - ixgbe: add firmware manipulation and regions devlink support
           - igb: introduce support for frame transmission premption
           - igb: adds persistent NAPI configuration
           - idpf: introduce RDMA support
           - idpf: add initial PTP support
       - Meta (fbnic):
           - extend hardware stats coverage
           - add devlink dev flash support
       - Broadcom (bnxt):
           - add support for RX-side device memory TCP
       - Wangxun (txgbe):
           - implement support for udp tunnel offload
           - complete PTP and SRIOV support for AML 25G/10G devices

   - Ethernet NICs embedded and virtual:
       - Google (gve):
           - add device memory TCP TX support
       - Amazon (ena):
           - support persistent per-NAPI config
       - Airoha:
           - add H/W support for L2 traffic offload
           - add per flow stats for flow offloading
       - RealTek (rtl8211): add support for WoL magic packet
       - Synopsys (stmmac):
           - dwmac-socfpga 1000BaseX support
           - add Loongson-2K3000 support
           - introduce support for hardware-accelerated VLAN stripping
       - Broadcom (bcmgenet):
           - expose more H/W stats
       - Freescale (enetc, dpaa2-eth):
           - enetc: add MAC filter, VLAN filter RSS and loopback support
           - dpaa2-eth: convert to H/W timestamping APIs
       - vxlan: convert FDB table to rhashtable, for better scalabilty
       - veth: apply qdisc backpressure on full ring to reduce TX drops

   - Ethernet switches:
       - Microchip (kzZ88x3): add ETS scheduler support

   - Ethernet PHYs:
       - RealTek (rtl8211):
           - add support for WoL magic packet
           - add support for PHY LEDs

   - CAN:
       - Adds RZ/G3E CANFD support to the rcar_canfd driver.
       - Preparatory work for CAN-XL support.
       - Add self-tests framework with support for CAN physical interfaces.

   - WiFi:
       - mac80211:
           - scan improvements with multi-link operation (MLO)
       - Qualcomm (ath12k):
           - enable AHB support for IPQ5332
           - add monitor interface support to QCN9274
           - add multi-link operation support to WCN7850
           - add 802.11d scan offload support to WCN7850
           - monitor mode for WCN7850, better 6 GHz regulatory
       - Qualcomm (ath11k):
           - restore hibernation support
       - MediaTek (mt76):
           - WiFi-7 improvements
           - implement support for mt7990
       - Intel (iwlwifi):
           - enhanced multi-link single-radio (EMLSR) support on 5 GHz links
           - rework device configuration
       - RealTek (rtw88):
           - improve throughput for RTL8814AU
       - RealTek (rtw89):
           - add multi-link operation support
           - STA/P2P concurrency improvements
           - support different SAR configs by antenna

   - Bluetooth:
       - introduce HCI Driver protocol
       - btintel_pcie: do not generate coredump for diagnostic events
       - btusb: add HCI Drv commands for configuring altsetting
       - btusb: add RTL8851BE device 0x0bda:0xb850
       - btusb: add new VID/PID 13d3/3584 for MT7922
       - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
       - btnxpuart: implement host-wakeup feature"

* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
  selftests/bpf: Fix bpf selftest build warning
  selftests: netfilter: Fix skip of wildcard interface test
  net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
  net: openvswitch: Fix the dead loop of MPLS parse
  calipso: Don't call calipso functions for AF_INET sk.
  selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
  net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
  octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
  octeontx2-pf: QOS: Perform cache sync on send queue teardown
  net: mana: Add support for Multi Vports on Bare metal
  net: devmem: ncdevmem: remove unused variable
  net: devmem: ksft: upgrade rx test to send 1K data
  net: devmem: ksft: add 5 tuple FS support
  net: devmem: ksft: add exit_wait to make rx test pass
  net: devmem: ksft: add ipv4 support
  net: devmem: preserve sockc_err
  page_pool: fix ugly page_pool formatting
  net: devmem: move list_add to net_devmem_bind_dmabuf.
  selftests: netfilter: nft_queue.sh: include file transfer duration in log message
  net: phy: mscc: Fix memory leak when using one step timestamping
  ...
2025-05-28 15:24:36 -07:00
Pan Taixi
2fbdb6d8e0 tracing: Fix compilation warning on arm32
On arm32, size_t is defined to be unsigned int, while PAGE_SIZE is
unsigned long. This hence triggers a compilation warning as min()
asserts the type of two operands to be equal. Casting PAGE_SIZE to size_t
solves this issue and works on other target architectures as well.

Compilation warning details:

kernel/trace/trace.c: In function 'tracing_splice_read_pipe':
./include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast
  (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                            ^
./include/linux/minmax.h:26:4: note: in expansion of macro '__typecheck'
   (__typecheck(x, y) && __no_side_effects(x, y))
    ^~~~~~~~~~~

...

kernel/trace/trace.c:6771:8: note: in expansion of macro 'min'
        min((size_t)trace_seq_used(&iter->seq),
        ^~~

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250526013731.1198030-1-pantaixi@huaweicloud.com
Fixes: f5178c41bb ("tracing: Fix oob write in trace_seq_to_buffer()")
Reviewed-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Pan Taixi <pantaixi@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-28 16:10:43 -04:00
Linus Torvalds
3d413f0cfd Merge tag 'audit-pr-20250527' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:

 - Always record AUDIT_ANOM events when auditing is enabled.

   Prior to this patch we only recorded AUDIT_ANOM events if auditing
   was enabled and the admin/distro had explicitly configured audit
   beyond the defaults. Considering that AUDIT_ANOM events are anomolous
   events considered to be "security relevant", it seems wise to record
   these events as long as auditing is enabled, even if the system is
   running with a default audit configuration.

 - Mark the audit_log_vformat() function with the __printf() attribute
   to quiet GCC.

* tag 'audit-pr-20250527' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: record AUDIT_ANOM_* events regardless of presence of rules
  audit: mark audit_log_vformat() with __printf() attribute
2025-05-28 08:34:19 -07:00
Linus Torvalds
7af6e3febb Merge tag 'integrity-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
 "Carrying the IMA measurement list across kexec is not a new feature,
  but is updated to address a couple of issues:

   - Carrying the IMA measurement list across kexec required knowing
     apriori all the file measurements between the "kexec load" and
     "kexec execute" in order to measure them before the "kexec load".
     Any delay between the "kexec load" and "kexec exec" exacerbated the
     problem.

   - Any file measurements post "kexec load" were not carried across
     kexec, resulting in the measurement list being out of sync with the
     TPM PCR.

  With these changes, the buffer for the IMA measurement list is still
  allocated at "kexec load", but copying the IMA measurement list is
  deferred to after quiescing the TPM.

  Two new kexec critical data records are defined"

* tag 'integrity-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  ima: do not copy measurement list to kdump kernel
  ima: measure kexec load and exec events as critical data
  ima: make the kexec extra memory configurable
  ima: verify if the segment size has changed
  ima: kexec: move IMA log copy from kexec load to execute
  ima: kexec: define functions to copy IMA log at soft boot
  ima: kexec: skip IMA segment validation after kexec soft reboot
  kexec: define functions to map and unmap segments
  ima: define and call ima_alloc_kexec_file_buf()
  ima: rename variable the seq_file "file" to "ima_kexec_file"
2025-05-28 08:12:33 -07:00
Linus Torvalds
feacb1774b Merge tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:

 - More in-kernel idle CPU selection improvements. Expand topology
   awareness coverage add scx_bpf_select_cpu_and() to allow more
   flexibility. The idle CPU selection kfuncs can now be called from
   unlocked contexts too.

 - A bunch of reorganization changes to lay the foundation for multiple
   hierarchical scheduler support. This isn't ready yet and the included
   changes don't make meaningful behavior differences. One notable
   change is replacing some static_key tests with dynamic tests as the
   test results may differ depending on the scheduler instance. This
   isn't expected to cause meaningful performance difference.

 - Other minor and doc updates.

 - There were multiple patches in for-6.15-fixes which conflicted with
   changes in for-6.16. for-6.15-fixes were pulled three times into
   for-6.16 to resolve the conflicts.

* tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (49 commits)
  sched_ext: Call ops.update_idle() after updating builtin idle bits
  sched_ext, docs: convert mentions of "CFS" to "fair-class scheduler"
  selftests/sched_ext: Update test enq_select_cpu_fails
  sched_ext: idle: Consolidate default idle CPU selection kfuncs
  selftests/sched_ext: Add test for scx_bpf_select_cpu_and() via test_run
  sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked context
  sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and()
  sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c
  sched_ext, docs: add label
  sched_ext: Explain the temporary situation around scx_root dereferences
  sched_ext: Add @sch to SCX_CALL_OP*()
  sched_ext: Cleanup [__]scx_exit/error*()
  sched_ext: Add @sch to SCX_CALL_OP*()
  sched_ext: Clean up scx_root usages
  Documentation: scheduler: Changed lowercase acronyms to uppercase
  sched_ext: Avoid NULL scx_root deref in __scx_exit()
  sched_ext: Add RCU protection to scx_root in DSQ iterator
  sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn()
  sched_ext: Move disable machinery into scx_sched
  sched_ext: Move event_stats_cpu into scx_sched
  ...
2025-05-27 21:12:50 -07:00
Linus Torvalds
3b66e6b3c0 Merge tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - cgroup rstat shared the tracking tree across all controllers with the
   rationale being that a cgroup which is using one resource is likely
   to be using other resources at the same time (ie. if something is
   allocating memory, it's probably consuming CPU cycles).

   However, this turned out to not scale very well especially with memcg
   using rstat for internal operations which made memcg stat read and
   flush patterns substantially different from other controllers. JP
   Kobryn split the rstat tree per controller.

 - cgroup BPF support was hooking into cgroup init/exit paths directly.

   Convert them to use a notifier chain instead so that other usages can
   be added easily. The two of the patches which implement this are
   mislabeled as belonging to sched_ext instead of cgroup. Sorry.

 - Relatively minor cpuset updates

 - Documentation updates

* tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits)
  sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier
  sched_ext: Introduce cgroup_lifetime_notifier
  cgroup: Minor reorganization of cgroup_create()
  cgroup, docs: cpu controller's interaction with various scheduling policies
  cgroup, docs: convert space indentation to tab indentation
  cgroup: avoid per-cpu allocation of size zero rstat cpu locks
  cgroup, docs: be specific about bandwidth control of rt processes
  cgroup: document the rstat per-cpu initialization
  cgroup: helper for checking rstat participation of css
  cgroup: use subsystem-specific rstat locks to avoid contention
  cgroup: use separate rstat trees for each subsystem
  cgroup: compare css to cgroup::self in helper for distingushing css
  cgroup: warn on rstat usage by early init subsystems
  cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask()
  cgroup/rstat: Improve cgroup_rstat_push_children() documentation
  cgroup: fix goto ordering in cgroup_init()
  cgroup: fix pointer check in css_rstat_init()
  cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs
  cgroup/cpuset: Fix obsolete comment in cpuset_css_offline()
  cgroup/cpuset: Always use cpu_active_mask
  ...
2025-05-27 20:59:53 -07:00
Linus Torvalds
91ad250cbe Merge tag 'wq-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Fix statistic update race condition and a couple documentation
  updates"

* tag 'wq-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix typo in comment
  workqueue: Fix race condition in wq->stats incrementation
  workqueue: Better document teardown for delayed_work
2025-05-27 20:49:06 -07:00
Linus Torvalds
f1975e4765 Merge tag 'sysctl-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:

 - Move kern_table members out of kernel/sysctl.c

   Moved a subset (tracing, panic, signal, stack_tracer and sparc) out
   of the kern_table array. The goal is for kern_table to only have
   sysctl elements. All this increases modularity by placing the
   ctl_tables closer to where they are used while reducing the chances
   of merge conflicts in kernel/sysctl.c.

 - Fixed sysctl unit test panic by relocating it to selftests

* tag 'sysctl-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  sysctl: Close test ctl_headers with a for loop
  sysctl: call sysctl tests with a for loop
  sysctl: Add 0012 to test the u8 range check
  sysctl: move u8 register test to lib/test_sysctl.c
  sparc: mv sparc sysctls into their own file under arch/sparc/kernel
  stack_tracer: move sysctl registration to kernel/trace/trace_stack.c
  tracing: Move trace sysctls into trace.c
  signal: Move signal ctl tables into signal.c
  panic: Move panic ctl tables into panic.c
2025-05-27 20:43:35 -07:00
Linus Torvalds
5cf5240991 Merge tag 'for-linus-6.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:

 - A fix for running as a Xen dom0 on the iMX8QXP Arm platform

 - An update of the xen.config adding XEN_UNPOPULATED_ALLOC for better
   support of PVH dom0

 - A fix of the Xen balloon driver when running without
   CONFIG_XEN_UNPOPULATED_ALLOC

 - A fix of the dm_op Xen hypercall on Arm needed to pass user space
   buffers to the hypervisor in certain configurations

* tag 'for-linus-6.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/arm: call uaccess_ttbr0_enable for dm_op hypercall
  xen/x86: fix initial memory balloon target
  xen: enable XEN_UNPOPULATED_ALLOC as part of xen.config
  xen: swiotlb: Wire up map_resource callback
2025-05-27 20:36:30 -07:00
Linus Torvalds
23022f5456 Merge tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski:
 "New two step DMA mapping API, which is is a first step to a long path
  to provide alternatives to scatterlist and to remove hacks, abuses and
  design mistakes related to scatterlists.

  This new approach optimizes some calls to DMA-IOMMU layer and cache
  maintenance by batching them, reduces memory usage as it is no need to
  store mapped DMA addresses to unmap them, and reduces some function
  call overhead.  It is a combination effort of many people, lead and
  developed by Christoph Hellwig and Leon Romanovsky"

* tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
  docs: core-api: document the IOVA-based API
  dma-mapping: add a dma_need_unmap helper
  dma-mapping: Implement link/unlink ranges API
  iommu/dma: Factor out a iommu_dma_map_swiotlb helper
  dma-mapping: Provide an interface to allow allocate IOVA
  iommu: add kernel-doc for iommu_unmap_fast
  iommu: generalize the batched sync after map interface
  dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  PCI/P2PDMA: Refactor the p2pdma mapping helpers
2025-05-27 20:09:06 -07:00
Wang Yaxin
5ef2dccfcc delayacct: remove redundant code and adjust indentation
Remove redundant code and adjust indentation of xxx_delay_max/min.

Link: https://lkml.kernel.org/r/20250521093157668iQrhhcMjA-th5LQf4-A3c@zte.com.cn
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-27 19:40:33 -07:00
Arnd Bergmann
4496e1c135 crash_dump, nvme: select CONFIGFS_FS as built-in
Configfs can be configured as a loadable module, which causes a link-time
failure for dm-crypt crash dump support:

crash_dump_dm_crypt.c:(.text+0x3a4): undefined reference to `config_item_init_type_name'
aarch64-linux-ld: kernel/crash_dump_dm_crypt.o: in function `configfs_dmcrypt_keys_init':
crash_dump_dm_crypt.c:(.init.text+0x90): undefined reference to `config_group_init'
aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xb4): undefined reference to `configfs_register_subsystem'
aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xd8): undefined reference to `configfs_unregister_subsystem'

This could be avoided with a dependency on CONFIGFS_FS=y, but the
dependency has an additional problem of causing Kconfig dependency loops
since most other uses select the symbol.

Using a simple 'select CONFIGFS_FS' here in turn fails with
CONFIG_DM_CRYPT=m, because that still only causes configfs to be a
loadable module.

The only version I found that fixes this reliably uses an additional
Kconfig symbol to ensure the 'select' actually turns on configfs as
builtin, with two additional changes to avoid dependency loops with nvme
and sysfs.

There is no compile-time dependency between configfs and sysfs, so
selecting configfs from a driver with sysfs disabled does not cause link
failures, only the default /sys/kernel/config mount point will not be
created.

Link: https://lkml.kernel.org/r/20250521160359.2132363-1-arnd@kernel.org
Fixes: 6b23858fd63b ("crash_dump: make dm crypt keys persist for the kdump kernel")
Fixes: 1fb4704084 ("nvme-loop: add configfs dependency")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-27 19:40:33 -07:00
Linus Torvalds
c89756bcf4 Merge tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "Once again, the changes are dominated by cpufreq updates, but this
  time the majority of them are cpufreq core changes, mostly related to
  the introduction of policy locking guards and __free() usage, and
  fixes related to boost handling.

  Still, there is also a significant update of the intel_pstate driver
  making it register an energy model when running on a hybrid platform
  which is used for enabling energy-aware scheduling (EAS) if the driver
  operates in the passive mode (and schedutil is used as the cpufreq
  governor for all CPUs which is the passive mode default).

  There are some amd-pstate driver updates too, for a good measure,
  including the "Requested CPU Min frequency" BIOS option support and
  new online/offline callbacks.

  In the cpuidle space, the most significant change is the addition of a
  C1 demotion on/off sysfs knob to intel_idle which should help some
  users to configure their systems more precisely. There is also the
  conversion of the PSCI cpuidle driver to a faux device one and there
  are two small updates of cpuidle governors.

  Device power management is also modified quite a bit, especially the
  handling of devices with asynchronous suspend and resume enabled
  during system transitions. They are now going to be handled more
  asynchronously during suspend transitions and somewhat less
  aggressively during resume transitions.

  Apart from the above, the operating performance points (OPP) library
  is now going to use mutex locking guards and scope-based cleanup
  helpers and there is the usual bunch of assorted fixes and code
  cleanups.

  Specifics:

   - Fix potential division-by-zero error in em_compute_costs() (Yaxiong
     Tian)

   - Fix typos in energy model documentation and example driver code
     (Moon Hee Lee, Atul Kumar Pant)

   - Rearrange the energy model management code and add a new function
     for adjusting a CPU energy model after adjusting the capacity of
     the given CPU to it (Rafael Wysocki)

   - Refactor cpufreq_online(), add and use cpufreq policy locking
     guards, use __free() in policy reference counting, and clean up
     core cpufreq code on top of that (Rafael Wysocki)

   - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
     Kumar)

   - Fix des_perf clamping with max_perf in amd_pstate_update()
     (Dhananjay Ugwekar)

   - Add offline, online and suspend callbacks to the amd-pstate driver,
     rename and use the existing amd_pstate_epp callbacks in it
     (Dhananjay Ugwekar)

   - Add support for the "Requested CPU Min frequency" BIOS option to
     the amd-pstate driver (Dhananjay Ugwekar)

   - Reset amd-pstate driver mode after running selftests (Swapnil
     Sapkal)

   - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
     Chancellor)

   - Add helper for governor checks to the schedutil cpufreq governor
     and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki)

   - Populate the cpu_capacity sysfs entries from the intel_pstate
     driver after registering asym capacity support (Ricardo Neri)

   - Add support for enabling Energy-aware scheduling (EAS) to the
     intel_pstate driver when operating in the passive mode on a hybrid
     platform (Rafael Wysocki)

   - Drop redundant cpus_read_lock() from store_local_boost() in the
     cpufreq core (Seyediman Seyedarab)

   - Replace sscanf() with kstrtouint() in the cpufreq code and use a
     symbol instead of a raw number in it (Bowen Yu)

   - Add support for autonomous CPU performance state selection to the
     CPPC cpufreq driver (Lifeng Zheng)

   - OPP: Add dev_pm_opp_set_level() (Praveen Talari)

   - Introduce scope-based cleanup headers and mutex locking guards in
     OPP core (Viresh Kumar)

   - Switch OPP to use kmemdup_array() (Zhang Enpei)

   - Optimize bucket assignment when next_timer_ns equals KTIME_MAX in
     the menu cpuidle governor (Zhongqiu Han)

   - Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla)

   - Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem
     Bityutskiy)

   - Fix typos in two comments in the teo cpuidle governor (Atul Kumar
     Pant)

   - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
     Kalla)

   - Move debug runtime PM attributes to runtime_attrs[] (Rafael
     Wysocki)

   - Add new devm_ functions for enabling runtime PM and runtime PM
     reference counting (Bence Csókás)

   - Remove size arguments from strscpy() calls in the hibernation core
     code (Thorsten Blum)

   - Adjust the handling of devices with asynchronous suspend enabled
     during system suspend and resume to start resuming them immediately
     after resuming their parents and to start suspending such a device
     immediately after suspending its first child (Rafael Wysocki)

   - Adjust messages printed during tasks freezing to avoid using
     pr_cont() (Andrew Sayers, Paul Menzel)

   - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
     Zhang)

   - Add missing wakeup source attribute relax_count to sysfs and remove
     the space character at the end ofi the string produced by
     pm_show_wakelocks() (Zijun Hu)

   - Add configurable pm_test delay for hibernation (Zihuan Zhang)

   - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
     cypd4226 device on Tegra boards from suspending prematurely (Jon
     Hunter)

   - Unbreak printing PM debug messages during hibernation and clean up
     some related code (Rafael Wysocki)

   - Add a systemd service to run cpupower and change cpupower binding's
     Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli)"

* tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits)
  cpufreq: CPPC: Add support for autonomous selection
  cpufreq: Update sscanf() to kstrtouint()
  cpufreq: Replace magic number
  OPP: switch to use kmemdup_array()
  PM: freezer: Rewrite restarting tasks log to remove stray *done.*
  PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn()
  cpufreq: drop redundant cpus_read_lock() from store_local_boost()
  cpupower: do not install files to /etc/default/
  cpupower: do not call systemctl at install time
  cpupower: do not write DESTDIR to cpupower.service
  PM: sleep: Introduce pm_sleep_transition_in_progress()
  cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver()
  cpufreq: intel_pstate: Document hybrid processor support
  cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
  cpufreq: intel_pstate: EAS support for hybrid platforms
  PM: EM: Introduce em_adjust_cpu_capacity()
  PM: EM: Move CPU capacity check to em_adjust_new_capacity()
  PM: EM: Documentation: Fix typos in example driver code
  cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas()
  PM: sleep: Introduce pm_suspend_in_progress()
  ...
2025-05-27 16:48:47 -07:00
Yonghong Song
5ffb537e41 selftests/bpf: Add tests with stack ptr register in conditional jmp
Add two tests:
  - one test has 'rX <op> r10' where rX is not r10, and
  - another test has 'rX <op> rY' where rX and rY are not r10
    but there is an early insn 'rX = r10'.

Without previous verifier change, both tests will fail.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250524041340.4046304-1-yonghong.song@linux.dev
2025-05-27 14:09:12 -07:00
Yonghong Song
e2d2115e56 bpf: Do not include stack ptr register in precision backtracking bookkeeping
Yi Lai reported an issue ([1]) where the following warning appears
in kernel dmesg:
  [   60.643604] verifier backtracking bug
  [   60.643635] WARNING: CPU: 10 PID: 2315 at kernel/bpf/verifier.c:4302 __mark_chain_precision+0x3a6c/0x3e10
  [   60.648428] Modules linked in: bpf_testmod(OE)
  [   60.650471] CPU: 10 UID: 0 PID: 2315 Comm: test_progs Tainted: G           OE       6.15.0-rc4-gef11287f8289-dirty #327 PREEMPT(full)
  [   60.654385] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  [   60.656682] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [   60.660475] RIP: 0010:__mark_chain_precision+0x3a6c/0x3e10
  [   60.662814] Code: 5a 30 84 89 ea e8 c4 d9 01 00 80 3d 3e 7d d8 04 00 0f 85 60 fa ff ff c6 05 31 7d d8 04
                       01 48 c7 c7 00 58 30 84 e8 c4 06 a5 ff <0f> 0b e9 46 fa ff ff 48 ...
  [   60.668720] RSP: 0018:ffff888116cc7298 EFLAGS: 00010246
  [   60.671075] RAX: 54d70e82dfd31900 RBX: ffff888115b65e20 RCX: 0000000000000000
  [   60.673659] RDX: 0000000000000001 RSI: 0000000000000004 RDI: 00000000ffffffff
  [   60.676241] RBP: 0000000000000400 R08: ffff8881f6f23bd3 R09: 1ffff1103ede477a
  [   60.678787] R10: dffffc0000000000 R11: ffffed103ede477b R12: ffff888115b60ae8
  [   60.681420] R13: 1ffff11022b6cbc4 R14: 00000000fffffff2 R15: 0000000000000001
  [   60.684030] FS:  00007fc2aedd80c0(0000) GS:ffff88826fa8a000(0000) knlGS:0000000000000000
  [   60.686837] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   60.689027] CR2: 000056325369e000 CR3: 000000011088b002 CR4: 0000000000370ef0
  [   60.691623] Call Trace:
  [   60.692821]  <TASK>
  [   60.693960]  ? __pfx_verbose+0x10/0x10
  [   60.695656]  ? __pfx_disasm_kfunc_name+0x10/0x10
  [   60.697495]  check_cond_jmp_op+0x16f7/0x39b0
  [   60.699237]  do_check+0x58fa/0xab10
  ...

Further analysis shows the warning is at line 4302 as below:

  4294                 /* static subprog call instruction, which
  4295                  * means that we are exiting current subprog,
  4296                  * so only r1-r5 could be still requested as
  4297                  * precise, r0 and r6-r10 or any stack slot in
  4298                  * the current frame should be zero by now
  4299                  */
  4300                 if (bt_reg_mask(bt) & ~BPF_REGMASK_ARGS) {
  4301                         verbose(env, "BUG regs %x\n", bt_reg_mask(bt));
  4302                         WARN_ONCE(1, "verifier backtracking bug");
  4303                         return -EFAULT;
  4304                 }

With the below test (also in the next patch):
  __used __naked static void __bpf_jmp_r10(void)
  {
	asm volatile (
	"r2 = 2314885393468386424 ll;"
	"goto +0;"
	"if r2 <= r10 goto +3;"
	"if r1 >= -1835016 goto +0;"
	"if r2 <= 8 goto +0;"
	"if r3 <= 0 goto +0;"
	"exit;"
	::: __clobber_all);
  }

  SEC("?raw_tp")
  __naked void bpf_jmp_r10(void)
  {
	asm volatile (
	"r3 = 0 ll;"
	"call __bpf_jmp_r10;"
	"r0 = 0;"
	"exit;"
	::: __clobber_all);
  }

The following is the verifier failure log:
  0: (18) r3 = 0x0                      ; R3_w=0
  2: (85) call pc+2
  caller:
   R10=fp0
  callee:
   frame1: R1=ctx() R3_w=0 R10=fp0
  5: frame1: R1=ctx() R3_w=0 R10=fp0
  ; asm volatile ("                                 \ @ verifier_precision.c:184
  5: (18) r2 = 0x20202000256c6c78       ; frame1: R2_w=0x20202000256c6c78
  7: (05) goto pc+0
  8: (bd) if r2 <= r10 goto pc+3        ; frame1: R2_w=0x20202000256c6c78 R10=fp0
  9: (35) if r1 >= 0xffe3fff8 goto pc+0         ; frame1: R1=ctx()
  10: (b5) if r2 <= 0x8 goto pc+0
  mark_precise: frame1: last_idx 10 first_idx 0 subseq_idx -1
  mark_precise: frame1: regs=r2 stack= before 9: (35) if r1 >= 0xffe3fff8 goto pc+0
  mark_precise: frame1: regs=r2 stack= before 8: (bd) if r2 <= r10 goto pc+3
  mark_precise: frame1: regs=r2,r10 stack= before 7: (05) goto pc+0
  mark_precise: frame1: regs=r2,r10 stack= before 5: (18) r2 = 0x20202000256c6c78
  mark_precise: frame1: regs=r10 stack= before 2: (85) call pc+2
  BUG regs 400

The main failure reason is due to r10 in precision backtracking bookkeeping.
Actually r10 is always precise and there is no need to add it for the precision
backtracking bookkeeping.

One way to fix the issue is to prevent bt_set_reg() if any src/dst reg is
r10. Andrii suggested to go with push_insn_history() approach to avoid
explicitly checking r10 in backtrack_insn().

This patch added push_insn_history() support for cond_jmp like 'rX <op> rY'
operations. In check_cond_jmp_op(), if any of rX or rY is a stack pointer,
push_insn_history() will record such information, and later backtrack_insn()
will do bt_set_reg() properly for those register(s).

  [1] https://lore.kernel.org/bpf/Z%2F8q3xzpU59CIYQE@ly-workstation/

Reported by: Yi Lai <yi1.lai@linux.intel.com>

Fixes: 407958a0e9 ("bpf: encapsulate precision backtracking bookkeeping")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250524041335.4046126-1-yonghong.song@linux.dev
2025-05-27 14:09:12 -07:00
Steven Rostedt
c98cc9797b ring-buffer: Move cpus_read_lock() outside of buffer->mutex
Running a modified trace-cmd record --nosplice where it does a mmap of the
ring buffer when '--nosplice' is set, caused the following lockdep splat:

 ======================================================
 WARNING: possible circular locking dependency detected
 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 Not tainted
 ------------------------------------------------------
 trace-cmd/1113 is trying to acquire lock:
 ffff888100062888 (&buffer->mutex){+.+.}-{4:4}, at: ring_buffer_map+0x11c/0xe70

 but task is already holding lock:
 ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #5 (&cpu_buffer->mapping_lock){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        ring_buffer_map+0xcf/0xe70
        tracing_buffers_mmap+0x1c4/0x3b0
        __mmap_region+0xd8d/0x1f70
        do_mmap+0x9d7/0x1010
        vm_mmap_pgoff+0x20b/0x390
        ksys_mmap_pgoff+0x2e9/0x440
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #4 (&mm->mmap_lock){++++}-{4:4}:
        __might_fault+0xa5/0x110
        _copy_to_user+0x22/0x80
        _perf_ioctl+0x61b/0x1b70
        perf_ioctl+0x62/0x90
        __x64_sys_ioctl+0x134/0x190
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #3 (&cpuctx_mutex){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        perf_event_init_cpu+0x325/0x7c0
        perf_event_init+0x52a/0x5b0
        start_kernel+0x263/0x3e0
        x86_64_start_reservations+0x24/0x30
        x86_64_start_kernel+0x95/0xa0
        common_startup_64+0x13e/0x141

 -> #2 (pmus_lock){+.+.}-{4:4}:
        __mutex_lock+0x192/0x18c0
        perf_event_init_cpu+0xb7/0x7c0
        cpuhp_invoke_callback+0x2c0/0x1030
        __cpuhp_invoke_callback_range+0xbf/0x1f0
        _cpu_up+0x2e7/0x690
        cpu_up+0x117/0x170
        cpuhp_bringup_mask+0xd5/0x120
        bringup_nonboot_cpus+0x13d/0x170
        smp_init+0x2b/0xf0
        kernel_init_freeable+0x441/0x6d0
        kernel_init+0x1e/0x160
        ret_from_fork+0x34/0x70
        ret_from_fork_asm+0x1a/0x30

 -> #1 (cpu_hotplug_lock){++++}-{0:0}:
        cpus_read_lock+0x2a/0xd0
        ring_buffer_resize+0x610/0x14e0
        __tracing_resize_ring_buffer.part.0+0x42/0x120
        tracing_set_tracer+0x7bd/0xa80
        tracing_set_trace_write+0x132/0x1e0
        vfs_write+0x21c/0xe80
        ksys_write+0xf9/0x1c0
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 -> #0 (&buffer->mutex){+.+.}-{4:4}:
        __lock_acquire+0x1405/0x2210
        lock_acquire+0x174/0x310
        __mutex_lock+0x192/0x18c0
        ring_buffer_map+0x11c/0xe70
        tracing_buffers_mmap+0x1c4/0x3b0
        __mmap_region+0xd8d/0x1f70
        do_mmap+0x9d7/0x1010
        vm_mmap_pgoff+0x20b/0x390
        ksys_mmap_pgoff+0x2e9/0x440
        do_syscall_64+0x79/0x1c0
        entry_SYSCALL_64_after_hwframe+0x76/0x7e

 other info that might help us debug this:

 Chain exists of:
   &buffer->mutex --> &mm->mmap_lock --> &cpu_buffer->mapping_lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&cpu_buffer->mapping_lock);
                                lock(&mm->mmap_lock);
                                lock(&cpu_buffer->mapping_lock);
   lock(&buffer->mutex);

  *** DEADLOCK ***

 2 locks held by trace-cmd/1113:
  #0: ffff888106b847e0 (&mm->mmap_lock){++++}-{4:4}, at: vm_mmap_pgoff+0x192/0x390
  #1: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70

 stack backtrace:
 CPU: 5 UID: 0 PID: 1113 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 PREEMPT
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x6e/0xa0
  print_circular_bug.cold+0x178/0x1be
  check_noncircular+0x146/0x160
  __lock_acquire+0x1405/0x2210
  lock_acquire+0x174/0x310
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? __mutex_lock+0x169/0x18c0
  __mutex_lock+0x192/0x18c0
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? function_trace_call+0x296/0x370
  ? __pfx___mutex_lock+0x10/0x10
  ? __pfx_function_trace_call+0x10/0x10
  ? __pfx___mutex_lock+0x10/0x10
  ? _raw_spin_unlock+0x2d/0x50
  ? ring_buffer_map+0x11c/0xe70
  ? ring_buffer_map+0x11c/0xe70
  ? __mutex_lock+0x5/0x18c0
  ring_buffer_map+0x11c/0xe70
  ? do_raw_spin_lock+0x12d/0x270
  ? find_held_lock+0x2b/0x80
  ? _raw_spin_unlock+0x2d/0x50
  ? rcu_is_watching+0x15/0xb0
  ? _raw_spin_unlock+0x2d/0x50
  ? trace_preempt_on+0xd0/0x110
  tracing_buffers_mmap+0x1c4/0x3b0
  __mmap_region+0xd8d/0x1f70
  ? ring_buffer_lock_reserve+0x99/0xff0
  ? __pfx___mmap_region+0x10/0x10
  ? ring_buffer_lock_reserve+0x99/0xff0
  ? __pfx_ring_buffer_lock_reserve+0x10/0x10
  ? __pfx_ring_buffer_lock_reserve+0x10/0x10
  ? bpf_lsm_mmap_addr+0x4/0x10
  ? security_mmap_addr+0x46/0xd0
  ? lock_is_held_type+0xd9/0x130
  do_mmap+0x9d7/0x1010
  ? 0xffffffffc0370095
  ? __pfx_do_mmap+0x10/0x10
  vm_mmap_pgoff+0x20b/0x390
  ? __pfx_vm_mmap_pgoff+0x10/0x10
  ? 0xffffffffc0370095
  ksys_mmap_pgoff+0x2e9/0x440
  do_syscall_64+0x79/0x1c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7fb0963a7de2
 Code: 00 00 00 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 89 cd 53 48 89 fb 48 85 ff 74 3b 41 89 ea 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 5b 5d c3 0f 1f 00 48 8b 05 e1 9f 0d 00 64
 RSP: 002b:00007ffdcc8fb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb0963a7de2
 RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
 RBP: 0000000000000001 R08: 0000000000000006 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffdcc8fbe68 R14: 00007fb096628000 R15: 00005633e01a5c90
  </TASK>

The issue is that cpus_read_lock() is taken within buffer->mutex. The
memory mapped pages are taken with the mmap_lock held. The buffer->mutex
is taken within the cpu_buffer->mapping_lock. There's quite a chain with
all these locks, where the deadlock can be fixed by moving the
cpus_read_lock() outside the taking of the buffer->mutex.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250527105820.0f45d045@gandalf.local.home
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-27 15:46:39 -04:00
Hou Tao
d496557826 bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
bpf_map_lookup_percpu_elem() helper is also available for sleepable bpf
program. When BPF JIT is disabled or under 32-bit host,
bpf_map_lookup_percpu_elem() will not be inlined. Using it in a
sleepable bpf program will trigger the warning in
bpf_map_lookup_percpu_elem(), because the bpf program only holds
rcu_read_lock_trace lock. Therefore, add the missed check.

Reported-by: syzbot+dce5aae19ae4d6399986@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000176a130617420310@google.com/
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250526062534.1105938-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27 10:45:59 -07:00
KaFai Wan
86bc9c7424 bpf: Avoid __bpf_prog_ret0_warn when jit fails
syzkaller reported an issue:

WARNING: CPU: 3 PID: 217 at kernel/bpf/core.c:2357 __bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357
Modules linked in:
CPU: 3 UID: 0 PID: 217 Comm: kworker/u32:6 Not tainted 6.15.0-rc4-syzkaller-00040-g8bac8898fe39
RIP: 0010:__bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357
Call Trace:
 <TASK>
 bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline]
 __bpf_prog_run include/linux/filter.h:718 [inline]
 bpf_prog_run include/linux/filter.h:725 [inline]
 cls_bpf_classify+0x74a/0x1110 net/sched/cls_bpf.c:105
 ...

When creating bpf program, 'fp->jit_requested' depends on bpf_jit_enable.
This issue is triggered because of CONFIG_BPF_JIT_ALWAYS_ON is not set
and bpf_jit_enable is set to 1, causing the arch to attempt JIT the prog,
but jit failed due to FAULT_INJECTION. As a result, incorrectly
treats the program as valid, when the program runs it calls
`__bpf_prog_ret0_warn` and triggers the WARN_ON_ONCE(1).

Reported-by: syzbot+0903f6d7f285e41cdf10@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/6816e34e.a70a0220.254cdc.002c.GAE@google.com
Fixes: fa9dd599b4 ("bpf: get rid of pure_initcall dependency to enable jits")
Signed-off-by: KaFai Wan <mannkafai@gmail.com>
Link: https://lore.kernel.org/r/20250526133358.2594176-1-mannkafai@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27 10:43:10 -07:00
Yonghong Song
f95695f2c4 bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
Marc Suñé (Isovalent, part of Cisco) reported an issue where an
uninitialized variable caused generating bpf prog binary code not
working as expected. The reproducer is in [1] where the flags
“-Wall -Werror” are enabled, but there is no warning as the compiler
takes advantage of uninitialized variable to do aggressive optimization.
The optimized code looks like below:

      ; {
           0:       bf 16 00 00 00 00 00 00 r6 = r1
      ;       bpf_printk("Start");
           1:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
                    0000000000000008:  R_BPF_64_64  .rodata
           3:       b4 02 00 00 06 00 00 00 w2 = 0x6
           4:       85 00 00 00 06 00 00 00 call 0x6
      ; DEFINE_FUNC_CTX_POINTER(data)
           5:       61 61 4c 00 00 00 00 00 w1 = *(u32 *)(r6 + 0x4c)
      ;       bpf_printk("pre ipv6_hdrlen_offset");
           6:       18 01 00 00 06 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x6 ll
                    0000000000000030:  R_BPF_64_64  .rodata
           8:       b4 02 00 00 17 00 00 00 w2 = 0x17
           9:       85 00 00 00 06 00 00 00 call 0x6
      <END>

The verifier will report the following failure:
  9: (85) call bpf_trace_printk#6
  last insn is not an exit or jmp

The above verifier log does not give a clear hint about how to fix
the problem and user may take quite some time to figure out that
the issue is due to compiler taking advantage of uninitialized variable.

In llvm internals, uninitialized variable usage may generate
'unreachable' IR insn and these 'unreachable' IR insns may indicate
uninitialized variable impact on code optimization. So far, llvm
BPF backend ignores 'unreachable' IR hence the above code is generated.
With clang21 patch [2], those 'unreachable' IR insn are converted
to func __bpf_trap(). In order to maintain proper control flow
graph for bpf progs, [2] also adds an 'exit' insn after bpf_trap()
if __bpf_trap() is the last insn in the function. The new code looks like:

      ; {
           0:       bf 16 00 00 00 00 00 00 r6 = r1
      ;       bpf_printk("Start");
           1:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
                    0000000000000008:  R_BPF_64_64  .rodata
           3:       b4 02 00 00 06 00 00 00 w2 = 0x6
           4:       85 00 00 00 06 00 00 00 call 0x6
      ; DEFINE_FUNC_CTX_POINTER(data)
           5:       61 61 4c 00 00 00 00 00 w1 = *(u32 *)(r6 + 0x4c)
      ;       bpf_printk("pre ipv6_hdrlen_offset");
           6:       18 01 00 00 06 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x6 ll
                    0000000000000030:  R_BPF_64_64  .rodata
           8:       b4 02 00 00 17 00 00 00 w2 = 0x17
           9:       85 00 00 00 06 00 00 00 call 0x6
          10:       85 10 00 00 ff ff ff ff call -0x1
                    0000000000000050:  R_BPF_64_32  __bpf_trap
          11:       95 00 00 00 00 00 00 00 exit
      <END>

In kernel, a new kfunc __bpf_trap() is added. During insn
verification, any hit with __bpf_trap() will result in
verification failure. The kernel is able to provide better
log message for debugging.

With llvm patch [2] and without this patch (no __bpf_trap()
kfunc for existing kernel), e.g., for old kernels, the verifier
outputs
  10: <invalid kfunc call>
  kfunc '__bpf_trap' is referenced but wasn't resolved
Basically, kernel does not support __bpf_trap() kfunc.
This still didn't give clear signals about possible reason.

With llvm patch [2] and with this patch, the verifier outputs
  10: (85) call __bpf_trap#74479
  unexpected __bpf_trap() due to uninitialized variable?
It gives much better hints for verification failure.

  [1] https://github.com/msune/clang_bpf/blob/main/Makefile#L3
  [2] https://github.com/llvm/llvm-project/pull/131731

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250523205326.1291640-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27 09:54:20 -07:00
Yonghong Song
d848bba680 bpf: Remove special_kfunc_set from verifier
Currently, the verifier has both special_kfunc_set and special_kfunc_list.
When adding a new kfunc usage to the verifier, it is often confusing
about whether special_kfunc_set or special_kfunc_list or both should
add that kfunc. For example, some kfuncs, e.g., bpf_dynptr_from_skb,
bpf_dynptr_clone, bpf_wq_set_callback_impl, does not need to be
in special_kfunc_set.

To avoid potential future confusion, special_kfunc_set is deleted
and btf_id_set_contains(&special_kfunc_set, ...) is removed.
The code is refactored with a new func check_special_kfunc(),
which contains all codes covered by original branch
  meta.btf == btf_vmlinux && btf_id_set_contains(&special_kfunc_set, meta.func_id)

There is no functionality change.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250523205321.1291431-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27 09:52:56 -07:00
T.J. Mercier
6eab7ac7c5 bpf: Add open coded dmabuf iterator
This open coded iterator allows for more flexibility when creating BPF
programs. It can support output in formats other than text. With an open
coded iterator, a single BPF program can traverse multiple kernel data
structures (now including dmabufs), allowing for more efficient analysis
of kernel data compared to multiple reads from procfs, sysfs, or
multiple traditional BPF iterator invocations.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250522230429.941193-4-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27 09:51:25 -07:00
T.J. Mercier
76ea955349 bpf: Add dmabuf iterator
The dmabuf iterator traverses the list of all DMA buffers.

DMA buffers are refcounted through their associated struct file. A
reference is taken on each buffer as the list is iterated to ensure each
buffer persists for the duration of the bpf program execution without
holding the list mutex.

Signed-off-by: T.J. Mercier <tjmercier@google.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250522230429.941193-3-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-27 09:51:25 -07:00
Maxim Levitsky
fb49f07ba1 locking/mutex: implement mutex_lock_killable_nest_lock
KVM's SEV intra-host migration code needs to lock all vCPUs
of the source and the target VM, before it proceeds with the migration.

The number of vCPUs that belong to each VM is not bounded by anything
except a self-imposed KVM limit of CONFIG_KVM_MAX_NR_VCPUS vCPUs which is
significantly larger than the depth of lockdep's lock stack.

Luckily, the locks in both of the cases mentioned above, are held under
the 'kvm->lock' of each VM, which means that we can use the little
known lockdep feature called a "nest_lock" to support this use case in
a cleaner way, compared to the way it's currently done.

Implement and expose 'mutex_lock_killable_nest_lock' for this
purpose.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27 12:16:41 -04:00
Maxim Levitsky
c5b6ababd2 locking/mutex: implement mutex_trylock_nested
Despite the fact that several lockdep-related checks are skipped when
calling trylock* versions of the locking primitives, for example
mutex_trylock, each time the mutex is acquired, a held_lock is still
placed onto the lockdep stack by __lock_acquire() which is called
regardless of whether the trylock* or regular locking API was used.

This means that if the caller successfully acquires more than
MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock,
lockdep will still complain that the maximum depth of the held lock stack
has been reached and disable itself.

For example, the following error currently occurs in the ARM version
of KVM, once the code tries to lock all vCPUs of a VM configured with more
than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern
systems, where having more than 48 CPUs is common, and it's also common to
run VMs that have vCPU counts approaching that number:

[  328.171264] BUG: MAX_LOCK_DEPTH too low!
[  328.175227] turning off the locking correctness validator.
[  328.180726] Please attach the output of /proc/lock_stat to the bug report
[  328.187531] depth: 48  max: 48!
[  328.190678] 48 locks held by qemu-kvm/11664:
[  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

Luckily, in all instances that require locking all vCPUs, the
'kvm->lock' is taken a priori, and that fact makes it possible to use
the little known feature of lockdep, called a 'nest_lock', to avoid this
warning and subsequent lockdep self-disablement.

The action of 'nested lock' being provided to lockdep's lock_acquire(),
causes the lockdep to detect that the top of the held lock stack contains
a lock of the same class and then increment its reference counter instead
of pushing a new held_lock item onto that stack.

See __lock_acquire for more information.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27 12:16:40 -04:00
Linus Torvalds
b1456f6dc1 Merge tag 'timers-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
 "Updates for the time/timer core code:

   - Rework the initialization of the posix-timer kmem_cache and move
     the cache pointer into the timer_data structure to prevent false
     sharing

   - Switch the alarmtimer code to lock guards

   - Improve the CPU selection criteria in the per CPU validation of the
     clocksource watchdog to avoid arbitrary selections (or omissions)
     on systems with a small number of CPUs

   - The usual cleanups and improvements"

* tag 'timers-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/nohz: Remove unused tick_nohz_full_add_cpus_to()
  clocksource: Fix the CPUs' choice in the watchdog per CPU verification
  alarmtimer: Switch spin_{lock,unlock}_irqsave() to guards
  alarmtimer: Remove dead return value in clock2alarm()
  time/jiffies: Change register_refined_jiffies() to void __init
  timers: Remove unused __round_jiffies(_up)
  posix-timers: Initialize cache early and move pointer into __timer_data
2025-05-27 09:04:15 -07:00
Linus Torvalds
5e8bbb2caa Merge tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
 "Another set of timer API cleanups:

    - Convert init_timer*(), try_to_del_timer_sync() and
      destroy_timer_on_stack() over to the canonical timer_*()
      namespace convention.

  There is another large conversion pending, which has not been included
  because it would have caused a gazillion of merge conflicts in next.
  The conversion scripts will be run towards the end of the merge window
  and a pull request sent once all conflict dependencies have been
  merged"

* tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()
  treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()
  timers: Rename init_timers() as timers_init()
  timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA
  timers: Rename __init_timer_on_stack() as __timer_init_on_stack()
  timers: Rename __init_timer() as __timer_init()
  timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack()
  timers: Rename init_timer_key() as timer_init_key()
2025-05-27 08:31:21 -07:00
Linus Torvalds
44ed0f35df Merge tag 'irq-msi-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
 "Updates for the MSI subsystem (core code and PCI):

   - Switch the MSI descriptor locking to lock guards

   - Replace a broken and naive implementation of PCI/MSI-X control word
     updates in the PCI/TPH driver with a properly serialized variant in
     the PCI/MSI core code.

   - Remove the MSI descriptor abuse in the SCCI/UFS/QCOM driver by
     replacing the direct access to the MSI descriptors with the proper
     API function calls. People will never understand that APIs exist
     for a reason...

   - Provide core infrastructre for the upcoming PCI endpoint library
     extensions. Currently limited to ARM GICv3+, but in theory
     extensible to other architectures.

   - Provide a MSI domain::teardown() callback, which allows drivers to
     undo the effects of the prepare() callback.

   - Move the MSI domain::prepare() callback invocation to domain
     creation time to avoid redundant (and in case of ARM/GIC-V3-ITS
     confusing) invocations on every allocation.

     In combination with the new teardown callback this removes some
     ugly hacks in the GIC-V3-ITS driver, which pretended to work around
     the short comings of the core code so far. With this update the
     code is correct by design and implementation.

   - Make the irqchip MSI library globally available, provide a MSI
     parent domain creation helper and convert a bunch of (PCI/)MSI
     drivers over to the modern MSI parent mechanism. This is the first
     step to get rid of at least one incarnation of the three PCI/MSI
     management schemes.

   - The usual small cleanups and improvements"

* tag 'irq-msi-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  PCI/MSI: Use bool for MSI enable state tracking
  PCI: tegra: Convert to MSI parent infrastructure
  PCI: xgene: Convert to MSI parent infrastructure
  PCI: apple: Convert to MSI parent infrastructure
  irqchip/msi-lib: Honour the MSI_FLAG_NO_AFFINITY flag
  irqchip/mvebu: Convert to msi_create_parent_irq_domain() helper
  irqchip/gic: Convert to msi_create_parent_irq_domain() helper
  genirq/msi: Add helper for creating MSI-parent irq domains
  irqchip: Make irq-msi-lib.h globally available
  irqchip/gic-v3-its: Use allocation size from the prepare call
  genirq/msi: Engage the .msi_teardown() callback on domain removal
  genirq/msi: Move prepare() call to per-device allocation
  irqchip/gic-v3-its: Implement .msi_teardown() callback
  genirq/msi: Add .msi_teardown() callback as the reverse of .msi_prepare()
  irqchip/gic-v3-its: Add support for device tree msi-map and msi-mask
  dt-bindings: PCI: pci-ep: Add support for iommu-map and msi-map
  irqchip/gic-v3-its: Set IRQ_DOMAIN_FLAG_MSI_IMMUTABLE for ITS
  irqdomain: Add IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and irq_domain_is_msi_immutable()
  platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
  genirq/msi: Rename msi_[un]lock_descs()
  ...
2025-05-27 08:15:26 -07:00
Linus Torvalds
2bd1bea5fa Merge tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq cleanups from Thomas Gleixner:
 "A set of cleanups for the generic interrupt subsystem:

   - Consolidate on one set of functions for the interrupt domain code
     to get rid of pointlessly duplicated code with only marginal
     different semantics.

   - Update the documentation accordingly and consolidate the coding
     style of the irqdomain header"

* tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  irqdomain: Consolidate coding style
  irqdomain: Fix kernel-doc and add it to Documentation
  Documentation: irqdomain: Update it
  Documentation: irq-domain.rst: Simple improvements
  Documentation: irq/concepts: Minor improvements
  Documentation: irq/concepts: Add commas and reflow
  irqdomain: Improve kernel-docs of functions
  irqdomain: Make struct irq_domain_info variables const
  irqdomain: Use irq_domain_instantiate()'s return value as initializers
  irqdomain: Drop irq_linear_revmap()
  pinctrl: keembay: Switch to irq_find_mapping()
  irqchip/armada-370-xp: Switch to irq_find_mapping()
  gpu: ipu-v3: Switch to irq_find_mapping()
  gpio: idt3243x: Switch to irq_find_mapping()
  sh: Switch to irq_find_mapping()
  powerpc: Switch to irq_find_mapping()
  irqdomain: Drop irq_domain_add_*() functions
  powerpc: Switch irq_domain_add_nomap() to use fwnode
  thermal: Switch to irq_domain_create_linear()
  soc: Switch to irq_domain_create_*()
  ...
2025-05-27 08:07:32 -07:00
Linus Torvalds
c0f182c979 Merge tag 'irq-drivers-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq controller updates from Thomas Gleixner:
 "Update for interrupt chip drivers:

   - Convert the generic interrupt chip to lock guards to remove copy &
     pasta boilerplate code and gotos.

   - A new driver fot the interrupt controller in the EcoNet EN751221
     MIPS SoC.

   - Extend the SG2042-MSI driver to support the new SG2044 SoC

   - Updates and cleanups for the (ancient) VT8500 driver

   - Improve the scalability of the ARM GICV4.1 ITS driver by utilizing
     node local copies a VM's interrupt translation table when possible.
     This results in a 12% reduction of VM IPI latency in certain
     workloads.

   - The usual cleanups and improvements all over the place"

* tag 'irq-drivers-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  irqchip/irq-pruss-intc: Simplify chained interrupt handler setup
  irqchip/gic-v4.1: Use local 4_1 ITS to generate VSGI
  irqchip/econet-en751221: Switch to of_fwnode_handle()
  irqchip/irq-vt8500: Switch to irq_domain_create_*()
  irqchip/econet-en751221: Switch to irq_domain_create_linear()
  irqchip/irq-vt8500: Use fewer global variables and add error handling
  irqchip/irq-vt8500: Use a dedicated chained handler function
  irqchip/irq-vt8500: Don't require 8 interrupts from a chained controller
  irqchip/irq-vt8500: Drop redundant copy of the device node pointer
  irqchip/irq-vt8500: Split up ack/mask functions
  irqchip/sg2042-msi: Fix wrong type cast in sg2044_msi_irq_ack()
  irqchip/sg2042-msi: Add the Sophgo SG2044 MSI interrupt controller
  irqchip/sg2042-msi: Introduce configurable chipinfo for SG2042
  irqchip/sg2042-msi: Rename functions and data structures to be SG2042 agnostic
  dt-bindings: interrupt-controller: Add Sophgo SG2044 MSI controller
  genirq/generic-chip: Fix incorrect lock guard conversions
  genirq/generic-chip: Remove unused lock wrappers
  irqchip: Convert generic irqchip locking to guards
  gpio: mvebu: Convert generic irqchip locking to guard()
  ARM: orion/gpio:: Convert generic irqchip locking to guard()
  ...
2025-05-27 08:00:46 -07:00
Linus Torvalds
60c1d948f7 Merge tag 'irq-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
 "Updates for the generic interrupt subsystem core code:

   - Address a long standing subtle problem in the CPU hotplug code for
     affinity-managed interrupts.

     Affinity-managed interrupts are shut down by the core code when the
     last CPU in the affinity set goes offline and started up again when
     the first CPU in the affinity set becomes online again.

     This unfortunately does not take into account whether an interrupt
     has been disabled before the last CPU goes offline and starts up
     the interrupt unconditionally when the first CPU becomes online
     again.

     That's obviously not what drivers expect.

     Address this by preserving the disabled state for affinity-managed
     interrupts accross these CPU hotplug operations. All non-managed
     interrupts are not affected by this because startup/shutdown is
     coupled to request/free_irq() which obviously has to reset state.

   - Support three-cell scheme interrupts to allow GPIO drivers to
     specify interrupts from an already existing scheme

   - Switch the interrupt subsystem core to lock guards. This gets rid
     of quite some copy & pasta boilerplate code all over the place.

   - The usual small cleanups and improvements all over the place"

* tag 'irq-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
  genirq/irqdesc: Remove double locking in hwirq_show()
  genirq: Retain disable depth for managed interrupts across CPU hotplug
  genirq: Bump the size of the local variable for sprintf()
  genirq/manage: Use the correct lock guard in irq_set_irq_wake()
  genirq: Consistently use '%u' format specifier for unsigned int variables
  genirq: Ensure flags in lock guard is consistently initialized
  genirq: Fix inverted condition in handle_nested_irq()
  genirq/cpuhotplug: Fix up lock guards conversion brainf..t
  genirq: Use scoped_guard() to shut clang up
  genirq: Remove unused remove_percpu_irq()
  genirq: Remove irq_[get|put]_desc*()
  genirq/manage: Rework irq_set_irqchip_state()
  genirq/manage: Rework irq_get_irqchip_state()
  genirq/manage: Rework teardown_percpu_nmi()
  genirq/manage: Rework prepare_percpu_nmi()
  genirq/manage: Rework disable_percpu_irq()
  genirq/manage: Rework irq_percpu_is_enabled()
  genirq/manage: Rework enable_percpu_irq()
  genirq/manage: Rework irq_set_parent()
  genirq/manage: Rework can_request_irq()
  ...
2025-05-27 07:46:58 -07:00
Linus Torvalds
0c1494015f Merge tag 'core-entry-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry code updates from Thomas Gleixner:
 "Updates for the generic and architecture entry code:

   - Move LoongArch and RISC-V ret_from_fork() implementations to C code
     so that syscall_exit_user_mode() can be inlined

   - Split the RISC-V ret_from_fork() implementation into return to user
     and return to kernel, which gives a measurable performance
     improvement

   - Inline syscall_exit_user_mode() which benefits all architectures by
     avoiding a function call and letting the compiler do better
     optimizations"

* tag 'core-entry-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  LoongArch: entry: Fix include order
  entry: Inline syscall_exit_to_user_mode()
  LoongArch: entry: Migrate ret_from_fork() to C
  riscv: entry: Split ret_from_fork() into user and kernel
  riscv: entry: Convert ret_from_fork() to C
2025-05-27 07:44:22 -07:00
Linus Torvalds
ddddf9d64f Merge tag 'perf-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
 "Core & generic-arch updates:

   - Add support for dynamic constraints and propagate it to the Intel
     driver (Kan Liang)

   - Fix & enhance driver-specific throttling support (Kan Liang)

   - Record sample last_period before updating on the x86 and PowerPC
     platforms (Mark Barnett)

   - Make perf_pmu_unregister() usable (Peter Zijlstra)

   - Unify perf_event_free_task() / perf_event_exit_task_context()
     (Peter Zijlstra)

   - Simplify perf_event_release_kernel() and perf_event_free_task()
     (Peter Zijlstra)

   - Allocate non-contiguous AUX pages by default (Yabin Cui)

  Uprobes updates:

   - Add support to emulate NOP instructions (Jiri Olsa)

   - selftests/bpf: Add 5-byte NOP uprobe trigger benchmark (Jiri Olsa)

  x86 Intel PMU enhancements:

   - Support Intel Auto Counter Reload [ACR] (Kan Liang)

   - Add PMU support for Clearwater Forest (Dapeng Mi)

   - Arch-PEBS preparatory changes: (Dapeng Mi)
       - Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
       - Decouple BTS initialization from PEBS initialization
       - Introduce pairs of PEBS static calls

  x86 AMD PMU enhancements:

   - Use hrtimer for handling overflows in the AMD uncore driver
     (Sandipan Das)

   - Prevent UMC counters from saturating (Sandipan Das)

  Fixes and cleanups:

   - Fix put_ctx() ordering (Frederic Weisbecker)

   - Fix irq work dereferencing garbage (Frederic Weisbecker)

   - Misc fixes and cleanups (Changbin Du, Frederic Weisbecker, Ian
     Rogers, Ingo Molnar, Kan Liang, Peter Zijlstra, Qing Wang, Sandipan
     Das, Thorsten Blum)"

* tag 'perf-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  perf/headers: Clean up <linux/perf_event.h> a bit
  perf/uapi: Clean up <uapi/linux/perf_event.h> a bit
  perf/uapi: Fix PERF_RECORD_SAMPLE comments in <uapi/linux/perf_event.h>
  mips/perf: Remove driver-specific throttle support
  xtensa/perf: Remove driver-specific throttle support
  sparc/perf: Remove driver-specific throttle support
  loongarch/perf: Remove driver-specific throttle support
  csky/perf: Remove driver-specific throttle support
  arc/perf: Remove driver-specific throttle support
  alpha/perf: Remove driver-specific throttle support
  perf/apple_m1: Remove driver-specific throttle support
  perf/arm: Remove driver-specific throttle support
  s390/perf: Remove driver-specific throttle support
  powerpc/perf: Remove driver-specific throttle support
  perf/x86/zhaoxin: Remove driver-specific throttle support
  perf/x86/amd: Remove driver-specific throttle support
  perf/x86/intel: Remove driver-specific throttle support
  perf: Only dump the throttle log for the leader
  perf: Fix the throttle logic for a group
  perf/core: Add the is_event_in_freq_mode() helper to simplify the code
  ...
2025-05-26 15:40:23 -07:00
Linus Torvalds
eaed94d1f6 Merge tag 'sched-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "Core & fair scheduler changes:

   - Tweak wait_task_inactive() to force dequeue sched_delayed tasks
     (John Stultz)

   - Adhere to place_entity() constraints (Peter Zijlstra)

   - Allow decaying util_est when util_avg > CPU capacity (Pierre
     Gondois)

   - Fix up wake_up_sync() vs DELAYED_DEQUEUE (Xuewen Yan)

  Energy management:

   - Introduce sched_update_asym_prefer_cpu() (K Prateek Nayak)

   - cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings
     change (K Prateek Nayak)

   - Align uclamp and util_est and call before freq update (Xuewen Yan)

  CPU isolation:

   - Make use of more than one housekeeping CPU (Phil Auld)

  RT scheduler:

   - Fix race in push_rt_task() (Harshit Agarwal)

   - Add kernel cmdline option for rt_group_sched (Michal Koutný)

  Scheduler topology support:

   - Improve topology_span_sane speed (Steve Wahl)

  Scheduler debugging:

   - Move and extend the sched_process_exit() tracepoint (Andrii
     Nakryiko)

   - Add RT_GROUP WARN checks for non-root task_groups (Michal Koutný)

   - Fix trace_sched_switch(.prev_state) (Peter Zijlstra)

   - Untangle cond_resched() and live-patching (Peter Zijlstra)

  Fixes and cleanups:

   - Misc fixes and cleanups (K Prateek Nayak, Michal Koutný, Peter
     Zijlstra, Xuewen Yan)"

* tag 'sched-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  sched/uclamp: Align uclamp and util_est and call before freq update
  sched/util_est: Simplify condition for util_est_{en,de}queue()
  sched/fair: Fixup wake_up_sync() vs DELAYED_DEQUEUE
  sched,livepatch: Untangle cond_resched() and live-patching
  sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
  sched/fair: Adhere to place_entity() constraints
  sched/debug: Print the local group's asym_prefer_cpu
  cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings change
  sched/topology: Introduce sched_update_asym_prefer_cpu()
  sched/fair: Use READ_ONCE() to read sg->asym_prefer_cpu
  sched/isolation: Make use of more than one housekeeping cpu
  sched/rt: Fix race in push_rt_task
  sched: Add annotations to RT_GROUP_SCHED fields
  sched: Add RT_GROUP WARN checks for non-root task_groups
  sched: Do not construct nor expose RT_GROUP_SCHED structures if disabled
  sched: Bypass bandwitdh checks with runtime disabled RT_GROUP_SCHED
  sched: Skip non-root task_groups with disabled RT_GROUP_SCHED
  sched: Add commadline option for RT_GROUP_SCHED toggling
  sched: Always initialize rt_rq's task_group
  sched: Remove unneeed macro wrap
  ...
2025-05-26 15:19:58 -07:00
Linus Torvalds
b3570b00dc Merge tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "Futexes:

   - Add support for task local hash maps (Sebastian Andrzej Siewior,
     Peter Zijlstra)

   - Implement the FUTEX2_NUMA ABI, which feature extends the futex
     interface to be NUMA-aware. On NUMA-aware futexes a second u32 word
     containing the NUMA node is added to after the u32 futex value word
     (Peter Zijlstra)

   - Implement the FUTEX2_MPOL ABI, which feature extends the futex
     interface to be mempolicy-aware as well, to further refine futex
     node mappings and lookups (Peter Zijlstra)

  Locking primitives:

   - Misc cleanups (Andy Shevchenko, Borislav Petkov, Colin Ian King,
     Ingo Molnar, Nam Cao, Peter Zijlstra)

  Lockdep:

   - Prevent abuse of lockdep subclasses (Waiman Long)

   - Add number of dynamic keys to /proc/lockdep_stats (Waiman Long)

  Plus misc cleanups and fixes"

* tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  selftests/futex: Fix spelling mistake "unitiliazed" -> "uninitialized"
  futex: Correct the kernedoc return value for futex_wait_setup().
  tools headers: Synchronize prctl.h ABI header
  futex: Use RCU_INIT_POINTER() in futex_mm_init().
  selftests/futex: Use TAP output in futex_numa_mpol
  selftests/futex: Use TAP output in futex_priv_hash
  futex: Fix kernel-doc comments
  futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()
  futex: Fix outdated comment in struct restart_block
  locking/lockdep: Add number of dynamic keys to /proc/lockdep_stats
  locking/lockdep: Prevent abuse of lockdep subclass
  locking/lockdep: Move hlock_equal() to the respective #ifdeffery
  futex,selftests: Add another FUTEX2_NUMA selftest
  selftests/futex: Add futex_numa_mpol
  selftests/futex: Add futex_priv_hash
  selftests/futex: Build without headers nonsense
  tools/perf: Allow to select the number of hash buckets
  tools headers: Synchronize prctl.h ABI header
  futex: Implement FUTEX2_MPOL
  futex: Implement FUTEX2_NUMA
  ...
2025-05-26 14:42:07 -07:00
Linus Torvalds
07046958f6 Merge tag 'next.2025.05.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Joel Fernandes:
 - Removed swake_up_one_online() workaround
 - Reverted an incorrect rcuog wake-up fix from offline softirq
 - Rust RCU Guard methods marked as inline
 - Updated MAINTAINERS with Joel’s and Zqiang's new email address
 - Replaced magic constant in rcu_seq_done_exact() with named constant
 - Added warning mechanism to validate rcu_seq_done_exact()
 - Switched SRCU polling API to use rcu_seq_done_exact()
 - Commented on redundant delta check in rcu_seq_done_exact()
 - Made ->gpwrap tests in rcutorture more frequent
 - Fixed reuse of ARM64 images in rcutorture
 - rcutorture improved to check Kconfig and reader conflict handling
 - Extracted logic from rcu_torture_one_read() for clarity
 - Updated LWN RCU API documentation links
 - Enabled --do-rt in torture.sh for CONFIG_PREEMPT_RT
 - Added tests for SRCU up/down reader primitives
 - Added comments and delays checks in rcutorture
 - Deprecated srcu_read_lock_lite() and srcu_read_unlock_lite() via checkpatch
 - Added --do-normal and --do-no-normal to torture.sh
 - Added RCU Rust binding tests to torture.sh
 - Reduced CPU overcommit and removed MAXSMP/CPUMASK_OFFSTACK in TREE01
 - Replaced kmalloc() with kcalloc() in rcuscale
 - Refined listRCU example code for stale data elimination
 - Fixed hardirq count bug for x86 in cpu_stall_cputime
 - Added safety checks in rcu/nocb for offloaded rdp access
 - Other miscellaneous changes

* tag 'next.2025.05.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (27 commits)
  rcutorture: Fix issue with re-using old images on ARM64
  rcutorture: Remove MAXSMP and CPUMASK_OFFSTACK from TREE01
  rcutorture: Reduce TREE01 CPU overcommit
  torture: Check for "Call trace:" as well as "Call Trace:"
  rcutorture: Perform more frequent testing of ->gpwrap
  torture: Add testing of RCU's Rust bindings to torture.sh
  torture: Add --do-{,no-}normal to torture.sh
  checkpatch: Deprecate srcu_read_lock_lite() and srcu_read_unlock_lite()
  rcutorture: Comment invocations of tick_dep_set_task()
  rcu/nocb: Add Safe checks for access offloaded rdp
  rcuscale: using kcalloc() to relpace kmalloc()
  doc/RCU/listRCU: refine example code for eliminating stale data
  doc: Update LWN RCU API links in whatisRCU.rst
  Revert "rcu/nocb: Fix rcuog wake-up from offline softirq"
  rust: sync: rcu: Mark Guard methods as inline
  rcu/cpu_stall_cputime: fix the hardirq count for x86 architecture
  rcu: Remove swake_up_one_online() bandaid
  MAINTAINERS: Update Zqiang's email address
  rcutorture: Make torture.sh --do-rt use CONFIG_PREEMPT_RT
  srcu: Use rcu_seq_done_exact() for polling API
  ...
2025-05-26 14:20:50 -07:00
Paolo Bonzini
85502b2214 Merge tag 'loongarch-kvm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.16

1. Don't flush tlb if HW PTW supported.
2. Add LoongArch KVM selftests support.
2025-05-26 16:12:13 -04:00
Rafael J. Wysocki
76524ffd10 Merge branches 'pm-runtime' and 'pm-sleep'
Merge updates related to system sleep handling and runtime PM for 6.16-rc1:

 - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
   Kalla).

 - Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki).

 - Add new devm_ functions for enabling runtime PM and runtime PM
   reference counting (Bence Csókás).

 - Remove size arguments from strscpy() calls in the hibernation core
   code (Thorsten Blum).

 - Adjust the handling of devices with asynchronous suspend enabled
   during system suspend and resume to start resuming them immediately
   after resuming their parents and to start suspending such a device
   immediately after suspending its first child (Rafael Wysocki).

 - Adjust messages printed during tasks freezing to avoid using
   pr_cont() (Andrew Sayers, Paul Menzel).

 - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
   Zhang).

 - Add missing wakeup source attribute relax_count to sysfs and
   remove the space character at the end ofi the string produced by
   pm_show_wakelocks() (Zijun Hu).

 - Add configurable pm_test delay for hibernation (Zihuan Zhang).

 - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
   cypd4226 device on Tegra boards from suspending prematurely (Jon
   Hunter).

 - Unbreak printing PM debug messages during hibernation and clean up
   some related code (Rafael Wysocki).

* pm-runtime:
  PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn()
  PM: sysfs: Move debug runtime PM attributes to runtime_attrs[]
  PM: runtime: Add new devm functions

* pm-sleep:
  PM: freezer: Rewrite restarting tasks log to remove stray *done.*
  PM: sleep: Introduce pm_sleep_transition_in_progress()
  PM: sleep: Introduce pm_suspend_in_progress()
  PM: sleep: Print PM debug messages during hibernation
  ucsi_ccg: Disable async suspend in ucsi_ccg_probe()
  PM: hibernate: add configurable delay for pm_test
  PM: wakeup: Delete space in the end of string shown by pm_show_wakelocks()
  PM: wakeup: Add missing wakeup source attribute relax_count
  PM: sleep: Remove unnecessary !!
  PM: sleep: Use two lines for "Restarting..." / "done" messages
  PM: sleep: Make suspend of devices more asynchronous
  PM: sleep: Suspend async parents after suspending children
  PM: sleep: Resume children after resuming the parent
  PM: hibernate: Remove size arguments when calling strscpy()
2025-05-26 21:21:58 +02:00