Commit Graph

43265 Commits

Author SHA1 Message Date
Yonghong Song
01cc55af93 bpf: Add bpf_this_cpu_ptr/bpf_per_cpu_ptr support for allocated percpu obj
The bpf helpers bpf_this_cpu_ptr() and bpf_per_cpu_ptr() are re-purposed
for allocated percpu objects. For an allocated percpu obj,
the reg type is 'PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU'.

The return type for these two re-purposed helpera is
'PTR_TO_MEM | MEM_RCU | MEM_ALLOC'.
The MEM_ALLOC allows that the per-cpu data can be read and written.

Since the memory allocator bpf_mem_alloc() returns
a ptr to a percpu ptr for percpu data, the first argument
of bpf_this_cpu_ptr() and bpf_per_cpu_ptr() is patched
with a dereference before passing to the helper func.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152749.1997202-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:17 -07:00
Yonghong Song
36d8bdf75a bpf: Add alloc/xchg/direct_access support for local percpu kptr
Add two new kfunc's, bpf_percpu_obj_new_impl() and
bpf_percpu_obj_drop_impl(), to allocate a percpu obj.
Two functions are very similar to bpf_obj_new_impl()
and bpf_obj_drop_impl(). The major difference is related
to percpu handling.

    bpf_rcu_read_lock()
    struct val_t __percpu_kptr *v = map_val->percpu_data;
    ...
    bpf_rcu_read_unlock()

For a percpu data map_val like above 'v', the reg->type
is set as
	PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU
if inside rcu critical section.

MEM_RCU marking here is similar to NON_OWN_REF as 'v'
is not a owning reference. But NON_OWN_REF is
trusted and typically inside the spinlock while
MEM_RCU is under rcu read lock. RCU is preferred here
since percpu data structures mean potential concurrent
access into its contents.

Also, bpf_percpu_obj_new_impl() is restricted such that
no pointers or special fields are allowed. Therefore,
the bpf_list_head and bpf_rb_root will not be supported
in this patch set to avoid potential memory leak issue
due to racing between bpf_obj_free_fields() and another
bpf_kptr_xchg() moving an allocated object to
bpf_list_head and bpf_rb_root.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152744.1996739-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:17 -07:00
Yonghong Song
55db92f42f bpf: Add BPF_KPTR_PERCPU as a field type
BPF_KPTR_PERCPU represents a percpu field type like below

  struct val_t {
    ... fields ...
  };
  struct t {
    ...
    struct val_t __percpu_kptr *percpu_data_ptr;
    ...
  };

where
  #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr")))

While BPF_KPTR_REF points to a trusted kernel object or a trusted
local object, BPF_KPTR_PERCPU points to a trusted local
percpu object.

This patch added basic support for BPF_KPTR_PERCPU
related to percpu_kptr field parsing, recording and free operations.
BPF_KPTR_PERCPU also supports the same map types
as BPF_KPTR_REF does.

Note that unlike a local kptr, it is possible that
a BPF_KTPR_PERCPU struct may not contain any
special fields like other kptr, bpf_spin_lock, bpf_list_head, etc.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152739.1996391-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:17 -07:00
Yonghong Song
41a5db8d81 bpf: Add support for non-fix-size percpu mem allocation
This is needed for later percpu mem allocation when the
allocation is done by bpf program. For such cases, a global
bpf_global_percpu_ma is added where a flexible allocation
size is needed.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230827152734.1995725-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08 08:42:17 -07:00
Zhenhua Huang
f875db4f20 Revert "dma-contiguous: check for memory region overlap"
This reverts commit 3fa6456ebe.

The Commit broke the CMA region creation through DT on arm64,
as showed below logs with "memblock=debug":
[    0.000000] memblock_phys_alloc_range: 41943040 bytes align=0x200000
from=0x0000000000000000 max_addr=0x00000000ffffffff
early_init_dt_alloc_reserved_memory_arch+0x34/0xa0
[    0.000000] memblock_reserve: [0x00000000fd600000-0x00000000ffdfffff]
memblock_alloc_range_nid+0xc0/0x19c
[    0.000000] Reserved memory: overlap with other memblock reserved region

>From call flow, region we defined in DT was always reserved before entering
into rmem_cma_setup. Also, rmem_cma_setup has one routine cma_init_reserved_mem
to ensure the region was reserved. Checking the region not reserved here seems
not correct.

early_init_fdt_scan_reserved_mem:
    fdt_scan_reserved_mem
        __reserved_mem_reserve_reg
		early_init_dt_reserve_memory
			memblock_reserve(using “reg” prop case)
        fdt_init_reserved_mem
		__reserved_mem_alloc_size
			*early_init_dt_alloc_reserved_memory_arch*
				memblock_reserve(dynamic alloc case)
        __reserved_mem_init_node
		rmem_cma_setup(region overlap check here should always fail)

Example DT can be used to reproduce issue:

    dump_mem: mem_dump_region {
            compatible = "shared-dma-pool";
            alloc-ranges = <0x0 0x00000000 0x0 0xffffffff>;
            reusable;
            size = <0 0x2800000>;
    };

Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
2023-09-08 05:58:32 -03:00
Linus Torvalds
73be7fb14e Merge tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking updates from Jakub Kicinski:
 "Including fixes from netfilter and bpf.

  Current release - regressions:

   - eth: stmmac: fix failure to probe without MAC interface specified

  Current release - new code bugs:

   - docs: netlink: fix missing classic_netlink doc reference

  Previous releases - regressions:

   - deal with integer overflows in kmalloc_reserve()

   - use sk_forward_alloc_get() in sk_get_meminfo()

   - bpf_sk_storage: fix the missing uncharge in sk_omem_alloc

   - fib: avoid warn splat in flow dissector after packet mangling

   - skb_segment: call zero copy functions before using skbuff frags

   - eth: sfc: check for zero length in EF10 RX prefix

  Previous releases - always broken:

   - af_unix: fix msg_controllen test in scm_pidfd_recv() for
     MSG_CMSG_COMPAT

   - xsk: fix xsk_build_skb() dereferencing possible ERR_PTR()

   - netfilter:
      - nft_exthdr: fix non-linear header modification
      - xt_u32, xt_sctp: validate user space input
      - nftables: exthdr: fix 4-byte stack OOB write
      - nfnetlink_osf: avoid OOB read
      - one more fix for the garbage collection work from last release

   - igmp: limit igmpv3_newpack() packet size to IP_MAX_MTU

   - bpf, sockmap: fix preempt_rt splat when using raw_spin_lock_t

   - handshake: fix null-deref in handshake_nl_done_doit()

   - ip: ignore dst hint for multipath routes to ensure packets are
     hashed across the nexthops

   - phy: micrel:
      - correct bit assignments for cable test errata
      - disable EEE according to the KSZ9477 errata

  Misc:

   - docs/bpf: document compile-once-run-everywhere (CO-RE) relocations

   - Revert "net: macsec: preserve ingress frame ordering", it appears
     to have been developed against an older kernel, problem doesn't
     exist upstream"

* tag 'net-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
  net: enetc: distinguish error from valid pointers in enetc_fixup_clear_rss_rfs()
  Revert "net: team: do not use dynamic lockdep key"
  net: hns3: remove GSO partial feature bit
  net: hns3: fix the port information display when sfp is absent
  net: hns3: fix invalid mutex between tc qdisc and dcb ets command issue
  net: hns3: fix debugfs concurrency issue between kfree buffer and read
  net: hns3: fix byte order conversion issue in hclge_dbg_fd_tcam_read()
  net: hns3: Support query tx timeout threshold by debugfs
  net: hns3: fix tx timeout issue
  net: phy: Provide Module 4 KSZ9477 errata (DS80000754C)
  netfilter: nf_tables: Unbreak audit log reset
  netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
  netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
  netfilter: nf_tables: uapi: Describe NFTA_RULE_CHAIN_ID
  netfilter: nfnetlink_osf: avoid OOB read
  netfilter: nftables: exthdr: fix 4-byte stack OOB write
  selftests/bpf: Check bpf_sk_storage has uncharged sk_omem_alloc
  bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
  bpf: bpf_sk_storage: Fix invalid wait context lockdep report
  s390/bpf: Pass through tail call counter in trampolines
  ...
2023-09-07 18:33:07 -07:00
Zheng Yejian
f6bd2c9248 ring-buffer: Avoid softlockup in ring_buffer_resize()
When user resize all trace ring buffer through file 'buffer_size_kb',
then in ring_buffer_resize(), kernel allocates buffer pages for each
cpu in a loop.

If the kernel preemption model is PREEMPT_NONE and there are many cpus
and there are many buffer pages to be allocated, it may not give up cpu
for a long time and finally cause a softlockup.

To avoid it, call cond_resched() after each cpu buffer allocation.

Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07 16:38:54 -04:00
Steven Rostedt (Google)
e5c624f027 tracing: Have event inject files inc the trace array ref count
The event inject files add events for a specific trace array. For an
instance, if the file is opened and the instance is deleted, reading or
writing to the file will cause a use after free.

Up the ref count of the trace_array when a event inject file is opened.

Link: https://lkml.kernel.org/r/20230907024804.292337868@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 6c3edaf9fd ("tracing: Introduce trace event injection")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07 16:38:54 -04:00
Steven Rostedt (Google)
7e2cfbd2d3 tracing: Have option files inc the trace array ref count
The option files update the options for a given trace array. For an
instance, if the file is opened and the instance is deleted, reading or
writing to the file will cause a use after free.

Up the ref count of the trace_array when an option file is opened.

Link: https://lkml.kernel.org/r/20230907024804.086679464@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07 16:38:54 -04:00
Steven Rostedt (Google)
9b37febc57 tracing: Have current_trace inc the trace array ref count
The current_trace updates the trace array tracer. For an instance, if the
file is opened and the instance is deleted, reading or writing to the file
will cause a use after free.

Up the ref count of the trace array when current_trace is opened.

Link: https://lkml.kernel.org/r/20230907024803.877687227@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07 16:38:53 -04:00
Steven Rostedt (Google)
7d660c9b2b tracing: Have tracing_max_latency inc the trace array ref count
The tracing_max_latency file points to the trace_array max_latency field.
For an instance, if the file is opened and the instance is deleted,
reading or writing to the file will cause a use after free.

Up the ref count of the trace_array when tracing_max_latency is opened.

Link: https://lkml.kernel.org/r/20230907024803.666889383@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07 16:38:53 -04:00
Steven Rostedt (Google)
f5ca233e2e tracing: Increase trace array ref count on enable and filter files
When the trace event enable and filter files are opened, increment the
trace array ref counter, otherwise they can be accessed when the trace
array is being deleted. The ref counter keeps the trace array from being
deleted while those files are opened.

Link: https://lkml.kernel.org/r/20230907024803.456187066@goodmis.org
Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 8530dec63e ("tracing: Add tracing_check_open_get_tr()")
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07 16:05:46 -04:00
Christoph Hellwig
4952801fc6 Revert "printk: export symbols for debug modules"
This reverts commit 3e00123a13.

No, we never export random symbols for out of tree modules.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230905081902.321778-1-hch@lst.de
2023-09-07 14:19:42 +02:00
Puranjay Mohan
20e490adea bpf: make bpf_prog_pack allocator portable
The bpf_prog_pack allocator currently uses module_alloc() and
module_memfree() to allocate and free memory. This is not portable
because different architectures use different methods for allocating
memory for BPF programs. Like ARM64 and riscv use vmalloc()/vfree().

Use bpf_jit_alloc_exec() and bpf_jit_free_exec() for memory management
in bpf_prog_pack allocator. Other architectures can override these with
their implementation and will be able to use bpf_prog_pack directly.

On architectures that don't override bpf_jit_alloc/free_exec() this is
basically a NOP.

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20230831131229.497941-2-puranjay12@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-09-06 06:26:05 -07:00
Martin KaFai Lau
55d49f750b bpf: bpf_sk_storage: Fix the missing uncharge in sk_omem_alloc
The commit c83597fa5d ("bpf: Refactor some inode/task/sk storage functions
for reuse"), refactored the bpf_{sk,task,inode}_storage_free() into
bpf_local_storage_unlink_nolock() which then later renamed to
bpf_local_storage_destroy(). The commit accidentally passed the
"bool uncharge_mem = false" argument to bpf_selem_unlink_storage_nolock()
which then stopped the uncharge from happening to the sk->sk_omem_alloc.

This missing uncharge only happens when the sk is going away (during
__sk_destruct).

This patch fixes it by always passing "uncharge_mem = true". It is a
noop to the task/inode/cgroup storage because they do not have the
map_local_storage_(un)charge enabled in the map_ops. A followup patch
will be done in bpf-next to remove the uncharge_mem argument.

A selftest is added in the next patch.

Fixes: c83597fa5d ("bpf: Refactor some inode/task/sk storage functions for reuse")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230901231129.578493-3-martin.lau@linux.dev
2023-09-06 11:08:14 +02:00
Martin KaFai Lau
a96a44aba5 bpf: bpf_sk_storage: Fix invalid wait context lockdep report
'./test_progs -t test_local_storage' reported a splat:

[   27.137569] =============================
[   27.138122] [ BUG: Invalid wait context ]
[   27.138650] 6.5.0-03980-gd11ae1b16b0a #247 Tainted: G           O
[   27.139542] -----------------------------
[   27.140106] test_progs/1729 is trying to lock:
[   27.140713] ffff8883ef047b88 (stock_lock){-.-.}-{3:3}, at: local_lock_acquire+0x9/0x130
[   27.141834] other info that might help us debug this:
[   27.142437] context-{5:5}
[   27.142856] 2 locks held by test_progs/1729:
[   27.143352]  #0: ffffffff84bcd9c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x40
[   27.144492]  #1: ffff888107deb2c0 (&storage->lock){..-.}-{2:2}, at: bpf_local_storage_update+0x39e/0x8e0
[   27.145855] stack backtrace:
[   27.146274] CPU: 0 PID: 1729 Comm: test_progs Tainted: G           O       6.5.0-03980-gd11ae1b16b0a #247
[   27.147550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   27.149127] Call Trace:
[   27.149490]  <TASK>
[   27.149867]  dump_stack_lvl+0x130/0x1d0
[   27.152609]  dump_stack+0x14/0x20
[   27.153131]  __lock_acquire+0x1657/0x2220
[   27.153677]  lock_acquire+0x1b8/0x510
[   27.157908]  local_lock_acquire+0x29/0x130
[   27.159048]  obj_cgroup_charge+0xf4/0x3c0
[   27.160794]  slab_pre_alloc_hook+0x28e/0x2b0
[   27.161931]  __kmem_cache_alloc_node+0x51/0x210
[   27.163557]  __kmalloc+0xaa/0x210
[   27.164593]  bpf_map_kzalloc+0xbc/0x170
[   27.165147]  bpf_selem_alloc+0x130/0x510
[   27.166295]  bpf_local_storage_update+0x5aa/0x8e0
[   27.167042]  bpf_fd_sk_storage_update_elem+0xdb/0x1a0
[   27.169199]  bpf_map_update_value+0x415/0x4f0
[   27.169871]  map_update_elem+0x413/0x550
[   27.170330]  __sys_bpf+0x5e9/0x640
[   27.174065]  __x64_sys_bpf+0x80/0x90
[   27.174568]  do_syscall_64+0x48/0xa0
[   27.175201]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[   27.175932] RIP: 0033:0x7effb40e41ad
[   27.176357] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d8
[   27.179028] RSP: 002b:00007ffe64c21fc8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[   27.180088] RAX: ffffffffffffffda RBX: 00007ffe64c22768 RCX: 00007effb40e41ad
[   27.181082] RDX: 0000000000000020 RSI: 00007ffe64c22008 RDI: 0000000000000002
[   27.182030] RBP: 00007ffe64c21ff0 R08: 0000000000000000 R09: 00007ffe64c22788
[   27.183038] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000000
[   27.184006] R13: 00007ffe64c22788 R14: 00007effb42a1000 R15: 0000000000000000
[   27.184958]  </TASK>

It complains about acquiring a local_lock while holding a raw_spin_lock.
It means it should not allocate memory while holding a raw_spin_lock
since it is not safe for RT.

raw_spin_lock is needed because bpf_local_storage supports tracing
context. In particular for task local storage, it is easy to
get a "current" task PTR_TO_BTF_ID in tracing bpf prog.
However, task (and cgroup) local storage has already been moved to
bpf mem allocator which can be used after raw_spin_lock.

The splat is for the sk storage. For sk (and inode) storage,
it has not been moved to bpf mem allocator. Using raw_spin_lock or not,
kzalloc(GFP_ATOMIC) could theoretically be unsafe in tracing context.
However, the local storage helper requires a verifier accepted
sk pointer (PTR_TO_BTF_ID), it is hypothetical if that (mean running
a bpf prog in a kzalloc unsafe context and also able to hold a verifier
accepted sk pointer) could happen.

This patch avoids kzalloc after raw_spin_lock to silent the splat.
There is an existing kzalloc before the raw_spin_lock. At that point,
a kzalloc is very likely required because a lookup has just been done
before. Thus, this patch always does the kzalloc before acquiring
the raw_spin_lock and remove the later kzalloc usage after the
raw_spin_lock. After this change, it will have a charge and then
uncharge during the syscall bpf_map_update_elem() code path.
This patch opts for simplicity and not continue the old
optimization to save one charge and uncharge.

This issue is dated back to the very first commit of bpf_sk_storage
which had been refactored multiple times to create task, inode, and
cgroup storage. This patch uses a Fixes tag with a more recent
commit that should be easier to do backport.

Fixes: b00fa38a9c ("bpf: Enable non-atomic allocations in local storage")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230901231129.578493-2-martin.lau@linux.dev
2023-09-06 11:07:54 +02:00
Sebastian Andrzej Siewior
6764e767f4 bpf: Assign bpf_tramp_run_ctx::saved_run_ctx before recursion check.
__bpf_prog_enter_recur() assigns bpf_tramp_run_ctx::saved_run_ctx before
performing the recursion check which means in case of a recursion
__bpf_prog_exit_recur() uses the previously set bpf_tramp_run_ctx::saved_run_ctx
value.

__bpf_prog_enter_sleepable_recur() assigns bpf_tramp_run_ctx::saved_run_ctx
after the recursion check which means in case of a recursion
__bpf_prog_exit_sleepable_recur() uses an uninitialized value. This does not
look right. If I read the entry trampoline code right, then bpf_tramp_run_ctx
isn't initialized upfront.

Align __bpf_prog_enter_sleepable_recur() with __bpf_prog_enter_recur() and
set bpf_tramp_run_ctx::saved_run_ctx before the recursion check is made.
Remove the assignment of saved_run_ctx in kern_sys_bpf() since it happens
a few cycles later.

Fixes: e384c7b7b4 ("bpf, x86: Create bpf_tramp_run_ctx on the caller thread's stack")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230830080405.251926-3-bigeasy@linutronix.de
2023-09-06 10:44:28 +02:00
Sebastian Andrzej Siewior
7645629f7d bpf: Invoke __bpf_prog_exit_sleepable_recur() on recursion in kern_sys_bpf().
If __bpf_prog_enter_sleepable_recur() detects recursion then it returns
0 without undoing rcu_read_lock_trace(), migrate_disable() or
decrementing the recursion counter. This is fine in the JIT case because
the JIT code will jump in the 0 case to the end and invoke the matching
exit trampoline (__bpf_prog_exit_sleepable_recur()).

This is not the case in kern_sys_bpf() which returns directly to the
caller with an error code.

Add __bpf_prog_exit_sleepable_recur() as clean up in the recursion case.

Fixes: b1d18a7574 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230830080405.251926-2-bigeasy@linutronix.de
2023-09-06 10:39:31 +02:00
Linus Torvalds
61401a8724 Merge tag 'kbuild-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:

 - Enable -Wenum-conversion warning option

 - Refactor the rpm-pkg target

 - Fix scripts/setlocalversion to consider annotated tags for rt-kernel

 - Add a jump key feature for the search menu of 'make nconfig'

 - Support Qt6 for 'make xconfig'

 - Enable -Wformat-overflow, -Wformat-truncation, -Wstringop-overflow,
   and -Wrestrict warnings for W=1 builds

 - Replace <asm/export.h> with <linux/export.h> for alpha, ia64, and
   sparc

 - Support DEB_BUILD_OPTIONS=parallel=N for the debian source package

 - Refactor scripts/Makefile.modinst and fix some modules_sign issues

 - Add a new Kconfig env variable to warn symbols that are not defined
   anywhere

 - Show help messages of config fragments in 'make help'

* tag 'kbuild-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (62 commits)
  kconfig: fix possible buffer overflow
  kbuild: Show marked Kconfig fragments in "help"
  kconfig: add warn-unknown-symbols sanity check
  kbuild: dummy-tools: make MPROFILE_KERNEL checks work on BE
  Documentation/llvm: refresh docs
  modpost: Skip .llvm.call-graph-profile section check
  kbuild: support modules_sign for external modules as well
  kbuild: support 'make modules_sign' with CONFIG_MODULE_SIG_ALL=n
  kbuild: move more module installation code to scripts/Makefile.modinst
  kbuild: reduce the number of mkdir calls during modules_install
  kbuild: remove $(MODLIB)/source symlink
  kbuild: move depmod rule to scripts/Makefile.modinst
  kbuild: add modules_sign to no-{compiler,sync-config}-targets
  kbuild: do not run depmod for 'make modules_sign'
  kbuild: deb-pkg: support DEB_BUILD_OPTIONS=parallel=N in debian/rules
  alpha: remove <asm/export.h>
  alpha: replace #include <asm/export.h> with #include <linux/export.h>
  ia64: remove <asm/export.h>
  ia64: replace #include <asm/export.h> with #include <linux/export.h>
  sparc: remove <asm/export.h>
  ...
2023-09-05 11:01:47 -07:00
Linus Torvalds
3c31041e37 Merge tag 'printk-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:

 - Do not try to get the console lock when it is not need or useful in
   panic()

 - Replace the global console_suspended state by a per-console flag

 - Export symbols needed for dumping the raw printk buffer in panic()

 - Fix documentation of printf formats for integer types

 - Moved Sergey Senozhatsky to the reviewer role

 - Misc cleanups

* tag 'printk-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: export symbols for debug modules
  lib: test_scanf: Add explicit type cast to result initialization in test_number_prefix()
  printk: ringbuffer: Fix truncating buffer size min_t cast
  printk: Rename abandon_console_lock_in_panic() to other_cpu_in_panic()
  printk: Add per-console suspended state
  printk: Consolidate console deferred printing
  printk: Do not take console lock for console_flush_on_panic()
  printk: Keep non-panic-CPUs out of console lock
  printk: Reduce console_unblank() usage in unsafe scenarios
  kdb: Do not assume write() callback available
  docs: printk-formats: Treat char as always unsigned
  docs: printk-formats: Fix hex printing of signed values
  MAINTAINERS: adjust printk/vsprintf entries
2023-09-04 13:20:19 -07:00
Petr Mladek
f0f6923953 Merge branch 'rework/misc-cleanups' into for-linus 2023-09-04 11:37:37 +02:00
Kees Cook
feec5e1f74 kbuild: Show marked Kconfig fragments in "help"
Currently the Kconfig fragments in kernel/configs and arch/*/configs
that aren't used internally aren't discoverable through "make help",
which consists of hard-coded lists of config fragments. Instead, list
all the fragment targets that have a "# Help: " comment prefix so the
targets can be generated dynamically.

Add logic to the Makefile to search for and display the fragment and
comment. Add comments to fragments that are intended to be direct targets.

Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-09-04 02:04:20 +09:00
Linus Torvalds
b70100f2e6 Merge tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes updates from Masami Hiramatsu:

 - kprobes: use struct_size() for variable size kretprobe_instance data
   structure.

 - eprobe: Simplify trace_eprobe list iteration.

 - probe events: Data structure field access support on BTF argument.

     - Update BTF argument support on the functions in the kernel
       loadable modules (only loaded modules are supported).

     - Move generic BTF access function (search function prototype and
       get function parameters) to a separated file.

     - Add a function to search a member of data structure in BTF.

     - Support accessing BTF data structure member from probe args by
       C-like arrow('->') and dot('.') operators. e.g.
          't sched_switch next=next->pid vruntime=next->se.vruntime'

     - Support accessing BTF data structure member from $retval. e.g.
          'f getname_flags%return +0($retval->name):string'

     - Add string type checking if BTF type info is available. This will
       reject if user specify ":string" type for non "char pointer"
       type.

     - Automatically assume the fprobe event as a function return event
       if $retval is used.

 - selftests/ftrace: Add BTF data field access test cases.

 - Documentation: Update fprobe event example with BTF data field.

* tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  Documentation: tracing: Update fprobe event example with BTF field
  selftests/ftrace: Add BTF fields access testcases
  tracing/fprobe-event: Assume fprobe is a return event by $retval
  tracing/probes: Add string type check with BTF
  tracing/probes: Support BTF field access from $retval
  tracing/probes: Support BTF based data structure field access
  tracing/probes: Add a function to search a member of a struct/union
  tracing/probes: Move finding func-proto API and getting func-param API to trace_btf
  tracing/probes: Support BTF argument on module functions
  tracing/eprobe: Iterate trace_eprobe directly
  kernel: kprobes: Use struct_size()
2023-09-02 11:10:50 -07:00
Linus Torvalds
e021c5f1f6 Merge tag 'trace-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull more tracing updates from Steven Rostedt:
 "Tracing fixes and clean ups:

   - Replace strlcpy() with strscpy()

   - Initialize the pipe cpumask to zero on allocation

   - Use within_module() instead of open coding it

   - Remove extra space in hwlat_detectory/mode output

   - Use LIST_HEAD() instead of open coding it

   - A bunch of clean ups and fixes for the cpumask filter

   - Set local da_mon_##name to static

   - Fix race in snapshot buffer between cpu write and swap"

* tag 'trace-v6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/filters: Fix coding style issues
  tracing/filters: Change parse_pred() cpulist ternary into an if block
  tracing/filters: Fix double-free of struct filter_pred.mask
  tracing/filters: Fix error-handling of cpulist parsing buffer
  tracing: Zero the pipe cpumask on alloc to avoid spurious -EBUSY
  ftrace: Use LIST_HEAD to initialize clear_hash
  ftrace: Use within_module to check rec->ip within specified module.
  tracing: Replace strlcpy with strscpy in trace/events/task.h
  tracing: Fix race issue between cpu buffer write and swap
  tracing: Remove extra space at the end of hwlat_detector/mode
  rv: Set variable 'da_mon_##name' to static
2023-09-02 10:50:54 -07:00
Linus Torvalds
a6216978de Merge tag 'timers-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix false positive 'softirq work is pending' messages on -rt kernels,
  caused by a buggy factoring-out of existing code"

* tag 'timers-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/rcu: Fix false positive "softirq work is pending" messages
2023-09-02 09:01:48 -07:00
Linus Torvalds
23dfeae882 Merge tag 'smp-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
 "Fix a CPU hotplug related deadlock between the task which initiates
  and controls a CPU hot-unplug operation vs. the CFS bandwidth timer"

* tag 'smp-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Prevent self deadlock on CPU hot-unplug
2023-09-02 08:58:49 -07:00
Linus Torvalds
c39cbc5b60 Merge tag 'sched-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Miscellaneous scheduler fixes: a reporting fix, a static symbol fix,
  and a kernel-doc fix"

* tag 'sched-urgent-2023-09-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Report correct state for TASK_IDLE | TASK_FREEZABLE
  sched/fair: Make update_entity_lag() static
  sched/core: Add kernel-doc for set_cpus_allowed_ptr()
2023-09-02 08:49:08 -07:00
Linus Torvalds
76be05d4fd cgroup: fix build when CGROUP_SCHED is not enabled
Sudip Mukherjee reports that the mips sb1250_swarm_defconfig build fails
with the current kernel.  It isn't actually MIPS-specific, it's just
that that defconfig does not have CGROUP_SCHED enabled like most configs
do, and as such shows this error:

  kernel/cgroup/cgroup.c: In function 'cgroup_local_stat_show':
  kernel/cgroup/cgroup.c:3699:15: error: implicit declaration of function 'cgroup_tryget_css'; did you mean 'cgroup_tryget'? [-Werror=implicit-function-declaration]
   3699 |         css = cgroup_tryget_css(cgrp, ss);
        |               ^~~~~~~~~~~~~~~~~
        |               cgroup_tryget
  kernel/cgroup/cgroup.c:3699:13: warning: assignment to 'struct cgroup_subsys_state *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
   3699 |         css = cgroup_tryget_css(cgrp, ss);
        |             ^

because cgroup_tryget_css() only exists when CGROUP_SCHED is enabled,
and the cgroup_local_stat_show() function should similarly be guarded by
that config option.

Move things around a bit to fix this all.

Fixes: d1d4ff5d11 ("cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED")
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-02 08:27:17 -07:00
Shrikanth Hegde
f8858d9606 sched/fair: Optimize should_we_balance() for large SMT systems
should_we_balance() is called in load_balance() to find out if the CPU that
is trying to do the load balance is the right one or not.

With commit:

  b1bfeab9b002("sched/fair: Consider the idle state of the whole core for load balance")

the code tries to find an idle core to do the load balancing
and falls back on an idle sibling CPU if there is no idle core.

However, on larger SMT systems, it could be needlessly iterating to find a
idle by scanning all the CPUs in an non-idle core. If the core is not idle,
and first SMT sibling which is idle has been found, then its not needed to
check other SMT siblings for idleness

Lets say in SMT4, Core0 has 0,2,4,6 and CPU0 is BUSY and rest are IDLE.
balancing domain is MC/DIE. CPU2 will be set as the first idle_smt and
same process would be repeated for CPU4 and CPU6 but this is unnecessary.
Since calling is_core_idle loops through all CPU's in the SMT mask, effect
is multiplied by weight of smt_mask. For example,when say 1 CPU is busy,
we would skip loop for 2 CPU's and skip iterating over 8CPU's. That
effect would be more in DIE/NUMA domain where there are more cores.

Testing and performance evaluation
==================================

The test has been done on this system which has 12 cores, i.e 24 small
cores with SMT=4:

  lscpu
  Architecture:            ppc64le
    Byte Order:            Little Endian
  CPU(s):                  96
    On-line CPU(s) list:   0-95
  Model name:              POWER10 (architected), altivec supported
    Thread(s) per core:    8

Used funclatency bcc tool to evaluate the time taken by should_we_balance(). For
base tip/sched/core the time taken is collected by making the
should_we_balance() noinline. time is in nanoseconds. The values are
collected by running the funclatency tracer for 60 seconds. values are
average of 3 such runs. This represents the expected reduced time with
patch.

tip/sched/core was at commit:

  2f88c8e802 ("sched/eevdf/doc: Modify the documented knob to base_slice_ns as well")

Results:

	------------------------------------------------------------------------------
	workload			   tip/sched/core	with_patch(%gain)
	------------------------------------------------------------------------------
	idle system				 809.3		 695.0(16.45)
	stress ng – 12 threads -l 100		1013.5		 893.1(13.49)
	stress ng – 24 threads -l 100		1073.5		 980.0(9.54)
	stress ng – 48 threads -l 100		 683.0		 641.0(6.55)
	stress ng – 96 threads -l 100		2421.0		2300(5.26)
	stress ng – 96 threads -l 15		 375.5		 377.5(-0.53)
	stress ng – 96 threads -l 25		 635.5		 637.5(-0.31)
	stress ng – 96 threads -l 35		 934.0		 891.0(4.83)

Ran schbench(old), hackbench and stress_ng  to evaluate the workload
performance between tip/sched/core and with patch.
No modification to tip/sched/core

TL;DR:

Good improvement is seen with schbench. when hackbench and stress_ng
runs for longer good improvement is seen.

	------------------------------------------------------------------------------
	schbench(old)		            tip		+patch(%gain)
	10 iterations			sched/core
	------------------------------------------------------------------------------
	1 Threads
	50.0th:		      		    8.00       9.00(-12.50)
	75.0th:   			    9.60       9.00(6.25)
	90.0th:   			   11.80      10.20(13.56)
	95.0th:   			   12.60      10.40(17.46)
	99.0th:   			   13.60      11.90(12.50)
	99.5th:   			   14.10      12.60(10.64)
	99.9th:   			   15.90      14.60(8.18)
	2 Threads
	50.0th:   			    9.90       9.20(7.07)
	75.0th:   			   12.60      10.10(19.84)
	90.0th:   			   15.50      12.00(22.58)
	95.0th:   			   17.70      14.00(20.90)
	99.0th:   			   21.20      16.90(20.28)
	99.5th:   			   22.60      17.50(22.57)
	99.9th:   			   30.40      19.40(36.18)
	4 Threads
	50.0th:   			   12.50      10.60(15.20)
	75.0th:   			   15.30      12.00(21.57)
	90.0th:   			   18.60      14.10(24.19)
	95.0th:   			   21.30      16.20(23.94)
	99.0th:   			   26.00      20.70(20.38)
	99.5th:   			   27.60      22.50(18.48)
	99.9th:   			   33.90      31.40(7.37)
	8 Threads
	50.0th:   			   16.30      14.30(12.27)
	75.0th:   			   20.20      17.40(13.86)
	90.0th:   			   24.50      21.90(10.61)
	95.0th:   			   27.30      24.70(9.52)
	99.0th:   			   35.00      31.20(10.86)
	99.5th:   			   46.40      33.30(28.23)
	99.9th:   			   89.30      57.50(35.61)
	16 Threads
	50.0th:   			   22.70      20.70(8.81)
	75.0th:   			   30.10      27.40(8.97)
	90.0th:   			   36.00      32.80(8.89)
	95.0th:   			   39.60      36.40(8.08)
	99.0th:   			   49.20      44.10(10.37)
	99.5th:   			   64.90      50.50(22.19)
	99.9th:   			  143.50     100.60(29.90)
	32 Threads
	50.0th:   			   34.60      35.50(-2.60)
	75.0th:   			   48.20      50.50(-4.77)
	90.0th:   			   59.20      62.40(-5.41)
	95.0th:   			   65.20      69.00(-5.83)
	99.0th:   			   80.40      83.80(-4.23)
	99.5th:   			  102.10      98.90(3.13)
	99.9th:   			  727.10     506.80(30.30)

schbench does improve in general. There is some run to run variation with
schbench. Did a validation run to confirm that trend is similar.

	------------------------------------------------------------------------------
	hackbench				tip	   +patch(%gain)
	20 iterations, 50000 loops	     sched/core
	------------------------------------------------------------------------------
	Process 10 groups                :      11.74      11.70(0.34)
	Process 20 groups                :      22.73      22.69(0.18)
	Process 30 groups                :      33.39      33.40(-0.03)
	Process 40 groups                :      43.73      43.61(0.27)
	Process 50 groups                :      53.82      54.35(-0.98)
	Process 60 groups                :      64.16      65.29(-1.76)
	thread 10 Time                   :      12.81      12.79(0.16)
	thread 20 Time                   :      24.63      24.47(0.65)
	Process(Pipe) 10 Time            :       6.40       6.34(0.94)
	Process(Pipe) 20 Time            :      10.62      10.63(-0.09)
	Process(Pipe) 30 Time            :      15.09      14.84(1.66)
	Process(Pipe) 40 Time            :      19.42      19.01(2.11)
	Process(Pipe) 50 Time            :      24.04      23.34(2.91)
	Process(Pipe) 60 Time            :      28.94      27.51(4.94)
	thread(Pipe) 10 Time             :       6.96       6.87(1.29)
	thread(Pipe) 20 Time             :      11.74      11.73(0.09)

hackbench shows slight improvement with pipe. Slight degradation in process.

	------------------------------------------------------------------------------
	stress_ng				tip        +patch(%gain)
	10 iterations 100000 cpu_ops	     sched/core
	------------------------------------------------------------------------------

	--cpu=96 -util=100 Time taken  	 :       5.30,       5.01(5.47)
	--cpu=48 -util=100 Time taken    :       7.94,       6.73(15.24)
	--cpu=24 -util=100 Time taken    :      11.67,       8.75(25.02)
	--cpu=12 -util=100 Time taken    :      15.71,      15.02(4.39)
	--cpu=96 -util=10 Time taken     :      22.71,      22.19(2.29)
	--cpu=96 -util=20 Time taken     :      12.14,      12.37(-1.89)
	--cpu=96 -util=30 Time taken     :       8.76,       8.86(-1.14)
	--cpu=96 -util=40 Time taken     :       7.13,       7.14(-0.14)
	--cpu=96 -util=50 Time taken     :       6.10,       6.13(-0.49)
	--cpu=96 -util=60 Time taken     :       5.42,       5.41(0.18)
	--cpu=96 -util=70 Time taken     :       4.94,       4.94(0.00)
	--cpu=96 -util=80 Time taken     :       4.56,       4.53(0.66)
	--cpu=96 -util=90 Time taken     :       4.27,       4.26(0.23)

Good improvement seen with 24 CPUs. In this case only one CPU is busy,
and no core is idle. Decent improvement with 100% utilization case. no
difference in other utilization.

Fixes: b1bfeab9b0 ("sched/fair: Consider the idle state of the whole core for load balance")
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230902081204.232218-1-sshegde@linux.vnet.ibm.com
2023-09-02 12:56:04 +02:00
Valentin Schneider
cbb557ba92 tracing/filters: Fix coding style issues
Recent commits have introduced some coding style issues, fix those up.

Link: https://lkml.kernel.org/r/20230901151039.125186-5-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:27:23 -04:00
Valentin Schneider
2900bcbee3 tracing/filters: Change parse_pred() cpulist ternary into an if block
Review comments noted that an if block would be clearer than a ternary, so
swap it out.

No change in behaviour intended

Link: https://lkml.kernel.org/r/20230901151039.125186-4-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:27:22 -04:00
Valentin Schneider
1caf7adb9e tracing/filters: Fix double-free of struct filter_pred.mask
When a cpulist filter is found to contain a single CPU, that CPU is saved
as a scalar and the backing cpumask storage is freed.

Also NULL the mask to avoid a double-free once we get down to
free_predicate().

Link: https://lkml.kernel.org/r/20230901151039.125186-3-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:27:22 -04:00
Valentin Schneider
9af4058493 tracing/filters: Fix error-handling of cpulist parsing buffer
parse_pred() allocates a string buffer to parse the user-provided cpulist,
but doesn't check the allocation result nor does it free the buffer once it
is no longer needed.

Add an allocation check, and free the buffer as soon as it is no longer
needed.

Link: https://lkml.kernel.org/r/20230901151039.125186-2-vschneid@redhat.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:27:22 -04:00
Brian Foster
3d07fa1dd1 tracing: Zero the pipe cpumask on alloc to avoid spurious -EBUSY
The pipe cpumask used to serialize opens between the main and percpu
trace pipes is not zeroed or initialized. This can result in
spurious -EBUSY returns if underlying memory is not fully zeroed.
This has been observed by immediate failure to read the main
trace_pipe file on an otherwise newly booted and idle system:

 # cat /sys/kernel/debug/tracing/trace_pipe
 cat: /sys/kernel/debug/tracing/trace_pipe: Device or resource busy

Zero the allocation of pipe_cpumask to avoid the problem.

Link: https://lore.kernel.org/linux-trace-kernel/20230831125500.986862-1-bfoster@redhat.com

Cc: stable@vger.kernel.org
Fixes: c2489bb7e6 ("tracing: Introduce pipe_cpumask to avoid race on trace_pipes")
Reviewed-by: Zheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:26:07 -04:00
Ruan Jinjie
2a30dbcbef ftrace: Use LIST_HEAD to initialize clear_hash
Use LIST_HEAD() to initialize clear_hash instead of open-coding it.

Link: https://lore.kernel.org/linux-trace-kernel/20230809071551.913041-1-ruanjinjie@huawei.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:18:38 -04:00
Levi Yun
1351148904 ftrace: Use within_module to check rec->ip within specified module.
within_module_core && within_module_init condition is same to
within module but it's more readable.

Use within_module instead of former condition to check rec->ip
within specified module area or not.

Link: https://lore.kernel.org/linux-trace-kernel/20230803205236.32201-1-ppbuk5246@gmail.com

Signed-off-by: Levi Yun <ppbuk5246@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:17:10 -04:00
Zheng Yejian
3163f635b2 tracing: Fix race issue between cpu buffer write and swap
Warning happened in rb_end_commit() at code:
	if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing)))

  WARNING: CPU: 0 PID: 139 at kernel/trace/ring_buffer.c:3142
	rb_commit+0x402/0x4a0
  Call Trace:
   ring_buffer_unlock_commit+0x42/0x250
   trace_buffer_unlock_commit_regs+0x3b/0x250
   trace_event_buffer_commit+0xe5/0x440
   trace_event_buffer_reserve+0x11c/0x150
   trace_event_raw_event_sched_switch+0x23c/0x2c0
   __traceiter_sched_switch+0x59/0x80
   __schedule+0x72b/0x1580
   schedule+0x92/0x120
   worker_thread+0xa0/0x6f0

It is because the race between writing event into cpu buffer and swapping
cpu buffer through file per_cpu/cpu0/snapshot:

  Write on CPU 0             Swap buffer by per_cpu/cpu0/snapshot on CPU 1
  --------                   --------
                             tracing_snapshot_write()
                               [...]

  ring_buffer_lock_reserve()
    cpu_buffer = buffer->buffers[cpu]; // 1. Suppose find 'cpu_buffer_a';
    [...]
    rb_reserve_next_event()
      [...]

                               ring_buffer_swap_cpu()
                                 if (local_read(&cpu_buffer_a->committing))
                                     goto out_dec;
                                 if (local_read(&cpu_buffer_b->committing))
                                     goto out_dec;
                                 buffer_a->buffers[cpu] = cpu_buffer_b;
                                 buffer_b->buffers[cpu] = cpu_buffer_a;
                                 // 2. cpu_buffer has swapped here.

      rb_start_commit(cpu_buffer);
      if (unlikely(READ_ONCE(cpu_buffer->buffer)
          != buffer)) { // 3. This check passed due to 'cpu_buffer->buffer'
        [...]           //    has not changed here.
        return NULL;
      }
                                 cpu_buffer_b->buffer = buffer_a;
                                 cpu_buffer_a->buffer = buffer_b;
                                 [...]

      // 4. Reserve event from 'cpu_buffer_a'.

  ring_buffer_unlock_commit()
    [...]
    cpu_buffer = buffer->buffers[cpu]; // 5. Now find 'cpu_buffer_b' !!!
    rb_commit(cpu_buffer)
      rb_end_commit()  // 6. WARN for the wrong 'committing' state !!!

Based on above analysis, we can easily reproduce by following testcase:
  ``` bash
  #!/bin/bash

  dmesg -n 7
  sysctl -w kernel.panic_on_warn=1
  TR=/sys/kernel/tracing
  echo 7 > ${TR}/buffer_size_kb
  echo "sched:sched_switch" > ${TR}/set_event
  while [ true ]; do
          echo 1 > ${TR}/per_cpu/cpu0/snapshot
  done &
  while [ true ]; do
          echo 1 > ${TR}/per_cpu/cpu0/snapshot
  done &
  while [ true ]; do
          echo 1 > ${TR}/per_cpu/cpu0/snapshot
  done &
  ```

To fix it, IIUC, we can use smp_call_function_single() to do the swap on
the target cpu where the buffer is located, so that above race would be
avoided.

Link: https://lore.kernel.org/linux-trace-kernel/20230831132739.4070878-1-zhengyejian1@huawei.com

Cc: <mhiramat@kernel.org>
Fixes: f1affcaaa8 ("tracing: Add snapshot in the per_cpu trace directories")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:00:00 -04:00
Mikhail Kobuk
2cf0dee989 tracing: Remove extra space at the end of hwlat_detector/mode
Space is printed after each mode value including the last one:
$ echo \"$(sudo cat /sys/kernel/tracing/hwlat_detector/mode)\"
"none [round-robin] per-cpu "

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Link: https://lore.kernel.org/linux-trace-kernel/20230825103432.7750-1-m.kobuk@ispras.ru

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 8fa826b734 ("trace/hwlat: Implement the mode config option")
Signed-off-by: Mikhail Kobuk <m.kobuk@ispras.ru>
Reviewed-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01 21:00:00 -04:00
Linus Torvalds
34232fcfe9 Merge tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
 "User visible changes:

   - Added a way to easier filter with cpumasks:

       # echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter

   - Show actual size of ring buffer after modifying the ring buffer
     size via buffer_size_kb.

     Currently it just returns what was written, but the actual size
     rounds up to the sub buffer size. Show that real size instead.

  Major changes:

   - Added "eventfs". This is the code that handles the inodes and
     dentries of tracefs/events directory. As there are thousands of
     events, and each event has several inodes and dentries that
     currently exist even when tracing is never used, they take up
     precious memory. Instead, eventfs will allocate the inodes and
     dentries in a JIT way (similar to what procfs does). There is now
     metadata that handles the events and subdirectories, and will
     create the inodes and dentries when they are used.

     Note, I also have patches that remove the subdirectory meta data,
     but will wait till the next merge window before applying them. It's
     a little more complex, and I want to make sure the dynamic code
     works properly before adding more complexity, making it easier to
     revert if need be.

  Minor changes:

   - Optimization to user event list traversal

   - Remove intermediate permission of tracefs files (note the
     intermediate permission removes all access to the files so it is
     not a security concern, but just a clean up)

   - Add the complex fix to FORTIFY_SOURCE to the kernel stack event
     logic

   - Other minor cleanups"

* tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits)
  tracefs: Remove kerneldoc from struct eventfs_file
  tracefs: Avoid changing i_mode to a temp value
  tracing/user_events: Optimize safe list traversals
  ftrace: Remove empty declaration ftrace_enable_daemon() and ftrace_disable_daemon()
  tracing: Remove unused function declarations
  tracing/filters: Document cpumask filtering
  tracing/filters: Further optimise scalar vs cpumask comparison
  tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU
  tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU
  tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU
  tracing/filters: Enable filtering the CPU common field by a cpumask
  tracing/filters: Enable filtering a scalar field by a cpumask
  tracing/filters: Enable filtering a cpumask field by another cpumask
  tracing/filters: Dynamically allocate filter_pred.regex
  test: ftrace: Fix kprobe test for eventfs
  eventfs: Move tracing/events to eventfs
  eventfs: Implement removal of meta data from eventfs
  eventfs: Implement functions to create files and dirs when accessed
  eventfs: Implement eventfs lookup, read, open functions
  eventfs: Implement eventfs file add functions
  ...
2023-09-01 16:34:25 -07:00
Linus Torvalds
bd30fe6a7d Merge tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:

 - Unbound workqueues now support more flexible affinity scopes.

   The default behavior is to soft-affine according to last level cache
   boundaries. A work item queued from a given LLC is executed by a
   worker running on the same LLC but the worker may be moved across
   cache boundaries as the scheduler sees fit. On machines which
   multiple L3 caches, which are becoming more popular along with
   chiplet designs, this improves cache locality while not harming work
   conservation too much.

   Unbound workqueues are now also a lot more flexible in terms of
   execution affinity. Differeing levels of affinity scopes are
   supported and both the default and per-workqueue affinity settings
   can be modified dynamically. This should help working around amny of
   sub-optimal behaviors observed recently with asymmetric ARM CPUs.

   This involved signficant restructuring of workqueue code. Nothing was
   reported yet but there's some risk of subtle regressions. Should keep
   an eye out.

 - Rescuer workers now has more identifiable comms.

 - workqueue.unbound_cpus added so that CPUs which can be used by
   workqueue can be constrained early during boot.

 - Now that all the in-tree users have been flushed out, trigger warning
   if system-wide workqueues are flushed.

* tag 'wq-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (31 commits)
  workqueue: fix data race with the pwq->stats[] increment
  workqueue: Rename rescuer kworker
  workqueue: Make default affinity_scope dynamically updatable
  workqueue: Add "Affinity Scopes and Performance" section to documentation
  workqueue: Implement non-strict affinity scope for unbound workqueues
  workqueue: Add workqueue_attrs->__pod_cpumask
  workqueue: Factor out need_more_worker() check and worker wake-up
  workqueue: Factor out work to worker assignment and collision handling
  workqueue: Add multiple affinity scopes and interface to select them
  workqueue: Modularize wq_pod_type initialization
  workqueue: Add tools/workqueue/wq_dump.py which prints out workqueue configuration
  workqueue: Generalize unbound CPU pods
  workqueue: Factor out clearing of workqueue-only attrs fields
  workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod()
  workqueue: Initialize unbound CPU pods later in the boot
  workqueue: Move wq_pod_init() below workqueue_init()
  workqueue: Rename NUMA related names to use pod instead
  workqueue: Rename workqueue_attrs->no_numa to ->ordered
  workqueue: Make unbound workqueues to use per-cpu pool_workqueues
  workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug
  ...
2023-09-01 16:06:32 -07:00
Linus Torvalds
7716f383a5 Merge tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - Per-cpu cpu usage stats are now tracked

   This currently isn't printed out in the cgroupfs interface and can
   only be accessed through e.g. BPF. Should decide on a not-too-ugly
   way to show per-cpu stats in cgroupfs

 - cpuset received some cleanups and prepatory patches for the pending
   cpus.exclusive patchset which will allow cpuset partitions to be
   created below non-partition parents, which should ease the management
   of partition cpusets

 - A lot of code and documentation cleanup patches

 - tools/testing/selftests/cgroup/test_cpuset.c added

* tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (32 commits)
  cgroup: Avoid -Wstringop-overflow warnings
  cgroup:namespace: Remove unused cgroup_namespaces_init()
  cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants
  cgroup: clean up if condition in cgroup_pidlist_start()
  cgroup: fix obsolete function name in cgroup_destroy_locked()
  Documentation: cgroup-v2.rst: Correct number of stats entries
  cgroup: fix obsolete function name above css_free_rwork_fn()
  cgroup/cpuset: fix kernel-doc
  cgroup: clean up printk()
  cgroup: fix obsolete comment above cgroup_create()
  docs: cgroup-v1: fix typo
  docs: cgroup-v1: correct the term of Page Cache organization in inode
  cgroup/misc: Store atomic64_t reads to u64
  cgroup/misc: Change counters to be explicit 64bit types
  cgroup/misc: update struct members descriptions
  cgroup: remove cgrp->kn check in css_populate_dir()
  cgroup: fix obsolete function name
  cgroup: use cached local variable parent in for loop
  cgroup: remove obsolete comment above struct cgroupstats
  cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED
  ...
2023-09-01 15:58:21 -07:00
Linus Torvalds
e987af4546 Merge tag 'percpu-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
 "One bigger change to percpu_counter's api allowing for init and
  destroy of multiple counters via percpu_counter_init_many() and
  percpu_counter_destroy_many(). This is used to help begin remediating
  a performance regression with percpu rss stats.

  Additionally, it seems larger core count machines are feeling the
  burden of the single threaded allocation of percpu. Mateusz is
  thinking about it and I will spend some time on it too.

  percpu:

   - A couple cleanups by Baoquan He and Bibo Mao. The only behavior
     change is to start printing messages if we're under the warn limit
     for failed atomic allocations.

  percpu_counter:

   - Shakeel introduced percpu counters into mm_struct which caused
     percpu allocations be on the hot path [1]. Originally I spent some
     time trying to improve the percpu allocator, but instead preferred
     what Mateusz Guzik proposed grouping at the allocation site,
     percpu_counter_init_many(). This allows a single percpu allocation
     to be shared by the counters. I like this approach because it
     creates a shared lifetime by the allocations. Additionally, I
     believe many inits have higher level synchronization requirements,
     like percpu_counter does against HOTPLUG_CPU. Therefore we can
     group these optimizations together"

Link: https://lore.kernel.org/linux-mm/20221024052841.3291983-1-shakeelb@google.com/ [1]

* tag 'percpu-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  kernel/fork: group allocation/free of per-cpu counters for mm struct
  pcpcntr: add group allocation/free
  mm/percpu.c: print error message too if atomic alloc failed
  mm/percpu.c: optimize the code in pcpu_setup_first_chunk() a little bit
  mm/percpu.c: remove redundant check
  mm/percpu: Remove some local variables in pcpu_populate_pte
2023-09-01 15:44:45 -07:00
Linus Torvalds
8e1e49550d Merge tag 'tty-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
 "Here is the big set of tty and serial driver changes for 6.6-rc1.

  Lots of cleanups in here this cycle, and some driver updates. Short
  summary is:

   - Jiri's continued work to make the tty code and apis be a bit more
     sane with regards to modern kernel coding style and types

   - cpm_uart driver updates

   - n_gsm updates and fixes

   - meson driver updates

   - sc16is7xx driver updates

   - 8250 driver updates for different hardware types

   - qcom-geni driver fixes

   - tegra serial driver change

   - stm32 driver updates

   - synclink_gt driver cleanups

   - tty structure size reduction

  All of these have been in linux-next this week with no reported
  issues. The last bit of cleanups from Jiri and the tty structure size
  reduction came in last week, a bit late but as they were just style
  changes and size reductions, I figured they should get into this merge
  cycle so that others can work on top of them with no merge conflicts"

* tag 'tty-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (199 commits)
  tty: shrink the size of struct tty_struct by 40 bytes
  tty: n_tty: deduplicate copy code in n_tty_receive_buf_real_raw()
  tty: n_tty: extract ECHO_OP processing to a separate function
  tty: n_tty: unify counts to size_t
  tty: n_tty: use u8 for chars and flags
  tty: n_tty: simplify chars_in_buffer()
  tty: n_tty: remove unsigned char casts from character constants
  tty: n_tty: move newline handling to a separate function
  tty: n_tty: move canon handling to a separate function
  tty: n_tty: use MASK() for masking out size bits
  tty: n_tty: make n_tty_data::num_overrun unsigned
  tty: n_tty: use time_is_before_jiffies() in n_tty_receive_overrun()
  tty: n_tty: use 'num' for writes' counts
  tty: n_tty: use output character directly
  tty: n_tty: make flow of n_tty_receive_buf_common() a bool
  Revert "tty: serial: meson: Add a earlycon for the T7 SoC"
  Documentation: devices.txt: Fix minors for ttyCPM*
  Documentation: devices.txt: Remove ttySIOC*
  Documentation: devices.txt: Remove ttyIOC*
  serial: 8250_bcm7271: improve bcm7271 8250 port
  ...
2023-09-01 09:38:00 -07:00
Linus Torvalds
4ad0a4c234 Merge tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:

 - Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and honour the
   configured SMT state when hotplugging CPUs into the system

 - Combine final TLB flush and lazy TLB mm shootdown IPIs when using the
   Radix MMU to avoid a broadcast TLBIE flush on exit

 - Drop the exclusion between ptrace/perf watchpoints, and drop the now
   unused associated arch hooks

 - Add support for the "nohlt" command line option to disable CPU idle

 - Add support for -fpatchable-function-entry for ftrace, with GCC >=
   13.1

 - Rework memory block size determination, and support 256MB size on
   systems with GPUs that have hotpluggable memory

 - Various other small features and fixes

Thanks to Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira
Rajeev, Benjamin Gray, Christophe Leroy, Frederic Barrat, Gautam
Menghani, Geoff Levand, Hari Bathini, Immad Mir, Jialin Zhang, Joel
Stanley, Jordan Niethe, Justin Stitt, Kajol Jain, Kees Cook, Krzysztof
Kozlowski, Laurent Dufour, Liang He, Linus Walleij, Mahesh Salgaonkar,
Masahiro Yamada, Michal Suchanek, Nageswara R Sastry, Nathan Chancellor,
Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nick Desaulniers, Omar
Sandoval, Randy Dunlap, Reza Arbab, Rob Herring, Russell Currey, Sourabh
Jain, Thomas Gleixner, Trevor Woerner, Uwe Kleine-König, Vaibhav Jain,
Xiongfeng Wang, Yuan Tan, Zhang Rui, and Zheng Zengkai.

* tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (135 commits)
  macintosh/ams: linux/platform_device.h is needed
  powerpc/xmon: Reapply "Relax frame size for clang"
  powerpc/mm/book3s64: Use 256M as the upper limit with coherent device memory attached
  powerpc/mm/book3s64: Fix build error with SPARSEMEM disabled
  powerpc/iommu: Fix notifiers being shared by PCI and VIO buses
  powerpc/mpc5xxx: Add missing fwnode_handle_put()
  powerpc/config: Disable SLAB_DEBUG_ON in skiroot
  powerpc/pseries: Remove unused hcall tracing instruction
  powerpc/pseries: Fix hcall tracepoints with JUMP_LABEL=n
  powerpc: dts: add missing space before {
  powerpc/eeh: Use pci_dev_id() to simplify the code
  powerpc/64s: Move CPU -mtune options into Kconfig
  powerpc/powermac: Fix unused function warning
  powerpc/pseries: Rework lppaca_shared_proc() to avoid DEBUG_PREEMPT
  powerpc: Don't include lppaca.h in paca.h
  powerpc/pseries: Move hcall_vphn() prototype into vphn.h
  powerpc/pseries: Move VPHN constants into vphn.h
  cxl: Drop unused detach_spa()
  powerpc: Drop zalloc_maybe_bootmem()
  powerpc/powernv: Use struct opal_prd_msg in more places
  ...
2023-08-31 12:43:10 -07:00
Linus Torvalds
df57721f9a Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 shadow stack support from Dave Hansen:
 "This is the long awaited x86 shadow stack support, part of Intel's
  Control-flow Enforcement Technology (CET).

  CET consists of two related security features: shadow stacks and
  indirect branch tracking. This series implements just the shadow stack
  part of this feature, and just for userspace.

  The main use case for shadow stack is providing protection against
  return oriented programming attacks. It works by maintaining a
  secondary (shadow) stack using a special memory type that has
  protections against modification. When executing a CALL instruction,
  the processor pushes the return address to both the normal stack and
  to the special permission shadow stack. Upon RET, the processor pops
  the shadow stack copy and compares it to the normal stack copy.

  For more information, refer to the links below for the earlier
  versions of this patch set"

Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/

* tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
  x86/shstk: Change order of __user in type
  x86/ibt: Convert IBT selftest to asm
  x86/shstk: Don't retry vm_munmap() on -EINTR
  x86/kbuild: Fix Documentation/ reference
  x86/shstk: Move arch detail comment out of core mm
  x86/shstk: Add ARCH_SHSTK_STATUS
  x86/shstk: Add ARCH_SHSTK_UNLOCK
  x86: Add PTRACE interface for shadow stack
  selftests/x86: Add shadow stack test
  x86/cpufeatures: Enable CET CR4 bit for shadow stack
  x86/shstk: Wire in shadow stack interface
  x86: Expose thread features in /proc/$PID/status
  x86/shstk: Support WRSS for userspace
  x86/shstk: Introduce map_shadow_stack syscall
  x86/shstk: Check that signal frame is shadow stack mem
  x86/shstk: Check that SSP is aligned on sigreturn
  x86/shstk: Handle signals for shadow stack
  x86/shstk: Introduce routines modifying shstk
  x86/shstk: Handle thread shadow stack
  x86/shstk: Add user-mode shadow stack support
  ...
2023-08-31 12:20:12 -07:00
Christoph Hellwig
765aa6b3a4 dma-pool: remove a __maybe_unused label in atomic_pool_expand
Move the #endif a line so that free_page label is only seen by the
compile pass when actually used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chunhui He <hchunhui@mail.ustc.edu.cn>
Reviewed-by: Robin Murphy <roin.murphy@arm.com>
2023-08-31 14:12:37 +02:00
Linus Torvalds
cd99b9eb4b Merge tag 'docs-6.6' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
 "Documentation work keeps chugging along; this includes:

   - Work from Carlos Bilbao to integrate rustdoc output into the
     generated HTML documentation. This took some work to figure out how
     to do it without slowing the docs build and without creating people
     who don't have Rust installed, but Carlos got there

   - Move the loongarch and mips architecture documentation under
     Documentation/arch/

   - Some more maintainer documentation from Jakub

  ... plus the usual assortment of updates, translations, and fixes"

* tag 'docs-6.6' of git://git.lwn.net/linux: (56 commits)
  Docu: genericirq.rst: fix irq-example
  input: docs: pxrc: remove reference to phoenix-sim
  Documentation: serial-console: Fix literal block marker
  docs/mm: remove references to hmm_mirror ops and clean typos
  docs/zh_CN: correct regi_chg(),regi_add() to region_chg(),region_add()
  Documentation: Fix typos
  Documentation/ABI: Fix typos
  scripts: kernel-doc: fix macro handling in enums
  scripts: kernel-doc: parse DEFINE_DMA_UNMAP_[ADDR|LEN]
  Documentation: riscv: Update boot image header since EFI stub is supported
  Documentation: riscv: Add early boot document
  Documentation: arm: Add bootargs to the table of added DT parameters
  docs: kernel-parameters: Refer to the correct bitmap function
  doc: update params of memhp_default_state=
  docs: Add book to process/kernel-docs.rst
  docs: sparse: fix invalid link addresses
  docs: vfs: clean up after the iterate() removal
  docs: Add a section on surveys to the researcher guidelines
  docs: move mips under arch
  docs: move loongarch under arch
  ...
2023-08-30 20:05:42 -07:00
Phil Sutter
ea078ae910 netfilter: nf_tables: Audit log rule reset
Resetting rules' stateful data happens outside of the transaction logic,
so 'get' and 'dump' handlers have to emit audit log entries themselves.

Fixes: 8daa8fde3f ("netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-31 01:29:28 +02:00
Phil Sutter
7e9be1124d netfilter: nf_tables: Audit log setelem reset
Since set element reset is not integrated into nf_tables' transaction
logic, an explicit log call is needed, similar to NFT_MSG_GETOBJ_RESET
handling.

For the sake of simplicity, catchall element reset will always generate
a dedicated log entry. This relieves nf_tables_dump_set() from having to
adjust the logged element count depending on whether a catchall element
was found or not.

Fixes: 079cd63321 ("netfilter: nf_tables: Introduce NFT_MSG_GETSETELEM_RESET")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-31 01:29:27 +02:00
Linus Torvalds
1a35914f73 Merge tag 'integrity-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity subsystem updates from Mimi Zohar:

 - With commit 099f26f22f ("integrity: machine keyring CA
   configuration") certificates may be loaded onto the IMA keyring,
   directly or indirectly signed by keys on either the "builtin" or the
   "machine" keyrings.

   With the ability for the system/machine owner to sign the IMA policy
   itself without needing to recompile the kernel, update the IMA
   architecture specific policy rules to require the IMA policy itself
   be signed.

   [ As commit 099f26f22f was upstreamed in linux-6.4, updating the
     IMA architecture specific policy now to require signed IMA policies
     may break userspace expectations. ]

 - IMA only checked the file data hash was not on the system blacklist
   keyring for files with an appended signature (e.g. kernel modules,
   Power kernel image).

   Check all file data hashes regardless of how it was signed

 - Code cleanup, and a kernel-doc update

* tag 'integrity-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  kexec_lock: Replace kexec_mutex() by kexec_lock() in two comments
  ima: require signed IMA policy when UEFI secure boot is enabled
  integrity: Always reference the blacklist keyring with appraisal
  ima: Remove deprecated IMA_TRUSTED_KEYRING Kconfig
2023-08-30 09:16:56 -07:00