Commit Graph

46597 Commits

Author SHA1 Message Date
Hou Tao
e319cdc895 bpf: Disable migration when destroying inode storage
When destroying inode storage, it invokes bpf_local_storage_destroy() to
remove all storage elements saved in the inode storage. The destroy
procedure will call bpf_selem_free() to free the element, and
bpf_selem_free() calls bpf_obj_free_fields() to free the special fields
in map value (e.g., kptr). Since kptrs may be allocated from bpf memory
allocator, migrate_{disable|enable} pairs are necessary for the freeing
of these kptrs.

To simplify reasoning about when migrate_disable() is needed for the
freeing of these dynamically-allocated kptrs, let the caller to
guarantee migration is disabled before invoking bpf_obj_free_fields().
Therefore, the patch adds migrate_{disable|enable} pair in
bpf_inode_storage_free(). The migrate_{disable|enable} pairs in the
underlying implementation of bpf_obj_free_fields() will be removed by
the following patch.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-7-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 18:06:36 -08:00
Hou Tao
9e6c958b54 bpf: Remove migrate_{disable|enable} from bpf_task_storage_lock helpers
Three callers of bpf_task_storage_lock() are ->map_lookup_elem,
->map_update_elem, ->map_delete_elem from bpf syscall. BPF syscall for
these three operations of task storage has already disabled migration.
Another two callers are bpf_task_storage_get() and
bpf_task_storage_delete() helpers which will be used by BPF program.

Two callers of bpf_task_storage_trylock() are bpf_task_storage_get() and
bpf_task_storage_delete() helpers. The running contexts of these helpers
have already disabled migration.

Therefore, it is safe to remove migrate_{disable|enable} from task
storage lock helpers for these call sites. However,
bpf_task_storage_free() also invokes bpf_task_storage_lock() and its
running context doesn't disable migration, therefore, add the missed
migrate_{disable|enable} in bpf_task_storage_free().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 18:06:36 -08:00
Hou Tao
25dc65f75b bpf: Remove migrate_{disable|enable} from bpf_cgrp_storage_lock helpers
Three callers of bpf_cgrp_storage_lock() are ->map_lookup_elem,
->map_update_elem, ->map_delete_elem from bpf syscall. BPF syscall for
these three operations of cgrp storage has already disabled migration.

Two call sites of bpf_cgrp_storage_trylock() are bpf_cgrp_storage_get(),
and bpf_cgrp_storage_delete() helpers. The running contexts of these
helpers have already disabled migration.

Therefore, it is safe to remove migrate_disable() for these callers.
However, bpf_cgrp_storage_free() also invokes bpf_cgrp_storage_lock()
and its running context doesn't disable migration. Therefore, also add
the missed migrate_{disabled|enable} in bpf_cgrp_storage_free().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 18:06:36 -08:00
Hou Tao
53f2ba0b1c bpf: Remove migrate_{disable|enable} in htab_elem_free
htab_elem_free() has two call-sites: delete_all_elements() has already
disabled migration, free_htab_elem() is invoked by other 4 functions:
__htab_map_lookup_and_delete_elem, __htab_map_lookup_and_delete_batch,
htab_map_update_elem and htab_map_delete_elem.

BPF syscall has already disabled migration before invoking
->map_update_elem, ->map_delete_elem, and ->map_lookup_and_delete_elem
callbacks for hash map. __htab_map_lookup_and_delete_batch() also
disables migration before invoking free_htab_elem(). ->map_update_elem()
and ->map_delete_elem() of hash map may be invoked by BPF program and
the running context of BPF program has already disabled migration.
Therefore, it is safe to remove the migration_{disable|enable} pair in
htab_elem_free()

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 18:06:36 -08:00
Hou Tao
ea5b229630 bpf: Remove migrate_{disable|enable} in ->map_for_each_callback
BPF program may call bpf_for_each_map_elem(), and it will call
the ->map_for_each_callback callback of related bpf map. Considering the
running context of bpf program has already disabled migration, remove
the unnecessary migrate_{disable|enable} pair in the implementations of
->map_for_each_callback. To ensure the guarantee will not be voilated
later, also add cant_migrate() check in the implementations.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 18:06:35 -08:00
Hou Tao
1b1a01db17 bpf: Remove migrate_{disable|enable} from LPM trie
Both bpf program and bpf syscall may invoke ->update or ->delete
operation for LPM trie. For bpf program, its running context has already
disabled migration explicitly through (migrate_disable()) or implicitly
through (preempt_disable() or disable irq). For bpf syscall, the
migration is disabled through the use of bpf_disable_instrumentation()
before invoking the corresponding map operation callback.

Therefore, it is safe to remove the migrate_{disable|enable){} pair from
LPM trie.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 18:06:35 -08:00
Jiri Olsa
2ebadb60cb bpf: Return error for missed kprobe multi bpf program execution
When kprobe multi bpf program can't be executed due to recursion check,
we currently return 0 (success) to fprobe layer where it's ignored for
standard kprobe multi probes.

For kprobe session the success return value will make fprobe layer to
install return probe and try to execute it as well.

But the return session probe should not get executed, because the entry
part did not run. FWIW the return probe bpf program most likely won't get
executed, because its recursion check will likely fail as well, but we
don't need to run it in the first place.. also we can make this clear
and obvious.

It also affects missed counts for kprobe session program execution, which
are now doubled (extra count for not executed return probe).

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250106175048.1443905-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 09:39:58 -08:00
Pu Lehui
ca3c4f646a bpf: Move out synchronize_rcu_tasks_trace from mutex CS
Commit ef1b808e3b ("bpf: Fix UAF via mismatching bpf_prog/attachment
RCU flavors") resolved a possible UAF issue in uprobes that attach
non-sleepable bpf prog by explicitly waiting for a tasks-trace-RCU grace
period. But, in the current implementation, synchronize_rcu_tasks_trace
is included within the mutex critical section, which increases the
length of the critical section and may affect performance. So let's move
out synchronize_rcu_tasks_trace from mutex CS.

Signed-off-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20250104013946.1111785-1-pulehui@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 09:38:41 -08:00
Soma Nakata
b8b1e30016 bpf: Fix range_tree_set() error handling
range_tree_set() might fail and return -ENOMEM,
causing subsequent `bpf_arena_alloc_pages` to fail.
Add the error handling.

Signed-off-by: Soma Nakata <soma.nakata@somane.sakura.ne.jp>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250106231536.52856-1-soma.nakata@somane.sakura.ne.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-08 09:35:33 -08:00
Emil Tsalapatis
512816403e bpf: Allow bpf_for/bpf_repeat calls while holding a spinlock
Add the bpf_iter_num_* kfuncs called by bpf_for in special_kfunc_list,
 and allow the calls even while holding a spin lock.

Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250104202528.882482-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-06 10:59:49 -08:00
Martin KaFai Lau
96ea081ed5 bpf: Reject struct_ops registration that uses module ptr and the module btf_id is missing
There is a UAF report in the bpf_struct_ops when CONFIG_MODULES=n.
In particular, the report is on tcp_congestion_ops that has
a "struct module *owner" member.

For struct_ops that has a "struct module *owner" member,
it can be extended either by the regular kernel module or
by the bpf_struct_ops. bpf_try_module_get() will be used
to do the refcounting and different refcount is done
based on the owner pointer. When CONFIG_MODULES=n,
the btf_id of the "struct module" is missing:

WARN: resolve_btfids: unresolved symbol module

Thus, the bpf_try_module_get() cannot do the correct refcounting.

Not all subsystem's struct_ops requires the "struct module *owner" member.
e.g. the recent sched_ext_ops.

This patch is to disable bpf_struct_ops registration if
the struct_ops has the "struct module *" member and the
"struct module" btf_id is missing. The btf_type_is_fwd() helper
is moved to the btf.h header file for this test.

This has happened since the beginning of bpf_struct_ops which has gone
through many changes. The Fixes tag is set to a recent commit that this
patch can apply cleanly. Considering CONFIG_MODULES=n is not
common and the age of the issue, targeting for bpf-next also.

Fixes: 1611603537 ("bpf: Create argument information for nullable arguments.")
Reported-by: Robert Morris <rtm@csail.mit.edu>
Closes: https://lore.kernel.org/bpf/74665.1733669976@localhost/
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241220201818.127152-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-03 10:16:46 -08:00
Pei Xiao
dfa94ce54f bpf: Use refcount_t instead of atomic_t for mmap_count
Use an API that resembles more the actual use of mmap_count.

Found by cocci:
kernel/bpf/arena.c:245:6-25: WARNING: atomic_dec_and_test variation before object free at line 249.

Fixes: b90d77e5fd ("bpf: Fix remap of arena.")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412292037.LXlYSHKl-lkp@intel.com/
Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Link: https://lore.kernel.org/r/6ecce439a6bc81adb85d5080908ea8959b792a50.1735542814.git.xiaopei01@kylinos.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-30 20:12:21 -08:00
Lorenzo Pieralisi
654a3381e3 bpf: Remove unused MT_ENTRY define
The range tree introduction removed the need for maple tree usage
but missed removing the MT_ENTRY defined value that was used to
mark maple tree allocated entries.
Remove the MT_ENTRY define.

Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Link: https://lore.kernel.org/r/20241223115901.14207-1-lpieralisi@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-30 15:18:13 -08:00
Thomas Weißschuh
4a24035964 bpf: Fix holes in special_kfunc_list if !CONFIG_NET
If the function is not available its entry has to be replaced with
BTF_ID_UNUSED instead of skipped.
Otherwise the list doesn't work correctly.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Closes: https://lore.kernel.org/lkml/CAADnVQJQpVziHzrPCCpGE5=8uzw2OkxP8gqe1FkJ6_XVVyVbNw@mail.gmail.com/
Fixes: 00a5acdbf3 ("bpf: Fix configuration-dependent BTF function references")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20241219-bpf-fix-special_kfunc_list-v1-1-d9d50dd61505@weissschuh.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-30 14:52:08 -08:00
Matan Shachnai
9aa0ebde00 bpf, verifier: Improve precision of BPF_MUL
This patch improves (or maintains) the precision of register value tracking
in BPF_MUL across all possible inputs. It also simplifies
scalar32_min_max_mul() and scalar_min_max_mul().

As it stands, BPF_MUL is composed of three functions:

case BPF_MUL:
  tnum_mul();
  scalar32_min_max_mul();
  scalar_min_max_mul();

The current implementation of scalar_min_max_mul() restricts the u64 input
ranges of dst_reg and src_reg to be within [0, U32_MAX]:

    /* Both values are positive, so we can work with unsigned and
     * copy the result to signed (unless it exceeds S64_MAX).
     */
    if (umax_val > U32_MAX || dst_reg->umax_value > U32_MAX) {
        /* Potential overflow, we know nothing */
        __mark_reg64_unbounded(dst_reg);
        return;
    }

This restriction is done to avoid unsigned overflow, which could otherwise
wrap the result around 0, and leave an unsound output where umin > umax. We
also observe that limiting these u64 input ranges to [0, U32_MAX] leads to
a loss of precision. Consider the case where the u64 bounds of dst_reg are
[0, 2^34] and the u64 bounds of src_reg are [0, 2^2]. While the
multiplication of these two bounds doesn't overflow and is sound [0, 2^36],
the current scalar_min_max_mul() would set the entire register state to
unbounded.

Importantly, we update BPF_MUL to allow signed bound multiplication
(i.e. multiplying negative bounds) as well as allow u64 inputs to take on
values from [0, U64_MAX]. We perform signed multiplication on two bounds
[a,b] and [c,d] by multiplying every combination of the bounds
(i.e. a*c, a*d, b*c, and b*d) and checking for overflow of each product. If
there is an overflow, we mark the signed bounds unbounded [S64_MIN, S64_MAX].
In the case of no overflow, we take the minimum of these products to
be the resulting smin, and the maximum to be the resulting smax.

The key idea here is that if there’s no possibility of overflow, either
when multiplying signed bounds or unsigned bounds, we can safely multiply the
respective bounds; otherwise, we set the bounds that exhibit overflow
(during multiplication) to unbounded.

if (check_mul_overflow(*dst_umax, src_reg->umax_value, dst_umax) ||
       (check_mul_overflow(*dst_umin, src_reg->umin_value, dst_umin))) {
        /* Overflow possible, we know nothing */
        *dst_umin = 0;
        *dst_umax = U64_MAX;
    }
  ...

Below, we provide an example BPF program (below) that exhibits the
imprecision in the current BPF_MUL, where the outputs are all unbounded. In
contrast, the updated BPF_MUL produces a bounded register state:

BPF_LD_IMM64(BPF_REG_1, 11),
BPF_LD_IMM64(BPF_REG_2, 4503599627370624),
BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0),
BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0),
BPF_ALU64_REG(BPF_AND, BPF_REG_1, BPF_REG_2),
BPF_LD_IMM64(BPF_REG_3, 809591906117232263),
BPF_ALU64_REG(BPF_MUL, BPF_REG_3, BPF_REG_1),
BPF_MOV64_IMM(BPF_REG_0, 1),
BPF_EXIT_INSN(),

Verifier log using the old BPF_MUL:

func#0 @0
0: R1=ctx() R10=fp0
0: (18) r1 = 0xb                      ; R1_w=11
2: (18) r2 = 0x10000000000080         ; R2_w=0x10000000000080
4: (87) r2 = -r2                      ; R2_w=scalar()
5: (87) r2 = -r2                      ; R2_w=scalar()
6: (5f) r1 &= r2                      ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R2_w=scalar()
7: (18) r3 = 0xb3c3f8c99262687        ; R3_w=0xb3c3f8c99262687
9: (2f) r3 *= r1                      ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R3_w=scalar()
...

Verifier using the new updated BPF_MUL (more precise bounds at label 9)

func#0 @0
0: R1=ctx() R10=fp0
0: (18) r1 = 0xb                      ; R1_w=11
2: (18) r2 = 0x10000000000080         ; R2_w=0x10000000000080
4: (87) r2 = -r2                      ; R2_w=scalar()
5: (87) r2 = -r2                      ; R2_w=scalar()
6: (5f) r1 &= r2                      ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R2_w=scalar()
7: (18) r3 = 0xb3c3f8c99262687        ; R3_w=0xb3c3f8c99262687
9: (2f) r3 *= r1                      ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=11,var_off=(0x0; 0xb)) R3_w=scalar(smin=0,smax=umax=0x7b96bb0a94a3a7cd,var_off=(0x0; 0x7fffffffffffffff))
...

Finally, we proved the soundness of the new scalar_min_max_mul() and
scalar32_min_max_mul() functions. Typically, multiplication operations are
expensive to check with bitvector-based solvers. We were able to prove the
soundness of these functions using Non-Linear Integer Arithmetic (NIA)
theory. Additionally, using Agni [2,3], we obtained the encodings for
scalar32_min_max_mul() and scalar_min_max_mul() in bitvector theory, and
were able to prove their soundness using 8-bit bitvectors (instead of
64-bit bitvectors that the functions actually use).

In conclusion, with this patch,

1. We were able to show that we can improve the overall precision of
   BPF_MUL. We proved (using an SMT solver) that this new version of
   BPF_MUL is at least as precise as the current version for all inputs
   and more precise for some inputs.

2. We are able to prove the soundness of the new scalar_min_max_mul() and
   scalar32_min_max_mul(). By leveraging the existing proof of tnum_mul
   [1], we can say that the composition of these three functions within
   BPF_MUL is sound.

[1] https://ieeexplore.ieee.org/abstract/document/9741267
[2] https://link.springer.com/chapter/10.1007/978-3-031-37709-9_12
[3] https://people.cs.rutgers.edu/~sn349/papers/sas24-preprint.pdf

Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Matan Shachnai <m.shachnai@gmail.com>
Link: https://lore.kernel.org/r/20241218032337.12214-2-m.shachnai@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-30 14:49:42 -08:00
Martin KaFai Lau
8eef6ac4d7 bpf: bpf_local_storage: Always use bpf_mem_alloc in PREEMPT_RT
In PREEMPT_RT, kmalloc(GFP_ATOMIC) is still not safe in non preemptible
context. bpf_mem_alloc must be used in PREEMPT_RT. This patch is
to enforce bpf_mem_alloc in the bpf_local_storage when CONFIG_PREEMPT_RT
is enabled.

[   35.118559] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[   35.118566] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1832, name: test_progs
[   35.118569] preempt_count: 1, expected: 0
[   35.118571] RCU nest depth: 1, expected: 1
[   35.118577] INFO: lockdep is turned off.
    ...
[   35.118647]  __might_resched+0x433/0x5b0
[   35.118677]  rt_spin_lock+0xc3/0x290
[   35.118700]  ___slab_alloc+0x72/0xc40
[   35.118723]  __kmalloc_noprof+0x13f/0x4e0
[   35.118732]  bpf_map_kzalloc+0xe5/0x220
[   35.118740]  bpf_selem_alloc+0x1d2/0x7b0
[   35.118755]  bpf_local_storage_update+0x2fa/0x8b0
[   35.118784]  bpf_sk_storage_get_tracing+0x15a/0x1d0
[   35.118791]  bpf_prog_9a118d86fca78ebb_trace_inet_sock_set_state+0x44/0x66
[   35.118795]  bpf_trace_run3+0x222/0x400
[   35.118820]  __bpf_trace_inet_sock_set_state+0x11/0x20
[   35.118824]  trace_inet_sock_set_state+0x112/0x130
[   35.118830]  inet_sk_state_store+0x41/0x90
[   35.118836]  tcp_set_state+0x3b3/0x640

There is no need to adjust the gfp_flags passing to the
bpf_mem_cache_alloc_flags() which only honors the GFP_KERNEL.
The verifier has ensured GFP_KERNEL is passed only in sleepable context.

It has been an old issue since the first introduction of the
bpf_local_storage ~5 years ago, so this patch targets the bpf-next.

bpf_mem_alloc is needed to solve it, so the Fixes tag is set
to the commit when bpf_mem_alloc was first used in the bpf_local_storage.

Fixes: 08a7ce384e ("bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elem")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241218193000.2084281-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-18 15:36:06 -08:00
Alexei Starovoitov
06103dccbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR.

No conflicts.

Adjacent changes in:
Auto-merging include/linux/bpf.h
Auto-merging include/linux/bpf_verifier.h
Auto-merging kernel/bpf/btf.c
Auto-merging kernel/bpf/verifier.c
Auto-merging kernel/trace/bpf_trace.c
Auto-merging tools/testing/selftests/bpf/progs/test_tp_btf_nullable.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-16 08:53:59 -08:00
Linus Torvalds
acd855a949 Merge tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:

 - Prevent incorrect dequeueing of the deadline dlserver helper task and
   fix its time accounting

 - Properly track the CFS runqueue runnable stats

 - Check the total number of all queued tasks in a sched fair's runqueue
   hierarchy before deciding to stop the tick

 - Fix the scheduling of the task that got woken last (NEXT_BUDDY) by
   preventing those from being delayed

* tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/dlserver: Fix dlserver time accounting
  sched/dlserver: Fix dlserver double enqueue
  sched/eevdf: More PELT vs DELAYED_DEQUEUE
  sched/fair: Fix sched_can_stop_tick() for fair tasks
  sched/fair: Fix NEXT_BUDDY
2024-12-15 09:38:03 -08:00
Linus Torvalds
35f301dd45 Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Daniel Borkmann:

 - Fix a bug in the BPF verifier to track changes to packet data
   property for global functions (Eduard Zingerman)

 - Fix a theoretical BPF prog_array use-after-free in RCU handling of
   __uprobe_perf_func (Jann Horn)

 - Fix BPF tracing to have an explicit list of tracepoints and their
   arguments which need to be annotated as PTR_MAYBE_NULL (Kumar
   Kartikeya Dwivedi)

 - Fix a logic bug in the bpf_remove_insns code where a potential error
   would have been wrongly propagated (Anton Protopopov)

 - Avoid deadlock scenarios caused by nested kprobe and fentry BPF
   programs (Priya Bala Govindasamy)

 - Fix a bug in BPF verifier which was missing a size check for
   BTF-based context access (Kumar Kartikeya Dwivedi)

 - Fix a crash found by syzbot through an invalid BPF prog_array access
   in perf_event_detach_bpf_prog (Jiri Olsa)

 - Fix several BPF sockmap bugs including a race causing a refcount
   imbalance upon element replace (Michal Luczaj)

 - Fix a use-after-free from mismatching BPF program/attachment RCU
   flavors (Jann Horn)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
  bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
  selftests/bpf: Add tests for raw_tp NULL args
  bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
  bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
  selftests/bpf: Add test for narrow ctx load for pointer args
  bpf: Check size for BTF-based ctx access of pointer members
  selftests/bpf: extend changes_pkt_data with cases w/o subprograms
  bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
  bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
  bpf: fix potential error return
  selftests/bpf: validate that tail call invalidates packet pointers
  bpf: consider that tail calls invalidate packet pointers
  selftests/bpf: freplace tests for tracking of changes_packet_data
  bpf: check changes_pkt_data property for extension programs
  selftests/bpf: test for changing packet data from global functions
  bpf: track changes_pkt_data property for global functions
  bpf: refactor bpf_helper_changes_pkt_data to use helper number
  bpf: add find_containing_subprog() utility function
  bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog
  bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors
  ...
2024-12-14 12:58:14 -08:00
Priya Bala Govindasamy
c83508da56 bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
BPF program types like kprobe and fentry can cause deadlocks in certain
situations. If a function takes a lock and one of these bpf programs is
hooked to some point in the function's critical section, and if the
bpf program tries to call the same function and take the same lock it will
lead to deadlock. These situations have been reported in the following
bug reports.

In percpu_freelist -
Link: https://lore.kernel.org/bpf/CAADnVQLAHwsa+2C6j9+UC6ScrDaN9Fjqv1WjB1pP9AzJLhKuLQ@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com/T/
In bpf_lru_list -
Link: https://lore.kernel.org/bpf/CAPPBnEajj+DMfiR_WRWU5=6A7KKULdB5Rob_NJopFLWF+i9gCA@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEZQDVN6VqnQXvVqGoB+ukOtHGZ9b9U0OLJJYvRoSsMY_g@mail.gmail.com/T/
Link: https://lore.kernel.org/bpf/CAPPBnEaCB1rFAYU7Wf8UxqcqOWKmRPU1Nuzk3_oLk6qXR7LBOA@mail.gmail.com/T/

Similar bugs have been reported by syzbot.
In queue_stack_maps -
Link: https://lore.kernel.org/lkml/0000000000004c3fc90615f37756@google.com/
Link: https://lore.kernel.org/all/20240418230932.2689-1-hdanton@sina.com/T/
In lpm_trie -
Link: https://lore.kernel.org/linux-kernel/00000000000035168a061a47fa38@google.com/T/
In ringbuf -
Link: https://lore.kernel.org/bpf/20240313121345.2292-1-hdanton@sina.com/T/

Prevent kprobe and fentry bpf programs from attaching to these critical
sections by removing CC_FLAGS_FTRACE for percpu_freelist.o,
bpf_lru_list.o, queue_stack_maps.o, lpm_trie.o, ringbuf.o files.

The bugs reported by syzbot are due to tracepoint bpf programs being
called in the critical sections. This patch does not aim to fix deadlocks
caused by tracepoint programs. However, it does prevent deadlocks from
occurring in similar situations due to kprobe and fentry programs.

Signed-off-by: Priya Bala Govindasamy <pgovind2@uci.edu>
Link: https://lore.kernel.org/r/CAPPBnEZpjGnsuA26Mf9kYibSaGLm=oF6=12L21X1GEQdqjLnzQ@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-14 09:49:27 -08:00
Kumar Kartikeya Dwivedi
838a10bd2e bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
Arguments to a raw tracepoint are tagged as trusted, which carries the
semantics that the pointer will be non-NULL.  However, in certain cases,
a raw tracepoint argument may end up being NULL. More context about this
issue is available in [0].

Thus, there is a discrepancy between the reality, that raw_tp arguments can
actually be NULL, and the verifier's knowledge, that they are never NULL,
causing explicit NULL check branch to be dead code eliminated.

A previous attempt [1], i.e. the second fixed commit, was made to
simulate symbolic execution as if in most accesses, the argument is a
non-NULL raw_tp, except for conditional jumps.  This tried to suppress
branch prediction while preserving compatibility, but surfaced issues
with production programs that were difficult to solve without increasing
verifier complexity. A more complete discussion of issues and fixes is
available at [2].

Fix this by maintaining an explicit list of tracepoints where the
arguments are known to be NULL, and mark the positional arguments as
PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments
are known to be ERR_PTR, and mark these arguments as scalar values to
prevent potential dereference.

Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2),
shifted by the zero-indexed argument number x 4. This can be represented
as follows:
1st arg: 0x1
2nd arg: 0x10
3rd arg: 0x100
... and so on (likewise for ERR_PTR case).

In the future, an automated pass will be used to produce such a list, or
insert __nullable annotations automatically for tracepoints. Each
compilation unit will be analyzed and results will be collated to find
whether a tracepoint pointer is definitely not null, maybe null, or an
unknown state where verifier conservatively marks it PTR_MAYBE_NULL.
A proof of concept of this tool from Eduard is available at [3].

Note that in case we don't find a specification in the raw_tp_null_args
array and the tracepoint belongs to a kernel module, we will
conservatively mark the arguments as PTR_MAYBE_NULL. This is because
unlike for in-tree modules, out-of-tree module tracepoints may pass NULL
freely to the tracepoint. We don't protect against such tracepoints
passing ERR_PTR (which is uncommon anyway), lest we mark all such
arguments as SCALAR_VALUE.

While we are it, let's adjust the test raw_tp_null to not perform
dereference of the skb->mark, as that won't be allowed anymore, and make
it more robust by using inline assembly to test the dead code
elimination behavior, which should still stay the same.

  [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb
  [1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com
  [2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
  [3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params

Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug
Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix
Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Fixes: cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13 16:24:53 -08:00
Kumar Kartikeya Dwivedi
c00d738e16 bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
This patch reverts commit
cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The
patch was well-intended and meant to be as a stop-gap fixing branch
prediction when the pointer may actually be NULL at runtime. Eventually,
it was supposed to be replaced by an automated script or compiler pass
detecting possibly NULL arguments and marking them accordingly.

However, it caused two main issues observed for production programs and
failed to preserve backwards compatibility. First, programs relied on
the verifier not exploring == NULL branch when pointer is not NULL, thus
they started failing with a 'dereference of scalar' error.  Next,
allowing raw_tp arguments to be modified surfaced the warning in the
verifier that warns against reg->off when PTR_MAYBE_NULL is set.

More information, context, and discusson on both problems is available
in [0]. Overall, this approach had several shortcomings, and the fixes
would further complicate the verifier's logic, and the entire masking
scheme would have to be removed eventually anyway.

Hence, revert the patch in preparation of a better fix avoiding these
issues to replace this commit.

  [0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com

Reported-by: Manu Bretelle <chantra@meta.com>
Fixes: cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13 16:24:53 -08:00
Thomas Weißschuh
00a5acdbf3 bpf: Fix configuration-dependent BTF function references
These BTF functions are not available unconditionally,
only reference them when they are available.

Avoid the following build warnings:

  BTF     .tmp_vmlinux1.btf.o
btf_encoder__tag_kfunc: failed to find kfunc 'bpf_send_signal_task' in BTF
btf_encoder__tag_kfuncs: failed to tag kfunc 'bpf_send_signal_task'
  NM      .tmp_vmlinux1.syms
  KSYMS   .tmp_vmlinux1.kallsyms.S
  AS      .tmp_vmlinux1.kallsyms.o
  LD      .tmp_vmlinux2
  NM      .tmp_vmlinux2.syms
  KSYMS   .tmp_vmlinux2.kallsyms.S
  AS      .tmp_vmlinux2.kallsyms.o
  LD      vmlinux
  BTFIDS  vmlinux
WARN: resolve_btfids: unresolved symbol prog_test_ref_kfunc
WARN: resolve_btfids: unresolved symbol bpf_crypto_ctx
WARN: resolve_btfids: unresolved symbol bpf_send_signal_task
WARN: resolve_btfids: unresolved symbol bpf_modify_return_test_tp
WARN: resolve_btfids: unresolved symbol bpf_dynptr_from_xdp
WARN: resolve_btfids: unresolved symbol bpf_dynptr_from_skb

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213-bpf-cond-ids-v1-1-881849997219@weissschuh.net
2024-12-13 15:06:51 -08:00
Anton Protopopov
4d3ae294f9 bpf: Add fd_array_cnt attribute for prog_load
The fd_array attribute of the BPF_PROG_LOAD syscall may contain a set
of file descriptors: maps or btfs. This field was introduced as a
sparse array. Introduce a new attribute, fd_array_cnt, which, if
present, indicates that the fd_array is a continuous array of the
corresponding length.

If fd_array_cnt is non-zero, then every map in the fd_array will be
bound to the program, as if it was used by the program. This
functionality is similar to the BPF_PROG_BIND_MAP syscall, but such
maps can be used by the verifier during the program load.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213130934.1087929-5-aspsk@isovalent.com
2024-12-13 14:48:36 -08:00
Anton Protopopov
76145f7255 bpf: Refactor check_pseudo_btf_id
Introduce a helper to add btfs to the env->used_maps array. Use it
to simplify the check_pseudo_btf_id() function. This new helper will
also be re-used in a consequent patch.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213130934.1087929-4-aspsk@isovalent.com
2024-12-13 14:45:58 -08:00
Anton Protopopov
928f3221cb bpf: Move map/prog compatibility checks
Move some inlined map/prog compatibility checks from the
resolve_pseudo_ldimm64() function to the dedicated
check_map_prog_compatibility() function. Call the latter function
from the add_used_map_from_fd() function directly.

This simplifies code and optimizes logic a bit, as before these
changes the check_map_prog_compatibility() function was executed on
every map usage, which doesn't make sense, as it doesn't include any
per-instruction checks, only map type vs. prog type.

(This patch also simplifies a consequent patch which will call the
add_used_map_from_fd() function from another code path.)

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213130934.1087929-3-aspsk@isovalent.com
2024-12-13 14:45:58 -08:00
Anton Protopopov
4e885fab71 bpf: Add a __btf_get_by_fd helper
Add a new helper to get a pointer to a struct btf from a file
descriptor. This helper doesn't increase a refcnt. Add a comment
explaining this and pointing to a corresponding function which
does take a reference.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213130934.1087929-2-aspsk@isovalent.com
2024-12-13 14:45:58 -08:00
Vineeth Pillai (Google)
c7f7e9c731 sched/dlserver: Fix dlserver time accounting
dlserver time is accounted when:
 - dlserver is active and the dlserver proxies the cfs task.
 - dlserver is active but deferred and cfs task runs after being picked
   through the normal fair class pick.

dl_server_update is called in two places to make sure that both the
above times are accounted for. But it doesn't check if dlserver is
active or not. Now that we have this dl_server_active flag, we can
consolidate dl_server_update into one place and all we need to check is
whether dlserver is active or not. When dlserver is active there is only
two possible conditions:
 - dlserver is deferred.
 - cfs task is running on behalf of dlserver.

Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org
2024-12-13 12:57:35 +01:00
Vineeth Pillai (Google)
b53127db1d sched/dlserver: Fix dlserver double enqueue
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.

Double enqueue of dlserver could happend due to couple of reasons:

Case 1
------

Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
  __pick_next_task
   pick_task_dl -> server_pick_task
    pick_task_fair
     pick_next_entity (if (sched_delayed))
      dequeue_entities
       dl_server_stop

server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.

Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.

One cpu would be in the schedule() and releasing RQ-lock:

current->state = TASK_INTERRUPTIBLE();
        schedule();
          deactivate_task()
            dl_stop_server();
          pick_next_task()
            pick_next_task_fair()
              sched_balance_newidle()
                rq_unlock(this_rq)

at which point another CPU can take our RQ-lock and do:

        try_to_wake_up()
          ttwu_queue()
            rq_lock()
            ...
            activate_task()
              dl_server_start() --> first enqueue
            wakeup_preempt() := check_preempt_wakeup_fair()
              update_curr()
                update_curr_task()
                  if (current->dl_server)
                    dl_server_update()
                      enqueue_dl_entity() --> second enqueue

This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.

Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org
2024-12-13 12:57:34 +01:00
Kumar Kartikeya Dwivedi
659b9ba7cb bpf: Check size for BTF-based ctx access of pointer members
Robert Morris reported the following program type which passes the
verifier in [0]:

SEC("struct_ops/bpf_cubic_init")
void BPF_PROG(bpf_cubic_init, struct sock *sk)
{
	asm volatile("r2 = *(u16*)(r1 + 0)");     // verifier should demand u64
	asm volatile("*(u32 *)(r2 +1504) = 0");   // 1280 in some configs
}

The second line may or may not work, but the first instruction shouldn't
pass, as it's a narrow load into the context structure of the struct ops
callback. The code falls back to btf_ctx_access to ensure correctness
and obtaining the types of pointers. Ensure that the size of the access
is correctly checked to be 8 bytes, otherwise the verifier thinks the
narrow load obtained a trusted BTF pointer and will permit loads/stores
as it sees fit.

Perform the check on size after we've verified that the load is for a
pointer field, as for scalar values narrow loads are fine. Access to
structs passed as arguments to a BPF program are also treated as
scalars, therefore no adjustment is needed in their case.

Existing verifier selftests are broken by this change, but because they
were incorrect. Verifier tests for d_path were performing narrow load
into context to obtain path pointer, had this program actually run it
would cause a crash. The same holds for verifier_btf_ctx_access tests.

  [0]: https://lore.kernel.org/bpf/51338.1732985814@localhost

Fixes: 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
Reported-by: Robert Morris <rtm@mit.edu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241212092050.3204165-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-12 11:40:18 -08:00
Eduard Zingerman
ac6542ad92 bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
bpf_prog_aux->func field might be NULL if program does not have
subprograms except for main sub-program. The fixed commit does
bpf_prog_aux->func access unconditionally, which might lead to null
pointer dereference.

The bug could be triggered by replacing the following BPF program:

    SEC("tc")
    int main_changes(struct __sk_buff *sk)
    {
        bpf_skb_pull_data(sk, 0);
        return 0;
    }

With the following BPF program:

    SEC("freplace")
    long changes_pkt_data(struct __sk_buff *sk)
    {
        return bpf_skb_pull_data(sk, 0);
    }

bpf_prog_aux instance itself represents the main sub-program,
use this property to fix the bug.

Fixes: 81f6d0530b ("bpf: check changes_pkt_data property for extension programs")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202412111822.qGw6tOyB-lkp@intel.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241212070711.427443-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-12 11:37:19 -08:00
Linus Torvalds
1594c49394 Merge tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull eprobes fix from Masami Hiramatsu:

 - release eprobe when failing to add dyn_event.

   This unregisters event call and release eprobe when it fails to add a
   dynamic event. Found in cleaning up.

* tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/eprobe: Fix to release eprobe when failed to add dyn_event
2024-12-10 18:15:25 -08:00
Jann Horn
7d0d673627 bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
Currently, the pointer stored in call->prog_array is loaded in
__uprobe_perf_func(), with no RCU annotation and no immediately visible
RCU protection, so it looks as if the loaded pointer can immediately be
dangling.
Later, bpf_prog_run_array_uprobe() starts a RCU-trace read-side critical
section, but this is too late. It then uses rcu_dereference_check(), but
this use of rcu_dereference_check() does not actually dereference anything.

Fix it by aligning the semantics to bpf_prog_run_array(): Let the caller
provide rcu_read_lock_trace() protection and then load call->prog_array
with rcu_dereference_check().

This issue seems to be theoretical: I don't know of any way to reach this
code without having handle_swbp() further up the stack, which is already
holding a rcu_read_lock_trace() lock, so where we take
rcu_read_lock_trace() in __uprobe_perf_func()/bpf_prog_run_array_uprobe()
doesn't actually have any effect.

Fixes: 8c7dcb84e3 ("bpf: implement sleepable uprobes by chaining gps")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241210-bpf-fix-uprobe-uaf-v4-1-5fc8959b2b74@google.com
2024-12-10 13:06:51 -08:00
Anton Protopopov
c4441ca86a bpf: fix potential error return
The bpf_remove_insns() function returns WARN_ON_ONCE(error), where
error is a result of bpf_adj_branches(), and thus should be always 0
However, if for any reason it is not 0, then it will be converted to
boolean by WARN_ON_ONCE and returned to user space as 1, not an actual
error value. Fix this by returning the original err after the WARN check.

Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241210114245.836164-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 11:17:53 -08:00
Eduard Zingerman
81f6d0530b bpf: check changes_pkt_data property for extension programs
When processing calls to global sub-programs, verifier decides whether
to invalidate all packet pointers in current state depending on the
changes_pkt_data property of the global sub-program.

Because of this, an extension program replacing a global sub-program
must be compatible with changes_pkt_data property of the sub-program
being replaced.

This commit:
- adds changes_pkt_data flag to struct bpf_prog_aux:
  - this flag is set in check_cfg() for main sub-program;
  - in jit_subprogs() for other sub-programs;
- modifies bpf_check_attach_btf_id() to check changes_pkt_data flag;
- moves call to check_attach_btf_id() after the call to check_cfg(),
  because it needs changes_pkt_data flag to be set:

    bpf_check:
      ...                             ...
    - check_attach_btf_id             resolve_pseudo_ldimm64
      resolve_pseudo_ldimm64   -->    bpf_prog_is_offloaded
      bpf_prog_is_offloaded           check_cfg
      check_cfg                     + check_attach_btf_id
      ...                             ...

The following fields are set by check_attach_btf_id():
- env->ops
- prog->aux->attach_btf_trace
- prog->aux->attach_func_name
- prog->aux->attach_func_proto
- prog->aux->dst_trampoline
- prog->aux->mod
- prog->aux->saved_dst_attach_type
- prog->aux->saved_dst_prog_type
- prog->expected_attach_type

Neither of these fields are used by resolve_pseudo_ldimm64() or
bpf_prog_offload_verifier_prep() (for netronome and netdevsim
drivers), so the reordering is safe.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 10:24:57 -08:00
Eduard Zingerman
51081a3f25 bpf: track changes_pkt_data property for global functions
When processing calls to certain helpers, verifier invalidates all
packet pointers in a current state. For example, consider the
following program:

    __attribute__((__noinline__))
    long skb_pull_data(struct __sk_buff *sk, __u32 len)
    {
        return bpf_skb_pull_data(sk, len);
    }

    SEC("tc")
    int test_invalidate_checks(struct __sk_buff *sk)
    {
        int *p = (void *)(long)sk->data;
        if ((void *)(p + 1) > (void *)(long)sk->data_end) return TCX_DROP;
        skb_pull_data(sk, 0);
        *p = 42;
        return TCX_PASS;
    }

After a call to bpf_skb_pull_data() the pointer 'p' can't be used
safely. See function filter.c:bpf_helper_changes_pkt_data() for a list
of such helpers.

At the moment verifier invalidates packet pointers when processing
helper function calls, and does not traverse global sub-programs when
processing calls to global sub-programs. This means that calls to
helpers done from global sub-programs do not invalidate pointers in
the caller state. E.g. the program above is unsafe, but is not
rejected by verifier.

This commit fixes the omission by computing field
bpf_subprog_info->changes_pkt_data for each sub-program before main
verification pass.
changes_pkt_data should be set if:
- subprogram calls helper for which bpf_helper_changes_pkt_data
  returns true;
- subprogram calls a global function,
  for which bpf_subprog_info->changes_pkt_data should be set.

The verifier.c:check_cfg() pass is modified to compute this
information. The commit relies on depth first instruction traversal
done by check_cfg() and absence of recursive function calls:
- check_cfg() would eventually visit every call to subprogram S in a
  state when S is fully explored;
- when S is fully explored:
  - every direct helper call within S is explored
    (and thus changes_pkt_data is set if needed);
  - every call to subprogram S1 called by S was visited with S1 fully
    explored (and thus S inherits changes_pkt_data from S1).

The downside of such approach is that dead code elimination is not
taken into account: if a helper call inside global function is dead
because of current configuration, verifier would conservatively assume
that the call occurs for the purpose of the changes_pkt_data
computation.

Reported-by: Nick Zavaritsky <mejedi@gmail.com>
Closes: https://lore.kernel.org/bpf/0498CA22-5779-4767-9C0C-A9515CEA711F@gmail.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 10:24:57 -08:00
Eduard Zingerman
b238e187b4 bpf: refactor bpf_helper_changes_pkt_data to use helper number
Use BPF helper number instead of function pointer in
bpf_helper_changes_pkt_data(). This would simplify usage of this
function in verifier.c:check_cfg() (in a follow-up patch),
where only helper number is easily available and there is no real need
to lookup helper proto.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 10:24:57 -08:00
Eduard Zingerman
27e88bc4df bpf: add find_containing_subprog() utility function
Add a utility function, looking for a subprogram containing a given
instruction index, rewrite find_subprog() to use this function.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-10 10:24:57 -08:00
Jiri Olsa
978c4486cc bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog
Syzbot reported [1] crash that happens for following tracing scenario:

  - create tracepoint perf event with attr.inherit=1, attach it to the
    process and set bpf program to it
  - attached process forks -> chid creates inherited event

    the new child event shares the parent's bpf program and tp_event
    (hence prog_array) which is global for tracepoint

  - exit both process and its child -> release both events
  - first perf_event_detach_bpf_prog call will release tp_event->prog_array
    and second perf_event_detach_bpf_prog will crash, because
    tp_event->prog_array is NULL

The fix makes sure the perf_event_detach_bpf_prog checks prog_array
is valid before it tries to remove the bpf program from it.

[1] https://lore.kernel.org/bpf/Z1MR6dCIKajNS6nU@krava/T/#m91dbf0688221ec7a7fc95e896a7ef9ff93b0b8ad

Fixes: 0ee288e69d ("bpf,perf: Fix perf_event_detach_bpf_prog error handling")
Reported-by: syzbot+2e0d2840414ce817aaac@syzkaller.appspotmail.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241208142507.1207698-1-jolsa@kernel.org
2024-12-10 10:16:28 -08:00
Jann Horn
ef1b808e3b bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors
Uprobes always use bpf_prog_run_array_uprobe() under tasks-trace-RCU
protection. But it is possible to attach a non-sleepable BPF program to a
uprobe, and non-sleepable BPF programs are freed via normal RCU (see
__bpf_prog_put_noref()). This leads to UAF of the bpf_prog because a normal
RCU grace period does not imply a tasks-trace-RCU grace period.

Fix it by explicitly waiting for a tasks-trace-RCU grace period after
removing the attachment of a bpf_prog to a perf_event.

Fixes: 8c7dcb84e3 ("bpf: implement sleepable uprobes by chaining gps")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20241210-bpf-fix-actual-uprobe-uaf-v1-1-19439849dd44@google.com
2024-12-10 10:14:02 -08:00
Linus Torvalds
df9e2102de Merge tag 'sched_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:

 - Remove wrong enqueueing of a task for a later wakeup when a task
   blocks on a RT mutex

 - Do not setup a new deadline entity on a boosted task as that has
   happened already

 - Update preempt= kernel command line param

 - Prevent needless softirqd wakeups in the idle task's context

 - Detect the case where the idle load balancer CPU becomes busy and
   avoid unnecessary load balancing invocation

 - Remove an unnecessary load balancing need_resched() call in
   nohz_csd_func()

 - Allow for raising of SCHED_SOFTIRQ softirq type on RT but retain the
   warning to catch any other cases

 - Remove a wrong warning when a cpuset update makes the task affinity
   no longer a subset of the cpuset

* tag 'sched_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex
  sched/deadline: Fix warning in migrate_enable for boosted tasks
  sched/core: Update kernel boot parameters for LAZY preempt.
  sched/core: Prevent wakeup of ksoftirqd during idle load balance
  sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy
  sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
  softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel
  sched: fix warning in sched_setaffinity
  sched/deadline: Fix replenish_dl_new_period dl_server condition
2024-12-09 10:28:55 -08:00
Linus Torvalds
32913f3482 futex: fix user access on powerpc
The powerpc user access code is special, and unlike other architectures
distinguishes between user access for reading and writing.

And commit 43a43faf53 ("futex: improve user space accesses") messed
that up.  It went undetected elsewhere, but caused ppc32 to fail early
during boot, because the user access had been started with
user_read_access_begin(), but then finished off with just a plain
"user_access_end()".

Note that the address-masking user access helpers don't even have that
read-vs-write distinction, so if powerpc ever wants to do address
masking tricks, we'll have to do some extra work for it.

[ Make sure to also do it for the EFAULT case, as pointed out by
  Christophe Leroy ]

Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/all/87bjxl6b0i.fsf@igel.home/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-09 10:00:25 -08:00
Peter Zijlstra
76f2f78329 sched/eevdf: More PELT vs DELAYED_DEQUEUE
Vincent and Dietmar noted that while
commit fc1892becd ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes
the entity runnable stats, it does not adjust the cfs_rq runnable stats,
which are based off of h_nr_running.

Track h_nr_delayed such that we can discount those and adjust the
signal.

Fixes: fc1892becd ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE")
Closes: https://lore.kernel.org/lkml/a9a45193-d0c6-4ba2-a822-464ad30b550e@arm.com/
Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/
[ Fixes checkpatch warnings and rebased ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-3-vincent.guittot@linaro.org
2024-12-09 11:48:09 +01:00
Vincent Guittot
c1f43c342e sched/fair: Fix sched_can_stop_tick() for fair tasks
We can't stop the tick of a rq if there are at least 2 tasks enqueued in
the whole hierarchy and not only at the root cfs rq.

rq->cfs.nr_running tracks the number of sched_entity at one level
whereas rq->cfs.h_nr_running tracks all queued tasks in the
hierarchy.

Fixes: 11cc374f46 ("sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20241202174606.4074512-2-vincent.guittot@linaro.org
2024-12-09 11:48:09 +01:00
K Prateek Nayak
493afbd187 sched/fair: Fix NEXT_BUDDY
Adam reports that enabling NEXT_BUDDY insta triggers a WARN in
pick_next_entity().

Moving clear_buddies() up before the delayed dequeue bits ensures
no ->next buddy becomes delayed. Further ensure no new ->next buddy
ever starts as delayed.

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Adam Li <adamli@os.amperecomputing.com>
Link: https://lkml.kernel.org/r/670a0d54-e398-4b1f-8a6e-90784e2fdf89@amd.com
2024-12-09 11:48:09 +01:00
Alexei Starovoitov
442bc81bd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Cross-merge bpf fixes after downstream PR.

Trivial conflict:
tools/testing/selftests/bpf/prog_tests/verifier.c

Adjacent changes in:
Auto-merging kernel/bpf/verifier.c
Auto-merging samples/bpf/Makefile
Auto-merging tools/testing/selftests/bpf/.gitignore
Auto-merging tools/testing/selftests/bpf/Makefile
Auto-merging tools/testing/selftests/bpf/prog_tests/verifier.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-08 17:01:51 -08:00
Linus Torvalds
eadaac4dd2 Merge tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:

 - Fix a /proc/interrupts formatting regression

 - Have the BCM2836 interrupt controller enter power management states
   properly

 - Other fixlets

* tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/stm32mp-exti: CONFIG_STM32MP_EXTI should not default to y when compile-testing
  genirq/proc: Add missing space separator back
  irqchip/bcm2836: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
  irqchip/gic-v3: Fix irq_complete_ack() comment
2024-12-08 11:54:04 -08:00
Linus Torvalds
c25ca0c2e4 Merge tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:

 - Handle the case where clocksources with small counter width can,
   in conjunction with overly long idle sleeps, falsely trigger the
   negative motion detection of clocksources

* tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Make negative motion detection more robust
2024-12-08 11:51:29 -08:00
Linus Torvalds
553c89ec31 Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "24 hotfixes.  17 are cc:stable.  15 are MM and 9 are non-MM.

  The usual bunch of singletons - please see the relevant changelogs for
  details"

* tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
  iio: magnetometer: yas530: use signed integer type for clamp limits
  sched/numa: fix memory leak due to the overwritten vma->numab_state
  mm/damon: fix order of arguments in damos_before_apply tracepoint
  lib: stackinit: hide never-taken branch from compiler
  mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()
  scatterlist: fix incorrect func name in kernel-doc
  mm: correct typo in MMAP_STATE() macro
  mm: respect mmap hint address when aligning for THP
  mm: memcg: declare do_memsw_account inline
  mm/codetag: swap tags when migrate pages
  ocfs2: update seq_file index in ocfs2_dlm_seq_next
  stackdepot: fix stack_depot_save_flags() in NMI context
  mm: open-code page_folio() in dump_page()
  mm: open-code PageTail in folio_flags() and const_folio_flags()
  mm: fix vrealloc()'s KASAN poisoning logic
  Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()"
  selftests/damon: add _damon_sysfs.py to TEST_FILES
  selftest: hugetlb_dio: fix test naming
  ocfs2: free inode when ocfs2_get_init_inode() fails
  nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry()
  ...
2024-12-08 11:26:13 -08:00
Masami Hiramatsu (Google)
494b332064 tracing/eprobe: Fix to release eprobe when failed to add dyn_event
Fix eprobe event to unregister event call and release eprobe when it fails
to add dynamic event correctly.

Link: https://lore.kernel.org/all/173289886698.73724.1959899350183686006.stgit@devnote2/

Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-08 23:25:09 +09:00