Commit Graph

303 Commits

Author SHA1 Message Date
Saket Kumar Bhaskar
7b6216baae sched_ext: Fix scx_enable() crash on helper kthread creation failure
A crash was observed when the sched_ext selftests runner was
terminated with Ctrl+\ while test 15 was running:

NIP [c00000000028fa58] scx_enable.constprop.0+0x358/0x12b0
LR [c00000000028fa2c] scx_enable.constprop.0+0x32c/0x12b0
Call Trace:
scx_enable.constprop.0+0x32c/0x12b0 (unreliable)
bpf_struct_ops_link_create+0x18c/0x22c
__sys_bpf+0x23f8/0x3044
sys_bpf+0x2c/0x6c
system_call_exception+0x124/0x320
system_call_vectored_common+0x15c/0x2ec

kthread_run_worker() returns an ERR_PTR() on failure rather than NULL,
but the current code in scx_alloc_and_add_sched() only checks for a NULL
helper. Incase of failure on SIGQUIT, the error is not handled in
scx_alloc_and_add_sched() and scx_enable() ends up dereferencing an
error pointer.

Error handling is fixed in scx_alloc_and_add_sched() to propagate
PTR_ERR() into ret, so that scx_enable() jumps to the existing error
path, avoiding random dereference on failure.

Fixes: bff3b5aec1 ("sched_ext: Move disable machinery into scx_sched")
Cc: stable@vger.kernel.org # v6.16+
Reported-and-tested-by: Samir Mulani <samir@linux.ibm.com>
Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20 08:45:43 -10:00
Zqiang
36c6f3c03d sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_work
For PREEMPT_RT kernels, the kick_cpus_irq_workfn() be invoked in
the per-cpu irq_work/* task context and there is no rcu-read critical
section to protect. this commit therefore use IRQ_WORK_INIT_HARD() to
initialize the per-cpu rq->scx.kick_cpus_irq_work in the
init_sched_ext_class().

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-17 05:07:22 -10:00
Zqiang
a257e97421 sched_ext: Fix possible deadlock in the deferred_irq_workfn()
For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in
the per-cpu irq_work/* task context and not disable-irq, if the rq
returned by container_of() is current CPU's rq, the following scenarios
may occur:

lock(&rq->__lock);
<Interrupt>
  lock(&rq->__lock);

This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to
initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn()
is always invoked in hard-irq context.

Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-13 08:29:28 -10:00
Emil Tsalapatis
c87488a123 sched/ext: convert scx_tasks_lock to raw spinlock
Update scx_task_locks so that it's safe to lock/unlock in a
non-sleepable context in PREEMPT_RT kernels. scx_task_locks is
(non-raw) spinlock used to protect the list of tasks under SCX.
This list is updated during from finish_task_switch(), which
cannot sleep. Regular spinlocks can be locked in such a context
in non-RT kernels, but are sleepable under when CONFIG_PREEMPT_RT=y.

Convert scx_task_locks into a raw spinlock, which is not sleepable
even on RT kernels.

Sample backtrace:

<TASK>
dump_stack_lvl+0x83/0xa0
__might_resched+0x14a/0x200
rt_spin_lock+0x61/0x1c0
? sched_ext_dead+0x2d/0xf0
? lock_release+0xc6/0x280
sched_ext_dead+0x2d/0xf0
? srso_alias_return_thunk+0x5/0xfbef5
finish_task_switch.isra.0+0x254/0x360
__schedule+0x584/0x11d0
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? tick_nohz_idle_exit+0x7e/0x120
schedule_idle+0x23/0x40
cpu_startup_entry+0x29/0x30
start_secondary+0xf8/0x100
common_startup_64+0x13e/0x148
</TASK>

Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12 08:42:02 -10:00
Zqiang
5f02151c41 sched_ext: Fix unsafe locking in the scx_dump_state()
For built with CONFIG_PREEMPT_RT=y kernels, the dump_lock will be converted
sleepable spinlock and not disable-irq, so the following scenarios occur:

inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
irq_work/0/27 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&rq->__lock){?...}-{2:2}, at: raw_spin_rq_lock_nested+0x2b/0x40
{IN-HARDIRQ-W} state was registered at:
   lock_acquire+0x1e1/0x510
   _raw_spin_lock_nested+0x42/0x80
   raw_spin_rq_lock_nested+0x2b/0x40
   sched_tick+0xae/0x7b0
   update_process_times+0x14c/0x1b0
   tick_periodic+0x62/0x1f0
   tick_handle_periodic+0x48/0xf0
   timer_interrupt+0x55/0x80
   __handle_irq_event_percpu+0x20a/0x5c0
   handle_irq_event_percpu+0x18/0xc0
   handle_irq_event+0xb5/0x150
   handle_level_irq+0x220/0x460
   __common_interrupt+0xa2/0x1e0
   common_interrupt+0xb0/0xd0
   asm_common_interrupt+0x2b/0x40
   _raw_spin_unlock_irqrestore+0x45/0x80
   __setup_irq+0xc34/0x1a30
   request_threaded_irq+0x214/0x2f0
   hpet_time_init+0x3e/0x60
   x86_late_time_init+0x5b/0xb0
   start_kernel+0x308/0x410
   x86_64_start_reservations+0x1c/0x30
   x86_64_start_kernel+0x96/0xa0
   common_startup_64+0x13e/0x148

 other info that might help us debug this:
 Possible unsafe locking scenario:

        CPU0
        ----
   lock(&rq->__lock);
   <Interrupt>
     lock(&rq->__lock);

  *** DEADLOCK ***

 stack backtrace:
 CPU: 0 UID: 0 PID: 27 Comm: irq_work/0
 Call Trace:
  <TASK>
  dump_stack_lvl+0x8c/0xd0
  dump_stack+0x14/0x20
  print_usage_bug+0x42e/0x690
  mark_lock.part.44+0x867/0xa70
  ? __pfx_mark_lock.part.44+0x10/0x10
  ? string_nocheck+0x19c/0x310
  ? number+0x739/0x9f0
  ? __pfx_string_nocheck+0x10/0x10
  ? __pfx_check_pointer+0x10/0x10
  ? kvm_sched_clock_read+0x15/0x30
  ? sched_clock_noinstr+0xd/0x20
  ? local_clock_noinstr+0x1c/0xe0
  __lock_acquire+0xc4b/0x62b0
  ? __pfx_format_decode+0x10/0x10
  ? __pfx_string+0x10/0x10
  ? __pfx___lock_acquire+0x10/0x10
  ? __pfx_vsnprintf+0x10/0x10
  lock_acquire+0x1e1/0x510
  ? raw_spin_rq_lock_nested+0x2b/0x40
  ? __pfx_lock_acquire+0x10/0x10
  ? dump_line+0x12e/0x270
  ? raw_spin_rq_lock_nested+0x20/0x40
  _raw_spin_lock_nested+0x42/0x80
  ? raw_spin_rq_lock_nested+0x2b/0x40
  raw_spin_rq_lock_nested+0x2b/0x40
  scx_dump_state+0x3b3/0x1270
  ? finish_task_switch+0x27e/0x840
  scx_ops_error_irq_workfn+0x67/0x80
  irq_work_single+0x113/0x260
  irq_work_run_list.part.3+0x44/0x70
  run_irq_workd+0x6b/0x90
  ? __pfx_run_irq_workd+0x10/0x10
  smpboot_thread_fn+0x529/0x870
  ? __pfx_smpboot_thread_fn+0x10/0x10
  kthread+0x305/0x3f0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x40/0x70
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>

This commit therefore use rq_lock_irqsave/irqrestore() to replace
rq_lock/unlock() in the scx_dump_state().

Fixes: 07814a9439 ("sched_ext: Print debug dump after an error exit")
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12 06:28:32 -10:00
Andrea Righi
f4fa7c25f6 sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()
scx_bpf_cpuperf_set() has a typo where it dereferences the local
variable @sch, instead of the global @scx_root pointer. Fix by
dereferencing the correct variable.

Fixes: 956f2b11a8 ("sched_ext: Drop kf_cpu_valid()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29 05:14:39 -10:00
Emil Tsalapatis
a3c4a0a42e sched_ext: fix flag check for deferred callbacks
When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING
instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests
whether there is already a pending request for queue_balance_callback() to
be invoked at the end of .balance().

Fixes: a8ad873113 ("sched_ext: defer queue_balance_callback() until after ops.dispatch")
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16 08:34:00 -10:00
Andrea Righi
05e63305c8 sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads
If we load a BPF scheduler while another scheduler is already running,
alloc_kick_pseqs() would be called again, overwriting the previously
allocated arrays.

Fix by moving the alloc_kick_pseqs() call after the scx_enable_state()
check, ensuring that the arrays are only allocated when a scheduler can
actually be loaded.

Fixes: 14c1da3895 ("sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-14 10:29:17 -10:00
Tejun Heo
14c1da3895 sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()
On systems with >4096 CPUs, scx_kick_cpus_pnt_seqs allocation fails during
boot because it exceeds the 32,768 byte percpu allocator limit.

Restructure to use DEFINE_PER_CPU() for the per-CPU pointers, with each CPU
pointing to its own kvzalloc'd array. Move allocation from boot time to
scx_enable() and free in scx_disable(), so the O(nr_cpu_ids^2) memory is only
consumed when sched_ext is active.

Use RCU to guard against racing with free. Arrays are freed via call_rcu()
and kick_cpus_irq_workfn() uses rcu_dereference_bh() with a NULL check.

While at it, rename to scx_kick_pseqs for brevity and update comments to
clarify these are pick_task sequence numbers.

v2: RCU protect scx_kick_seqs to manage kick_cpus_irq_workfn() racing
    against disable as per Andrea.

v3: Fix bugs notcied by Andrea.

Reported-by: Phil Auld <pauld@redhat.com>
Link: http://lkml.kernel.org/r/20251007133523.GA93086@pauld.westford.csb
Cc: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:42:19 -10:00
Emil Tsalapatis
a8ad873113 sched_ext: defer queue_balance_callback() until after ops.dispatch
The sched_ext code calls queue_balance_callback() during enqueue_task()
to defer operations that drop multiple locks until we can unpin them.
The call assumes that the rq lock is held until the callbacks are
invoked, and the pending callbacks will not be visible to any other
threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock().

However, balance_one() may actually drop the lock during a BPF dispatch
call. Another thread may win the race to get the rq lock and see the
pending callback. To avoid this, sched_ext must only queue the callback
after the dispatch calls have completed.

CPU 0                   CPU 1           CPU 2

scx_balance()
  rq_unpin_lock()
  scx_balance_one()
    |= IN_BALANCE	scx_enqueue()
    ops.dispatch()
      rq_unlock()
                        rq_lock()
                        queue_balance_callback()
                        rq_unlock()
                                        [WARN] rq_pin_lock()
      rq_lock()
    &= ~IN_BALANCE
rq_repin_lock()

Changelog

v2-> v1 (https://lore.kernel.org/sched-ext/aOgOxtHCeyRT_7jn@gpd4)

- Fixed explanation in patch description (Andrea)
- Fixed scx_rq mask state updates (Andrea)
- Added Reviewed-by tag from Andrea

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:36:19 -10:00
Tejun Heo
efeeaac9ae sched_ext: Sync error_irq_work before freeing scx_sched
By the time scx_sched_free_rcu_work() runs, the scx_sched is no longer
reachable. However, a previously queued error_irq_work may still be pending or
running. Ensure it completes before proceeding with teardown.

Fixes: bff3b5aec1 ("sched_ext: Move disable machinery into scx_sched")
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:25:55 -10:00
Tejun Heo
54e96258a6 sched_ext: Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCU
scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() take a DSQ
iterator argument which has to be valid. Mark them with KF_RCU.

Fixes: 4c30f5ce4f ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-13 08:13:38 -10:00
Tejun Heo
df10932ad7 Revert "sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()"
This reverts commit c8191ee8e6 which triggers
the following suspicious RCU usage warning:

[    6.647598] =============================
[    6.647603] WARNING: suspicious RCU usage
[    6.647605] 6.17.0-rc7-virtme #1 Not tainted
[    6.647608] -----------------------------
[    6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage!
[    6.647610]
[    6.647610] other info that might help us debug this:
[    6.647610]
[    6.647612]
[    6.647612] rcu_scheduler_active = 2, debug_locks = 1
[    6.647613] 1 lock held by swapper/10/0:
[    6.647614]  #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at:
+raw_spin_rq_lock_nested+0x20/0x90
[    6.647630]
[    6.647630] stack backtrace:
[    6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1
+PREEMPT(full)
[    6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all)
[    6.647648] Call Trace:
[    6.647652]  <IRQ>
[    6.647655]  dump_stack_lvl+0x78/0xe0
[    6.647665]  lockdep_rcu_suspicious+0x14a/0x1b0
[    6.647672]  __rhashtable_lookup.constprop.0+0x1d5/0x250
[    6.647680]  find_dsq_for_dispatch+0xbc/0x190
[    6.647684]  do_enqueue_task+0x25b/0x550
[    6.647689]  enqueue_task_scx+0x21d/0x360
[    6.647692]  ? trace_lock_acquire+0x22/0xb0
[    6.647695]  enqueue_task+0x2e/0xd0
[    6.647698]  ttwu_do_activate+0xa2/0x290
[    6.647703]  sched_ttwu_pending+0xfd/0x250
[    6.647706]  __flush_smp_call_function_queue+0x1cd/0x610
[    6.647714]  __sysvec_call_function_single+0x34/0x150
[    6.647720]  sysvec_call_function_single+0x6e/0x80
[    6.647726]  </IRQ>
[    6.647726]  <TASK>
[    6.647727]  asm_sysvec_call_function_single+0x1a/0x20

Reported-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 20:38:23 -10:00
Tejun Heo
ebfd5226ec sched_ext: Merge branch 'for-6.17-fixes' into for-6.18
Pull sched_ext/for-6.17-fixes to receive:

 55ed11b181 ("sched_ext: idle: Handle migration-disabled tasks in BPF code")

which conflicts with the following commit in for-6.18:

 2407bae23d ("sched_ext: Add the @sch parameter to ext_idle helpers")

The conflict is a simple context conflict which can be resolved by taking
the updated parts from both commits.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:10:20 -10:00
Tejun Heo
c0008a5632 sched_ext: Misc updates around scx_sched instance pointer
In preparation for multiple scheduler support:

- Add the @sch parameter to find_global_dsq() and refill_task_slice_dfl().

- Restructure scx_allow_ttwu_queue() and make it read scx_root into $sch.

- Make RCU protection in scx_dsq_move() and scx_bpf_dsq_move_to_local()
  explicit.

v2: Add scx_root -> sch conversion in scx_allow_ttwu_queue().

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:03:26 -10:00
Tejun Heo
d4f7d86666 sched_ext: Drop scx_kf_exit() and scx_kf_error()
The intention behind scx_kf_exit/error() was that when called from kfuncs,
scx_kf_exit/error() would be able to implicitly determine the scx_sched
instance being operated on and thus wouldn't need the @sch parameter passed
in explicitly. This turned out to be unnecessarily complicated to implement
and not have enough practical benefits. Replace scx_kf_exit/error() usages
with scx_exit/error() which take an explicit @sch parameter.

- Add the @sch parameter to scx_kf_allowed(), scx_kf_allowed_on_arg_tasks,
  mark_direct_dispatch() and other intermediate functions transitively.

- In callers that don't already have @sch available, grab RCU, read
  $scx_root, verify it's not NULL and use it.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:03:26 -10:00
Tejun Heo
4d9553fee3 sched_ext: Add the @sch parameter to scx_dsq_insert_preamble/commit()
In preparation for multiple scheduler support, add the @sch parameter to
scx_dsq_insert_preamble/commit() and update the callers to read $scx_root
and pass it in. The passed in @sch parameter is not used yet.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:03:26 -10:00
Tejun Heo
956f2b11a8 sched_ext: Drop kf_cpu_valid()
The intention behind kf_cpu_valid() was that when called from kfuncs,
kf_cpu_valid() would be able to implicitly determine the scx_sched instance
being operated on and thus wouldn't need @sch passed in explicitly. This
turned out to be unnecessarily complicated to implement and not have
justifiable practical benefits. Replace kf_cpu_valid() usages with
ops_cpu_valid() which takes explicit @sch.

Callers which don't have $sch available in the context are updated to read
$scx_root under RCU read lock, verify that it's not NULL and pass it in.

scx_bpf_cpu_rq() is restructured to use guard(rcu)() instead of explicit
rcu_read_[un]lock().

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:03:26 -10:00
Tejun Heo
fc6a93aa62 sched_ext: Add the @sch parameter to __bstr_format()
In preparation for multiple scheduler support, add the @sch parameter to
__bstr_format() and update the callers to read $scx_root, verify that it's
not NULL and pass it in. The passed in @sch parameter is not used yet.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:03:26 -10:00
Tejun Heo
9fc687edf2 sched_ext: Separate out scx_kick_cpu() and add @sch to it
In preparation for multiple scheduler support, separate out scx_kick_cpu()
from scx_bpf_kick_cpu() and add the @sch parameter to it. scx_bpf_kick_cpu()
now acquires an RCU read lock, reads $scx_root, and calls scx_kick_cpu()
with it if non-NULL. The passed in @sch parameter is not used yet.

Internal uses of scx_bpf_kick_cpu() are converted to scx_kick_cpu(). Where
$sch is available, it's used. In the pick_task_scx() path where no
associated scheduler can be identified, $scx_root is used directly. Note
that $scx_root cannot be NULL in this case.

Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:03:26 -10:00
Tejun Heo
f3aec2adce sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()
ops.exit() may be called even if the loading failed before ops.init()
finishes successfully. This is because ops.exit() allows rich exit info
communication. Add SCX_EFLAG_INITIALIZED flag to scx_exit_info.flags to
indicate whether ops.init() finished successfully.

This enables BPF schedulers to distinguish between exit scenarios and
handle cleanup appropriately based on initialization state.

Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:03:26 -10:00
Tejun Heo
f75efc8f4c sched_ext: Fix stray scx_root usage in task_can_run_on_remote_rq()
task_can_run_on_remote_rq() takes @sch but it is using scx_root when
incrementing SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE, which is inconsistent and
gets in the way of implementing multiple scheduler support. Use @sch
instead. As currently scx_root is the only possible scheduler instance, this
doesn't cause any behavior changes.

Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:03:26 -10:00
Tejun Heo
c8191ee8e6 sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
The find_user_dsq() function is called from contexts that are already
under RCU read lock protection. Switch from rhashtable_lookup_fast() to
rhashtable_lookup() to avoid redundant RCU locking.

Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 09:03:25 -10:00
Andrea Righi
340de1f673 sched_ext: Verify RCU protection in scx_bpf_cpu_curr()
scx_bpf_cpu_curr() has been introduced to retrieve the current task of a
given runqueue, allowing schedulers to interact with that task.

The kfunc assumes that it is always called in an RCU context, but this
is not always guaranteed and some BPF schedulers can trigger the
following warning:

  WARNING: suspicious RCU usage
  sched_ext: BPF scheduler "cosmos_1.0.2_gd0e71ca_x86_64_unknown_linux_gnu_debug" enabled
  6.17.0-rc1 #1-NixOS Not tainted
  -----------------------------
  kernel/sched/ext.c:6415 suspicious rcu_dereference_check() usage!
  ...
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x6f/0xb0
  lockdep_rcu_suspicious.cold+0x4e/0x96
  scx_bpf_cpu_curr+0x7e/0x80
  bpf_prog_c68b2b6b6b1b0ff8_sched_timerfn+0xce/0x1dc
  bpf_timer_cb+0x7b/0x130
  __hrtimer_run_queues+0x1ea/0x380
  hrtimer_run_softirq+0x8c/0xd0
  handle_softirqs+0xc9/0x3b0
  __irq_exit_rcu+0x96/0xc0
  irq_exit_rcu+0xe/0x20
  sysvec_apic_timer_interrupt+0x73/0x80
  </IRQ>
  <TASK>

To address this, mark the kfunc with KF_RCU_PROTECTED, so the verifier
can enforce its usage only inside RCU-protected sections.

Note: this also requires commit 1512231b6c ("bpf: Enforce RCU protection
for KF_RCU_PROTECTED"), currently in bpf-next, to enforce the proper
KF_RCU_PROTECTED.

Fixes: 20b158094a ("sched_ext: Introduce scx_bpf_cpu_curr()")
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-23 05:09:40 -10:00
Andrea Righi
ac6772e8bc sched_ext: Add migration-disabled counter to error state dump
Include the task's migration-disabled counter when dumping task state
during an error exit.

This can help diagnose cases where tasks can get stuck, because they're
unable to migrate elsewhere.

tj: s/nomig/no_mig/ for readability and consistency with other keys.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-18 08:54:57 -10:00
Andrea Righi
0b47b6c354 Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"
scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
CPU is taken by a higher scheduling class to give tasks queued to the
CPU's local DSQ a chance to be migrated somewhere else, instead of
waiting indefinitely for that CPU to become available again.

In doing so, we decided to skip migration-disabled tasks, under the
assumption that they cannot be migrated anyway.

However, when a higher scheduling class preempts a CPU, the running task
is always inserted at the head of the local DSQ as a migration-disabled
task. This means it is always skipped by scx_bpf_reenqueue_local(), and
ends up being confined to the same CPU even if that CPU is heavily
contended by other higher scheduling class tasks.

As an example, let's consider the following scenario:

 $ schedtool -a 0,1, -e yes > /dev/null
 $ sudo schedtool -F -p 99 -a 0, -e \
   stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000

The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
(SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
task initially runs on CPU0, it will remain there because it always sees
CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
utilization while CPU1 stays idle:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[                        0.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
 1067 root        RT   0  R   0  99.0  0.2  0:31.16 stress-ng-cpu [run]
  975 arighi      20   0  R   0   1.0  0.0  0:26.32 yes

By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
this case) via ops.enqueue(), leading to better CPU utilization:

    0[||||||||||||||||||||||100.0%]   8[                        0.0%]
    1[||||||||||||||||||||||100.0%]   9[                        0.0%]
    2[                        0.0%]  10[                        0.0%]
    3[                        0.0%]  11[                        0.0%]
    4[                        0.0%]  12[                        0.0%]
    5[                        0.0%]  13[                        0.0%]
    6[                        0.0%]  14[                        0.0%]
    7[                        0.0%]  15[                        0.0%]
  PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
  577 root        RT   0  R   0 100.0  0.2  0:23.17 stress-ng-cpu [run]
  555 arighi      20   0  R   1 100.0  0.0  0:28.67 yes

It's debatable whether per-CPU tasks should be re-enqueued as well, but
doing so is probably safer: the scheduler can recognize re-enqueued
tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
either put them back at the head of the local DSQ or let another task
attempt to take the CPU.

This also prevents giving per-CPU tasks an implicit priority boost,
which would otherwise make them more likely to reclaim CPUs preempted by
higher scheduling classes.

Fixes: 97e13ecb02 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16 10:15:23 -10:00
Andrea Righi
47d9f82128 sched_ext: Fix NULL dereference in scx_bpf_cpu_rq() warning
When printing the deprecation warning for scx_bpf_cpu_rq(), we may hit a
NULL pointer dereference if the kfunc is called before a BPF scheduler
is fully attached, for example, when invoked from a BPF timer or during
ops.init():

 [   50.752775] BUG: kernel NULL pointer dereference, address: 0000000000000331
 ...
 [   50.764205] RIP: 0010:scx_bpf_cpu_rq+0x30/0xa0
 ...
 [   50.787661] Call Trace:
 [   50.788398]  <TASK>
 [   50.789061]  bpf_prog_08f7fd2dcb187aaf_wakeup_timerfn+0x75/0x1a8
 [   50.792477]  bpf_timer_cb+0x7e/0x140
 [   50.796003]  hrtimer_run_softirq+0x91/0xe0
 [   50.796952]  handle_softirqs+0xce/0x3c0
 [   50.799087]  run_ksoftirqd+0x3e/0x70
 [   50.800197]  smpboot_thread_fn+0x133/0x290
 [   50.802320]  kthread+0x115/0x220
 [   50.804984]  ret_from_fork+0x17a/0x1d0
 [   50.806920]  ret_from_fork_asm+0x1a/0x30
 [   50.807799]  </TASK>

Fix this by only printing the warning once the scheduler is fully
registered.

Fixes: 5c48d88fe0 ("sched_ext: deprecation warn for scx_bpf_cpu_rq()")
Cc: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04 10:27:16 -10:00
Christian Loehle
5c48d88fe0 sched_ext: deprecation warn for scx_bpf_cpu_rq()
scx_bpf_cpu_rq() works on an unlocked rq which generally isn't safe.
For the common use-cases scx_bpf_locked_rq() and
scx_bpf_cpu_curr() work, so add a deprecation warning
to scx_bpf_cpu_rq() so it can eventually be removed.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03 11:51:57 -10:00
Christian Loehle
20b158094a sched_ext: Introduce scx_bpf_cpu_curr()
Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr
task of a remote rq without assuming its lock is held.

Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr
(e.g. to see if it should be preempted). This is problematic because
scx_bpf_cpu_rq() provides access to all fields of struct rq, most of
which aren't safe to use without holding the associated rq lock.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03 11:50:42 -10:00
Christian Loehle
e0ca169638 sched_ext: Introduce scx_bpf_locked_rq()
Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held.
Furthermore they become meaningless without rq lock, too.
Make a safer version of scx_bpf_cpu_rq() that only returns a rq
if we hold rq lock of that rq.

Also mark the new scx_bpf_locked_rq() as returning NULL as
scx_bpf_cpu_rq() should've been too.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03 11:50:36 -10:00
Tejun Heo
a5bd6ba30b sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations
SCX hooks into CPU cgroup controller operations and read-locks
scx_cgroup_rwsem to exclude them while enabling and disable schedulers.
While this works, it's unnecessarily complicated given that
cgroup_[un]lock() are available and thus the cgroup operations can be locked
out that way.

Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach
operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop
scx_cgroup_finish_attach() which is no longer necessary. Drop the now
unnecessary rcu locking and css ref bumping in scx_cgroup_init() and
scx_cgroup_exit().

As scx_cgroup_set_weight/bandwidth() paths aren't protected by
cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain
the locking there.

This is overall simpler and will also allow enable/disable paths to
synchronize against cgroup changes independent of the CPU controller.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:36:07 -10:00
Tejun Heo
bcb7c23056 sched_ext: Put event_stats_cpu in struct scx_sched_pcpu
scx_sched.event_stats_cpu is the percpu counters that are used to track
stats. Introduce struct scx_sched_pcpu and move the counters inside. This
will ease adding more per-cpu fields. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:33:28 -10:00
Tejun Heo
0c2b8356e4 sched_ext: Move internal type and accessor definitions to ext_internal.h
There currently isn't a place to place SCX-internal types and accessors to
be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h
and move internal type and accessor definitions there. This trims ext.c a
bit and makes future additions easier. Pure code reorganization. No
functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:33:28 -10:00
Tejun Heo
4a1d9d73aa sched_ext: Keep bypass on between enable failure and scx_disable_workfn()
scx_enable() turns on the bypass mode while enable is in progress. If
enabling fails, it turns off the bypass mode and then triggers scx_error().
scx_error() will trigger scx_disable_workfn() which will turn on the bypass
mode again and unload the failed scheduler.

This moves the system out of bypass mode between the enable error path and
the disable path, which is unnecessary and can be brittle - e.g. the thread
running scx_enable() may already be on the failed scheduler and can be
switched out before it triggers scx_error() leading to a stall. The watchdog
would eventually kick in, so the situation isn't critical but is still
suboptimal.

There is nothing to be gained by turning off the bypass mode between
scx_enable() failure and scx_disable_workfn(). Keep bypass on.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:33:28 -10:00
Tejun Heo
b7975c4869 sched_ext: Make explicit scx_task_iter_relock() calls unnecessary
During tasks iteration, the locks can be dropped using
scx_task_iter_unlock() to perform e.g. sleepable allocations. Afterwards,
scx_task_iter_relock() has to be called prior to other iteration operations,
which is error-prone. This can be easily automated by tracking whether
scx_tasks_lock is held in scx_task_iter and re-acquiring when necessary. It
already tracks whether the task's rq is locked after all.

- Add scx_task_iter->list_locked which remembers whether scx_tasks_lock is
  held.

- Rename scx_task_iter->locked to scx_task_iter->locked_task to better
  distinguish it from ->list_locked.

- Replace scx_task_iter_relock() with __scx_task_iter_maybe_relock() which
  is automatically called by scx_task_iter_next() and scx_task_iter_stop().

- Drop explicit scx_task_iter_relock() calls.

The resulting behavior should be equivalent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03 11:33:28 -10:00
Andrea Righi
ddf7233fca sched/ext: Fix invalid task state transitions on class switch
When enabling a sched_ext scheduler, we may trigger invalid task state
transitions, resulting in warnings like the following (which can be
easily reproduced by running the hotplug selftest in a loop):

 sched_ext: Invalid task state transition 0 -> 3 for fish[770]
 WARNING: CPU: 18 PID: 787 at kernel/sched/ext.c:3862 scx_set_task_state+0x7c/0xc0
 ...
 RIP: 0010:scx_set_task_state+0x7c/0xc0
 ...
 Call Trace:
  <TASK>
  scx_enable_task+0x11f/0x2e0
  switching_to_scx+0x24/0x110
  scx_enable.isra.0+0xd14/0x13d0
  bpf_struct_ops_link_create+0x136/0x1a0
  __sys_bpf+0x1edd/0x2c30
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0xbb/0x370
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

This happens because we skip initialization for tasks that are already
dead (with their usage counter set to zero), but we don't exclude them
during the scheduling class transition phase.

Fix this by also skipping dead tasks during class swiching, preventing
invalid task state transitions.

Fixes: a8532fac7b ("sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-11 06:56:37 -10:00
Linus Torvalds
6a68cec16b Merge tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:

 - Add support for cgroup "cpu.max" interface

 - Code organization cleanup so that ext_idle.c doesn't depend on the
   source-file-inclusion build method of sched/

 - Drop UP paths in accordance with sched core changes

 - Documentation and other misc changes

* tag 'sched_ext-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Fix scx_bpf_reenqueue_local() reference
  sched_ext: Drop kfuncs marked for removal in 6.15
  sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic
  kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments
  sched_ext: Add support for cgroup bandwidth control interface
  sched_ext, sched/core: Factor out struct scx_task_group
  sched_ext: Return NULL in llc_span
  sched_ext: Always use SMP versions in kernel/sched/ext_idle.h
  sched_ext: Always use SMP versions in kernel/sched/ext_idle.c
  sched_ext: Always use SMP versions in kernel/sched/ext.h
  sched_ext: Always use SMP versions in kernel/sched/ext.c
  sched_ext: Documentation: Clarify time slice handling in task lifecycle
  sched_ext: Make scx_locked_rq() inline
  sched_ext: Make scx_rq_bypassing() inline
  sched_ext: idle: Make local functions static in ext_idle.c
  sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()
2025-07-31 16:29:46 -07:00
Christian Loehle
ae96bba1ca sched_ext: Fix scx_bpf_reenqueue_local() reference
The comment mentions bpf_scx_reenqueue_local(), but the function
is provided for the BPF program implementing scx, as such the
naming convention is scx_bpf_reenqueue_local(), fix the comment.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-07-17 08:17:26 -10:00
Breno Leitao
e14fd98c6d sched/ext: Prevent update_locked_rq() calls with NULL rq
Avoid invoking update_locked_rq() when the runqueue (rq) pointer is NULL
in the SCX_CALL_OP and SCX_CALL_OP_RET macros.

Previously, calling update_locked_rq(NULL) with preemption enabled could
trigger the following warning:

    BUG: using __this_cpu_write() in preemptible [00000000]

This happens because __this_cpu_write() is unsafe to use in preemptible
context.

rq is NULL when an ops invoked from an unlocked context. In such cases, we
don't need to store any rq, since the value should already be NULL
(unlocked). Ensure that update_locked_rq() is only called when rq is
non-NULL, preventing calling __this_cpu_write() on preemptible context.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Fixes: 18853ba782 ("sched_ext: Track currently locked rq")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v6.15
2025-07-16 15:02:12 -10:00
Jake Hillion
4ecf837414 sched_ext: Drop kfuncs marked for removal in 6.15
sched_ext performed a kfunc renaming pass in 6.13 and kept the old names
around for compatibility with old binaries. These were scheduled for
cleanup in 6.15 but were missed. Submitting for cleanup in for-next.

Removed the kfuncs, their flags, and any references I could find to them
in doc comments. Left the entries in include/scx/compat.bpf.h as they're
still useful to make new binaries compatible with old kernels.

Tested by applying to my kernel. It builds and a modern version of
scx_lavd loads fine.

Signed-off-by: Jake Hillion <jake@hillion.co.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-25 13:13:20 -10:00
David Dai
cb444006a6 sched_ext, rcu: Eject BPF scheduler on RCU CPU stall panic
For systems using a sched_ext scheduler and has panic_on_rcu_stall
enabled, try kicking out the current scheduler before issuing a panic.

While there are numerous reasons for RCU CPU stalls that are not
directly attributed to the scheduler, deferring the panic gives
sched_ext an opportunity to provide additional debug info when ejecting
the current scheduler. Also, handling the event more gracefully allows
us to potentially recover the system instead of incurring additional
down time.

Suggested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: David Dai <david.dai@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-24 13:05:26 -10:00
Ke Ma
e2a37c277c kernel/sched/ext.c: fix typo "occured" -> "occurred" in comments
Fixes a minor spelling mistake in two comment lines

Signed-off-by: Ke Ma <makebit1999@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-23 08:11:16 -10:00
Tejun Heo
ddceadce63 sched_ext: Add support for cgroup bandwidth control interface
From 077814f57f8acce13f91dc34bbd2b7e4911fbf25 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 13 Jun 2025 15:06:47 -1000

- Add CONFIG_GROUP_SCHED_BANDWIDTH which is selected by both
  CONFIG_CFS_BANDWIDTH and EXT_GROUP_SCHED.

- Put bandwidth control interface files for both cgroup v1 and v2 under
  CONFIG_GROUP_SCHED_BANDWIDTH.

- Update tg_bandwidth() to fetch configuration parameters from fair if
  CONFIG_CFS_BANDWIDTH, SCX otherwise.

- Update tg_set_bandwidth() to update the parameters for both fair and SCX.

- Add bandwidth control parameters to struct scx_cgroup_init_args.

- Add sched_ext_ops.cgroup_set_bandwidth() which is invoked on bandwidth
  control parameter updates.

- Update scx_qmap and maximal selftest to test the new feature.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20 17:03:51 -10:00
Tejun Heo
6e6558a6bc sched_ext, sched/core: Factor out struct scx_task_group
More sched_ext fields will be added to struct task_group. In preparation,
factor out sched_ext fields into struct scx_task_group to reduce clutter in
the common header. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20 17:03:27 -10:00
Tejun Heo
e4e149dd2f sched_ext: Merge branch 'for-6.16-fixes' into for-6.17
Pull sched_ext/for-6.16-fixes to receive:

 c50784e99f ("sched_ext: Make scx_group_set_weight() always update tg->scx.weight")
 33796b9187 ("sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()")

which are needed to implement CPU bandwidth control interface support.

Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20 17:01:21 -10:00
Tejun Heo
33796b9187 sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()
During task_group creation, sched_create_group() calls
scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext
portion. This is premature and ends up calling ops.cgroup_set_weight() with
an incorrect @cgrp before ops.cgroup_init() is called.

sched_create_group() should just initialize SCX related fields in the new
task_group. Fix it by factoring out scx_tg_init() from sched_init() and
making sched_create_group() call that function instead of
scx_group_set_weight().

v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads
    to build failures on !CONFIG_GROUP_SCHED configs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
2025-06-17 08:19:55 -10:00
Tejun Heo
c50784e99f sched_ext: Make scx_group_set_weight() always update tg->scx.weight
Otherwise, tg->scx.weight can go out of sync while scx_cgroup is not enabled
and ops.cgroup_init() may be called with a stale weight value.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
2025-06-17 08:19:43 -10:00
Cheng-Yang Chou
165af41516 sched_ext: Always use SMP versions in kernel/sched/ext.c
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.

tj: Updated subject for clarity.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-13 14:29:39 -10:00
Andrea Righi
086ed90a64 sched_ext: Make scx_locked_rq() inline
scx_locked_rq() is used both from ext.c and ext_idle.c, move it to ext.h
as a static inline function.

No functional changes.

v2: Rename locked_rq to scx_locked_rq_state, expose it and make
    scx_locked_rq() inline, as suggested by Tejun.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09 06:25:35 -10:00
Andrea Righi
e212743bd7 sched_ext: Make scx_rq_bypassing() inline
scx_rq_bypassing() is used both from ext.c and ext_idle.c, move it to
ext.h as a static inline function.

No functional changes.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09 06:25:24 -10:00