The update of the ->expmaskinitnext and of ->ncpus are unsynchronized,
with the value of ->ncpus being incremented long before the corresponding
->expmaskinitnext mask is updated. If an RCU expedited grace period
sees ->ncpus change, it will update the ->expmaskinit masks from the new
->expmaskinitnext masks. But it is possible that ->ncpus has already
been updated, but the ->expmaskinitnext masks still have their old values.
For the current expedited grace period, no harm done. The CPU could not
have been online before the grace period started, so there is no need to
wait for its non-existent pre-existing readers.
But the next RCU expedited grace period is in a world of hurt. The value
of ->ncpus has already been updated, so this grace period will assume
that the ->expmaskinitnext masks have not changed. But they have, and
they won't be taken into account until the next never-been-online CPU
comes online. This means that RCU will be ignoring some CPUs that it
should be paying attention to.
The solution is to update ->ncpus and ->expmaskinitnext while holding
the ->lock for the rcu_node structure containing the ->expmaskinitnext
mask. Because smp_store_release() is now used to update ->ncpus and
smp_load_acquire() is now used to locklessly read it, if the expedited
grace period sees ->ncpus change, then the updating CPU has to
already be holding the corresponding ->lock. Therefore, when the
expedited grace period later acquires that ->lock, it is guaranteed
to see the new value of ->expmaskinitnext.
On the other hand, if the expedited grace period loads ->ncpus just
before an update, earlier full memory barriers guarantee that
the incoming CPU isn't far enough along to be running any RCU readers.
This commit therefore makes the required change.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU callbacks must be migrated away from an outgoing CPU, and this is
done near the end of the CPU-hotplug operation, after the outgoing CPU is
long gone. Unfortunately, this means that other CPU-hotplug callbacks
can execute while the outgoing CPU's callbacks are still immobilized
on the long-gone CPU's callback lists. If any of these CPU-hotplug
callbacks must wait, either directly or indirectly, for the invocation
of any of the immobilized RCU callbacks, the system will hang.
This commit avoids such hangs by migrating the callbacks away from the
outgoing CPU immediately upon its departure, shortly after the return
from __cpu_die() in takedown_cpu(). Thus, RCU is able to advance these
callbacks and invoke them, which allows all the after-the-fact CPU-hotplug
callbacks to wait on these RCU callbacks without risk of a hang.
While in the neighborhood, this commit also moves rcu_send_cbs_to_orphanage()
and rcu_adopt_orphan_cbs() under a pre-existing #ifdef to avoid including
dead code on the one hand and to avoid define-without-use warnings on the
other hand.
Reported-by: Jeffrey Hugo <jhugo@codeaurora.org>
Link: http://lkml.kernel.org/r/db9c91f6-1b17-6136-84f0-03c3c2581ab4@codeaurora.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Richard Weinberger <richard@nod.at>
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered") automatically enabled ordered attribute for unbound
workqueues w/ max_active == 1. Because ordered workqueues reject
max_active and some attribute changes, this implicit ordered mode
broke cases where the user creates an unbound workqueue w/ max_active
== 1 and later explicitly changes the related attributes.
This patch distinguishes explicit and implicit ordered setting and
overrides from attribute changes if implict.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
Explain cgroup_enable_threaded() and note that the function can never
be called on the root cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Waiman Long <longman@redhat.com>
cgroup_enable_threaded() checks that the cgroup doesn't have any tasks
or children and fails the operation if so. This test is unnecessary
because the first part is already checked by
cgroup_can_be_thread_root() and the latter is unnecessary. The latter
actually cause a behavioral oddity. Please consider the following
hierarchy. All cgroups are domains.
A
/ \
B C
\
D
If B is made threaded, C and D becomes invalid domains. Due to the no
children restriction, threaded mode can't be enabled on C. For C and
D, the only thing the user can do is removal.
There is no reason for this restriction. Remove it.
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair. This commit therefore replaces the spin_unlock_wait() call in
task_work_run() with a spin_lock_irq() and a spin_unlock_irq() aruond
the cmpxchg() dequeue loop. This should be safe from a performance
perspective because ->pi_lock is local to the task and because calls to
the other side of the race, task_work_cancel(), should be rare.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The handling of RCU's no-CBs CPUs has a maintenance headache, namely
that if call_rcu() is invoked with interrupts disabled, the rcuo kthread
wakeup must be defered to a point where we can be sure that scheduler
locks are not held. Of course, there are a lot of code paths leading
from an interrupts-disabled invocation of call_rcu(), and missing any
one of these can result in excessive callback-invocation latency, and
potentially even system hangs.
This commit therefore uses a timer to guarantee that the wakeup will
eventually occur. If one of the deferred-wakeup points kicks in, then
the timer is simply cancelled.
This commit also fixes up an incomplete removal of commits that were
intended to plug remaining exit paths, which should have the added
benefit of reducing the overhead of RCU's context-switch hooks. In
addition, it simplifies leader-to-follower callback-list handoff by
introducing locking. The call_rcu()-to-leader handoff continues to
use atomic operations in order to maintain good real-time latency for
common-case use of call_rcu().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Dan Carpenter fix for mod_timer() usage bug found by smatch. ]
ddebug_remove_module() use mod->name to find the ddebug_table of the
module and remove it. But dynamic_debug_setup() use the first
_ddebug->modname to create ddebug_table for the module. It's ok when
the _ddebug->modname is the same with the mod->name.
But livepatch module is special, it may contain _ddebugs of other
modules, the modname of which is different from the name of livepatch
module. So ddebug_remove_module() can't use mod->name to find the
right ddebug_table and remove it. It can cause kernel crash when we cat
the file <debugfs>/dynamic_debug/control.
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
We forgot to set the error code on two error paths which means that we
return ERR_PTR(0) which is NULL. The caller, find_and_alloc_map(), is
not expecting that and will have a NULL dereference.
Fixes: 546ac1ffb7 ("bpf: add devmap, a map for storing net device references")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Linux kernel invokes call_rcu() from various interrupt/softirq
handlers, but rcutorture does not. This commit therefore adds this
behavior to rcutorture's repertoire.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit augments the grace-period-kthread starvation debugging
messages by adding the last CPU that ran the kthread.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes an unused local variable named ts_rem that is
marked __maybe_unused. Yes, the variable was assigned to, but it
was never used beyond that point, hence not needed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It appears that at least some of the rcutorture writer stall messages
coincide with unusually long CPU-online operations, for example, no
fewer than 205 seconds in a recent test. It is of course possible that
the writer stall is not unrelated to this unusually long CPU-hotplug
operation, and so this commit adds the rcutorture writer task's CPU to
the stall message to gain more information about this possible connection.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Strings used in event tracing need to be specially handled, for example,
being copied to the trace buffer instead of being pointed to by the trace
buffer. Although the TPS() macro can be used to "launder" pointed-to
strings, this might not be all that effective within a loadable module.
This commit therefore copies rcutorture's strings to the trace buffer.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Now that it is legal to invoke srcu_read_lock() and srcu_read_unlock()
for a given srcu_struct from both process context and {soft,}irq
handlers, it is time to test it. This commit therefore enables
testing of SRCU readers from rcutorture's timer handler, using in_task()
to determine whether or not it is safe to sleep in the SRCU read-side
critical sections.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The synchronize_rcu_tasks() and call_rcu_tasks() APIs are now available
regardless of kernel configuration, so this commit removes the
CONFIG_TASKS_RCU ifdef from rcuperf.c.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds printing of SRCU lock/unlock totals, which are just
the sums of the per-CPU counts. Saves a bit of mental arithmetic.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit gets rid of some ugly #ifdefs in rcutorture.c by moving
the SRCU status printing to the SRCU implementations.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The function process_srcu() is not invoked outside of srcutree.c, so
this commit makes it static and drops the EXPORT_SYMBOL_GPL().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Other than lockdep support, Tiny RCU has no need for the
scheduler status. However, Tiny SRCU will need this to control
boot-time behavior independent of lockdep. Therefore, this commit
moves rcu_scheduler_starting() from kernel/rcu/tiny_plugin.h to
kernel/rcu/srcutiny.c. This in turn allows the complete removal of
kernel/rcu/tiny_plugin.h.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Define a common prefix ("PM:") for messages printed by the
code in kernel/power/suspend.c.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Some messages in suspend.c currently print state names from
pm_states[], but that may be confusing if the mem_sleep sysfs
attribute is changed to anything different from "mem", because
in those cases the messages will say either "freeze" or "standby"
after writing "mem" to /sys/power/state.
To avoid the confusion, use mem_sleep_labels[] strings in those
messages instead.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
The pm_test sysfs attribute is under CONFIG_PM_DEBUG, but it doesn't
make sense to provide it if CONFIG_PM_SLEEP is unset, so put it under
CONFIG_PM_SLEEP_DEBUG instead.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Restore the pm_wakeup_pending() check in __device_suspend_noirq()
removed by commit eed4d47efe (ACPI / sleep: Ignore spurious SCI
wakeups from suspend-to-idle) as that allows the function to return
earlier if there's a wakeup event pending already (so that it may
spend less time on carrying out operations that will be reversed
shortly anyway) and rework the main suspend-to-idle loop to take
that optimization into account.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As a preparation for subsequent changes, rearrange the core
suspend-to-idle code by moving the initial invocation of
dpm_suspend_noirq() into s2idle_loop().
This also causes debug messages from that code to appear in
a less confusing order.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We have to subtract the src max from the dst min, and vice-versa, since
(e.g.) the smallest result comes from the largest subtrahend.
Fixes: 484611357c ("bpf: allow access into map value arrays")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today sending a signal with rt_sigqueueinfo and receving it on
a signalfd does not work reliably. The issue is that reading
a signalfd instead of returning a siginfo returns a signalfd_siginfo and
the kernel must convert from one to the other.
The kernel does not currently have the code to deduce which union
members of struct siginfo are in use.
In this patchset I fix that by introducing a new function siginfo_layout
that can look at a siginfo and report which union member of struct
siginfo is in use. Before that I clean up how we populate struct
siginfo.
The siginfo structure has two key members si_signo and si_code. Some
si_codes are signal specific and for those it takes si_signo and si_code
to indicate the members of siginfo that are valid. The rest of the
si_code values are signal independent like SI_USER, SI_KERNEL, SI_QUEUE,
and SI_TIMER and only si_code is needed to indicate which members of
siginfo are valid.
At least that is how POSIX documents them, and how common sense would
indicate they should function. In practice we have been rather sloppy
about maintaining the ABI in linux and we have some exceptions. We have
a couple of buggy architectures that make SI_USER mean something
different when combined with SIGFPE or SIGTRAP. Worse we have
fcntl(F_SETSIG) which results in the si_codes POLL_IN, POLL_OUT,
POLL_MSG, POLL_ERR, POLL_PRI, POLL_HUP being sent with any arbitrary
signal, while the values are in a range that overlaps the signal
specific si_codes.
Thankfully the ambiguous cases with the POLL_NNN si_codes are for
things no sane persion would do that so we can rectify the situtation.
AKA no one cares so we won't cause a regression fixing it.
As part of fixing this I stop leaking the __SI_xxxx codes to userspace
and stop storing them in the high 16bits of si_code. Making the kernel
code fundamentally simpler. We have already confirmed that the one
application that would see this difference in kernel behavior CRIU won't
be affected by this change as it copies values verbatim from one kernel
interface to another.
v3:
- Corrected the patches so they bisect properly
v2:
- Benchmarked the code to confirm no performance changes are visible.
- Reworked the first couple of patches so that TRAP_FIXME and
FPE_FIXME are not exported to userspace.
- Rebased on top of the siginfo cleanup that came in v4.13-rc1
- Updated alpha to use both TRAP_FIXME and FPE_FIXME
Eric W. Biederman (7):
signal/alpha: Document a conflict with SI_USER for SIGTRAP
signal/ia64: Document a conflict with SI_USER with SIGFPE
signal/sparc: Document a conflict with SI_USER with SIGFPE
signal/mips: Document a conflict with SI_USER with SIGFPE
signal/testing: Don't look for __SI_FAULT in userspace
fcntl: Don't use ambiguous SIG_POLL si_codes
signal: Remove kernel interal si_code magic
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
struct siginfo is a union and the kernel since 2.4 has been hiding a union
tag in the high 16bits of si_code using the values:
__SI_KILL
__SI_TIMER
__SI_POLL
__SI_FAULT
__SI_CHLD
__SI_RT
__SI_MESGQ
__SI_SYS
While this looks plausible on the surface, in practice this situation has
not worked well.
- Injected positive signals are not copied to user space properly
unless they have these magic high bits set.
- Injected positive signals are not reported properly by signalfd
unless they have these magic high bits set.
- These kernel internal values leaked to userspace via ptrace_peek_siginfo
- It was possible to inject these kernel internal values and cause the
the kernel to misbehave.
- Kernel developers got confused and expected these kernel internal values
in userspace in kernel self tests.
- Kernel developers got confused and set si_code to __SI_FAULT which
is SI_USER in userspace which causes userspace to think an ordinary user
sent the signal and that it was not kernel generated.
- The values make it impossible to reorganize the code to transform
siginfo_copy_to_user into a plain copy_to_user. As si_code must
be massaged before being passed to userspace.
So remove these kernel internal si codes and make the kernel code simpler
and more maintainable.
To replace these kernel internal magic si_codes introduce the helper
function siginfo_layout, that takes a signal number and an si_code and
computes which union member of siginfo is being used. Have
siginfo_layout return an enumeration so that gcc will have enough
information to warn if a switch statement does not handle all of union
members.
A couple of architectures have a messed up ABI that defines signal
specific duplications of SI_USER which causes more special cases in
siginfo_layout than I would like. The good news is only problem
architectures pay the cost.
Update all of the code that used the previous magic __SI_ values to
use the new SIL_ values and to call siginfo_layout to get those
values. Escept where not all of the cases are handled remove the
defaults in the switch statements so that if a new case is missed in
the future the lack will show up at compile time.
Modify the code that copies siginfo si_code to userspace to just copy
the value and not cast si_code to a short first. The high bits are no
longer used to hold a magic union member.
Fixup the siginfo header files to stop including the __SI_ values in
their constants and for the headers that were missing it to properly
update the number of si_codes for each signal type.
The fixes to copy_siginfo_from_user32 implementations has the
interesting property that several of them perviously should never have
worked as the __SI_ values they depended up where kernel internal.
With that dependency gone those implementations should work much
better.
The idea of not passing the __SI_ values out to userspace and then
not reinserting them has been tested with criu and criu worked without
changes.
Ref: 2.4.0-test1
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
While refactoring, f7b2814bb9 ("cgroup: factor out
cgroup_{apply|finalize}_control() from
cgroup_subtree_control_write()") broke error return value from the
function. The return value from the last operation is always
overridden to zero. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.6+
Signed-off-by: Tejun Heo <tj@kernel.org>
The messages printed by tk_debug_account_sleep_time() are basically
useful for system sleep debugging, so print them only when the other
debug messages from the core suspend/hibernate code are enabled.
While at it, make it clear that the messages from
tk_debug_account_sleep_time() are about timekeeping suspend
duration, because in general timekeeping may be suspeded and
resumed for multiple times during one system suspend-resume cycle.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Regardless of whether or not debug messages from the core system
suspend/hibernation code are enabled, it is useful to know when
system-wide transitions start and finish (or fail), so print "info"
messages at these points.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Mark Salyzyn <salyzyn@android.com>
Debug messages from the system suspend/hibernation infrastructure can
fill up the entire kernel log buffer in some cases and anyway they
are only useful for debugging. They depend on CONFIG_PM_DEBUG, but
that is set as a rule as some generally useful diagnostic facilities
depend on it too.
For this reason, avoid printing those messages by default, but make
it possible to turn them on as needed with the help of a new sysfs
attribute under /sys/power/.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Have the core suspend/resume framework store the system-wide suspend
state (suspend_state_t) we are about to enter, and expose it to drivers
via pm_suspend_target_state in order to retrieve that. The state is
assigned in suspend_devices_and_enter().
This is useful for platform specific drivers that may need to take a
slightly different suspend/resume path based on the system's
suspend/resume state being entered.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The policy->transition_delay_us field is used only by the schedutil
governor currently, and this field describes how fast the driver wants
the cpufreq governor to change CPUs frequency. It should rather be a
common thing across all governors, as it doesn't have any schedutil
dependency here.
Create a new helper cpufreq_policy_transition_delay_us() to get the
transition delay across all governors.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Instead of relying on the struct fwnode_handle type field, define
fwnode_operations structs for all separate types of fwnodes. To find out
the type, compare to the ops field to relevant ops structs.
This change has two benefits:
1. it avoids adding the type field to each and every instance of struct
fwnode_handle, thus saving memory and
2. makes the ops field the single factor that defines both the types of
the fwnode as well as defines the implementation of its operations,
decreasing the possibility of bugs when developing code dealing with
fwnode internals.
Suggested-by: Rob Herring <robh@kernel.org>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull tracing fixes from Steven Rostedt:
"Three minor updates
- Use the new GFP_RETRY_MAYFAIL to be more aggressive in allocating
memory for the ring buffer without causing OOMs
- Fix a memory leak in adding and removing instances
- Add __rcu annotation to be able to debug RCU usage of function
tracing a bit better"
* tag 'trace-v4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
trace: fix the errors caused by incompatible type of RCU variables
tracing: Fix kmemleak in instance_rmdir
tracing/ring_buffer: Try harder to allocate
Pull scheduler fixes from Ingo Molnar:
"A cputime fix and code comments/organization fix to the deadline
scheduler"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix confusing comments about selection of top pi-waiter
sched/cputime: Don't use smp_processor_id() in preemptible context
Pull perf fixes from Ingo Molnar:
"Two hw-enablement patches, two race fixes, three fixes for regressions
of semantics, plus a number of tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Add proper condition to run sched_task callbacks
perf/core: Fix locking for children siblings group read
perf/core: Fix scheduling regression of pinned groups
perf/x86/intel: Fix debug_store reset field for freq events
perf/x86/intel: Add Goldmont Plus CPU PMU support
perf/x86/intel: Enable C-state residency events for Apollo Lake
perf symbols: Accept zero as the kernel base address
Revert "perf/core: Drop kernel samples even though :u is specified"
perf annotate: Fix broken arrow at row 0 connecting jmp instruction to its target
perf evsel: State in the default event name if attr.exclude_kernel is set
perf evsel: Fix attr.exclude_kernel setting for default cycles:p
Pull locking fixlet from Ingo Molnar:
"Remove an unnecessary priority adjustment in the rtmutex code"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Remove unnecessary priority adjustment
Pull irq fixes from Ingo Molnar:
"A resume_irq() fix, plus a number of static declaration fixes"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/digicolor: Drop unnecessary static
irqchip/mips-cpu: Drop unnecessary static
irqchip/gic/realview: Drop unnecessary static
irqchip/mips-gic: Remove population of irq domain names
genirq/PM: Properly pretend disabled state when force resuming interrupts
Update debug controller so that it prints out debug info about thread
mode.
1) The relationship between proc_cset and threaded_csets are displayed.
2) The status of being a thread root or threaded cgroup is displayed.
This patch is extracted from Waiman's larger patch.
v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links
file as the same information is available from cgroup.type.
Suggested by Waiman.
- Threaded marking is moved to the previous patch.
Patch-originally-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the dom_cgrp to which the processes of the subtree conceptually belong
and domain-level resource consumptions not tied to any specific task
are charged. In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.
This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a dom_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets. "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes. This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.
Task iteration is implemented nested in css_set iteration. If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.
v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new
enable-threaded-per-cgroup behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup v2 is in the process of growing thread granularity support. A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.
The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged. Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.
This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.
* cgroup->dom_cgrp points to self for normal and thread roots. For
proper thread subtree members, points to the dom_cgrp (the thread
root).
* css_set->dom_cset points to self if for normal and thread roots. If
threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
The dom_cgrp serves as the resource domain and keeps the matching
csses available. The dom_cset holds those csses and makes them
easily accessible.
* All threaded csets are linked on their dom_csets to enable iteration
of all threaded tasks.
* cgroup->nr_threaded_children keeps track of the number of threaded
children.
This patch adds the above but doesn't actually use them yet. The
following patches will build on top.
v4: ->nr_threaded_children added.
v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new
enable-threaded-per-cgroup behavior.
v2: Added cgroup_is_threaded() helper.
Signed-off-by: Tejun Heo <tj@kernel.org>
css_task_iter currently always walks all tasks. With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration. As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag
is not specified, it walks all tasks as before. When asserted, the
iterator only walks the group leaders.
For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".
While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr(). As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2. This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.
While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.
v3: - Restructured so that cgroup_procs_write_permission() takes
@src_cgrp and @dst_cgrp.
v2: - Rolled in Waiman's task reference count fix.
- Updated on top of nsdelegate changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
We're missing ctx lock when iterating children siblings
within the perf_read path for group reading. Following
race and crash can happen:
User space doing read syscall on event group leader:
T1:
perf_read
lock event->ctx->mutex
perf_read_group
lock leader->child_mutex
__perf_read_group_add(child)
list_for_each_entry(sub, &leader->sibling_list, group_entry)
----> sub might be invalid at this point, because it could
get removed via perf_event_exit_task_context in T2
Child exiting and cleaning up its events:
T2:
perf_event_exit_task_context
lock ctx->mutex
list_for_each_entry_safe(child_event, next, &child_ctx->event_list,...
perf_event_exit_event(child)
lock ctx->lock
perf_group_detach(child)
unlock ctx->lock
----> child is removed from sibling_list without any sync
with T1 path above
...
free_event(child)
Before the child is removed from the leader's child_list,
(and thus is omitted from perf_read_group processing), we
need to ensure that perf_read_group touches child's
siblings under its ctx->lock.
Peter further notes:
| One additional note; this bug got exposed by commit:
|
| ba5213ae6b ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
|
| which made it possible to actually trigger this code-path.
Tested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ba5213ae6b ("perf/core: Correct event creation with PERF_FORMAT_GROUP")
Link: http://lkml.kernel.org/r/20170720141455.2106-1-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>