Commit Graph

36889 Commits

Author SHA1 Message Date
Linus Torvalds
65090f30ab Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...
2021-06-29 17:29:11 -07:00
Jakub Kicinski
b6df00789e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in net/netfilter/nf_tables_api.c.

Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.

skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-29 15:45:27 -07:00
Daniel Bristot de Oliveira
6a82f42a2e trace/timerlat: Fix indentation on timerlat_main()
Dan Carpenter reported that:

 The patch a955d7eac1: "trace: Add timerlat tracer" from Jun 22,
 2021, leads to the following static checker warning:

	kernel/trace/trace_osnoise.c:1400 timerlat_main()
	warn: inconsistent indenting

here:
  1389          while (!kthread_should_stop()) {
  1390                  now = ktime_to_ns(hrtimer_cb_get_time(&tlat->timer));
  1391                  diff = now - tlat->abs_period;
  1392
  1393                  s.seqnum = tlat->count;
  1394                  s.timer_latency = diff;
  1395                  s.context = THREAD_CONTEXT;
  1396
  1397                  trace_timerlat_sample(&s);
  1398
  1399  #ifdef CONFIG_STACKTRACE
  1400          if (osnoise_data.print_stack)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	    This should be indented another tab?

  1401                  if (osnoise_data.print_stack <= time_to_us(diff))
  1402                          timerlat_dump_stack();
  1403  #endif /* CONFIG_STACKTRACE */
  1404
  1405                  tlat->tracing_thread = false;
  1406                  if (osnoise_data.stop_tracing_total)
  1407                          if (time_to_us(diff) >= osnoise_data.stop_tracing_total)
  1408                                  osnoise_stop_tracing();
  1409
  1410                  wait_next_period(tlat);
  1411          }

And the static checker is right. Fix the indentation.

Link: https://lkml.kernel.org/r/3d5d8c9258fbdcfa9d3c7362941b3d13a2a28d9d.1624986368.git.bristot@redhat.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: linux-kernel@vger.kernel.org
Fixes: a955d7eac1 ("trace: Add timerlat tracer")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-29 16:37:50 -04:00
Daniel Bristot de Oliveira
19c3eaa722 trace/osnoise: Make 'noise' variable s64 in run_osnoise()
Dab Carpenter reported that:

 The patch bce29ac9ce: "trace: Add osnoise tracer" from Jun 22,
 2021, leads to the following static checker warning:

	kernel/trace/trace_osnoise.c:1103 run_osnoise()
	warn: unsigned 'noise' is never less than zero.

In this part of the code:

  1100                  /*
  1101                   * This shouldn't happen.
  1102                   */
  1103                  if (noise < 0) {
                            ^^^^^^^^^
  1104                          osnoise_taint("negative noise!");
  1105                          goto out;
  1106                  }
  1107

And the static checker is right because 'noise' is u64.

Make noise s64 and keep the check. It is important to check if
the time read is behaving correctly - so we can trust the results.

I also re-arranged some variable declarations.

Link: https://lkml.kernel.org/r/acd7cd6e7d56b798a298c3bc8139a390b3c4ab52.1624986368.git.bristot@redhat.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: linux-kernel@vger.kernel.org
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-29 16:37:50 -04:00
Linus Torvalds
3563f55ce6 Merge tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "These add hybrid processors support to the intel_pstate driver and
  make it work with more processor models when HWP is disabled, make the
  intel_idle driver use special C6 idle state paremeters when package
  C-states are disabled, add cooling support to the tegra30 devfreq
  driver, rework the TEO (timer events oriented) cpuidle governor,
  extend the OPP (operating performance points) framework to use the
  required-opps DT property in more cases, fix some issues and clean up
  a number of assorted pieces of code.

  Specifics:

   - Make intel_pstate support hybrid processors using abstract
     performance units in the HWP interface (Rafael Wysocki).

   - Add Icelake servers and Cometlake support in no-HWP mode to
     intel_pstate (Giovanni Gherdovich).

   - Make cpufreq_online() error path be consistent with the CPU device
     removal path in cpufreq (Rafael Wysocki).

   - Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
     Randy Dunlap, Shaokun Zhang).

   - Make intel_idle use special idle state parameters for C6 when
     package C-states are disabled (Chen Yu).

   - Rework the TEO (timer events oriented) cpuidle governor to address
     some theoretical shortcomings in it (Rafael Wysocki).

   - Drop unneeded semicolon from the TEO governor (Wan Jiabing).

   - Modify the runtime PM framework to accept unassigned suspend and
     resume callback pointers (Ulf Hansson).

   - Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).

   - Improve device performance states support in the generic power
     domains (genpd) framework (Ulf Hansson).

   - Fix some documentation issues in genpd (Yang Yingliang).

   - Make the operating performance points (OPP) framework use the
     required-opps DT property in use cases that are not related to
     genpd (Hsin-Yi Wang).

   - Make lazy_link_required_opp_table() use list_del_init instead of
     list_del/INIT_LIST_HEAD (Yang Yingliang).

   - Simplify wake IRQs handling in the core system-wide sleep support
     code and clean up some coding style inconsistencies in it (Tian
     Tao, Zhen Lei).

   - Add cooling support to the tegra30 devfreq driver and improve its
     DT bindings (Dmitry Osipenko).

   - Fix some assorted issues in the devfreq core and drivers (Chanwoo
     Choi, Dong Aisheng, YueHaibing)"

* tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
  PM / devfreq: passive: Fix get_target_freq when not using required-opp
  cpufreq: Make cpufreq_online() call driver->offline() on errors
  opp: Allow required-opps to be used for non genpd use cases
  cpuidle: teo: remove unneeded semicolon in teo_select()
  dt-bindings: devfreq: tegra30-actmon: Add cooling-cells
  dt-bindings: devfreq: tegra30-actmon: Convert to schema
  PM / devfreq: userspace: Use DEVICE_ATTR_RW macro
  PM: runtime: Clarify documentation when callbacks are unassigned
  PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
  PM: runtime: Improve path in rpm_idle() when no callback
  PM: hibernate: remove leading spaces before tabs
  PM: sleep: remove trailing spaces and tabs
  PM: domains: Drop/restore performance state votes for devices at runtime PM
  PM: domains: Return early if perf state is already set for the device
  PM: domains: Split code in dev_pm_genpd_set_performance_state()
  cpuidle: teo: Use kerneldoc documentation in admin-guide
  cpuidle: teo: Rework most recent idle duration values treatment
  cpuidle: teo: Change the main idle state selection logic
  cpuidle: teo: Cosmetic modification of teo_select()
  cpuidle: teo: Cosmetic modifications of teo_update()
  ...
2021-06-29 13:36:06 -07:00
Linus Torvalds
a941a0349c Merge tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Time and clocksource/clockevent related updates:

  Core changes:

   - Infrastructure to support per CPU "broadcast" devices for per CPU
     clockevent devices which stop in deep idle states. This allows us
     to utilize the more efficient architected timer on certain ARM SoCs
     for normal operation instead of permanentely using the slow to
     access SoC specific clockevent device.

   - Print the name of the broadcast/wakeup device in /proc/timer_list

   - Make the clocksource watchdog more robust against delays between
     reading the current active clocksource and the watchdog
     clocksource. Such delays can be caused by NMIs, SMIs and vCPU
     preemption.

     Handle this by reading the watchdog clocksource twice, i.e. before
     and after reading the current active clocksource. In case that the
     two watchdog reads shows an excessive time delta, the read sequence
     is repeated up to 3 times.

   - Improve the debug output and add a test module for the watchdog
     mechanism.

   - Reimplementation of the venerable time64_to_tm() function with a
     faster and significantly smaller version. Straight from the source,
     i.e. the author of the related research paper contributed this!

  Driver changes:

   - No new drivers, not even new device tree bindings!

   - Fixes, improvements and cleanups and all over the place"

* tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  time/kunit: Add missing MODULE_LICENSE()
  time: Improve performance of time64_to_tm()
  clockevents: Use list_move() instead of list_del()/list_add()
  clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
  clocksource: Provide kernel module to test clocksource watchdog
  clocksource: Reduce clocksource-skew threshold
  clocksource: Limit number of CPUs checked for clock synchronization
  clocksource: Check per-CPU clock synchronization when marked unstable
  clocksource: Retry clock read if long delays detected
  clockevents: Add missing parameter documentation
  clocksource/drivers/timer-ti-dm: Drop unnecessary restore
  clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
  clocksource/drivers/arm_global_timer: Remove duplicated argument in arm_global_timer
  clocksource/drivers/arm_global_timer: Make symbol 'gt_clk_rate_change_nb' static
  arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore
  clocksource/drivers/arm_global_timer: Implement rate compensation whenever source clock changes
  clocksource/drivers/ingenic: Rename unreasonable array names
  clocksource/drivers/timer-ti-dm: Save and restore timer TIOCP_CFG
  clocksource/drivers/mediatek: Ack and disable interrupts on suspend
  clocksource/drivers/samsung_pwm: Constify source IO memory
  ...
2021-06-29 12:31:16 -07:00
Linus Torvalds
21edf50948 Merge tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "Updates for the interrupt subsystem:

  Core changes:

   - Cleanup and simplification of common code to invoke the low level
     interrupt flow handlers when this invocation requires irqdomain
     resolution. Add the necessary core infrastructure.

   - Provide a proper interface for modular PMU drivers to set the
     interrupt affinity.

   - Add a request flag which allows to exclude interrupts from spurious
     interrupt detection. Useful especially for IPI handlers which
     always return IRQ_HANDLED which turns the spurious interrupt
     detection into a pointless waste of CPU cycles.

  Driver changes:

   - Bulk convert interrupt chip drivers to the new irqdomain low level
     flow handler invocation mechanism.

   - Add device tree bindings for the Renesas R-Car M3-W+ SoC

   - Enable modular build of the Qualcomm PDC driver

   - The usual small fixes and improvements"

* tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  dt-bindings: interrupt-controller: arm,gic-v3: Describe GICv3 optional properties
  irqchip: gic-pm: Remove redundant error log of clock bulk
  irqchip/sun4i: Remove unnecessary oom message
  irqchip/irq-imx-gpcv2: Remove unnecessary oom message
  irqchip/imgpdc: Remove unnecessary oom message
  irqchip/gic-v3-its: Remove unnecessary oom message
  irqchip/gic-v2m: Remove unnecessary oom message
  irqchip/exynos-combiner: Remove unnecessary oom message
  irqchip: Bulk conversion to generic_handle_domain_irq()
  genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()
  genirq: Add generic_handle_domain_irq() helper
  irqchip/nvic: Convert from handle_IRQ() to handle_domain_irq()
  irqdesc: Fix __handle_domain_irq() comment
  genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and co
  irqdomain: Introduce irq_resolve_mapping()
  irqdomain: Protect the linear revmap with RCU
  irqdomain: Cache irq_data instead of a virq number in the revmap
  irqdomain: Use struct_size() helper when allocating irqdomain
  irqdomain: Make normal and nomap irqdomains exclusive
  powerpc: Move the use of irq_domain_add_nomap() behind a config option
  ...
2021-06-29 12:25:04 -07:00
Linus Torvalds
62180152e0 Merge tag 'smp-urgent-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Thomas Gleixner:
 "A fix for the CPU hotplug and cpusets interaction:

  cpusets delegate the hotplug work to a workqueue to prevent a lock
  order inversion vs. the CPU hotplug lock. The work is not flushed
  before the hotplug operation returns which creates user visible
  inconsistent state. Prevent this by flushing the work after dropping
  CPU hotplug lock and before releasing the outer mutex which serializes
  the CPU hotplug related sysfs interface operations"

* tag 'smp-urgent-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Cure the cpusets trainwreck
2021-06-29 12:23:02 -07:00
Linus Torvalds
371fb85457 Merge tag 'smp-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug cleanup from Thomas Gleixner:
 "A simple cleanup for the CPU hotplug code to avoid per_cpu_ptr()
  reevaluation"

* tag 'smp-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Simplify access to percpu cpuhp_state
2021-06-29 12:21:21 -07:00
Linus Torvalds
e563592c3e Merge tag 'printk-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:

 - Add %pt[RT]s modifier to vsprintf(). It overrides ISO 8601 separator
   by using ' ' (space). It produces "YYYY-mm-dd HH:MM:SS" instead of
   "YYYY-mm-ddTHH:MM:SS".

 - Correctly parse long row of numbers by sscanf() when using the field
   width. Add extensive sscanf() selftest.

 - Generalize re-entrant CPU lock that has already been used to
   serialize dump_stack() output. It is part of the ongoing printk
   rework. It will allow to remove the obsoleted printk_safe buffers and
   introduce atomic consoles.

 - Some code clean up and sparse warning fixes.

* tag 'printk-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: fix cpu lock ordering
  lib/dump_stack: move cpu lock to printk.c
  printk: Remove trailing semicolon in macros
  random32: Fix implicit truncation warning in prandom_seed_state()
  lib: test_scanf: Remove pointless use of type_min() with unsigned types
  selftests: lib: Add wrapper script for test_scanf
  lib: test_scanf: Add tests for sscanf number conversion
  lib: vsprintf: Fix handling of number field widths in vsscanf
  lib: vsprintf: scanf: Negative number must have field width > 1
  usb: host: xhci-tegra: Switch to use %ptTs
  nilfs2: Switch to use %ptTs
  kdb: Switch to use %ptTs
  lib/vsprintf: Allow to override ISO 8601 date and time separator
2021-06-29 12:07:18 -07:00
Mike Rapoport
43b02ba93b mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
configuration option is equivalent to FLATMEM.

Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.

Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mike Rapoport
a9ee6cf5c6 mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.

Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.

Done with

	$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
		$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
	$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
		$(git grep -wl NEED_MULTIPLE_NODES)

with manual tweaks afterwards.

[rppt@linux.ibm.com: fix arm boot crash]
  Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com

Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mel Gorman
74f4482209 mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
This introduces a new sysctl vm.percpu_pagelist_high_fraction.  It is
similar to the old vm.percpu_pagelist_fraction.  The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention.  However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled.  This
sysctl only adjusts pcp->high so that zone->lock contention is potentially
reduced but allocation latency during a PCP refill remains the same.

  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=8
  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  35071
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=64
              high:  4383
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=0
              high:  649
              batch: 63

[mgorman@techsingularity.net: fix documentation]
  Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net

Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:55 -07:00
Mel Gorman
bbbecb35a4 mm/page_alloc: delete vm.percpu_pagelist_fraction
Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.

The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the
zone size into account or the number of CPUs local to a zone.  With larger
zones and more CPUs per node, the contention is getting worse.
Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
and high values means that the sysctl can reduce zone lock contention but
also increase allocation latencies.

This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static.  It
also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.

The motivation is partially to adjust to larger memory sizes but is also
driven by the fact that large batches of page freeing via release_pages()
often shows zone contention as a major part of the problem.  Another is a
bug report based on an older kernel where a multi-terabyte process can
takes several minutes to exit.  A workaround was to use
vm.percpu_pagelist_fraction to increase the pcp->high value but testing
indicated that a production workload could not use the same values because
of an increase in allocation latencies.  Unfortunately, I cannot reproduce
this test case myself as the multi-terabyte machines are in active use but
it should alleviate the problem.

The series aims to address both and partially acts as a pre-requisite.
pcp only works with order-0 which is useless for SLUB (when using high
orders) and THP (unconditionally).  To store high-order pages on PCP, the
pcp->high values need to be increased first.

This patch (of 6):

The vm.percpu_pagelist_fraction is used to increase the batch and high
limits for the per-cpu page allocator (PCP).  The intent behind the sysctl
is to reduce zone lock acquisition when allocating/freeing pages but it
has a problem.  While it can decrease contention, it can also increase
latency on the allocation side due to unreasonably large batch sizes.
This leads to games where an administrator adjusts
percpu_pagelist_fraction on the fly to work around contention and
allocation latency problems.

This series aims to alleviate the problems with zone lock contention while
avoiding the allocation-side latency problems.  For the purposes of
review, it's easier to remove this sysctl now and reintroduce a similar
sysctl later in the series that deals only with pcp->high.

Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:54 -07:00
Liam Howlett
9016ddeddf kernel/events/uprobes: use vma_lookup() in find_active_uprobe()
Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Link: https://lkml.kernel.org/r/20210521174745.2219620-17-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:51 -07:00
David Hildenbrand
8fa207525f perf: MAP_EXECUTABLE does not indicate VM_MAYEXEC
Patch series "perf/binfmt/mm: remove in-tree usage of MAP_EXECUTABLE".

Stumbling over the history of MAP_EXECUTABLE, I noticed that we still have
some in-tree users that we can get rid of.

This patch (of 3):

Before commit e9714acf8c ("mm: kill vma flag VM_EXECUTABLE and
mm->num_exe_file_vmas"), VM_EXECUTABLE indicated MAP_EXECUTABLE.
MAP_EXECUTABLE is nowadays essentially ignored by the kernel and does not
relate to VM_MAYEXEC.

Link: https://lkml.kernel.org/r/20210421093453.6904-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210421093453.6904-2-david@redhat.com
Fixes: f972eb63b1 ("perf: Pass protection and flags bits through mmap2 interface")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:50 -07:00
Dan Schatzberg
c74d40e8b5 loop: charge i/o to mem and blk cg
The current code only associates with the existing blkcg when aio is used
to access the backing file.  This patch covers all types of i/o to the
backing file and also associates the memcg so if the backing file is on
tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css and int_active_memcg so it can be
used by the loop module.

Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:50 -07:00
Andrea Arcangeli
a458b76a41 mm: gup: pack has_pinned in MMF_HAS_PINNED
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.

Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
would reintroduce a loss of SMP scalability to pin-fast, so there's no
future potential usefulness to keep an atomic in the mm for this.

set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
(atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
atomic_set after this commit) has to be still issued only once per "mm",
so the difference between the two will be lost in the noise.

will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.

will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
improvement against upstream as expected.

This is a noop as far as overall performance and SMP scalability are
concerned.

[peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
  Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
[akpm@linux-foundation.org: coding style fixes]
[peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]

Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Wang Qing
b124ac45bd kernel: watchdog: modify the explanation related to watchdog thread
The watchdog thread has been replaced by cpu_stop_work, modify the
explanation related.

Link: https://lkml.kernel.org/r/1619687073-24686-2-git-send-email-wangqing@vivo.com
Signed-off-by: Wang Qing <wangqing@vivo.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Guilherme G. Piccoli" <gpiccoli@canonical.com>
Cc: Qais Yousef <qais.yousef@arm.com>
Cc: Santosh Sivaraj <santosh@fossix.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:46 -07:00
Petr Mladek
d71ba1649f kthread_worker: fix return value when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
kthread_mod_delayed_work() might race with
kthread_cancel_delayed_work_sync() or another kthread_mod_delayed_work()
call.  The function lets the other operation win when it sees
work->canceling counter set.  And it returns @false.

But it should return @true as it is done by the related workqueue API, see
mod_delayed_work_on().

The reason is that the return value might be used for reference counting.
It has to distinguish the case when the number of queued works has changed
or stayed the same.

The change is safe.  kthread_mod_delayed_work() return value is not
checked anywhere at the moment.

Link: https://lore.kernel.org/r/20210521163526.GA17916@redhat.com
Link: https://lkml.kernel.org/r/20210610133051.15337-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@google.com>
Cc: <jenhaochen@google.com>
Cc: Martin Liu <liumartin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:45 -07:00
Steven Rostedt (VMware)
9913d5745b tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
All internal use cases for tracepoint_probe_register() is set to not ever
be called with the same function and data. If it is, it is considered a
bug, as that means the accounting of handling tracepoints is corrupted.
If the function and data for a tracepoint is already registered when
tracepoint_probe_register() is called, it will call WARN_ON_ONCE() and
return with EEXISTS.

The BPF system call can end up calling tracepoint_probe_register() with
the same data, which now means that this can trigger the warning because
of a user space process. As WARN_ON_ONCE() should not be called because
user space called a system call with bad data, there needs to be a way to
register a tracepoint without triggering a warning.

Enter tracepoint_probe_register_may_exist(), which can be called, but will
not cause a WARN_ON() if the probe already exists. It will still error out
with EEXIST, which will then be sent to the user space that performed the
BPF system call.

This keeps the previous testing for issues with other users of the
tracepoint code, while letting BPF call it with duplicated data and not
warn about it.

Link: https://lore.kernel.org/lkml/20210626135845.4080-1-penguin-kernel@I-love.SAKURA.ne.jp/
Link: https://syzkaller.appspot.com/bug?id=41f4318cf01762389f4d1c1c459da4f542fe5153

Cc: stable@vger.kernel.org
Fixes: c4f6699dfc ("bpf: introduce BPF_RAW_TRACEPOINT")
Reported-by: syzbot <syzbot+721aa903751db87aa244@syzkaller.appspotmail.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot+721aa903751db87aa244@syzkaller.appspotmail.com
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-29 11:51:25 -04:00
Rafael J. Wysocki
afe94fb82c Merge branches 'pm-core' and 'pm-sleep'
* pm-core:
  PM: runtime: Clarify documentation when callbacks are unassigned
  PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
  PM: runtime: Improve path in rpm_idle() when no callback
  PM: runtime: document common mistake with pm_runtime_get_sync()

* pm-sleep:
  PM: hibernate: remove leading spaces before tabs
  PM: sleep: remove trailing spaces and tabs
  PM: hibernate: fix spelling mistakes
  PM: wakeirq: Set IRQF_NO_AUTOEN when requesting the IRQ
2021-06-29 15:52:53 +02:00
Petr Mladek
94f2be50ba Merge branch 'printk-rework' into for-linus 2021-06-29 09:53:17 +02:00
Linus Torvalds
c54b245d01 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace rlimit handling update from Eric Biederman:
 "This is the work mainly by Alexey Gladkov to limit rlimits to the
  rlimits of the user that created a user namespace, and to allow users
  to have stricter limits on the resources created within a user
  namespace."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  cred: add missing return error code when set_cred_ucounts() failed
  ucounts: Silence warning in dec_rlimit_ucounts
  ucounts: Set ucount_max to the largest positive value the type can hold
  kselftests: Add test to check for rlimit changes in different user namespaces
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_NPROC on top of ucounts
  Use atomic_t for ucounts reference counting
  Add a reference to ucounts for each cred
  Increase size of ucounts to atomic_long_t
2021-06-28 20:39:26 -07:00
Linus Torvalds
616ea5cc4a Merge tag 'seccomp-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:

 - Add "atomic addfd + send reply" mode to SECCOMP_USER_NOTIF to better
   handle EINTR races visible to seccomp monitors. (Rodrigo Campos,
   Sargun Dhillon)

 - Improve seccomp selftests for readability in CI systems. (Kees Cook)

* tag 'seccomp-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: Avoid using "sysctl" for report
  selftests/seccomp: Flush benchmark output
  selftests/seccomp: More closely track fds being assigned
  selftests/seccomp: Add test for atomic addfd+send
  seccomp: Support atomic "addfd + send reply"
2021-06-28 19:49:37 -07:00
Tanner Love
a358f40600 once: implement DO_ONCE_LITE for non-fast-path "do once" functionality
Certain uses of "do once" functionality reside outside of fast path,
and so do not require jump label patching via static keys, making
existing DO_ONCE undesirable in such cases.

Replace uses of __section(".data.once") with DO_ONCE_LITE(_IF)?

This patch changes the return values of xfs_printk_once, printk_once,
and printk_deferred_once. Before, they returned whether the print was
performed, but now, they always return true. This is okay because the
return values of the following macros are entirely ignored throughout
the kernel:
- xfs_printk_once
- xfs_warn_once
- xfs_notice_once
- xfs_info_once
- printk_once
- pr_emerg_once
- pr_alert_once
- pr_crit_once
- pr_err_once
- pr_warn_once
- pr_notice_once
- pr_info_once
- pr_devel_once
- pr_debug_once
- printk_deferred_once
- orc_warn

Changes
v3:
  - Expand commit message to explain why changing return values of
    xfs_printk_once, printk_once, printk_deferred_once is benign
v2:
  - Fix i386 build warnings

Signed-off-by: Tanner Love <tannerlove@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28 15:54:57 -07:00
David S. Miller
e1289cfb63 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-06-28

The following pull-request contains BPF updates for your *net-next* tree.

We've added 37 non-merge commits during the last 12 day(s) which contain
a total of 56 files changed, 394 insertions(+), 380 deletions(-).

The main changes are:

1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney.

2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski.

3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev.

4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria.

5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards.

6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa.

7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi.

8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim.

9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young.

10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer.

11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski.

12) Fix up bpfilter startup log-level to info level, from Gary Lin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28 15:28:03 -07:00
Linus Torvalds
9840cfcb97 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "There's a reasonable amount here and the juicy details are all below.

  It's worth noting that the MTE/KASAN changes strayed outside of our
  usual directories due to core mm changes and some associated changes
  to some other architectures; Andrew asked for us to carry these [1]
  rather that take them via the -mm tree.

  Summary:

   - Optimise SVE switching for CPUs with 128-bit implementations.

   - Fix output format from SVE selftest.

   - Add support for versions v1.2 and 1.3 of the SMC calling
     convention.

   - Allow Pointer Authentication to be configured independently for
     kernel and userspace.

   - PMU driver cleanups for managing IRQ affinity and exposing event
     attributes via sysfs.

   - KASAN optimisations for both hardware tagging (MTE) and out-of-line
     software tagging implementations.

   - Relax frame record alignment requirements to facilitate 8-byte
     alignment with KASAN and Clang.

   - Cleanup of page-table definitions and removal of unused memory
     types.

   - Reduction of ARCH_DMA_MINALIGN back to 64 bytes.

   - Refactoring of our instruction decoding routines and addition of
     some missing encodings.

   - Move entry code moved into C and hardened against harmful compiler
     instrumentation.

   - Update booting requirements for the FEAT_HCX feature, added to v8.7
     of the architecture.

   - Fix resume from idle when pNMI is being used.

   - Additional CPU sanity checks for MTE and preparatory changes for
     systems where not all of the CPUs support 32-bit EL0.

   - Update our kernel string routines to the latest Cortex Strings
     implementation.

   - Big cleanup of our cache maintenance routines, which were
     confusingly named and inconsistent in their implementations.

   - Tweak linker flags so that GDB can understand vmlinux when using
     RELR relocations.

   - Boot path cleanups to enable early initialisation of per-cpu
     operations needed by KCSAN.

   - Non-critical fixes and miscellaneous cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (150 commits)
  arm64: tlb: fix the TTL value of tlb_get_level
  arm64: Restrict undef hook for cpufeature registers
  arm64/mm: Rename ARM64_SWAPPER_USES_SECTION_MAPS
  arm64: insn: avoid circular include dependency
  arm64: smp: Bump debugging information print down to KERN_DEBUG
  drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe()
  perf/arm-cmn: Fix invalid pointer when access dtc object sharing the same IRQ number
  arm64: suspend: Use cpuidle context helpers in cpu_suspend()
  PSCI: Use cpuidle context helpers in psci_cpu_suspend_enter()
  arm64: Convert cpu_do_idle() to using cpuidle context helpers
  arm64: Add cpuidle context save/restore helpers
  arm64: head: fix code comments in set_cpu_boot_mode_flag
  arm64: mm: drop unused __pa(__idmap_text_start)
  arm64: mm: fix the count comments in compute_indices
  arm64/mm: Fix ttbr0 values stored in struct thread_info for software-pan
  arm64: mm: Pass original fault address to handle_mm_fault()
  arm64/mm: Drop SECTION_[SHIFT|SIZE|MASK]
  arm64/mm: Use CONT_PMD_SHIFT for ARM64_MEMSTART_SHIFT
  arm64/mm: Drop SWAPPER_INIT_MAP_SIZE
  arm64: Conditionally configure PTR_AUTH key of the kernel.
  ...
2021-06-28 14:04:24 -07:00
Ingo Molnar
d2343cb8d1 sched/core: Disable CONFIG_SCHED_CORE by default
This option at minimum adds extra code to the scheduler - even if
it's default unused - and most users wouldn't want it.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-28 22:43:05 +02:00
Rodrigo Campos
0ae71c7720 seccomp: Support atomic "addfd + send reply"
Alban Crequy reported a race condition userspace faces when we want to
add some fds and make the syscall return them[1] using seccomp notify.

The problem is that currently two different ioctl() calls are needed by
the process handling the syscalls (agent) for another userspace process
(target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and
SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible
for the agent to do the first ioctl to add a file descriptor but the
target is interrupted (EINTR) before the agent does the second ioctl()
call.

This patch adds a flag to the ADDFD ioctl() so it adds the fd and
returns that value atomically to the target program, as suggested by
Kees Cook[2]. This is done by simply allowing
seccomp_do_user_notification() to add the fd and return it in this case.
Therefore, in this case the target wakes up from the wait in
seccomp_do_user_notification() either to interrupt the syscall or to add
the fd and return it.

This "allocate an fd and return" functionality is useful for syscalls
that return a file descriptor only, like connect(2). Other syscalls that
return a file descriptor but not as return value (or return more than
one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not
work with this flag.

This effectively combines SECCOMP_IOCTL_NOTIF_ADDFD and
SECCOMP_IOCTL_NOTIF_SEND into an atomic opteration. The notification's
return value, nor error can be set by the user. Upon successful invocation
of the SECCOMP_IOCTL_NOTIF_ADDFD ioctl with the SECCOMP_ADDFD_FLAG_SEND
flag, the notifying process's errno will be 0, and the return value will
be the file descriptor number that was installed.

[1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Tycho Andersen <tycho@tycho.pizza>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210517193908.3113-4-sargun@sargun.me
2021-06-28 12:49:52 -07:00
Linus Torvalds
9269d27e51 Merge tag 'timers-nohz-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timers/nohz updates from Ingo Molnar:

 - Micro-optimize tick_nohz_full_cpu()

 - Optimize idle exit tick restarts to be less eager

 - Optimize tick_nohz_dep_set_task() to only wake up a single CPU.
   This reduces IPIs and interruptions on nohz_full CPUs.

 - Optimize tick_nohz_dep_set_signal() in a similar fashion.

 - Skip IPIs in tick_nohz_kick_task() when trying to kick a
   non-running task.

 - Micro-optimize tick_nohz_task_switch() IRQ flags handling to
   reduce context switching costs.

 - Misc cleanups and fixes

* tag 'timers-nohz-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Add myself as context tracking maintainer
  tick/nohz: Call tick_nohz_task_switch() with interrupts disabled
  tick/nohz: Kick only _queued_ task whose tick dependency is updated
  tick/nohz: Change signal tick dependency to wake up CPUs of member tasks
  tick/nohz: Only wake up a single target cpu when kicking a task
  tick/nohz: Update nohz_full Kconfig help
  tick/nohz: Update idle_exittime on actual idle exit
  tick/nohz: Remove superflous check for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
  tick/nohz: Conditionally restart tick on idle exit
  tick/nohz: Evaluate the CPU expression after the static key
2021-06-28 12:22:06 -07:00
Linus Torvalds
54a728dc5e Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler udpates from Ingo Molnar:

 - Changes to core scheduling facilities:

    - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
      coordinated scheduling across SMT siblings. This is a much
      requested feature for cloud computing platforms, to allow the
      flexible utilization of SMT siblings, without exposing untrusted
      domains to information leaks & side channels, plus to ensure more
      deterministic computing performance on SMT systems used by
      heterogenous workloads.

      There are new prctls to set core scheduling groups, which allows
      more flexible management of workloads that can share siblings.

    - Fix task->state access anti-patterns that may result in missed
      wakeups and rename it to ->__state in the process to catch new
      abuses.

 - Load-balancing changes:

    - Tweak newidle_balance for fair-sched, to improve 'memcache'-like
      workloads.

    - "Age" (decay) average idle time, to better track & improve
      workloads such as 'tbench'.

    - Fix & improve energy-aware (EAS) balancing logic & metrics.

    - Fix & improve the uclamp metrics.

    - Fix task migration (taskset) corner case on !CONFIG_CPUSET.

    - Fix RT and deadline utilization tracking across policy changes

    - Introduce a "burstable" CFS controller via cgroups, which allows
      bursty CPU-bound workloads to borrow a bit against their future
      quota to improve overall latencies & batching. Can be tweaked via
      /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.

    - Rework assymetric topology/capacity detection & handling.

 - Scheduler statistics & tooling:

    - Disable delayacct by default, but add a sysctl to enable it at
      runtime if tooling needs it. Use static keys and other
      optimizations to make it more palatable.

    - Use sched_clock() in delayacct, instead of ktime_get_ns().

 - Misc cleanups and fixes.

* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/doc: Update the CPU capacity asymmetry bits
  sched/topology: Rework CPU capacity asymmetry detection
  sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
  psi: Fix race between psi_trigger_create/destroy
  sched/fair: Introduce the burstable CFS controller
  sched/uclamp: Fix uclamp_tg_restrict()
  sched/rt: Fix Deadline utilization tracking during policy change
  sched/rt: Fix RT utilization tracking during policy change
  sched: Change task_struct::state
  sched,arch: Remove unused TASK_STATE offsets
  sched,timer: Use __set_current_state()
  sched: Add get_current_state()
  sched,perf,kvm: Fix preemption condition
  sched: Introduce task_is_running()
  sched: Unbreak wakeups
  sched/fair: Age the average idle time
  sched/cpufreq: Consider reduced CPU capacity in energy calculation
  sched/fair: Take thermal pressure into account while estimating energy
  thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
  sched/fair: Return early from update_tg_cfs_load() if delta == 0
  ...
2021-06-28 12:14:19 -07:00
Linus Torvalds
28a27cbd86 Merge tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:

 - Platform PMU driver updates:

     - x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
     - Fix RDPMC support
     - Fix [extended-]PEBS-via-PT support
     - Fix Sapphire Rapids event constraints
     - Fix :ppp support on Sapphire Rapids
     - Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
     - Other heterogenous-PMU fixes

 - Kprobes:

     - Remove the unused and misguided kprobe::fault_handler callbacks.
     - Warn about kprobes taking a page fault.
     - Fix the 'nmissed' stat counter.

 - Misc cleanups and fixes.

* tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix task context PMU for Hetero
  perf/x86/intel: Fix instructions:ppp support in Sapphire Rapids
  perf/x86/intel: Add more events requires FRONTEND MSR on Sapphire Rapids
  perf/x86/intel: Fix fixed counter check warning for some Alder Lake
  perf/x86/intel: Fix PEBS-via-PT reload base value for Extended PEBS
  perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
  kprobes: Do not increment probe miss count in the fault handler
  x86,kprobes: WARN if kprobes tries to handle a fault
  kprobes: Remove kprobe::fault_handler
  uprobes: Update uprobe_write_opcode() kernel-doc comment
  perf/hw_breakpoint: Fix DocBook warnings in perf hw_breakpoint
  perf/core: Fix DocBook warnings
  perf/core: Make local function perf_pmu_snapshot_aux() static
  perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on ICX
  perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on SNR
  perf/x86/intel/uncore: Generalize I/O stacks to PMON mapping procedure
  perf/x86/intel/uncore: Drop unnecessary NULL checks after container_of()
2021-06-28 12:03:20 -07:00
Linus Torvalds
a15286c63d Merge tag 'locking-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:

 - Core locking & atomics:

     - Convert all architectures to ARCH_ATOMIC: move every architecture
       to ARCH_ATOMIC, then get rid of ARCH_ATOMIC and all the
       transitory facilities and #ifdefs.

       Much reduction in complexity from that series:

           63 files changed, 756 insertions(+), 4094 deletions(-)

     - Self-test enhancements

 - Futexes:

     - Add the new FUTEX_LOCK_PI2 ABI, which is a variant that doesn't
       set FLAGS_CLOCKRT (.e. uses CLOCK_MONOTONIC).

       [ The temptation to repurpose FUTEX_LOCK_PI's implicit setting of
         FLAGS_CLOCKRT & invert the flag's meaning to avoid having to
         introduce a new variant was resisted successfully. ]

     - Enhance futex self-tests

 - Lockdep:

     - Fix dependency path printouts

     - Optimize trace saving

     - Broaden & fix wait-context checks

 - Misc cleanups and fixes.

* tag 'locking-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  locking/lockdep: Correct the description error for check_redundant()
  futex: Provide FUTEX_LOCK_PI2 to support clock selection
  futex: Prepare futex_lock_pi() for runtime clock selection
  lockdep/selftest: Remove wait-type RCU_CALLBACK tests
  lockdep/selftests: Fix selftests vs PROVE_RAW_LOCK_NESTING
  lockdep: Fix wait-type for empty stack
  locking/selftests: Add a selftest for check_irq_usage()
  lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage()
  locking/lockdep: Remove the unnecessary trace saving
  locking/lockdep: Fix the dep path printing for backwards BFS
  selftests: futex: Add futex compare requeue test
  selftests: futex: Add futex wait test
  seqlock: Remove trailing semicolon in macros
  locking/lockdep: Reduce LOCKDEP dependency list
  locking/lockdep,doc: Improve readability of the block matrix
  locking/atomics: atomic-instrumented: simplify ifdeffery
  locking/atomic: delete !ARCH_ATOMIC remnants
  locking/atomic: xtensa: move to ARCH_ATOMIC
  locking/atomic: sparc: move to ARCH_ATOMIC
  locking/atomic: sh: move to ARCH_ATOMIC
  ...
2021-06-28 11:45:29 -07:00
Linus Torvalds
b89c07dea1 Merge tags 'objtool-urgent-2021-06-28' and 'objtool-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fix and updates from Ingo Molnar:
 "An ELF format fix for a section flags mismatch bug that breaks kernel
  tooling such as kpatch-build.

  The biggest change in this cycle is the new code to handle and rewrite
  variable sized jump labels - which results in slightly tighter code
  generation in hot paths, through the use of short(er) NOPs.

  Also a number of cleanups and fixes, and a change to the generic
  include/linux/compiler.h to handle a s390 GCC quirk"

* tag 'objtool-urgent-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Don't make .altinstructions writable

* tag 'objtool-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Improve reloc hash size guestimate
  instrumentation.h: Avoid using inline asm operand modifiers
  compiler.h: Avoid using inline asm operand modifiers
  kbuild: Fix objtool dependency for 'OBJECT_FILES_NON_STANDARD_<obj> := n'
  objtool: Reflow handle_jump_alt()
  jump_label/x86: Remove unused JUMP_LABEL_NOP_SIZE
  jump_label, x86: Allow short NOPs
  objtool: Provide stats for jump_labels
  objtool: Rewrite jump_label instructions
  objtool: Decode jump_entry::key addend
  jump_label, x86: Emit short JMP
  jump_label: Free jump_entry::key bit1 for build use
  jump_label, x86: Add variable length patching support
  jump_label, x86: Introduce jump_entry_size()
  jump_label, x86: Improve error when we fail expected text
  jump_label, x86: Factor out the __jump_table generation
  jump_label, x86: Strip ASM jump_label support
  x86, objtool: Dont exclude arch/x86/realmode/
  objtool: Rewrite hashtable sizing
2021-06-28 11:35:55 -07:00
Daniel Bristot de Oliveira
498627b4ac trace/osnoise: Fix return value on osnoise_init_hotplug_support
kernel test robot reported:

  >> kernel/trace/trace_osnoise.c:1584:2: error: void function
  'osnoise_init_hotplug_support' should not return a
  value [-Wreturn-type]
           return 0;

When !CONFIG_HOTPLUG_CPU.

Fix it problem by removing the return value.

Link: https://lkml.kernel.org/r/c7fc67f1a117cc88bab2e508c898634872795341.1624872608.git.bristot@redhat.com

Fixes: c8895e271f ("trace/osnoise: Support hotplug operations")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-28 14:12:27 -04:00
Daniel Bristot de Oliveira
2a81afa326 trace/osnoise: Make interval u64 on osnoise_main
kernel test robot reported:

  >> kernel/trace/trace_osnoise.c:966:3: warning: comparison of distinct
     pointer types ('typeof ((interval)) *' (aka 'long long *') and
     'uint64_t *' (aka 'unsigned long long *'))
     [-Wcompare-distinct-pointer-types]
                   do_div(interval, USEC_PER_MSEC);
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/asm-generic/div64.h:228:28: note: expanded from macro 'do_div'
           (void)(((typeof((n)) *)0) == ((uint64_t *)0));  \
                  ~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~~

As interval cannot be negative because sample_period >= sample_runtime,
making interval u64 on osnoise_main() is enough to fix this problem.

Link: https://lkml.kernel.org/r/4ae1e7780563598563de079a3ef6d4d10b5f5546.1624872608.git.bristot@redhat.com

Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-28 14:12:26 -04:00
Daniel Bristot de Oliveira
f7d9f6370e trace/osnoise: Fix 'no previous prototype' warnings
kernel test robot reported some osnoise functions with "no previous
prototype."

Fix these warnings by making local functions static, and by adding:

 void osnoise_trace_irq_entry(int id);
 void osnoise_trace_irq_exit(int id, const char *desc);

to include/linux/trace.h.

Link: https://lkml.kernel.org/r/e40d3cb4be8bde921f4b40fa6a095cf85ab807bd.1624872608.git.bristot@redhat.com

Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-28 14:12:26 -04:00
Steven Rostedt (VMware)
b96285e10a tracing: Have osnoise_main() add a quiescent state for task rcu
ftracetest triggered:

 INFO: rcu_tasks detected stalls on tasks:
 00000000b92b832d: .. nvcsw: 1/1 holdout: 1 idle_cpu: -1/7
 task:osnoise/7       state:R  running task     stack:    0 pid: 2133 ppid:     2 flags:0x00004000
 Call Trace:
  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
  ? trace_hardirqs_on+0x2b/0xe0
  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
  ? trace_clock_local+0xc/0x20
  ? osnoise_main+0x10e/0x450
  ? trace_softirq_entry_callback+0x50/0x50
  ? kthread+0x153/0x170
  ? __kthread_bind_mask+0x60/0x60
  ? ret_from_fork+0x22/0x30

While running osnoise tracer with other tracers that rely on
synchronize_rcu_tasks(), where that just hung.

The reason is that osnoise_main() never schedules out if the interval
is less than 1, and this will cause synchronize_rcu_tasks() to never
return.

Link: https://lkml.kernel.org/r/20210628114953.6dc06a91@oasis.local.home

Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-28 14:11:41 -04:00
Linus Torvalds
c10383b3fb Merge tag 'regulator-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
 "The main core change this release is generic support for handling of
  hardware errors from Matti Vaittinen, including some small updates to
  the reboot and thermal code so we can share support for powering off
  the system if things are going wrong enough.

  Otherwise this release we've mainly seen the addition of new drivers,
  including MT6359 which has pulled in some small changes from the MFD
  tree for build dependencies.

   - Support for controlling the trigger points for hardware error
     detection, and shared handlers for this.

   - Support for Maxim MAX8993, Mediatek MT6359 and MT6359P, Qualcomm
     PM8226 and SA8115P-ADP, and Sylergy TCS4526"

* tag 'regulator-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (91 commits)
  regulator: bd9576: Fix uninitializes variable may_have_irqs
  regulator: max8893: Select REGMAP_I2C to fix build error
  regulator: da9052: Ensure enough delay time for .set_voltage_time_sel
  regulator: mt6358: Fix vdram2 .vsel_mask
  regulator: hi6421v600: Fix setting wrong driver_data
  MAINTAINERS: Add reviewer for regulator irq_helpers
  regulator: bd9576: Fix the driver name in id table
  regulator: bd9576: Support error reporting
  regulator: bd9576 add FET ON-resistance for OCW
  regulator: add property parsing and callbacks to set protection limits
  regulator: IRQ based event/error notification helpers
  regulator: move rdev_print helpers to internal.h
  regulator: add warning flags
  thermal: Use generic HW-protection shutdown API
  reboot: Add hardware protection power-off
  regulator: Add protection limit properties
  regulator: hi6421v600: Fix setting idle mode
  regulator: Add MAX8893 bindings
  regulator: max8893: add regulator driver
  regulator: hi6421: Use correct variable type for regmap api val argument
  ...
2021-06-28 11:06:10 -07:00
Rustam Kovhaev
ccff81e1d0 bpf: Fix false positive kmemleak report in bpf_ringbuf_area_alloc()
kmemleak scans struct page, but it does not scan the page content. If we
allocate some memory with kmalloc(), then allocate page with alloc_page(),
and if we put kmalloc pointer somewhere inside that page, kmemleak will
report kmalloc pointer as a false positive.

We can instruct kmemleak to scan the memory area by calling kmemleak_alloc()
and kmemleak_free(), but part of struct bpf_ringbuf is mmaped to user space,
and if struct bpf_ringbuf changes we would have to revisit and review size
argument in kmemleak_alloc(), because we do not want kmemleak to scan the
user space memory. Let's simplify things and use kmemleak_not_leak() here.

For posterity, also adding additional prior analysis from Andrii:

  I think either kmemleak or syzbot are misreporting this. I've added a
  bunch of printks around all allocations performed by BPF ringbuf. [...]
  On repro side I get these two warnings:

  [vmuser@archvm bpf]$ sudo ./repro
  BUG: memory leak
  unreferenced object 0xffff88810d538c00 (size 64):
    comm "repro", pid 2140, jiffies 4294692933 (age 14.540s)
    hex dump (first 32 bytes):
      00 af 19 04 00 ea ff ff c0 ae 19 04 00 ea ff ff  ................
      80 ae 19 04 00 ea ff ff c0 29 2e 04 00 ea ff ff  .........)......
    backtrace:
      [<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
      [<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
      [<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
      [<00000000f601d565>] do_syscall_64+0x2d/0x40
      [<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae

  BUG: memory leak
  unreferenced object 0xffff88810d538c80 (size 64):
    comm "repro", pid 2143, jiffies 4294699025 (age 8.448s)
    hex dump (first 32 bytes):
      80 aa 19 04 00 ea ff ff 00 ab 19 04 00 ea ff ff  ................
      c0 ab 19 04 00 ea ff ff 80 44 28 04 00 ea ff ff  .........D(.....
    backtrace:
      [<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
      [<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
      [<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
      [<00000000f601d565>] do_syscall_64+0x2d/0x40
      [<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae

  Note that both reported leaks (ffff88810d538c80 and ffff88810d538c00)
  correspond to pages array bpf_ringbuf is allocating and tracking properly
  internally. Note also that syzbot repro doesn't close FD of created BPF
  ringbufs, and even when ./repro itself exits with error, there are still
  two forked processes hanging around in my system. So clearly ringbuf maps
  are alive at that point. So reporting any memory leak looks weird at that
  point, because that memory is being used by active referenced BPF ringbuf.

  It's also a question why repro doesn't clean up its forks. But if I do a
  `pkill repro`, I do see that all the allocated memory is /properly/ cleaned
  up [and the] "leaks" are deallocated properly.

  BTW, if I add close() right after bpf() syscall in syzbot repro, I see that
  everything is immediately deallocated, like designed. And no memory leak
  is reported. So I don't think the problem is anywhere in bpf_ringbuf code,
  rather in the leak detection and/or repro itself.

Reported-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
[ Daniel: also included analysis from Andrii to the commit log ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+5d895828587f49e7fe9b@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzYk+dqs+jwu6VKXP-RttcTEGFe+ySTGWT9CRNkagDiJVA@mail.gmail.com
Link: https://lore.kernel.org/lkml/YNTAqiE7CWJhOK2M@nuc10
Link: https://lore.kernel.org/lkml/20210615101515.GC26027@arm.com
Link: https://syzkaller.appspot.com/bug?extid=5d895828587f49e7fe9b
Link: https://lore.kernel.org/bpf/20210626181156.1873604-1-rkovhaev@gmail.com
2021-06-28 15:57:46 +02:00
Namhyung Kim
95b861a793 bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing
Allow the helper to be called from tracing programs. This is needed to
handle cgroup hiererachies in the program.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210627153627.824198-1-namhyung@kernel.org
2021-06-28 15:43:02 +02:00
Odin Ugedal
1c35b07e6d sched/fair: Ensure _sum and _avg values stay consistent
The _sum and _avg values are in general sync together with the PELT
divider. They are however not always completely in perfect sync,
resulting in situations where _sum gets to zero while _avg stays
positive. Such situations are undesirable.

This comes from the fact that PELT will increase period_contrib, also
increasing the PELT divider, without updating _sum and _avg values to
stay in perfect sync where (_sum == _avg * divider). However, such PELT
change will never lower _sum, making it impossible to end up in a
situation where _sum is zero and _avg is not.

Therefore, we need to ensure that when subtracting load outside PELT,
that when _sum is zero, _avg is also set to zero. This occurs when
(_sum < _avg * divider), and the subtracted (_avg * divider) is bigger
or equal to the current _sum, while the subtracted _avg is smaller than
the current _avg.

Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20210624111815.57937-1-odin@uged.al
2021-06-28 15:42:24 +02:00
Randy Dunlap
0e3c1f30b0 genirq/irqdesc: Drop excess kernel-doc entry @lookup
Fix kernel-doc warning in irqdesc.c:

../kernel/irq/irqdesc.c:692: warning: Excess function parameter 'lookup' description in 'handle_domain_irq'

Fixes: e1c054918c ("genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210628004044.9011-1-rdunlap@infradead.org
2021-06-28 11:33:32 +01:00
Thomas Gleixner
3d2ce675ab Merge tag 'irqchip-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:

 - Revamped the irqdomain internals to consistently cache irqdata

 - Expose a new API to simplify IRQ handling involving an irqdomain by
   not using the IRQ number

 - Convert all the irqchip drivers to this new API

 - Allow the Qualcomm PDC driver to be compiled as a module

 - Fix HiSi MBIGEN compile warning when CONFIG_ACPI isn't selected

 - Remove a bunch of spurious printks on error paths

 - The obligatory couple of DT updates
2021-06-28 11:55:20 +02:00
Thomas Gleixner
2d0a9eb23c time/kunit: Add missing MODULE_LICENSE()
[ mingo: MODULE_LICENSE() takes a string. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-28 07:40:23 +02:00
Linus Torvalds
b4b27b9eed Revert "signal: Allow tasks to cache one sigqueue struct"
This reverts commits 4bad58ebc8 (and
399f8dd9a8, which tried to fix it).

I do not believe these are correct, and I'm about to release 5.13, so am
reverting them out of an abundance of caution.

The locking is odd, and appears broken.

On the allocation side (in __sigqueue_alloc()), the locking is somewhat
straightforward: it depends on sighand->siglock.  Since one caller
doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
the case with no locks held.

On the freeing side (in sigqueue_cache_or_free()), there is no locking
at all, and the logic instead depends on 'current' being a single
thread, and not able to race with itself.

To make things more exciting, there's also the data race between freeing
a signal and allocating one, which is handled by using WRITE_ONCE() and
READ_ONCE(), and being mutually exclusive wrt the initial state (ie
freeing will only free if the old state was NULL, while allocating will
obviously only use the value if it was non-NULL, so only one or the
other will actually act on the value).

However, while the free->alloc paths do seem mutually exclusive thanks
to just the data value dependency, it's not clear what the memory
ordering constraints are on it.  Could writes from the previous
allocation possibly be delayed and seen by the new allocation later,
causing logical inconsistencies?

So it's all very exciting and unusual.

And in particular, it seems that the freeing side is incorrect in
depending on "current" being single-threaded.  Yes, 'current' is a
single thread, but in the presense of asynchronous events even a single
thread can have data races.

And such asynchronous events can and do happen, with interrupts causing
signals to be flushed and thus free'd (for example - sending a
SIGCONT/SIGSTOP can happen from interrupt context, and can flush
previously queued process control signals).

So regardless of all the other questions about the memory ordering and
locking for this new cached allocation, the sigqueue_cache_or_free()
assumptions seem to be fundamentally incorrect.

It may be that people will show me the errors of my ways, and tell me
why this is all safe after all.  We can reinstate it if so.  But my
current belief is that the WRITE_ONCE() that sets the cached entry needs
to be a smp_store_release(), and the READ_ONCE() that finds a cached
entry needs to be a smp_load_acquire() to handle memory ordering
correctly.

And the sequence in sigqueue_cache_or_free() would need to either use a
lock or at least be interrupt-safe some way (perhaps by using something
like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
percpu operations it needs to be interrupt-safe).

Fixes: 399f8dd9a8 ("signal: Prevent sigqueue caching after task got released")
Fixes: 4bad58ebc8 ("signal: Allow tasks to cache one sigqueue struct")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-27 13:32:54 -07:00
Daniel Bristot de Oliveira
c8895e271f trace/osnoise: Support hotplug operations
Enable and disable osnoise/timerlat thread during on CPU hotplug online
and offline operations respectivelly.

Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/
Link: https://lkml.kernel.org/r/39f98590b3caeb3c32f09526214058efe0e9272a.1624372313.git.bristot@redhat.com

Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-25 19:57:24 -04:00
Daniel Bristot de Oliveira
ba998f7d95 trace/hwlat: Support hotplug operations
Enable and disable hwlat thread during cpu hotplug online
and offline operations, respectivelly.

Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/
Link: https://lkml.kernel.org/r/52012d25ea35491a0f8088b947864d8df8e25157.1624372313.git.bristot@redhat.com

Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-25 19:57:24 -04:00
Daniel Bristot de Oliveira
039a602db3 trace/hwlat: Protect kdata->kthread with get/put_online_cpus
In preparation to the hotplug support, protect kdata->kthread
with get/put_online_cpus() to avoid concurrency with hotplug
operations.

Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/
Link: https://lkml.kernel.org/r/8bdb2a56f46abfd301d6fffbf43448380c09a6f5.1624372313.git.bristot@redhat.com

Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-06-25 19:57:24 -04:00