linux

mirror of https://github.com/raspberrypi/linux.git synced 2025-12-25 03:22:39 +00:00

Author	SHA1	Message	Date
Christoph Hellwig	3d0ed0e4fa	dma-mapping: check pfn validity in dma_common_{mmap,get_sgtable} [ Upstream commit `66d7780f18` ] Check that the pfn returned from arch_dma_coherent_to_pfn refers to a valid page and reject the mmap / get_sgtable requests otherwise. Based on the arm implementation of the mmap and get_sgtable methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Vignesh Raghavendra <vigneshr@ti.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-08-25 10:10:25 -04:00
Viresh Kumar	4b837b7923	cpufreq: schedutil: Don't skip freq update when limits change commit `600f5badb7` upstream. To avoid reducing the frequency of a CPU prematurely, we skip reducing the frequency if the CPU had been busy recently. This should not be done when the limits of the policy are changed, for example due to thermal throttling. We should always get the frequency within the new limits as soon as possible. Trying to fix this by using only one flag, i.e. need_freq_update, can lead to a race condition where the flag gets cleared without forcing us to change the frequency at least once. And so this patch introduces another flag to avoid that race condition. Fixes: `ecd2884291` ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") Cc: v4.18+ <stable@vger.kernel.org> # v4.18+ Reported-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-08-25 10:10:21 -04:00
Leonard Crestez	5ca37bfa8c	perf/core: Fix creating kernel counters for PMUs that override event->cpu [ Upstream commit `4ce54af8b3` ] Some hardware PMU drivers will override perf_event.cpu inside their event_init callback. This causes a lockdep splat when initialized through the kernel API: WARNING: CPU: 0 PID: 250 at kernel/events/core.c:2917 ctx_sched_out+0x78/0x208 pc : ctx_sched_out+0x78/0x208 Call trace: ctx_sched_out+0x78/0x208 __perf_install_in_context+0x160/0x248 remote_function+0x58/0x68 generic_exec_single+0x100/0x180 smp_call_function_single+0x174/0x1b8 perf_install_in_context+0x178/0x188 perf_event_create_kernel_counter+0x118/0x160 Fix this by calling perf_install_in_context with event->cpu, just like perf_event_open Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Frank Li <Frank.li@nxp.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Link: https://lkml.kernel.org/r/c4ebe0503623066896d7046def4d6b1e06e0eb2e.1563972056.git.leonard.crestez@nxp.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-08-16 10:11:07 +02:00
Ming Lei	bb3db40acb	genirq/affinity: Create affinity mask for single vector commit `491beed3b1` upstream. Since commit `c66d4bd110` ("genirq/affinity: Add new callback for (re)calculating interrupt sets"), irq_create_affinity_masks() returns NULL in case of single vector. This change has caused regression on some drivers, such as lpfc. The problem is that single vector requests can happen in some generic cases: 1) kdump kernel 2) irq vectors resource is close to exhaustion. If in that situation the affinity mask for a single vector is not created, every caller has to handle the special case. There is no reason why the mask cannot be created, so remove the check for a single vector and create the mask. Fixes: `c66d4bd110` ("genirq/affinity: Add new callback for (re)calculating interrupt sets") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190805011906.5020-1-ming.lei@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-08-16 10:10:56 +02:00
Changbin Du	e4287f51fc	fgraph: Remove redundant ftrace_graph_notrace_addr() test commit `6c77221df9` upstream. We already have tested it before. The second one should be removed. With this change, the performance should have little improvement. Link: http://lkml.kernel.org/r/20190730140850.7927-1-changbin.du@gmail.com Cc: stable@vger.kernel.org Fixes: `9cd2992f2d` ("fgraph: Have set_graph_notrace only affect function_graph tracer") Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-08-06 19:08:15 +02:00
Josh Poimboeuf	33260ae248	bpf: Disable GCC -fgcse optimization for ___bpf_prog_run() [ Upstream commit `3193c0836f` ] On x86-64, with CONFIG_RETPOLINE=n, GCC's "global common subexpression elimination" optimization results in ___bpf_prog_run()'s jumptable code changing from this: select_insn: jmp jumptable(, %rax, 8) ... ALU64_ADD_X: ... jmp jumptable(, %rax, 8) ALU_ADD_X: ... jmp jumptable(, %rax, 8) to this: select_insn: mov jumptable, %r12 jmp (%r12, %rax, 8) ... ALU64_ADD_X: ... jmp (%r12, %rax, 8) ALU_ADD_X: ... jmp (%r12, %rax, 8) The jumptable address is placed in a register once, at the beginning of the function. The function execution can then go through multiple indirect jumps which rely on that same register value. This has a few issues: 1) Objtool isn't smart enough to be able to track such a register value across multiple recursive indirect jumps through the jump table. 2) With CONFIG_RETPOLINE enabled, this optimization actually results in a small slowdown. I measured a ~4.7% slowdown in the test_bpf "tcpdump port 22" selftest. This slowdown is actually predicted by the GCC manual: Note: When compiling a program using computed gotos, a GCC extension, you may get better run-time performance if you disable the global common subexpression elimination pass by adding -fno-gcse to the command line. So just disable the optimization for this function. Fixes: `e55a73251d` ("bpf: Fix ORC unwinding in non-JIT BPF code") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/30c3ca29ba037afcbd860a8672eef0021addf9fe.1563413318.git.jpoimboe@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-08-06 19:08:14 +02:00
Peter Zijlstra	1b3ecd64aa	stacktrace: Force USER_DS for stack_trace_save_user() [ Upstream commit `cac9b9a4b0` ] When walking userspace stacks, USER_DS needs to be set, otherwise access_ok() will not function as expected. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Eiichi Tsukata <devel@etsukata.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lkml.kernel.org/r/20190718085754.GM3402@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-08-06 19:08:13 +02:00
Andrii Nakryiko	a381e158c7	bpf: fix BTF verifier size resolution logic [ Upstream commit `1acc5d5c58` ] BTF verifier has a size resolution bug which in some circumstances leads to invalid size resolution for, e.g., TYPEDEF modifier. This happens if we have [1] PTR -> [2] TYPEDEF -> [3] ARRAY, in which case due to being in pointer context ARRAY size won't be resolved (because for pointer it doesn't matter, so it's a sink in pointer context), but it will be permanently remembered as zero for TYPEDEF and TYPEDEF will be marked as RESOLVED. Eventually ARRAY size will be resolved correctly, but TYPEDEF resolved_size won't be updated anymore. This, subsequently, will lead to erroneous map creation failure, if that TYPEDEF is specified as either key or value, as key_size/value_size won't correspond to resolved size of TYPEDEF (kernel will believe it's zero). Note, that if BTF was ordered as [1] ARRAY <- [2] TYPEDEF <- [3] PTR, this won't be a problem, as by the time we get to TYPEDEF, ARRAY's size is already calculated and stored. This bug manifests itself in rejecting BTF-defined maps that use array typedef as a value type: typedef int array_t[16]; struct { __uint(type, BPF_MAP_TYPE_ARRAY); __type(value, array_t); /* i.e., array_t value; / } test_map SEC(".maps"); The fix consists on not relying on modifier's resolved_size and instead using modifier's resolved_id (type ID for "concrete" type to which modifier eventually resolves) and doing size determination for that resolved type. This allow to preserve existing "early DFS termination" logic for PTR or STRUCT_OR_ARRAY contexts, but still do correct size determination for modifier types. Fixes: `eb3f595dab` ("bpf: btf: Validate type reference") Cc: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-08-06 19:08:10 +02:00
Arnd Bergmann	aaf15ffc6b	swiotlb: fix phys_addr_t overflow warning [ Upstream commit `9c106119f6` ] On architectures that have a larger dma_addr_t than phys_addr_t, the swiotlb_tbl_map_single() function truncates its return code in the failure path, making it impossible to identify the error later, as we compare to the original value: kernel/dma/swiotlb.c:551:9: error: implicit conversion from 'dma_addr_t' (aka 'unsigned long long') to 'phys_addr_t' (aka 'unsigned int') changes value from 18446744073709551615 to 4294967295 [-Werror,-Wconstant-conversion] return DMA_MAPPING_ERROR; Use an explicit typecast here to convert it to the narrower type, and use the same expression in the error handling later. Fixes: `b907e20508` ("swiotlb: remove SWIOTLB_MAP_ERROR") Acked-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-08-06 19:08:05 +02:00
Prarit Bhargava	768cb58039	kernel/module.c: Only return -EEXIST for modules that have finished loading [ Upstream commit `6e6de3dee5` ] Microsoft HyperV disables the X86_FEATURE_SMCA bit on AMD systems, and linux guests boot with repeated errors: amd64_edac_mod: Unknown symbol amd_unregister_ecc_decoder (err -2) amd64_edac_mod: Unknown symbol amd_register_ecc_decoder (err -2) amd64_edac_mod: Unknown symbol amd_report_gart_errors (err -2) amd64_edac_mod: Unknown symbol amd_unregister_ecc_decoder (err -2) amd64_edac_mod: Unknown symbol amd_register_ecc_decoder (err -2) amd64_edac_mod: Unknown symbol amd_report_gart_errors (err -2) The warnings occur because the module code erroneously returns -EEXIST for modules that have failed to load and are in the process of being removed from the module list. module amd64_edac_mod has a dependency on module edac_mce_amd. Using modules.dep, systemd will load edac_mce_amd for every request of amd64_edac_mod. When the edac_mce_amd module loads, the module has state MODULE_STATE_UNFORMED and once the module load fails and the state becomes MODULE_STATE_GOING. Another request for edac_mce_amd module executes and add_unformed_module() will erroneously return -EEXIST even though the previous instance of edac_mce_amd has MODULE_STATE_GOING. Upon receiving -EEXIST, systemd attempts to load amd64_edac_mod, which fails because of unknown symbols from edac_mce_amd. add_unformed_module() must wait to return for any case other than MODULE_STATE_LIVE to prevent a race between multiple loads of dependent modules. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Barret Rhoden <brho@google.com> Cc: David Arcari <darcari@redhat.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-08-06 19:08:04 +02:00
Cheng Jian	004e0c5d77	ftrace: Enable trampoline when rec count returns back to one [ Upstream commit `a124692b69` ] Custom trampolines can only be enabled if there is only a single ops attached to it. If there's only a single callback registered to a function, and the ops has a trampoline registered for it, then we can call the trampoline directly. This is very useful for improving the performance of ftrace and livepatch. If more than one callback is registered to a function, the general trampoline is used, and the custom trampoline is not restored back to the direct call even if all the other callbacks were unregistered and we are back to one callback for the function. To fix this, set FTRACE_FL_TRAMP flag if rec count is decremented to one, and the ops that left has a trampoline. Testing After this patch : insmod livepatch_unshare_files.ko cat /sys/kernel/debug/tracing/enabled_functions unshare_files (1) R I tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0 echo unshare_files > /sys/kernel/debug/tracing/set_ftrace_filter echo function > /sys/kernel/debug/tracing/current_tracer cat /sys/kernel/debug/tracing/enabled_functions unshare_files (2) R I ->ftrace_ops_list_func+0x0/0x150 echo nop > /sys/kernel/debug/tracing/current_tracer cat /sys/kernel/debug/tracing/enabled_functions unshare_files (1) R I tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0 Link: http://lkml.kernel.org/r/1556969979-111047-1-git-send-email-cj.chengjian@huawei.com Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-08-06 19:08:02 +02:00
Jann Horn	5344904c68	sched/fair: Use RCU accessors consistently for ->numa_group commit `cb361d8cde` upstream. The old code used RCU annotations and accessors inconsistently for ->numa_group, which can lead to use-after-frees and NULL dereferences. Let all accesses to ->numa_group use proper RCU helpers to prevent such issues. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: `8c8a743c50` ("sched/numa: Use {cpu, pid} to create task groups for shared faults") Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-08-04 09:29:39 +02:00
Jann Horn	8004fa12d0	sched/fair: Don't free p->numa_faults with concurrent readers commit `16d51a590a` upstream. When going through execve(), zero out the NUMA fault statistics instead of freeing them. During execve, the task is reachable through procfs and the scheduler. A concurrent /proc/*/sched reader can read data from a freed ->numa_faults allocation (confirmed by KASAN) and write it back to userspace. I believe that it would also be possible for a use-after-free read to occur through a race between a NUMA fault and execve(): task_numa_fault() can lead to task_numa_compare(), which invokes task_weight() on the currently running task of a different CPU. Another way to fix this would be to make ->numa_faults RCU-managed or add extra locking, but it seems easier to wipe the NUMA fault statistics on execve. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: `82727018b0` ("sched/numa: Call task_numa_free() from do_execve()") Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-08-04 09:29:39 +02:00
Stanislav Fomichev	985114b174	bpf: fix NULL deref in btf_type_is_resolve_source_only commit `e4f0712021` upstream. Commit `1dc9285184` ("bpf: kernel side support for BTF Var and DataSec") added invocations of btf_type_is_resolve_source_only before btf_type_nosize_or_null which checks for the NULL pointer. Swap the order of btf_type_nosize_or_null and btf_type_is_resolve_source_only to make sure the do the NULL pointer check first. Fixes: `1dc9285184` ("bpf: kernel side support for BTF Var and DataSec") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-08-04 09:29:36 +02:00
Linus Torvalds	7e37ded00e	access: avoid the RCU grace period for the temporary subjective credentials commit `d7852fbd0f` upstream. It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU work because it installs a temporary credential that gets allocated and freed for each system call. The allocation and freeing overhead is mostly benign, but because credentials can be accessed under the RCU read lock, the freeing involves a RCU grace period. Which is not a huge deal normally, but if you have a lot of access() calls, this causes a fair amount of seconday damage: instead of having a nice alloc/free patterns that hits in hot per-CPU slab caches, you have all those delayed free's, and on big machines with hundreds of cores, the RCU overhead can end up being enormous. But it turns out that all of this is entirely unnecessary. Exactly because access() only installs the credential as the thread-local subjective credential, the temporary cred pointer doesn't actually need to be RCU free'd at all. Once we're done using it, we can just free it synchronously and avoid all the RCU overhead. So add a 'non_rcu' flag to 'struct cred', which can be set by users that know they only use it in non-RCU context (there are other potential users for this). We can make it a union with the rcu freeing list head that we need for the RCU case, so this doesn't need any extra storage. Note that this also makes 'get_current_cred()' clear the new non_rcu flag, in case we have filesystems that take a long-term reference to the cred and then expect the RCU delayed freeing afterwards. It's not entirely clear that this is required, but it makes for clear semantics: the subjective cred remains non-RCU as long as you only access it synchronously using the thread-local accessors, but you _can_ use it as a generic cred if you want to. It is possible that we should just remove the whole RCU markings for ->cred entirely. Only ->real_cred is really supposed to be accessed through RCU, and the long-term cred copies that nfs uses might want to explicitly re-enable RCU freeing if required, rather than have get_current_cred() do it implicitly. But this is a "minimal semantic changes" change for the immediate problem. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jan Glauber <jglauber@marvell.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com> Cc: Greg KH <greg@kroah.com> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-31 07:25:03 +02:00
Arnd Bergmann	ca40a74b28	locking/lockdep: Hide unused 'class' variable [ Upstream commit `68037aa782` ] The usage is now hidden in an #ifdef, so we need to move the variable itself in there as well to avoid this warning: kernel/locking/lockdep_proc.c:203:21: error: unused variable 'class' [-Werror,-Wunused-variable] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yuyang Du <duyuyang@gmail.com> Cc: frederic@kernel.org Fixes: `68d41d8c94` ("locking/lockdep: Fix lock used or unused stats error") Link: https://lkml.kernel.org/r/20190715092809.736834-1-arnd@arndb.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-31 07:24:55 +02:00
Yuyang Du	f61d5f5006	locking/lockdep: Fix lock used or unused stats error [ Upstream commit `68d41d8c94` ] The stats variable nr_unused_locks is incremented every time a new lock class is register and decremented when the lock is first used in __lock_acquire(). And after all, it is shown and checked in lockdep_stats. However, under configurations that either CONFIG_TRACE_IRQFLAGS or CONFIG_PROVE_LOCKING is not defined: The commit: `0918065151` ("locking/lockdep: Consolidate lock usage bit initialization") missed marking the LOCK_USED flag at IRQ usage initialization because as mark_usage() is not called. And the commit: `886532aee3` ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING") further made mark_lock() not defined such that the LOCK_USED cannot be marked at all when the lock is first acquired. As a result, we fix this by not showing and checking the stats under such configurations for lockdep_stats. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Yuyang Du <duyuyang@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: arnd@arndb.de Cc: frederic@kernel.org Link: https://lkml.kernel.org/r/20190709101522.9117-1-duyuyang@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-31 07:24:55 +02:00
Florian Fainelli	724f88e5ed	dma-remap: Avoid de-referencing NULL atomic_pool [ Upstream commit `4b4b077cbd` ] With architectures allowing the kernel to be placed almost arbitrarily in memory (e.g.: ARM64), it is possible to have the kernel resides at physical addresses above 4GB, resulting in neither the default CMA area, nor the atomic pool from successfully allocating. This does not prevent specific peripherals from working though, one example is XHCI, which still operates correctly. Trouble comes when the XHCI driver gets suspended and resumed, since we can now trigger the following NPD: [ 12.664170] usb usb1: root hub lost power or was reset [ 12.669387] usb usb2: root hub lost power or was reset [ 12.674662] Unable to handle kernel NULL pointer dereference at virtual address 00000008 [ 12.682896] pgd = ffffffc1365a7000 [ 12.686386] [00000008] pgd=0000000136500003, pud=0000000136500003, *pmd=0000000000000000 [ 12.694897] Internal error: Oops: 96000006 [#1] SMP [ 12.699843] Modules linked in: [ 12.702980] CPU: 0 PID: 1499 Comm: pml Not tainted 4.9.135-1.13pre #51 [ 12.709577] Hardware name: BCM97268DV (DT) [ 12.713736] task: ffffffc136bb6540 task.stack: ffffffc1366cc000 [ 12.719740] PC is at addr_in_gen_pool+0x4/0x48 [ 12.724253] LR is at __dma_free+0x64/0xbc [ 12.728325] pc : [<ffffff80083c0df8>] lr : [<ffffff80080979e0>] pstate: 60000145 [ 12.735825] sp : ffffffc1366cf990 [ 12.739196] x29: ffffffc1366cf990 x28: ffffffc1366cc000 [ 12.744608] x27: 0000000000000000 x26: ffffffc13a8568c8 [ 12.750020] x25: 0000000000000000 x24: ffffff80098f9000 [ 12.755433] x23: 000000013a5ff000 x22: ffffff8009c57000 [ 12.760844] x21: ffffffc13a856810 x20: 0000000000000000 [ 12.766255] x19: 0000000000001000 x18: 000000000000000a [ 12.771667] x17: 0000007f917553e0 x16: 0000000000001002 [ 12.777078] x15: 00000000000a36cb x14: ffffff80898feb77 [ 12.782490] x13: ffffffffffffffff x12: 0000000000000030 [ 12.787899] x11: 00000000fffffffe x10: ffffff80098feb7f [ 12.793311] x9 : 0000000005f5e0ff x8 : 65776f702074736f [ 12.798723] x7 : 6c2062756820746f x6 : ffffff80098febb1 [ 12.804134] x5 : ffffff800809797c x4 : 0000000000000000 [ 12.809545] x3 : 000000013a5ff000 x2 : 0000000000000fff [ 12.814955] x1 : ffffff8009c57000 x0 : 0000000000000000 [ 12.820363] [ 12.821907] Process pml (pid: 1499, stack limit = 0xffffffc1366cc020) [ 12.828421] Stack: (0xffffffc1366cf990 to 0xffffffc1366d0000) [ 12.834240] f980: ffffffc1366cf9e0 ffffff80086004d0 [ 12.842186] f9a0: ffffffc13ab08238 0000000000000010 ffffff80097c2218 ffffffc13a856810 [ 12.850131] f9c0: ffffff8009c57000 000000013a5ff000 0000000000000008 000000013a5ff000 [ 12.858076] f9e0: ffffffc1366cfa50 ffffff80085f9250 ffffffc13ab08238 0000000000000004 [ 12.866021] fa00: ffffffc13ab08000 ffffff80097b6000 ffffffc13ab08130 0000000000000001 [ 12.873966] fa20: 0000000000000008 ffffffc13a8568c8 0000000000000000 ffffffc1366cc000 [ 12.881911] fa40: ffffffc13ab08130 0000000000000001 ffffffc1366cfa90 ffffff80085e3de8 [ 12.889856] fa60: ffffffc13ab08238 0000000000000000 ffffffc136b75b00 0000000000000000 [ 12.897801] fa80: 0000000000000010 ffffff80089ccb92 ffffffc1366cfac0 ffffff80084ad040 [ 12.905746] faa0: ffffffc13a856810 0000000000000000 ffffff80084ad004 ffffff80084b91a8 [ 12.913691] fac0: ffffffc1366cfae0 ffffff80084b91b4 ffffffc13a856810 ffffff80080db5cc [ 12.921636] fae0: ffffffc1366cfb20 ffffff80084b96bc ffffffc13a856810 0000000000000010 [ 12.929581] fb00: ffffffc13a856870 0000000000000000 ffffffc13a856810 ffffff800984d2b8 [ 12.937526] fb20: ffffffc1366cfb50 ffffff80084baa70 ffffff8009932ad0 ffffff800984d260 [ 12.945471] fb40: 0000000000000010 00000002eff0a065 ffffffc1366cfbb0 ffffff80084bafbc [ 12.953415] fb60: 0000000000000010 0000000000000003 ffffff80098fe000 0000000000000000 [ 12.961360] fb80: ffffff80097b6000 ffffff80097b6dc8 ffffff80098c12b8 ffffff80098c12f8 [ 12.969306] fba0: ffffff8008842000 ffffff80097b6dc8 ffffffc1366cfbd0 ffffff80080e0d88 [ 12.977251] fbc0: 00000000fffffffb ffffff80080e10bc ffffffc1366cfc60 ffffff80080e16a8 [ 12.985196] fbe0: 0000000000000000 0000000000000003 ffffff80097b6000 ffffff80098fe9f0 [ 12.993140] fc00: ffffff80097d4000 ffffff8008983802 0000000000000123 0000000000000040 [ 13.001085] fc20: ffffff8008842000 ffffffc1366cc000 ffffff80089803c2 00000000ffffffff [ 13.009029] fc40: 0000000000000000 0000000000000000 ffffffc1366cfc60 0000000000040987 [ 13.016974] fc60: ffffffc1366cfcc0 ffffff80080dfd08 0000000000000003 0000000000000004 [ 13.024919] fc80: 0000000000000003 ffffff80098fea08 ffffffc136577ec0 ffffff80089803c2 [ 13.032864] fca0: 0000000000000123 0000000000000001 0000000500000002 0000000000040987 [ 13.040809] fcc0: ffffffc1366cfd00 ffffff80083a89d4 0000000000000004 ffffffc136577ec0 [ 13.048754] fce0: ffffffc136610cc0 ffffffffffffffea ffffffc1366cfeb0 ffffffc136610cd8 [ 13.056700] fd00: ffffffc1366cfd10 ffffff800822a614 ffffffc1366cfd40 ffffff80082295d4 [ 13.064645] fd20: 0000000000000004 ffffffc136577ec0 ffffffc136610cc0 0000000021670570 [ 13.072590] fd40: ffffffc1366cfd80 ffffff80081b5d10 ffffff80097b6000 ffffffc13aae4200 [ 13.080536] fd60: ffffffc1366cfeb0 0000000000000004 0000000021670570 0000000000000004 [ 13.088481] fd80: ffffffc1366cfe30 ffffff80081b6b20 ffffffc13aae4200 0000000000000000 [ 13.096427] fda0: 0000000000000004 0000000021670570 ffffffc1366cfeb0 ffffffc13a838200 [ 13.104371] fdc0: 0000000000000000 000000000000000a ffffff80097b6000 0000000000040987 [ 13.112316] fde0: ffffffc1366cfe20 ffffff80081b3af0 ffffffc13a838200 0000000000000000 [ 13.120261] fe00: ffffffc1366cfe30 ffffff80081b6b0c ffffffc13aae4200 0000000000000000 [ 13.128206] fe20: 0000000000000004 0000000000040987 ffffffc1366cfe70 ffffff80081b7dd8 [ 13.136151] fe40: ffffff80097b6000 ffffffc13aae4200 ffffffc13aae4200 fffffffffffffff7 [ 13.144096] fe60: 0000000021670570 ffffffc13a8c63c0 0000000000000000 ffffff8008083180 [ 13.152042] fe80: ffffffffffffff1d 0000000021670570 ffffffffffffffff 0000007f917ad9b8 [ 13.159986] fea0: 0000000020000000 0000000000000015 0000000000000000 0000000000040987 [ 13.167930] fec0: 0000000000000001 0000000021670570 0000000000000004 0000000000000000 [ 13.175874] fee0: 0000000000000888 0000440110000000 000000000000006d 0000000000000003 [ 13.183819] ff00: 0000000000000040 ffffff80ffffffc8 0000000000000000 0000000000000020 [ 13.191762] ff20: 0000000000000000 0000000000000000 0000000000000001 0000000000000000 [ 13.199707] ff40: 0000000000000000 0000007f917553e0 0000000000000000 0000000000000004 [ 13.207651] ff60: 0000000021670570 0000007f91835480 0000000000000004 0000007f91831638 [ 13.215595] ff80: 0000000000000004 00000000004b0de0 00000000004b0000 0000000000000000 [ 13.223539] ffa0: 0000000000000000 0000007fc92ac8c0 0000007f9175d178 0000007fc92ac8c0 [ 13.231483] ffc0: 0000007f917ad9b8 0000000020000000 0000000000000001 0000000000000040 [ 13.239427] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 13.247360] Call trace: [ 13.249866] Exception stack(0xffffffc1366cf7a0 to 0xffffffc1366cf8d0) [ 13.256386] f7a0: 0000000000001000 0000007fffffffff ffffffc1366cf990 ffffff80083c0df8 [ 13.264331] f7c0: 0000000060000145 ffffff80089b5001 ffffffc13ab08130 0000000000000001 [ 13.272275] f7e0: 0000000000000008 ffffffc13a8568c8 0000000000000000 0000000000000000 [ 13.280220] f800: ffffffc1366cf960 ffffffc1366cf960 ffffffc1366cf930 00000000ffffffd8 [ 13.288165] f820: ffffff8009931ac0 4554535953425553 4544006273753d4d 3831633d45434956 [ 13.296110] f840: ffff003832313a39 ffffff800845926c ffffffc1366cf880 0000000000040987 [ 13.304054] f860: 0000000000000000 ffffff8009c57000 0000000000000fff 000000013a5ff000 [ 13.311999] f880: 0000000000000000 ffffff800809797c ffffff80098febb1 6c2062756820746f [ 13.319944] f8a0: 65776f702074736f 0000000005f5e0ff ffffff80098feb7f 00000000fffffffe [ 13.327884] f8c0: 0000000000000030 ffffffffffffffff [ 13.332835] [<ffffff80083c0df8>] addr_in_gen_pool+0x4/0x48 [ 13.338398] [<ffffff80086004d0>] xhci_mem_cleanup+0xc8/0x51c [ 13.344137] [<ffffff80085f9250>] xhci_resume+0x308/0x65c [ 13.349524] [<ffffff80085e3de8>] xhci_brcm_resume+0x84/0x8c [ 13.355174] [<ffffff80084ad040>] platform_pm_resume+0x3c/0x64 [ 13.360997] [<ffffff80084b91b4>] dpm_run_callback+0x5c/0x15c [ 13.366732] [<ffffff80084b96bc>] device_resume+0xc0/0x190 [ 13.372205] [<ffffff80084baa70>] dpm_resume+0x144/0x2cc [ 13.377504] [<ffffff80084bafbc>] dpm_resume_end+0x20/0x34 [ 13.382980] [<ffffff80080e0d88>] suspend_devices_and_enter+0x104/0x704 [ 13.389585] [<ffffff80080e16a8>] pm_suspend+0x320/0x53c [ 13.394881] [<ffffff80080dfd08>] state_store+0xbc/0xe0 [ 13.400094] [<ffffff80083a89d4>] kobj_attr_store+0x14/0x24 [ 13.405655] [<ffffff800822a614>] sysfs_kf_write+0x60/0x70 [ 13.411128] [<ffffff80082295d4>] kernfs_fop_write+0x130/0x194 [ 13.416954] [<ffffff80081b5d10>] __vfs_write+0x60/0x150 [ 13.422254] [<ffffff80081b6b20>] vfs_write+0xc8/0x164 [ 13.427376] [<ffffff80081b7dd8>] SyS_write+0x70/0xc8 [ 13.432412] [<ffffff8008083180>] el0_svc_naked+0x34/0x38 [ 13.437800] Code: 92800173 97f6fb9e 17fffff5 d1000442 (f8408c03) [ 13.444033] ---[ end trace 2effe12f909ce205 ]--- The call path leading to this problem is xhci_mem_cleanup() -> dma_free_coherent() -> dma_free_from_pool() -> addr_in_gen_pool. If the atomic_pool is NULL, we can't possibly have the address in the atomic pool anyway, so guard against that. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-31 07:24:40 +02:00
Peter Zijlstra	e11aaff1c3	perf/core: Fix race between close() and fork() commit `1cf8dfe8a6` upstream. Syzcaller reported the following Use-after-Free bug: close() clone() copy_process() perf_event_init_task() perf_event_init_context() mutex_lock(parent_ctx->mutex) inherit_task_group() inherit_group() inherit_event() mutex_lock(event->child_mutex) // expose event on child list list_add_tail() mutex_unlock(event->child_mutex) mutex_unlock(parent_ctx->mutex) ... goto bad_fork_* bad_fork_cleanup_perf: perf_event_free_task() perf_release() perf_event_release_kernel() list_for_each_entry() mutex_lock(ctx->mutex) mutex_lock(event->child_mutex) // event is from the failing inherit // on the other CPU perf_remove_from_context() list_move() mutex_unlock(event->child_mutex) mutex_unlock(ctx->mutex) mutex_lock(ctx->mutex) list_for_each_entry_safe() // event already stolen mutex_unlock(ctx->mutex) delayed_free_task() free_task() list_for_each_entry_safe() list_del() free_event() _free_event() // and so event->hw.target // is the already freed failed clone() if (event->hw.target) put_task_struct(event->hw.target) // WHOOPSIE, already quite dead Which puts the lie to the the comment on perf_event_free_task(): 'unexposed, unused context' not so much. Which is a 'fun' confluence of fail; copy_process() doing an unconditional free_task() and not respecting refcounts, and perf having creative locking. In particular: `82d94856fa` ("perf/core: Fix lock inversion between perf,trace,cpuhp") seems to have overlooked this 'fun' parade. Solve it by using the fact that detached events still have a reference count on their (previous) context. With this perf_event_free_task() can detect when events have escaped and wait for their destruction. Debugged-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: <stable@vger.kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: `82d94856fa` ("perf/core: Fix lock inversion between perf,trace,cpuhp") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-28 08:27:22 +02:00
Alexander Shishkin	a36a926748	perf/core: Fix exclusive events' grouping commit `8a58ddae23` upstream. So far, we tried to disallow grouping exclusive events for the fear of complications they would cause with moving between contexts. Specifically, moving a software group to a hardware context would violate the exclusivity rules if both groups contain matching exclusive events. This attempt was, however, unsuccessful: the check that we have in the perf_event_open() syscall is both wrong (looks at wrong PMU) and insufficient (group leader may still be exclusive), as can be illustrated by running: $ perf record -e '{intel_pt//,cycles}' uname $ perf record -e '{cycles,intel_pt//}' uname ultimately successfully. Furthermore, we are completely free to trigger the exclusivity violation by: perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}' even though the helpful perf record will not allow that, the ABI will. The warning later in the perf_event_open() path will also not trigger, because it's also wrong. Fix all this by validating the original group before moving, getting rid of broken safeguards and placing a useful one to perf_install_in_context(). Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: mathieu.poirier@linaro.org Cc: will.deacon@arm.com Fixes: `bed5b25ad9` ("perf: Add a pmu capability for "exclusive" events") Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-28 08:27:22 +02:00
Nadav Amit	1bbf2811db	resource: fix locking in find_next_iomem_res() commit `49f17c26c1` upstream. Since resources can be removed, locking should ensure that the resource is not removed while accessing it. However, find_next_iomem_res() does not hold the lock while copying the data of the resource. Keep holding the lock while the data is copied. While at it, change the return value to a more informative value. It is disregarded by the callers. [akpm@linux-foundation.org: fix find_next_iomem_res() documentation] Link: http://lkml.kernel.org/r/20190613045903.4922-2-namit@vmware.com Fixes: `ff3cc952d3` ("resource: Add remove_resource interface") Signed-off-by: Nadav Amit <namit@vmware.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-26 09:11:07 +02:00
Aneesh Kumar K.V	8a75019214	mm/nvdimm: add is_ioremap_addr and use that to check ioremap address commit `9bd3bb6703` upstream. Architectures like powerpc use different address range to map ioremap and vmalloc range. The memunmap() check used by the nvdimm layer was wrongly using is_vmalloc_addr() to check for ioremap range which fails for ppc64. This result in ppc64 not freeing the ioremap mapping. The side effect of this is an unbind failure during module unload with papr_scm nvdimm driver Link: http://lkml.kernel.org/r/20190701134038.14165-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Fixes: `b5beae5e22` ("powerpc/pseries: Add driver for PAPR SCM regions") Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-26 09:11:06 +02:00
Daniel Jordan	90b50060fe	padata: use smp_mb in padata_reorder to avoid orphaned padata jobs commit `cf144f81a9` upstream. Testing padata with the tcrypt module on a 5.2 kernel... # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3 # modprobe tcrypt mode=211 sec=1 ...produces this splat: INFO: task modprobe:10075 blocked for more than 120 seconds. Not tainted 5.2.0-base+ #16 modprobe D 0 10075 10064 0x80004080 Call Trace: ? __schedule+0x4dd/0x610 ? ring_buffer_unlock_commit+0x23/0x100 schedule+0x6c/0x90 schedule_timeout+0x3b/0x320 ? trace_buffer_unlock_commit_regs+0x4f/0x1f0 wait_for_common+0x160/0x1a0 ? wake_up_q+0x80/0x80 { crypto_wait_req } # entries in braces added by hand { do_one_aead_op } { test_aead_jiffies } test_aead_speed.constprop.17+0x681/0xf30 [tcrypt] do_test+0x4053/0x6a2b [tcrypt] ? 0xffffffffa00f4000 tcrypt_mod_init+0x50/0x1000 [tcrypt] ... The second modprobe command never finishes because in padata_reorder, CPU0's load of reorder_objects is executed before the unlocking store in spin_unlock_bh(pd->lock), causing CPU0 to miss CPU1's increment: CPU0 CPU1 padata_reorder padata_do_serial LOAD reorder_objects // 0 INC reorder_objects // 1 padata_reorder TRYLOCK pd->lock // failed UNLOCK pd->lock CPU0 deletes the timer before returning from padata_reorder and since no other job is submitted to padata, modprobe waits indefinitely. Add a pair of full barriers to guarantee proper ordering: CPU0 CPU1 padata_reorder padata_do_serial UNLOCK pd->lock smp_mb() LOAD reorder_objects INC reorder_objects smp_mb__after_atomic() padata_reorder TRYLOCK pd->lock smp_mb__after_atomic is needed so the read part of the trylock operation comes after the INC, as Andrea points out. Thanks also to Andrea for help with writing a litmus test. Fixes: `16295bec63` ("padata: Generic parallelization/serialization interface") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: <stable@vger.kernel.org> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Paul E. McKenney <paulmck@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: linux-arch@vger.kernel.org Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-26 09:11:02 +02:00
Eric W. Biederman	2f0e9eed11	signal: Correct namespace fixups of si_pid and si_uid commit `7a0cf09494` upstream. The function send_signal was split from __send_signal so that it would be possible to bypass the namespace logic based upon current[1]. As it turns out the si_pid and the si_uid fixup are both inappropriate in the case of kill_pid_usb_asyncio so move that logic into send_signal. It is difficult to arrange but possible for a signal with an si_code of SI_TIMER or SI_SIGIO to be sent across namespace boundaries. In which case tests for when it is ok to change si_pid and si_uid based on SI_FROMUSER are incorrect. Replace the use of SI_FROMUSER with a new test has_si_pid_and_used based on siginfo_layout. Now that the uid fixup is no longer present after expanding SEND_SIG_NOINFO properly calculate the si_uid that the target task needs to read. [1] `7978b567d3` ("signals: add from_ancestor_ns parameter to send_signal()") Cc: stable@vger.kernel.org Fixes: `6588c1e3ff` ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary") Fixes: `6b550f9495` ("user namespace: make signal.c respect user namespaces") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-26 09:11:01 +02:00
Eric W. Biederman	1852befc68	signal/usb: Replace kill_pid_info_as_cred with kill_pid_usb_asyncio commit `70f1b0d34b` upstream. The usb support for asyncio encoded one of it's values in the wrong field. It should have used si_value but instead used si_addr which is not present in the _rt union member of struct siginfo. The practical result of this is that on a 64bit big endian kernel when delivering a signal to a 32bit process the si_addr field is set to NULL, instead of the expected pointer value. This issue can not be fixed in copy_siginfo_to_user32 as the usb usage of the the _sigfault (aka si_addr) member of the siginfo union when SI_ASYNCIO is set is incompatible with the POSIX and glibc usage of the _rt member of the siginfo union. Therefore replace kill_pid_info_as_cred with kill_pid_usb_asyncio a dedicated function for this one specific case. There are no other users of kill_pid_info_as_cred so this specialization should have no impact on the amount of code in the kernel. Have kill_pid_usb_asyncio take instead of a siginfo_t which is difficult and error prone, 3 arguments, a signal number, an errno value, and an address enconded as a sigval_t. The encoding of the address as a sigval_t allows the code that reads the userspace request for a signal to handle this compat issue along with all of the other compat issues. Add BUILD_BUG_ONs in kernel/signal.c to ensure that we can now place the pointer value at the in si_pid (instead of si_addr). That is the code now verifies that si_pid and si_addr always occur at the same location. Further the code veries that for native structures a value placed in si_pid and spilling into si_uid will appear in userspace in si_addr (on a byte by byte copy of siginfo or a field by field copy of siginfo). The code also verifies that for a 64bit kernel and a 32bit userspace the 32bit pointer will fit in si_pid. I have used the usbsig.c program below written by Alan Stern and slightly tweaked by me to run on a big endian machine to verify the issue exists (on sparc64) and to confirm the patch below fixes the issue. /* usbsig.c -- test USB async signal delivery / #define _GNU_SOURCE #include <stdio.h> #include <fcntl.h> #include <signal.h> #include <string.h> #include <sys/ioctl.h> #include <unistd.h> #include <endian.h> #include <linux/usb/ch9.h> #include <linux/usbdevice_fs.h> static struct usbdevfs_urb urb; static struct usbdevfs_disconnectsignal ds; static volatile sig_atomic_t done = 0; void urb_handler(int sig, siginfo_t info , void ucontext) { printf("Got signal %d, signo %d errno %d code %d addr: %p urb: %p\n", sig, info->si_signo, info->si_errno, info->si_code, info->si_addr, &urb); printf("%s\n", (info->si_addr == &urb) ? "Good" : "Bad"); } void ds_handler(int sig, siginfo_t info , void ucontext) { printf("Got signal %d, signo %d errno %d code %d addr: %p ds: %p\n", sig, info->si_signo, info->si_errno, info->si_code, info->si_addr, &ds); printf("%s\n", (info->si_addr == &ds) ? "Good" : "Bad"); done = 1; } int main(int argc, char argv) { char devfilename; int fd; int rc; struct sigaction act; struct usb_ctrlrequest req; void ptr; char buf[80]; if (argc != 2) { fprintf(stderr, "Usage: usbsig device-file-name\n"); return 1; } devfilename = argv[1]; fd = open(devfilename, O_RDWR); if (fd == -1) { perror("Error opening device file"); return 1; } act.sa_sigaction = urb_handler; sigemptyset(&act.sa_mask); act.sa_flags = SA_SIGINFO; rc = sigaction(SIGUSR1, &act, NULL); if (rc == -1) { perror("Error in sigaction"); return 1; } act.sa_sigaction = ds_handler; sigemptyset(&act.sa_mask); act.sa_flags = SA_SIGINFO; rc = sigaction(SIGUSR2, &act, NULL); if (rc == -1) { perror("Error in sigaction"); return 1; } memset(&urb, 0, sizeof(urb)); urb.type = USBDEVFS_URB_TYPE_CONTROL; urb.endpoint = USB_DIR_IN \| 0; urb.buffer = buf; urb.buffer_length = sizeof(buf); urb.signr = SIGUSR1; req = (struct usb_ctrlrequest ) buf; req->bRequestType = USB_DIR_IN \| USB_TYPE_STANDARD \| USB_RECIP_DEVICE; req->bRequest = USB_REQ_GET_DESCRIPTOR; req->wValue = htole16(USB_DT_DEVICE << 8); req->wIndex = htole16(0); req->wLength = htole16(sizeof(buf) - sizeof(req)); rc = ioctl(fd, USBDEVFS_SUBMITURB, &urb); if (rc == -1) { perror("Error in SUBMITURB ioctl"); return 1; } rc = ioctl(fd, USBDEVFS_REAPURB, &ptr); if (rc == -1) { perror("Error in REAPURB ioctl"); return 1; } memset(&ds, 0, sizeof(ds)); ds.signr = SIGUSR2; ds.context = &ds; rc = ioctl(fd, USBDEVFS_DISCSIGNAL, &ds); if (rc == -1) { perror("Error in DISCSIGNAL ioctl"); return 1; } printf("Waiting for usb disconnect\n"); while (!done) { sleep(1); } close(fd); return 0; } Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-usb@vger.kernel.org Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.com> Fixes: v2.3.39 Cc: stable@vger.kernel.org Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-26 09:11:01 +02:00
Eiichi Tsukata	09df5c0ef2	tracing: Fix user stack trace "??" output commit `6d54ceb539` upstream. Commit `c5c27a0a58` ("x86/stacktrace: Remove the pointless ULONG_MAX marker") removes ULONG_MAX marker from user stack trace entries but trace_user_stack_print() still uses the marker and it outputs unnecessary "??". For example: less-1911 [001] d..2 34.758944: <user stack trace> => <00007f16f2295910> => ?? => ?? => ?? => ?? => ?? => ?? => ?? The user stack trace code zeroes the storage before saving the stack, so if the trace is shorter than the maximum number of entries it can terminate the print loop if a zero entry is detected. Link: http://lkml.kernel.org/r/20190630085438.25545-1-devel@etsukata.com Cc: stable@vger.kernel.org Fixes: `4285f2fcef` ("tracing: Remove the ULONG_MAX stack trace hackery") Signed-off-by: Eiichi Tsukata <devel@etsukata.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-26 09:10:55 +02:00
Julien Thierry	afa1d4c43c	arm64: Fix interrupt tracing in the presence of NMIs commit `17ce302f31` upstream. In the presence of any form of instrumentation, nmi_enter() should be done before calling any traceable code and any instrumentation code. Currently, nmi_enter() is done in handle_domain_nmi(), which is much too late as instrumentation code might get called before. Move the nmi_enter/exit() calls to the arch IRQ vector handler. On arm64, it is not possible to know if the IRQ vector handler was called because of an NMI before acknowledging the interrupt. However, It is possible to know whether normal interrupts could be taken in the interrupted context (i.e. if taking an NMI in that context could introduce a potential race condition). When interrupting a context with IRQs disabled, call nmi_enter() as soon as possible. In contexts with IRQs enabled, defer this to the interrupt controller, which is in a better position to know if an interrupt taken is an NMI. Fixes: `bc3c03ccb4` ("arm64: Enable the support of pseudo-NMIs") Cc: <stable@vger.kernel.org> # 5.1.x- Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Julien Thierry <julien.thierry@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-26 09:10:54 +02:00
Shijith Thotton	75b0fe8772	genirq: Update irq stats from NMI handlers [ Upstream commit `c09cb12935` ] The NMI handlers handle_percpu_devid_fasteoi_nmi() and handle_fasteoi_nmi() do not update the interrupt counts. Due to that the NMI interrupt count does not show up correctly in /proc/interrupts. Add the statistics and treat the NMI handlers in the same way as per cpu interrupts and prevent them from updating irq_desc::tot_count as this might be corrupted due to concurrency. [ tglx: Massaged changelog ] Fixes: `2dcf1fbcad` ("genirq: Provide NMI handlers") Signed-off-by: Shijith Thotton <sthotton@marvell.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1562313336-11888-1-git-send-email-sthotton@marvell.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:45 +02:00
Jiong Wang	57d50e607b	bpf: fix BPF_ALU32 \| BPF_ARSH on BE arches [ Upstream commit `75672dda27` ] Yauheni reported the following code do not work correctly on BE arches: ALU_ARSH_X: DST = (u64) (u32) (((s32 ) &DST) >> SRC); CONT; ALU_ARSH_K: DST = (u64) (u32) (((s32 ) &DST) >> IMM); CONT; and are causing failure of test_verifier test 'arsh32 on imm 2' on BE arches. The code is taking address and interpreting memory directly, so is not endianness neutral. We should instead perform standard C type casting on the variable. A u64 to s32 conversion will drop the high 32-bit and reserve the low 32-bit as signed integer, this is all we want. Fixes: `2dc6b100f9` ("bpf: interpreter support BPF_ALU \| BPF_ARSH") Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:37 +02:00
Nathan Huckleberry	c0730f79dc	timer_list: Guard procfs specific code [ Upstream commit `a9314773a9` ] With CONFIG_PROC_FS=n the following warning is emitted: kernel/time/timer_list.c:361:36: warning: unused variable 'timer_list_sops' [-Wunused-const-variable] static const struct seq_operations timer_list_sops = { Add #ifdef guard around procfs specific code. Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: john.stultz@linaro.org Cc: sboyd@kernel.org Cc: clang-built-linux@googlegroups.com Link: https://github.com/ClangBuiltLinux/linux/issues/534 Link: https://lkml.kernel.org/r/20190614181604.112297-1-nhuck@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:33 +02:00
Miroslav Lichvar	b2e0490989	ntp: Limit TAI-UTC offset [ Upstream commit `d897a4ab11` ] Don't allow the TAI-UTC offset of the system clock to be set by adjtimex() to a value larger than 100000 seconds. This prevents an overflow in the conversion to int, prevents the CLOCK_TAI clock from getting too far ahead of the CLOCK_REALTIME clock, and it is still large enough to allow leap seconds to be inserted at the maximum rate currently supported by the kernel (once per day) for the next ~270 years, however unlikely it is that someone can survive a catastrophic event which slowed down the rotation of the Earth so much. Reported-by: Weikang shi <swkhack@gmail.com> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <sboyd@kernel.org> Link: https://lkml.kernel.org/r/20190618154713.20929-1-mlichvar@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:32 +02:00
Alexei Starovoitov	c407f8770e	bpf: fix callees pruning callers [ Upstream commit `eea1c227b9` ] The commit `7640ead939` partially resolved the issue of callees incorrectly pruning the callers. With introduction of bounded loops and jmps_processed heuristic single verifier state may contain multiple branches and calls. It's possible that new verifier state (for future pruning) will be allocated inside callee. Then callee will exit (still within the same verifier state). It will go back to the caller and there R6-R9 registers will be read and will trigger mark_reg_read. But the reg->live for all frames but the top frame is not set to LIVE_NONE. Hence mark_reg_read will fail to propagate liveness into parent and future walking will incorrectly conclude that the states are equivalent because LIVE_READ is not set. In other words the rule for parent/live should be: whenever register parentage chain is set the reg->live should be set to LIVE_NONE. is_state_visited logic already follows this rule for spilled registers. Fixes: `7640ead939` ("bpf: verifier: make sure callees don't prune with caller differences") Fixes: `f4d7e40a5b` ("bpf: introduce function calls (verification)") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:30 +02:00
Qian Cai	41a811e161	sched/fair: Fix "runnable_avg_yN_inv" not used warnings [ Upstream commit `509466b7d4` ] runnable_avg_yN_inv[] is only used in kernel/sched/pelt.c but was included in several other places because they need other macros all came from kernel/sched/sched-pelt.h which was generated by Documentation/scheduler/sched-pelt. As the result, it causes compilation a lot of warnings, kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=] kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=] kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=] ... Silence it by appending the __maybe_unused attribute for it, so all generated variables and macros can still be kept in the same file. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1559596304-31581-1-git-send-email-cai@lca.pw Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:28 +02:00
Gao Xiang	29c36efe71	sched/core: Add __sched tag for io_schedule() [ Upstream commit `e3b929b0a1` ] Non-inline io_schedule() was introduced in: commit `10ab56434f` ("sched/core: Separate out io_schedule_prepare() and io_schedule_finish()") Keep in line with io_schedule_timeout(), otherwise "/proc/<pid>/wchan" will report io_schedule() rather than its callers when waiting for IO. Reported-by: Jilong Kou <koujilong@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Miao Xie <miaoxie@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `10ab56434f` ("sched/core: Separate out io_schedule_prepare() and io_schedule_finish()") Link: https://lkml.kernel.org/r/20190603091338.2695-1-gaoxiang25@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:28 +02:00
Valdis Klētnieks	e6164c5cf3	bpf: silence warning messages in core [ Upstream commit `aee450cbe4` ] Compiling kernel/bpf/core.c with W=1 causes a flood of warnings: kernel/bpf/core.c:1198:65: warning: initialized field overwritten [-Woverride-init] 1198 \| #define BPF_INSN_3_TBL(x, y, z) [BPF_##x \| BPF_##y \| BPF_##z] = true \| ^~~~ kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL' 1087 \| INSN_3(ALU, ADD, X), \ \| ^~~~~~ kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP' 1202 \| BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL), \| ^~~~~~~~~~~~ kernel/bpf/core.c:1198:65: note: (near initialization for 'public_insntable[12]') 1198 \| #define BPF_INSN_3_TBL(x, y, z) [BPF_##x \| BPF_##y \| BPF_##z] = true \| ^~~~ kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL' 1087 \| INSN_3(ALU, ADD, X), \ \| ^~~~~~ kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP' 1202 \| BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL), \| ^~~~~~~~~~~~ 98 copies of the above. The attached patch silences the warnings, because we know we're overwriting the default initializer. That leaves bpf/core.c with only 6 other warnings, which become more visible in comparison. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:25 +02:00
Imre Deak	e26b15def4	locking/lockdep: Fix merging of hlocks with non-zero references [ Upstream commit `d9349850e1` ] The sequence static DEFINE_WW_CLASS(test_ww_class); struct ww_acquire_ctx ww_ctx; struct ww_mutex ww_lock_a; struct ww_mutex ww_lock_b; struct ww_mutex ww_lock_c; struct mutex lock_c; ww_acquire_init(&ww_ctx, &test_ww_class); ww_mutex_init(&ww_lock_a, &test_ww_class); ww_mutex_init(&ww_lock_b, &test_ww_class); ww_mutex_init(&ww_lock_c, &test_ww_class); mutex_init(&lock_c); ww_mutex_lock(&ww_lock_a, &ww_ctx); mutex_lock(&lock_c); ww_mutex_lock(&ww_lock_b, &ww_ctx); ww_mutex_lock(&ww_lock_c, &ww_ctx); mutex_unlock(&lock_c); () ww_mutex_unlock(&ww_lock_c); ww_mutex_unlock(&ww_lock_b); ww_mutex_unlock(&ww_lock_a); ww_acquire_fini(&ww_ctx); () will trigger the following error in __lock_release() when calling mutex_release() at : DEBUG_LOCKS_WARN_ON(depth <= 0) The problem is that the hlock merging happening at updates the references for test_ww_class incorrectly to 3 whereas it should've updated it to 4 (representing all the instances for ww_ctx and ww_lock_[abc]). Fix this by updating the references during merging correctly taking into account that we can have non-zero references (both for the hlock that we merge into another hlock or for the hlock we are merging into). Signed-off-by: Imre Deak <imre.deak@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: =?UTF-8?q?Ville=20Syrj=C3=A4l=C3=A4?= <ville.syrjala@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/r/20190524201509.9199-2-imre.deak@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:20 +02:00
Imre Deak	74e248d9f9	locking/lockdep: Fix OOO unlock when hlocks need merging [ Upstream commit `8c8889d8ea` ] The sequence static DEFINE_WW_CLASS(test_ww_class); struct ww_acquire_ctx ww_ctx; struct ww_mutex ww_lock_a; struct ww_mutex ww_lock_b; struct mutex lock_c; struct mutex lock_d; ww_acquire_init(&ww_ctx, &test_ww_class); ww_mutex_init(&ww_lock_a, &test_ww_class); ww_mutex_init(&ww_lock_b, &test_ww_class); mutex_init(&lock_c); ww_mutex_lock(&ww_lock_a, &ww_ctx); mutex_lock(&lock_c); ww_mutex_lock(&ww_lock_b, &ww_ctx); mutex_unlock(&lock_c); () ww_mutex_unlock(&ww_lock_b); ww_mutex_unlock(&ww_lock_a); ww_acquire_fini(&ww_ctx); triggers the following WARN in __lock_release() when doing the unlock at : DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth - 1); The problem is that the WARN check doesn't take into account the merging of ww_lock_a and ww_lock_b which results in decreasing curr->lockdep_depth by 2 not only 1. Note that the following sequence doesn't trigger the WARN, since there won't be any hlock merging. ww_acquire_init(&ww_ctx, &test_ww_class); ww_mutex_init(&ww_lock_a, &test_ww_class); ww_mutex_init(&ww_lock_b, &test_ww_class); mutex_init(&lock_c); mutex_init(&lock_d); ww_mutex_lock(&ww_lock_a, &ww_ctx); mutex_lock(&lock_c); mutex_lock(&lock_d); ww_mutex_lock(&ww_lock_b, &ww_ctx); mutex_unlock(&lock_d); ww_mutex_unlock(&ww_lock_b); ww_mutex_unlock(&ww_lock_a); mutex_unlock(&lock_c); ww_acquire_fini(&ww_ctx); In general both of the above two sequences are valid and shouldn't trigger any lockdep warning. Fix this by taking the decrement due to the hlock merging into account during lock release and hlock class re-setting. Merging can't happen during lock downgrading since there won't be a new possibility to merge hlocks in that case, so add a WARN if merging still happens then. Signed-off-by: Imre Deak <imre.deak@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: ville.syrjala@linux.intel.com Link: https://lkml.kernel.org/r/20190524201509.9199-1-imre.deak@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:20 +02:00
Eric W. Biederman	4adddecd46	signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sig [ Upstream commit `f9070dc945` ] The locking in force_sig_info is not prepared to deal with a task that exits or execs (as sighand may change). The is not a locking problem in force_sig as force_sig is only built to handle synchronous exceptions. Further the function force_sig_info changes the signal state if the signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the delivery of the signal. The signal SIGKILL can not be ignored and can not be blocked and SIGNAL_UNKILLABLE won't prevent it from being delivered. So using force_sig rather than send_sig for SIGKILL is confusing and pointless. Because it won't impact the sending of the signal and and because using force_sig is wrong, replace force_sig with send_sig. Cc: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Serge Hallyn <serge@hallyn.com> Cc: Oleg Nesterov <oleg@redhat.com> Fixes: `cf3f89214e` ("pidns: add reboot_pid_ns() to handle the reboot syscall") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-07-26 09:10:16 +02:00
Thomas Gleixner	f4999a2a3a	genirq: Add optional hardware synchronization for shutdown commit `62e0468650` upstream. free_irq() ensures that no hardware interrupt handler is executing on a different CPU before actually releasing resources and deactivating the interrupt completely in a domain hierarchy. But that does not catch the case where the interrupt is on flight at the hardware level but not yet serviced by the target CPU. That creates an interesing race condition: CPU 0 CPU 1 IRQ CHIP interrupt is raised sent to CPU1 Unable to handle immediately (interrupts off, deep idle delay) mask() ... free() shutdown() synchronize_irq() release_resources() do_IRQ() -> resources are not available That might be harmless and just trigger a spurious interrupt warning, but some interrupt chips might get into a wedged state. Utilize the existing irq_get_irqchip_state() callback for the synchronization in free_irq(). synchronize_hardirq() is not using this mechanism as it might actually deadlock unter certain conditions, e.g. when called with interrupts disabled and the target CPU is the one on which the synchronization is invoked. synchronize_irq() uses it because that function cannot be called from non preemtible contexts as it might sleep. No functional change intended and according to Marc the existing GIC implementations where the driver supports the callback should be able to cope with that core change. Famous last words. Fixes: `464d12309e` ("x86/vector: Switch IOAPIC to global reservation mode") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Tested-by: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/20190628111440.279463375@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-21 09:00:39 +02:00
Thomas Gleixner	41e95c3449	genirq: Fix misleading synchronize_irq() documentation commit `1d21f2af85` upstream. The function might sleep, so it cannot be called from interrupt context. Not even with care. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/20190628111440.189241552@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-21 09:00:39 +02:00
Thomas Gleixner	5db47f4300	genirq: Delay deactivation in free_irq() commit `4001d8e876` upstream. When interrupts are shutdown, they are immediately deactivated in the irqdomain hierarchy. While this looks obviously correct there is a subtle issue: There might be an interrupt in flight when free_irq() is invoking the shutdown. This is properly handled at the irq descriptor / primary handler level, but the deactivation might completely disable resources which are required to acknowledge the interrupt. Split the shutdown code and deactivate the interrupt after synchronization in free_irq(). Fixup all other usage sites where this is not an issue to invoke the combined shutdown_and_deactivate() function instead. This still might be an issue if the interrupt in flight servicing is delayed on a remote CPU beyond the invocation of synchronize_irq(), but that cannot be handled at that level and needs to be handled in the synchronize_irq() context. Fixes: `f8264e3496` ("irqdomain: Introduce new interfaces to support hierarchy irqdomains") Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/20190628111440.098196390@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-07-21 09:00:39 +02:00
Jann Horn	6994eefb00	ptrace: Fix ->ptracer_cred handling for PTRACE_TRACEME Fix two issues: When called for PTRACE_TRACEME, ptrace_link() would obtain an RCU reference to the parent's objective credentials, then give that pointer to get_cred(). However, the object lifetime rules for things like struct cred do not permit unconditionally turning an RCU reference into a stable reference. PTRACE_TRACEME records the parent's credentials as if the parent was acting as the subject, but that's not the case. If a malicious unprivileged child uses PTRACE_TRACEME and the parent is privileged, and at a later point, the parent process becomes attacker-controlled (because it drops privileges and calls execve()), the attacker ends up with control over two processes with a privileged ptrace relationship, which can be abused to ptrace a suid binary and obtain root privileges. Fix both of these by always recording the credentials of the process that is requesting the creation of the ptrace relationship: current_cred() can't change under us, and current is the proper subject for access control. This change is theoretically userspace-visible, but I am not aware of any code that it will actually break. Fixes: `64b875f7ac` ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-07-05 02:00:41 +09:00
Linus Torvalds	550d1f5bda	Merge tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This includes three fixes: - Fix a deadlock from a previous fix to keep module loading and function tracing text modifications from stepping on each other (this has a few patches to help document the issue in comments) - Fix a crash when the snapshot buffer gets out of sync with the main ring buffer - Fix a memory leak when reading the memory logs" * tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace/x86: Anotate text_mutex split between ftrace_arch_code_modify_post_process() and ftrace_arch_code_modify_prepare() tracing/snapshot: Resize spare buffer if size changed tracing: Fix memory leak in tracing_err_log_open() ftrace/x86: Add a comment to why we take text_mutex in ftrace_arch_code_modify_prepare() ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()	2019-07-04 10:26:17 +09:00
Christian Brauner	28dd29c06d	fork: return proper negative error code Make sure to return a proper negative error code from copy_process() when anon_inode_getfile() fails with CLONE_PIDFD. Otherwise _do_fork() will not detect an error and get_task_pid() will operator on a nonsensical pointer: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c R13: 00007ffc15fbb0ff R14: 00007ff07e47e9c0 R15: 0000000000000000 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 7990 Comm: syz-executor290 Not tainted 5.2.0-rc6+ #9 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline] RIP: 0010:get_task_pid+0xe1/0x210 kernel/pid.c:372 Code: 89 ff e8 62 27 5f 00 49 8b 07 44 89 f1 4c 8d bc c8 90 01 00 00 eb 0c e8 0d fe 25 00 49 81 c7 38 05 00 00 4c 89 f8 48 c1 e8 03 <80> 3c 18 00 74 08 4c 89 ff e8 31 27 5f 00 4d 8b 37 e8 f9 47 12 00 RSP: 0018:ffff88808a4a7d78 EFLAGS: 00010203 RAX: 00000000000000a7 RBX: dffffc0000000000 RCX: ffff888088180600 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88808a4a7d90 R08: ffffffff814fb3a8 R09: ffffed1015d66bf8 R10: ffffed1015d66bf8 R11: 1ffff11015d66bf7 R12: 0000000000041ffc R13: 1ffff11011494fbc R14: 0000000000000000 R15: 000000000000053d FS: 00007ff07e47e700(0000) GS:ffff8880aeb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004b5100 CR3: 0000000094df2000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: _do_fork+0x1b9/0x5f0 kernel/fork.c:2360 __do_sys_clone kernel/fork.c:2454 [inline] __se_sys_clone kernel/fork.c:2448 [inline] __x64_sys_clone+0xc1/0xd0 kernel/fork.c:2448 do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Link: https://lore.kernel.org/lkml/000000000000e0dc0d058c9e7142@google.com Reported-and-tested-by: syzbot+002e636502bc4b64eb5c@syzkaller.appspotmail.com Fixes: `6fd2fe494b` ("copy_process(): don't use ksys_close() on cleanups") Cc: Jann Horn <jannh@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <christian@brauner.io>	2019-07-01 16:43:30 +02:00
Linus Torvalds	7c15f41e87	Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP fixes from Thomas Gleixner: "Two small changes for the cpu hotplug code: - Prevent out of bounds access which actually might crash the machine caused by a missing bounds check in the fail injection code - Warn about unsupported migitation mode command line arguments to make people aware that they typoed the paramater. Not necessarily a fix but quite some people tripped over that" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Fix out-of-bounds read when setting fail state cpu/speculation: Warn on unsupported mitigations= parameter	2019-06-30 11:19:17 +08:00
Linus Torvalds	57103eb7c6	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Various fixes, most of them related to bugs perf fuzzing found in the x86 code" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/regs: Use PERF_REG_EXTENDED_MASK perf/x86: Remove pmu->pebs_no_xmm_regs perf/x86: Clean up PEBS_XMM_REGS perf/x86/regs: Check reserved bits perf/x86: Disable extended registers for non-supported PMUs perf/ioctl: Add check for the sample_period value perf/core: Fix perf_sample_regs_user() mm check	2019-06-29 19:39:17 +08:00
Linus Torvalds	2407e48606	Merge tag 'pm-5.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Avoid skipping bus-level PCI power management during system resume for PCIe ports left in D0 during the preceding suspend transition on platforms where the power states of those ports can change out of the PCI layer's control" * tag 'pm-5.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PCI: PM: Avoid skipping bus-level PM on platforms without ACPI	2019-06-29 19:29:45 +08:00
Andrea Arcangeli	1bf4580e00	fork,memcg: alloc_thread_stack_node needs to set tsk->stack Commit `5eed6f1dff` ("fork,memcg: fix crash in free_thread_stack on memcg charge fail") corrected two instances, but there was a third instance of this bug. Without setting tsk->stack, if memcg_charge_kernel_stack fails, it'll execute free_thread_stack() on a dangling pointer. Enterprise kernels are compiled with VMAP_STACK=y so this isn't critical, but custom VMAP_STACK=n builds should have some performance advantage, with the drawback of risking to fail fork because compaction didn't succeed. So as long as VMAP_STACK=n is a supported option it's worth fixing it upstream. Link: http://lkml.kernel.org/r/20190619011450.28048-1-aarcange@redhat.com Fixes: `9b6f7e163c` ("mm: rework memcg kernel stack accounting") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-06-29 16:43:45 +08:00
Oleg Nesterov	97abc889ee	signal: remove the wrong signal_pending() check in restore_user_sigmask() This is the minimal fix for stable, I'll send cleanups later. Commit `854a6ed568` ("signal: Add restore_user_sigmask()") introduced the visible change which breaks user-space: a signal temporary unblocked by set_user_sigmask() can be delivered even if the caller returns success or timeout. Change restore_user_sigmask() to accept the additional "interrupted" argument which should be used instead of signal_pending() check, and update the callers. Eric said: : For clarity. I don't think this is required by posix, or fundamentally to : remove the races in select. It is what linux has always done and we have : applications who care so I agree this fix is needed. : : Further in any case where the semantic change that this patch rolls back : (aka where allowing a signal to be delivered and the select like call to : complete) would be advantage we can do as well if not better by using : signalfd. : : Michael is there any chance we can get this guarantee of the linux : implementation of pselect and friends clearly documented. The guarantee : that if the system call completes successfully we are guaranteed that no : signal that is unblocked by using sigmask will be delivered? Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com Fixes: `854a6ed568` ("signal: Add restore_user_sigmask()") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Eric Wong <e@80x24.org> Tested-by: Eric Wong <e@80x24.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-06-29 16:43:45 +08:00
Eiichi Tsukata	46cc0b4442	tracing/snapshot: Resize spare buffer if size changed Current snapshot implementation swaps two ring_buffers even though their sizes are different from each other, that can cause an inconsistency between the contents of buffer_size_kb file and the current buffer size. For example: # cat buffer_size_kb 7 (expanded: 1408) # echo 1 > events/enable # grep bytes per_cpu/cpu0/stats bytes: 1441020 # echo 1 > snapshot // current:1408, spare:1408 # echo 123 > buffer_size_kb // current:123, spare:1408 # echo 1 > snapshot // current:1408, spare:123 # grep bytes per_cpu/cpu0/stats bytes: 1443700 # cat buffer_size_kb 123 // != current:1408 And also, a similar per-cpu case hits the following WARNING: Reproducer: # echo 1 > per_cpu/cpu0/snapshot # echo 123 > buffer_size_kb # echo 1 > per_cpu/cpu0/snapshot WARNING: WARNING: CPU: 0 PID: 1946 at kernel/trace/trace.c:1607 update_max_tr_single.part.0+0x2b8/0x380 Modules linked in: CPU: 0 PID: 1946 Comm: bash Not tainted 5.2.0-rc6 #20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014 RIP: 0010:update_max_tr_single.part.0+0x2b8/0x380 Code: ff e8 dc da f9 ff 0f 0b e9 88 fe ff ff e8 d0 da f9 ff 44 89 ee bf f5 ff ff ff e8 33 dc f9 ff 41 83 fd f5 74 96 e8 b8 da f9 ff <0f> 0b eb 8d e8 af da f9 ff 0f 0b e9 bf fd ff ff e8 a3 da f9 ff 48 RSP: 0018:ffff888063e4fca0 EFLAGS: 00010093 RAX: ffff888066214380 RBX: ffffffff99850fe0 RCX: ffffffff964298a8 RDX: 0000000000000000 RSI: 00000000fffffff5 RDI: 0000000000000005 RBP: 1ffff1100c7c9f96 R08: ffff888066214380 R09: ffffed100c7c9f9b R10: ffffed100c7c9f9a R11: 0000000000000003 R12: 0000000000000000 R13: 00000000ffffffea R14: ffff888066214380 R15: ffffffff99851060 FS: 00007f9f8173c700(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000714dc0 CR3: 0000000066fa6000 CR4: 00000000000006f0 Call Trace: ? trace_array_printk_buf+0x140/0x140 ? __mutex_lock_slowpath+0x10/0x10 tracing_snapshot_write+0x4c8/0x7f0 ? trace_printk_init_buffers+0x60/0x60 ? selinux_file_permission+0x3b/0x540 ? tracer_preempt_off+0x38/0x506 ? trace_printk_init_buffers+0x60/0x60 __vfs_write+0x81/0x100 vfs_write+0x1e1/0x560 ksys_write+0x126/0x250 ? __ia32_sys_read+0xb0/0xb0 ? do_syscall_64+0x1f/0x390 do_syscall_64+0xc1/0x390 entry_SYSCALL_64_after_hwframe+0x49/0xbe This patch adds resize_buffer_duplicate_size() to check if there is a difference between current/spare buffer sizes and resize a spare buffer if necessary. Link: http://lkml.kernel.org/r/20190625012910.13109-1-devel@etsukata.com Cc: stable@vger.kernel.org Fixes: `ad909e21bb` ("tracing: Add internal tracing_snapshot() functions") Signed-off-by: Eiichi Tsukata <devel@etsukata.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2019-06-28 14:58:52 -04:00

1 2 3 4 5 ...

30506 Commits