Commit Graph

6473 Commits

Author SHA1 Message Date
Radim Krčmář
9cf0eaf88d KVM: x86: fix APIC page invalidation
commit b1394e745b upstream.

Implementation of the unpinned APIC page didn't update the VMCS address
cache when invalidation was done through range mmu notifiers.
This became a problem when the page notifier was removed.

Re-introduce the arch-specific helper and call it from ...range_start.

Reported-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Fixes: 38b9917350 ("kvm: vmx: Implement set_apic_access_page_addr")
Fixes: 369ea8242c ("mm/rmap: update to new mmu_notifier semantic v2")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
Tested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:28:12 +01:00
Dave Hansen
61b7a6f1ce x86/entry: Use SYSCALL_DEFINE() macros for sys_modify_ldt()
[ Upstream commit da20ab3518 ]

We do not have tracepoints for sys_modify_ldt() because we define
it directly instead of using the normal SYSCALL_DEFINEx() macros.

However, there is a reason sys_modify_ldt() does not use the macros:
it has an 'int' return type instead of 'unsigned long'.  This is
a bug, but it's a bug cemented in the ABI.

What does this mean?  If we return -EINVAL from a function that
returns 'int', we have 0x00000000ffffffea in %rax.  But, if we
return -EINVAL from a function returning 'unsigned long', we end
up with 0xffffffffffffffea in %rax, which is wrong.

To work around this and maintain the 'int' behavior while using
the SYSCALL_DEFINEx() macros, so we add a cast to 'unsigned int'
in both implementations of sys_modify_ldt().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171018172107.1A79C532@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-09 22:01:49 +01:00
Peter Zijlstra
2715f6841a x86/uaccess, sched/preempt: Verify access_ok() context
commit 7c4788950b upstream.

I recently encountered wreckage because access_ok() was used where it
should not be, add an explicit WARN when access_ok() is used wrongly.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-15 15:53:17 +01:00
Mathias Krause
fb6da44f96 x86/alternatives: Fix alt_max_short macro to really be a max()
commit 6b32c126d3 upstream.

The alt_max_short() macro in asm/alternative.h does not work as
intended, leading to nasty bugs. E.g. alt_max_short("1", "3")
evaluates to 3, but alt_max_short("3", "1") evaluates to 1 -- not
exactly the maximum of 1 and 3.

In fact, I had to learn it the hard way by crashing my kernel in not
so funny ways by attempting to make use of the ALTENATIVE_2 macro
with alternatives where the first one was larger than the second
one.

According to [1] and commit dbe4058a6a ("x86/alternatives: Fix
ALTERNATIVE_2 padding generation properly") the right handed side
should read "-(-(a < b))" not "-(-(a - b))". Fix that, to make the
macro work as intended.

While at it, fix up the comments regarding the additional "-", too.
It's not about gas' usage of s32 but brain dead logic of having a
"true" value of -1 for the < operator ... *sigh*

Btw., the one in asm/alternative-asm.h is correct. And, apparently,
all current users of ALTERNATIVE_2() pass same sized alternatives,
avoiding to hit the bug.

[1] http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax

Reviewed-and-tested-by: Borislav Petkov <bp@suse.de>
Fixes: dbe4058a6a ("x86/alternatives: Fix ALTERNATIVE_2 padding generation properly")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1507228213-13095-1-git-send-email-minipli@googlemail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18 09:35:41 +02:00
Paolo Bonzini
922e562b26 KVM: x86: fix singlestepping over syscall
commit c8401dda2f upstream.

TF is handled a bit differently for syscall and sysret, compared
to the other instructions: TF is checked after the instruction completes,
so that the OS can disable #DB at a syscall by adding TF to FMASK.
When the sysret is executed the #DB is taken "as if" the syscall insn
just completed.

KVM emulates syscall so that it can trap 32-bit syscall on Intel processors.
Fix the behavior, otherwise you could get #DB on a user stack which is not
nice.  This does not affect Linux guests, as they use an IST or task gate
for #DB.

This fixes CVE-2017-7518.

Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[bwh: Backported to 4.9:
 - kvm_vcpu_check_singlestep() sets some flags differently
 - Drop changes to kvm_skip_emulated_instruction()]
Cc: Ben Hutchings <benh@debian.org>
Cc: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12 11:51:26 +02:00
Andy Lutomirski
0caec70692 x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
commit 9584d98bed upstream.

In ELF_COPY_CORE_REGS, we're copying from the current task, so
accessing thread.fsbase and thread.gsbase makes no sense.  Just read
the values from the CPU registers.

In practice, the old code would have been correct most of the time
simply because thread.fsbase and thread.gsbase usually matched the
CPU registers.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chang Seok <chang.seok.bae@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-20 08:19:57 +02:00
Arnd Bergmann
43f776dab3 x86/io: Add "memory" clobber to insb/insw/insl/outsb/outsw/outsl
commit 7206f9bf10 upstream.

The x86 version of insb/insw/insl uses an inline assembly that does
not have the target buffer listed as an output. This can confuse
the compiler, leading it to think that a subsequent access of the
buffer is uninitialized:

  drivers/net/wireless/wl3501_cs.c: In function ‘wl3501_mgmt_scan_confirm’:
  drivers/net/wireless/wl3501_cs.c:665:9: error: ‘sig.status’ is used uninitialized in this function [-Werror=uninitialized]
  drivers/net/wireless/wl3501_cs.c:668:12: error: ‘sig.cap_info’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  drivers/net/sb1000.c: In function 'sb1000_rx':
  drivers/net/sb1000.c:775:9: error: 'st[0]' is used uninitialized in this function [-Werror=uninitialized]
  drivers/net/sb1000.c:776:10: error: 'st[1]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  drivers/net/sb1000.c:784:11: error: 'st[1]' may be used uninitialized in this function [-Werror=maybe-uninitialized]

I tried to mark the exact input buffer as an output here, but couldn't
figure it out. As suggested by Linus, marking all memory as clobbered
however is good enough too. For the outs operations, I also add the
memory clobber, to force the input to be written to local variables.
This is probably already guaranteed by the "asm volatile", but it can't
hurt to do this for symmetry.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: http://lkml.kernel.org/r/20170719125310.2487451-5-arnd@arndb.de
Link: https://lkml.org/lkml/2017/7/12/605
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-02 07:07:53 +02:00
Eric Biggers
3559de45c9 x86/mm: Fix use-after-free of ldt_struct
commit ccd5b32351 upstream.

The following commit:

  39a0526fb3 ("x86/mm: Factor out LDT init from context init")

renamed init_new_context() to init_new_context_ldt() and added a new
init_new_context() which calls init_new_context_ldt().  However, the
error code of init_new_context_ldt() was ignored.  Consequently, if a
memory allocation in alloc_ldt_struct() failed during a fork(), the
->context.ldt of the new task remained the same as that of the old task
(due to the memcpy() in dup_mm()).  ldt_struct's are not intended to be
shared, so a use-after-free occurred after one task exited.

Fix the bug by making init_new_context() pass through the error code of
init_new_context_ldt().

This bug was found by syzkaller, which encountered the following splat:

    BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
    Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710

    CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
     __dump_stack lib/dump_stack.c:16 [inline]
     dump_stack+0x194/0x257 lib/dump_stack.c:52
     print_address_description+0x73/0x250 mm/kasan/report.c:252
     kasan_report_error mm/kasan/report.c:351 [inline]
     kasan_report+0x24e/0x340 mm/kasan/report.c:409
     __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
     free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
     free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
     destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
     destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
     __mmdrop+0xe9/0x530 kernel/fork.c:889
     mmdrop include/linux/sched/mm.h:42 [inline]
     exec_mmap fs/exec.c:1061 [inline]
     flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291
     load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855
     search_binary_handler+0x142/0x6b0 fs/exec.c:1652
     exec_binprm fs/exec.c:1694 [inline]
     do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816
     do_execve+0x31/0x40 fs/exec.c:1860
     call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100
     ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

    Allocated by task 3700:
     save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
     save_stack+0x43/0xd0 mm/kasan/kasan.c:447
     set_track mm/kasan/kasan.c:459 [inline]
     kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
     kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627
     kmalloc include/linux/slab.h:493 [inline]
     alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67
     write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277
     sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307
     entry_SYSCALL_64_fastpath+0x1f/0xbe

    Freed by task 3700:
     save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
     save_stack+0x43/0xd0 mm/kasan/kasan.c:447
     set_track mm/kasan/kasan.c:459 [inline]
     kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
     __cache_free mm/slab.c:3503 [inline]
     kfree+0xca/0x250 mm/slab.c:3820
     free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121
     free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
     destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
     destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
     __mmdrop+0xe9/0x530 kernel/fork.c:889
     mmdrop include/linux/sched/mm.h:42 [inline]
     __mmput kernel/fork.c:916 [inline]
     mmput+0x541/0x6e0 kernel/fork.c:927
     copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931
     copy_process kernel/fork.c:1546 [inline]
     _do_fork+0x1ef/0xfb0 kernel/fork.c:2025
     SYSC_clone kernel/fork.c:2135 [inline]
     SyS_clone+0x37/0x50 kernel/fork.c:2129
     do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287
     return_from_SYSCALL_64+0x0/0x7a

Here is a C reproducer:

    #include <asm/ldt.h>
    #include <pthread.h>
    #include <signal.h>
    #include <stdlib.h>
    #include <sys/syscall.h>
    #include <sys/wait.h>
    #include <unistd.h>

    static void *fork_thread(void *_arg)
    {
        fork();
    }

    int main(void)
    {
        struct user_desc desc = { .entry_number = 8191 };

        syscall(__NR_modify_ldt, 1, &desc, sizeof(desc));

        for (;;) {
            if (fork() == 0) {
                pthread_t t;

                srand(getpid());
                pthread_create(&t, NULL, fork_thread, NULL);
                usleep(rand() % 10000);
                syscall(__NR_exit_group, 0);
            }
            wait(NULL);
        }
    }

Note: the reproducer takes advantage of the fact that alloc_ldt_struct()
may use vmalloc() to allocate a large ->entries array, and after
commit:

  5d17a73a2e ("vmalloc: back off when the current task is killed")

it is possible for userspace to fail a task's vmalloc() by
sending a fatal signal, e.g. via exit_group().  It would be more
difficult to reproduce this bug on kernels without that commit.

This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: 39a0526fb3 ("x86/mm: Factor out LDT init from context init")
Link: http://lkml.kernel.org/r/20170824175029.76040-1-ebiggers3@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-30 10:21:51 +02:00
Kees Cook
1f2347a095 mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes
commit c715b72c1b upstream.

Moving the x86_64 and arm64 PIE base from 0x555555554000 to 0x000100000000
broke AddressSanitizer.  This is a partial revert of:

  eab09532d4 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
  02445990a9 ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")

The AddressSanitizer tool has hard-coded expectations about where
executable mappings are loaded.

The motivation for changing the PIE base in the above commits was to
avoid the Stack-Clash CVEs that allowed executable mappings to get too
close to heap and stack.  This was mainly a problem on 32-bit, but the
64-bit bases were moved too, in an effort to proactively protect those
systems (proofs of concept do exist that show 64-bit collisions, but
other recent changes to fix stack accounting and setuid behaviors will
minimize the impact).

The new 32-bit PIE base is fine for ASan (since it matches the ET_EXEC
base), so only the 64-bit PIE base needs to be reverted to let x86 and
arm64 ASan binaries run again.  Future changes to the 64-bit PIE base on
these architectures can be made optional once a more dynamic method for
dealing with AddressSanitizer is found.  (e.g.  always loading PIE into
the mmap region for marked binaries.)

Link: http://lkml.kernel.org/r/20170807201542.GA21271@beast
Fixes: eab09532d4 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Fixes: 02445990a9 ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Kostya Serebryany <kcc@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-24 17:12:20 -07:00
Marek Marczykowski-Górecki
4d3d3a1690 x86/xen: allow userspace access during hypercalls
commit c54590cac5 upstream.

Userspace application can do a hypercall through /dev/xen/privcmd, and
some for some hypercalls argument is a pointers to user-provided
structure. When SMAP is supported and enabled, hypervisor can't access.
So, lets allow it.

The same applies to HYPERVISOR_dm_op, where additionally privcmd driver
carefully verify buffer addresses.

Cc: stable@vger.kernel.org
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
[HYPERVISOR_dm_op dropped - not present until 4.11]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:08:01 -07:00
Jim Mattson
07592d6225 kvm: vmx: Check value written to IA32_BNDCFGS
commit 4531662d1a upstream.

Bits 11:2 must be zero and the linear addess in bits 63:12 must be
canonical. Otherwise, WRMSR(BNDCFGS) should raise #GP.

Fixes: 0dd376e709 ("KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:42:24 +02:00
Kees Cook
63c2f8f8c4 binfmt_elf: use ELF_ET_DYN_BASE only for PIE
commit eab09532d4 upstream.

The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries.  (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.)

With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
ELF_ET_DYN_BASE continued to be used since the kernel was only looking
at ET_DYN.  However, since ELF_ET_DYN_BASE is traditionally set at the
top 1/3rd of the TASK_SIZE, a substantial portion of the address space
is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
loaded above the mmap region.  This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
pathological stack regions.

Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
region in all cases, and will now additionally avoid programs falling
back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
if it would have collided with the stack, now it will fail to load
instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by
the loader.  Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
which means architectures can now safely lower their values without risk
of loaders colliding with their subsequently loaded programs.

For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
the entire 32-bit address space for 32-bit pointers.

Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
suggestions on how to implement this solution.

Fixes: d1fd836dcf ("mm: split ET_DYN ASLR from mmap ASLR")
Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:42:21 +02:00
Mikulas Patocka
81ba752aa4 x86/mm/pat: Don't report PAT on CPUs that don't support it
commit 99c13b8c88 upstream.

The pat_enabled() logic is broken on CPUs which do not support PAT and
where the initialization code fails to call pat_init(). Due to that the
enabled flag stays true and pat_enabled() returns true wrongfully.

As a consequence the mappings, e.g. for Xorg, are set up with the wrong
caching mode and the required MTRR setups are omitted.

To cure this the following changes are required:

  1) Make pat_enabled() return true only if PAT initialization was
     invoked and successful.

  2) Invoke init_cache_modes() unconditionally in setup_arch() and
     remove the extra callsites in pat_disable() and the pat disabled
     code path in pat_init().

Also rename __pat_enabled to pat_disabled to reflect the real purpose of
this variable.

Fixes: 9cd25aac1f ("x86/mm/pat: Emulate PAT when it is disabled")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: "Luis R. Rodriguez" <mcgrof@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1707041749300.3456@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15 12:16:17 +02:00
Ladi Prosek
1eeb794263 KVM: x86: fix emulation of RSM and IRET instructions
commit 6ed071f051 upstream.

On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
on hflags is reverted later on in x86_emulate_instruction where hflags are
overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.

Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
an instruction is emulated, this commit deletes emul_flags altogether and
makes the emulator access vcpu->arch.hflags using two new accessors. This
way all changes, on the emulator side as well as in functions called from
the emulator and accessing vcpu state with emul_to_vcpu, are preserved.

More details on the bug and its manifestation with Windows and OVMF:

  It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
  I believe that the SMM part explains why we started seeing this only with
  OVMF.

  KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
  the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
  later on in x86_emulate_instruction we overwrite arch.hflags with
  ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
  The AMD-specific hflag of interest here is HF_NMI_MASK.

  When rebooting the system, Windows sends an NMI IPI to all but the current
  cpu to shut them down. Only after all of them are parked in HLT will the
  initiating cpu finish the restart. If NMI is masked, other cpus never get
  the memo and the initiating cpu spins forever, waiting for
  hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.

Fixes: a584539b24 ("KVM: x86: pass the whole hflags field to emulator and back")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05 14:40:31 +02:00
Borislav Petkov
68c83a3791 x86/MCE: Export memory_error()
commit 2d1f406139 upstream.

Export the function which checks whether an MCE is a memory error to
other users so that we can reuse the logic. Drop the boot_cpu_data use,
while at it, as mce.cpuvendor already has the CPU vendor in there.

Integrate a piece from a patch from Vishal Verma
<vishal.l.verma@intel.com> to export it for modules (nfit).

The main reason we're exporting it is that the nfit handler
nfit_handle_mce() needs to detect a memory error properly before doing
its recovery actions.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Link: http://lkml.kernel.org/r/20170519093915.15413-2-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07 12:07:47 +02:00
Linus Torvalds
ae382caa96 x86: fix 32-bit case of __get_user_asm_u64()
commit 33c9e97290 upstream.

The code to fetch a 64-bit value from user space was entirely buggered,
and has been since the code was merged in early 2016 in commit
b2f680380d ("x86/mm/32: Add support for 64-bit __get_user() on 32-bit
kernels").

Happily the buggered routine is almost certainly entirely unused, since
the normal way to access user space memory is just with the non-inlined
"get_user()", and the inlined version didn't even historically exist.

The normal "get_user()" case is handled by external hand-written asm in
arch/x86/lib/getuser.S that doesn't have either of these issues.

There were two independent bugs in __get_user_asm_u64():

 - it still did the STAC/CLAC user space access marking, even though
   that is now done by the wrapper macros, see commit 11f1a4b975
   ("x86: reorganize SMAP handling in user space accesses").

   This didn't result in a semantic error, it just means that the
   inlined optimized version was hugely less efficient than the
   allegedly slower standard version, since the CLAC/STAC overhead is
   quite high on modern Intel CPU's.

 - the double register %eax/%edx was marked as an output, but the %eax
   part of it was touched early in the asm, and could thus clobber other
   inputs to the asm that gcc didn't expect it to touch.

   In particular, that meant that the generated code could look like
   this:

        mov    (%eax),%eax
        mov    0x4(%eax),%edx

   where the load of %edx obviously was _supposed_ to be from the 32-bit
   word that followed the source of %eax, but because %eax was
   overwritten by the first instruction, the source of %edx was
   basically random garbage.

The fixes are trivial: remove the extraneous STAC/CLAC entries, and mark
the 64-bit output as early-clobber to let gcc know that no inputs should
alias with the output register.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-25 15:44:34 +02:00
Ben Hutchings
e0c871792c x86, pmem: Fix cache flushing for iovec write < 8 bytes
commit 8376efd31d upstream.

Commit 11e63f6d92 added cache flushing for unaligned writes from an
iovec, covering the first and last cache line of a >= 8 byte write and
the first cache line of a < 8 byte write.  But an unaligned write of
2-7 bytes can still cover two cache lines, so make sure we flush both
in that case.

Fixes: 11e63f6d92 ("x86, pmem: fix broken __copy_user_nocache ...")
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:28:36 +02:00
Boris Ostrovsky
5d7ab8339a xen: Revert commits da72ff5bfc and 72a9b18629
commit 84d582d236 upstream.

Recent discussion (http://marc.info/?l=xen-devel&m=149192184523741)
established that commit 72a9b18629 ("xen: Remove event channel
notification through Xen PCI platform device") (and thus commit
da72ff5bfc ("partially revert "xen: Remove event channel
notification through Xen PCI platform device"")) are unnecessary and,
in fact, prevent HVM guests from booting on Xen releases prior to 4.0

Therefore we revert both of those commits.

The summary of that discussion is below:

  Here is the brief summary of the current situation:

  Before the offending commit (72a9b18629):

  1) INTx does not work because of the reset_watches path.
  2) The reset_watches path is only taken if you have Xen > 4.0
  3) The Linux Kernel by default will use vector inject if the hypervisor
     support. So even INTx does not work no body running the kernel with
     Xen > 4.0 would notice. Unless he explicitly disabled this feature
     either in the kernel or in Xen (and this can only be disabled by
     modifying the code, not user-supported way to do it).

  After the offending commit (+ partial revert):

  1) INTx is no longer support for HVM (only for PV guests).
  2) Any HVM guest The kernel will not boot on Xen < 4.0 which does
     not have vector injection support. Since the only other mode
     supported is INTx which.

  So based on this summary, I think before commit (72a9b18629) we were
  in much better position from a user point of view.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Ross Lagerwall <ross.lagerwall@citrix.com>
Cc: xen-devel@lists.xenproject.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: Anthony Liguori <aliguori@amazon.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-14 14:00:22 +02:00
Thomas Gleixner
5e29a45f1e x86/vdso: Plug race between mapping and ELF header setup
commit 6fdc6dd902 upstream.

The vsyscall32 sysctl can racy against a concurrent fork when it switches
from disabled to enabled:

    arch_setup_additional_pages()
	if (vdso32_enabled)
           --> No mapping
                                        sysctl.vsysscall32()
                                          --> vdso32_enabled = true
    create_elf_tables()
      ARCH_DLINFO_IA32
        if (vdso32_enabled) {
           --> Add VDSO entry with NULL pointer

Make ARCH_DLINFO_IA32 check whether the VDSO mapping has been set up for
the newly forked process or not.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathias Krause <minipli@googlemail.com>
Link: http://lkml.kernel.org/r/20170410151723.602367196@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21 09:31:20 +02:00
Dan Williams
a9826aa486 x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
commit 11e63f6d92 upstream.

Before we rework the "pmem api" to stop abusing __copy_user_nocache()
for memcpy_to_pmem() we need to fix cases where we may strand dirty data
in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
for arbitrary data transfers from userspace. There is no guarantee that
these transfers, performed by dax_iomap_actor(), will have aligned
destinations or aligned transfer lengths. Backstop the usage
__copy_user_nocache() with explicit cache management in these unaligned
cases.

Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
that is saved for a later patch that moves the entirety of the "pmem
api" into the pmem driver directly.

Fixes: 5de490daec ("pmem: add copy_from_iter_pmem() and clear_pmem()")
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21 09:31:20 +02:00
Paolo Bonzini
3e8c6bd9db KVM: x86: cleanup the page tracking SRCU instance
commit 2beb6dad2e upstream.

SRCU uses a delayed work item.  Skip cleaning it up, and
the result is use-after-free in the work item callbacks.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 0eb05bf290
Reviewed-by: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-31 10:31:45 +02:00
Daniel Borkmann
2f18b39499 x86/tlb: Fix tlb flushing when lguest clears PGE
commit 2c4ea6e28d upstream.

Fengguang reported random corruptions from various locations on x86-32
after commits d2852a2240 ("arch: add ARCH_HAS_SET_MEMORY config") and
9d876e79df ("bpf: fix unlocking of jited image when module ronx not set")
that uses the former. While x86-32 doesn't have a JIT like x86_64, the
bpf_prog_lock_ro() and bpf_prog_unlock_ro() got enabled due to
ARCH_HAS_SET_MEMORY, whereas Fengguang's test kernel doesn't have module
support built in and therefore never had the DEBUG_SET_MODULE_RONX setting
enabled.

After investigating the crashes further, it turned out that using
set_memory_ro() and set_memory_rw() didn't have the desired effect, for
example, setting the pages as read-only on x86-32 would still let
probe_kernel_write() succeed without error. This behavior would manifest
itself in situations where the vmalloc'ed buffer was accessed prior to
set_memory_*() such as in case of bpf_prog_alloc(). In cases where it
wasn't, the page attribute changes seemed to have taken effect, leading to
the conclusion that a TLB invalidate didn't happen. Moreover, it turned out
that this issue reproduced with qemu in "-cpu kvm64" mode, but not for
"-cpu host". When the issue occurs, change_page_attr_set_clr() did trigger
a TLB flush as expected via __flush_tlb_all() through cpa_flush_range(),
though.

There are 3 variants for issuing a TLB flush: invpcid_flush_all() (depends
on CPU feature bits X86_FEATURE_INVPCID, X86_FEATURE_PGE), cr4 based flush
(depends on X86_FEATURE_PGE), and cr3 based flush.  For "-cpu host" case in
my setup, the flush used invpcid_flush_all() variant, whereas for "-cpu
kvm64", the flush was cr4 based. Switching the kvm64 case to cr3 manually
worked fine, and further investigating the cr4 one turned out that
X86_CR4_PGE bit was not set in cr4 register, meaning the
__native_flush_tlb_global_irq_disabled() wrote cr4 twice with the same
value instead of clearing X86_CR4_PGE in the first write to trigger the
flush.

It turned out that X86_CR4_PGE was cleared from cr4 during init from
lguest_arch_host_init() via adjust_pge(). The X86_FEATURE_PGE bit is also
cleared from there due to concerns of using PGE in guest kernel that can
lead to hard to trace bugs (see bff672e630 ("lguest: documentation V:
Host") in init()). The CPU feature bits are cleared in dynamic
boot_cpu_data, but they never propagated to __flush_tlb_all() as it uses
static_cpu_has() instead of boot_cpu_has() for testing which variant of TLB
flushing to use, meaning they still used the old setting of the host
kernel.

Clearing via setup_clear_cpu_cap(X86_FEATURE_PGE) so this would propagate
to static_cpu_has() checks is too late at this point as sections have been
patched already, so for now, it seems reasonable to switch back to
boot_cpu_has(X86_FEATURE_PGE) as it was prior to commit c109bf9599
("x86/cpufeature: Remove cpu_has_pge"). This lets the TLB flush trigger via
cr3 as originally intended, properly makes the new page attributes visible
and thus fixes the crashes seen by Fengguang.

Fixes: c109bf9599 ("x86/cpufeature: Remove cpu_has_pge")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: bp@suse.de
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: lkp@01.org
Cc: Laura Abbott <labbott@redhat.com>
Link: http://lkml.kernrl.org/r/20170301125426.l4nf65rx4wahohyl@wfg-t540p.sh.intel.com
Link: http://lkml.kernel.org/r/25c41ad9eca164be4db9ad84f768965b7eb19d9e.1489191673.git.daniel@iogearbox.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-15 10:02:51 +08:00
Dave Hansen
bfb55d4087 x86/pkeys: Check against max pkey to avoid overflows
commit 58ab9a088d upstream.

Kirill reported a warning from UBSAN about undefined behavior when using
protection keys.  He is running on hardware that actually has support for
it, which is not widely available.

The warning triggers because of very large shifts of integers when doing a
pkey_free() of a large, invalid value. This happens because we never check
that the pkey "fits" into the mm_pkey_allocation_map().

I do not believe there is any danger here of anything bad happening
other than some aliasing issues where somebody could do:

	pkey_free(35);

and the kernel would effectively execute:

	pkey_free(8);

While this might be confusing to an app that was doing something stupid, it
has to do something stupid and the effects are limited to the app shooting
itself in the foot.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-kselftest@vger.kernel.org
Cc: shuah@kernel.org
Cc: kirill.shutemov@linux.intel.com
Link: http://lkml.kernel.org/r/20170223222603.A022ED65@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12 06:41:47 +01:00
Borislav Petkov
6e306c5907 x86/CPU/AMD: Bring back Compute Unit ID
commit 79a8b9aa38 upstream.

Commit:

  a33d331761 ("x86/CPU/AMD: Fix Bulldozer topology")

restored the initial approach we had with the Fam15h topology of
enumerating CU (Compute Unit) threads as cores. And this is still
correct - they're beefier than HT threads but still have some
shared functionality.

Our current approach has a problem with the Mad Max Steam game, for
example. Yves Dionne reported a certain "choppiness" while playing on
v4.9.5.

That problem stems most likely from the fact that the CU threads share
resources within one CU and when we schedule to a thread of a different
compute unit, this incurs latency due to migrating the working set to a
different CU through the caches.

When the thread siblings mask mirrors that aspect of the CUs and
threads, the scheduler pays attention to it and tries to schedule within
one CU first. Which takes care of the latency, of course.

Reported-by: Yves Dionne <yves.dionne@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20170205105022.8705-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-14 15:25:41 -08:00
Thomas Gleixner
bd7e769457 x86/bugs: Separate AMD E400 erratum and C1E bug
commit 3344ed3079 upstream.

The workaround for the AMD Erratum E400 (Local APIC timer stops in C1E
state) is a two step process:

 - Selection of the E400 aware idle routine

 - Detection whether the platform is affected

The idle routine selection happens for possibly affected CPUs depending on
family/model/stepping information. These range of CPUs is not necessarily
affected as the decision whether to enable the C1E feature is made by the
firmware. Unfortunately there is no way to query this at early boot.

The current implementation polls a MSR in the E400 aware idle routine to
detect whether the CPU is affected. This is inefficient on non affected
CPUs because every idle entry has to do the MSR read.

There is a better way to detect this before going idle for the first time
which requires to seperate the bug flags:

  X86_BUG_AMD_E400 	- Selects the E400 aware idle routine and
  			  enables the detection

  X86_BUG_AMD_APIC_C1E  - Set when the platform is affected by E400

Replace the current X86_BUG_AMD_APIC_C1E usage by the new X86_BUG_AMD_E400
bug bit to select the idle routine which currently does an unconditional
detection poll. X86_BUG_AMD_APIC_C1E is going to be used in later patches
to remove the MSR polling and simplify the handling of this misfeature.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-3-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-19 20:18:01 +01:00
Adam Borowski
705df55bd0 x86/kbuild: enable modversions for symbols exported from asm
commit 334bb77387 upstream.

Commit 4efca4ed ("kbuild: modversions for EXPORT_SYMBOL() for asm") adds
modversion support for symbols exported from asm files. Architectures
must include C-style declarations for those symbols in asm/asm-prototypes.h
in order for them to be versioned.

Add these declarations for x86, and an architecture-independent file that
can be used for common symbols.

With f27c2f6 reverting 8ab2ae6 ("default exported asm symbols to zero") we
produce a scary warning on x86, this commit fixes that.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Tested-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Peter Wu <peter@lekensteyn.nl>
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Michal Marek <mmarek@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 10:40:17 +01:00
Dmitry Safonov
7b2dd36828 x86/coredump: Always use user_regs_struct for compat_elf_gregset_t
Commit:

  90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")

changed the coredumping code to construct the elf coredump file according
to register set size - and that's good: if binary crashes with 32-bit code
selector, generate 32-bit ELF core, otherwise - 64-bit core.

That was made for restoring 32-bit applications on x86_64: we want
32-bit application after restore to generate 32-bit ELF dump on crash.

All was quite good and recently I started reworking 32-bit applications
dumping part of CRIU: now it has two parasites (32 and 64) for seizing
compat/native tasks, after rework it'll have one parasite, working in
64-bit mode, to which 32-bit prologue long-jumps during infection.

And while it has worked for my work machine, in VM with
!CONFIG_X86_X32_ABI during reworking I faced that segfault in 32-bit
binary, that has long-jumped to 64-bit mode results in dereference
of garbage:

 32-victim[19266]: segfault at f775ef65 ip 00000000f775ef65 sp 00000000f776aa50 error 14
 BUG: unable to handle kernel paging request at ffffffffffffffff
 IP: [<ffffffff81332ce0>] strlen+0x0/0x20
 [...]
 Call Trace:
  [] elf_core_dump+0x11a9/0x1480
  [] do_coredump+0xa6b/0xe60
  [] get_signal+0x1a8/0x5c0
  [] do_signal+0x23/0x660
  [] exit_to_usermode_loop+0x34/0x65
  [] prepare_exit_to_usermode+0x2f/0x40
  [] retint_user+0x8/0x10

That's because we have 64-bit registers set (with according total size)
and we're writing it to elf_thread_core_info which has smaller size
on !CONFIG_X86_X32_ABI. That lead to overwriting ELF notes part.

Tested on 32-, 64-bit ELF crashes and on 32-bit binaries that have
jumped with 64-bit code selector - all is readable with gdb.

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: 90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-24 06:01:05 +01:00
Lukas Wunner
e8a6123e9e x86/platform/intel-mid: Retrofit pci_platform_pm_ops ->get_state hook
Commit cc7cc02bad ("PCI: Query platform firmware for device power
state") augmented struct pci_platform_pm_ops with a ->get_state hook and
implemented it for acpi_pci_platform_pm, the only pci_platform_pm_ops
existing till v4.7.

However v4.8 introduced another pci_platform_pm_ops for Intel Mobile
Internet Devices with commit 5823d0893e ("x86/platform/intel-mid: Add
Power Management Unit driver").  It is missing the ->get_state hook,
which is fatal since pci_set_platform_pm() enforces its presence.  Andy
Shevchenko reports that without the present commit, such a device
"crashes without even a character printed out on serial console and
reboots (since watchdog)".

Retrofit mid_pci_platform_pm with the missing callback to fix the
breakage.

Acked-and-tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Fixes: cc7cc02bad ("PCI: Query platform firmware for device power state")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: http://lkml.kernel.org/r/7c1567d4c49303a4aada94ba16275cbf56b8976b.1477221514.git.lukas@wunner.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-07 13:06:59 +01:00
Linus Torvalds
66cecb6789 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
 "One NULL pointer dereference, and two fixes for regressions introduced
  during the merge window.

  The rest are fixes for MIPS, s390 and nested VMX"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: x86: Check memopp before dereference (CVE-2016-8630)
  kvm: nVMX: VMCLEAR an active shadow VMCS after last use
  KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK
  KVM: x86: fix wbinvd_dirty_mask use-after-free
  kvm/x86: Show WRMSR data is in hex
  kvm: nVMX: Fix kernel panics induced by illegal INVEPT/INVVPID types
  KVM: document lock orders
  KVM: fix OOPS on flush_work
  KVM: s390: Fix STHYI buffer alignment for diag224
  KVM: MIPS: Precalculate MMIO load resume PC
  KVM: MIPS: Make ERET handle ERL before EXL
  KVM: MIPS: Fix lazy user ASID regenerate for SMP
2016-11-04 13:08:05 -07:00
Paolo Bonzini
ea26e4ec08 KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK
Since commit a545ab6a00 ("kvm: x86: add tsc_offset field to struct
kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
cached and need not be fished out of the VMCS or VMCB.  This means
that we can implement adjust_tsc_offset_guest and read_l1_tsc
entirely in generic code.  The simplification is particularly
significant for VMX code, where vmx->nested.vmcs01_tsc_offset
was duplicating what is now in vcpu->arch.tsc_offset.  Therefore
the vmcs01_tsc_offset can be dropped completely.

More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
which, after commit 108b249c45 ("KVM: x86: introduce get_kvmclock_ns",
2016-09-01) called read_l1_tsc while the VMCS was not loaded.
It thus returned bogus values on Intel CPUs.

Fixes: 108b249c45
Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 20:03:07 +01:00
Dave Airlie
8ef4227615 x86/io: add interface to reserve io memtype for a resource range. (v1.1)
A recent change to the mm code in:
87744ab383 mm: fix cache mode tracking in vm_insert_mixed()

started enforcing checking the memory type against the registered list for
amixed pfn insertion mappings. It happens that the drm drivers for a number
of gpus relied on this being broken. Currently the driver only inserted
VRAM mappings into the tracking table when they came from the kernel,
and userspace mappings never landed in the table. This led to a regression
where all the mapping end up as UC instead of WC now.

I've considered a number of solutions but since this needs to be fixed
in fixes and not next, and some of the solutions were going to introduce
overhead that hadn't been there before I didn't consider them viable at
this stage. These mainly concerned hooking into the TTM io reserve APIs,
but these API have a bunch of fast paths I didn't want to unwind to add
this to.

The solution I've decided on is to add a new API like the arch_phys_wc
APIs (these would have worked but wc_del didn't take a range), and
use them from the drivers to add a WC compatible mapping to the table
for all VRAM on those GPUs. This means we can then create userspace
mapping that won't get degraded to UC.

v1.1: use CONFIG_X86_PAT + add some comments in io.h

Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: x86@kernel.org
Cc: mcgrof@suse.com
Cc: Dan Williams <dan.j.williams@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-10-26 15:45:38 +10:00
Heiko Carstens
c8061485a0 sched/core, x86: Make struct thread_info arch specific again
The following commit:

  c65eacbe29 ("sched/core: Allow putting thread_info into task_struct")

... made 'struct thread_info' a generic struct with only a
single ::flags member, if CONFIG_THREAD_INFO_IN_TASK_STRUCT=y is
selected.

This change however seems to be quite x86 centric, since at least the
generic preemption code (asm-generic/preempt.h) assumes that struct
thread_info also has a preempt_count member, which apparently was not
true for x86.

We could add a bit more #ifdefs to solve this problem too, but it seems
to be much simpler to make struct thread_info arch specific
again. This also makes the conversion to THREAD_INFO_IN_TASK_STRUCT a
bit easier for architectures that have a couple of arch specific stuff
in their thread_info definition.

The arch specific stuff _could_ be moved to thread_struct. However
keeping them in thread_info makes it easier: accessing thread_info
members is simple, since it is at the beginning of the task_struct,
while the thread_struct is at the end. At least on s390 the offsets
needed to access members of the thread_struct (with task_struct as
base) are too large for various asm instructions.  This is not a
problem when keeping these members within thread_info.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: keescook@chromium.org
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/1476901693-8492-2-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-20 13:27:47 +02:00
Piotr Luc
8214899342 x86/cpufeature: Add AVX512_4VNNIW and AVX512_4FMAPS features
AVX512_4VNNIW  - Vector instructions for deep learning enhanced word
variable precision.
AVX512_4FMAPS - Vector instructions for deep learning floating-point
single precision.

These new instructions are to be used in future Intel Xeon & Xeon Phi
processors. The bits 2&3 of CPUID[level:0x07, EDX] inform that new
instructions are supported by a processor.

The spec can be found in the Intel Software Developer Manual (SDM) or in
the Instruction Set Extensions Programming Reference (ISE).

Define new feature flags to enumerate the new instructions in /proc/cpuinfo
accordingly to CPUID bits and add the required xsave extensions which are
required for proper operation.

Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20161018150111.29926-1-piotr.luc@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-19 17:37:13 +02:00
Linus Torvalds
0832881425 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes, plus hw-enablement changes:

   - fix persistent RAM handling
   - remove pkeys warning
   - remove duplicate macro
   - fix debug warning in irq handler
   - add new 'Knights Mill' CPU related constants and enable the perf bits"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Add Knights Mill CPUID
  perf/x86/intel/rapl: Add Knights Mill CPUID
  perf/x86/intel: Add Knights Mill CPUID
  x86/cpu/intel: Add Knights Mill to Intel family
  x86/e820: Don't merge consecutive E820_PRAM ranges
  pkeys: Remove easily triggered WARN
  x86: Remove duplicate rtit status MSR macro
  x86/smp: Add irq_enter/exit() in smp_reschedule_interrupt()
2016-10-18 09:59:04 -07:00
Josh Poimboeuf
55a76b59b5 locking/rwsem/x86: Add stack frame dependency for ____down_write()
Arnd reported the following objtool warning:

  kernel/locking/rwsem.o: warning: objtool: down_write_killable()+0x16: call without frame pointer save/setup

The warning means gcc placed the ____down_write() inline asm (and its
call instruction) before the frame pointer setup in
down_write_killable(), which breaks frame pointer convention and can
result in incorrect stack traces.

Force the stack frame to be created before the call instruction by
listing the stack pointer as an output operand in the inline asm
statement.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1188b7015f04baf361e59de499ee2d7272c59dce.1476393828.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-18 12:21:16 +02:00
Piotr Luc
0047f59834 x86/cpu/intel: Add Knights Mill to Intel family
Add CPUID of Knights Mill (KNM) processor to Intel family list.

Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161012180520.30976-1-piotr.luc@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-17 10:45:08 +02:00
Ingo Molnar
1d33369db2 Merge tag 'v4.9-rc1' into x86/urgent, to pick up updates
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-16 11:31:39 +02:00
Linus Torvalds
84d69848c9 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild updates from Michal Marek:

 - EXPORT_SYMBOL for asm source by Al Viro.

   This does bring a regression, because genksyms no longer generates
   checksums for these symbols (CONFIG_MODVERSIONS). Nick Piggin is
   working on a patch to fix this.

   Plus, we are talking about functions like strcpy(), which rarely
   change prototypes.

 - Fixes for PPC fallout of the above by Stephen Rothwell and Nick
   Piggin

 - fixdep speedup by Alexey Dobriyan.

 - preparatory work by Nick Piggin to allow architectures to build with
   -ffunction-sections, -fdata-sections and --gc-sections

 - CONFIG_THIN_ARCHIVES support by Stephen Rothwell

 - fix for filenames with colons in the initramfs source by me.

* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (22 commits)
  initramfs: Escape colons in depfile
  ppc: there is no clear_pages to export
  powerpc/64: whitelist unresolved modversions CRCs
  kbuild: -ffunction-sections fix for archs with conflicting sections
  kbuild: add arch specific post-link Makefile
  kbuild: allow archs to select link dead code/data elimination
  kbuild: allow architectures to use thin archives instead of ld -r
  kbuild: Regenerate genksyms lexer
  kbuild: genksyms fix for typeof handling
  fixdep: faster CONFIG_ search
  ia64: move exports to definitions
  sparc32: debride memcpy.S a bit
  [sparc] unify 32bit and 64bit string.h
  sparc: move exports to definitions
  ppc: move exports to definitions
  arm: move exports to definitions
  s390: move exports to definitions
  m68k: move exports to definitions
  alpha: move exports to actual definitions
  x86: move exports to actual definitions
  ...
2016-10-14 14:26:58 -07:00
Linus Torvalds
b6daa51b9a Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:

 - Nick improved generic implementations of percpu operations which
   modify the variable and return so that they calculate the physical
   address only once.

 - percpu_ref percpu <-> atomic mode switching improvements. The
   patchset was originally posted about a year ago but fell through the
   crack.

 - misc non-critical fixes.

* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
  mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()
  percpu: eliminate two sparse warnings
  percpu: improve generic percpu modify-return implementation
  percpu-refcount: init ->confirm_switch member properly
  percpu_ref: allow operation mode switching operations to be called concurrently
  percpu_ref: restructure operation mode switching
  percpu_ref: unify staggered atomic switching wait behavior
  percpu_ref: reorganize __percpu_ref_switch_to_atomic() and relocate percpu_ref_switch_to_atomic()
  percpu_ref: remove unnecessary RCU grace period for staggered atomic switching confirmation
2016-10-14 11:46:25 -07:00
Longpeng(Mike)
c836eeda3e x86: Remove duplicate rtit status MSR macro
The MSR_IA32_RTIT_STATUS is defined twice, so remove one.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: len.brown@intel.com
Cc: peterz@infradead.org
Cc: rafael.j.wysocki@intel.com
Cc: alexander.shishkin@linux.intel.com
Cc: ray.huang@amd.com
Cc: Aravind.Gopalakrishnan@amd.com
Cc: wu.wubin@huawei.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: zhaoshenglong@huawei.com
Cc: vladimir_zapolskiy@mentor.com
Link: http://lkml.kernel.org/r/1476405740-80816-1-git-send-email-longpeng2@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-14 14:14:20 +02:00
Linus Torvalds
4cdf8dbe2d Merge branch 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess.h prepwork from Al Viro:
 "Preparations to tree-wide switch to use of linux/uaccess.h (which,
  obviously, will allow to start unifying stuff for real). The last step
  there, ie

    PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
    sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
            `git grep -l "$PATT"|grep -v ^include/linux/uaccess.h`

  is not taken here - I would prefer to do it once just before or just
  after -rc1.  However, everything should be ready for it"

* 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  remove a stray reference to asm/uaccess.h in docs
  sparc64: separate extable_64.h, switch elf_64.h to it
  score: separate extable.h, switch module.h to it
  mips: separate extable.h, switch module.h to it
  x86: separate extable.h, switch sections.h to it
  remove stray include of asm/uaccess.h from cacheflush.h
  mn10300: remove a bogus processor.h->uaccess.h include
  xtensa: split uaccess.h into C and asm sides
  bonding: quit messing with IOCTL
  kill __kernel_ds_p off
  mn10300: finish verify_area() off
  frv: move HAVE_ARCH_UNMAPPED_AREA to pgtable.h
  exceptions: detritus removal
2016-10-11 23:38:39 -07:00
Hidehiro Kawai
0ee59413c9 x86/panic: replace smp_send_stop() with kdump friendly version in panic path
Daniel Walker reported problems which happens when
crash_kexec_post_notifiers kernel option is enabled
(https://lkml.org/lkml/2015/6/24/44).

In that case, smp_send_stop() is called before entering kdump routines
which assume other CPUs are still online.  As the result, for x86, kdump
routines fail to save other CPUs' registers and disable virtualization
extensions.

To fix this problem, call a new kdump friendly function,
crash_smp_send_stop(), instead of the smp_send_stop() when
crash_kexec_post_notifiers is enabled.  crash_smp_send_stop() is a weak
function, and it just call smp_send_stop().  Architecture codes should
override it so that kdump can work appropriately.  This patch only
provides x86-specific version.

For Xen's PV kernel, just keep the current behavior.

NOTES:

- Right solution would be to place crash_smp_send_stop() before
  __crash_kexec() invocation in all cases and remove smp_send_stop(), but
  we can't do that until all architectures implement own
  crash_smp_send_stop()

- crash_smp_send_stop()-like work is still needed by
  machine_crash_shutdown() because crash_kexec() can be called without
  entering panic()

Fixes: f06e5153f4 (kernel/panic.c: add "crash_kexec_post_notifiers" option)
Link: http://lkml.kernel.org/r/20160810080948.11028.15344.stgit@sysi4-13.yrl.intra.hitachi.co.jp
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Reported-by: Daniel Walker <dwalker@fifo99.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Daniel Walker <dwalker@fifo99.com>
Cc: Xunlei Pang <xpang@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Daney <david.daney@cavium.com>
Cc: Aaro Koskinen <aaro.koskinen@iki.fi>
Cc: "Steven J. Hill" <steven.hill@cavium.com>
Cc: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:32 -07:00
Linus Torvalds
93c26d7dc0 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull protection keys syscall interface from Thomas Gleixner:
 "This is the final step of Protection Keys support which adds the
  syscalls so user space can actually allocate keys and protect memory
  areas with them. Details and usage examples can be found in the
  documentation.

  The mm side of this has been acked by Mel"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pkeys: Update documentation
  x86/mm/pkeys: Do not skip PKRU register if debug registers are not used
  x86/pkeys: Fix pkeys build breakage for some non-x86 arches
  x86/pkeys: Add self-tests
  x86/pkeys: Allow configuration of init_pkru
  x86/pkeys: Default to a restrictive init PKRU
  pkeys: Add details of system call use to Documentation/
  generic syscalls: Wire up memory protection keys syscalls
  x86: Wire up protection keys system calls
  x86/pkeys: Allocation/free syscalls
  x86/pkeys: Make mprotect_key() mask off additional vm_flags
  mm: Implement new pkey_mprotect() system call
  x86/pkeys: Add fault handling for PF_PK page fault bit
2016-10-10 11:01:51 -07:00
Linus Torvalds
5fa0eb0b4d Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 updates from Thomas Gleixner:
 "A pile of regression fixes and updates:

   - address the fallout of the patches which made the cpuid - nodeid
     relation permanent: Handling of invalid APIC ids and preventing
     pointless warning messages.

   - force eager FPU when protection keys are enabled. Protection keys
     are not generating FPU exceptions so they cannot work with the lazy
     FPU mechanism.

   - prevent force migration of interrupts which are not part of the CPU
     vector domain.

   - handle the fact that APIC ids are not updated in the ACPI/MADT
     tables on physical CPU hotplug

   - remove bash-isms from syscall table generator script

   - use the hypervisor supplied APIC frequency when running on VMware"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pkeys: Make protection keys an "eager" feature
  x86/apic: Prevent pointless warning messages
  x86/acpi: Prevent LAPIC id 0xff from being accounted
  arch/x86: Handle non enumerated CPU after physical hotplug
  x86/unwind: Fix oprofile module link error
  x86/vmware: Skip lapic calibration on VMware
  x86/syscalls: Remove bash-isms in syscall table generator
  x86/irq: Prevent force migration of irqs which are not in the vector domain
2016-10-10 10:59:07 -07:00
Dave Hansen
d4b05923f5 x86/pkeys: Make protection keys an "eager" feature
Our XSAVE features are divided into two categories: those that
generate FPU exceptions, and those that do not.  MPX and pkeys do
not generate FPU exceptions and thus can not be used lazily.  We
disable them when lazy mode is forced on.

We have a pair of masks to collect these two sets of features, but
XFEATURE_MASK_PKRU was added to the wrong mask: XFEATURE_MASK_LAZY.
Fix it by moving the feature to XFEATURE_MASK_EAGER.

Note: this only causes problem if you boot with lazy FPU mode
(eagerfpu=off) which is *not* the default.  It also only affects
hardware which is not currently publicly available.  It looks like
eager mode is going away, but we still need this patch applied
to any kernel that has protection keys and lazy mode, which is 4.6
through 4.8 at this point, and 4.9 if the lazy removal isn't sent
to Linus for 4.9.

Fixes: c8df400984 ("x86/fpu, x86/mm/pkeys: Add PKRU xsave fields and data structures")
Signed-off-by: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20161007162342.28A49813@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-08 12:26:44 +02:00
Chris Metcalf
6727ad9e20 nmi_backtrace: generate one-line reports for idle cpus
When doing an nmi backtrace of many cores, most of which are idle, the
output is a little overwhelming and very uninformative.  Suppress
messages for cpus that are idling when they are interrupted and just
emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

We do this by grouping all the cpuidle code together into a new
.cpuidle.text section, and then checking the address of the interrupted
PC to see if it lies within that section.

This commit suitably tags x86 and tile idle routines, and only adds in
the minimal framework for other architectures.

Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
Tested-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Chris Metcalf
9a01c3ed5c nmi_backtrace: add more trigger_*_cpu_backtrace() methods
Patch series "improvements to the nmi_backtrace code" v9.

This patch series modifies the trigger_xxx_backtrace() NMI-based remote
backtracing code to make it more flexible, and makes a few small
improvements along the way.

The motivation comes from the task isolation code, where there are
scenarios where we want to be able to diagnose a case where some cpu is
about to interrupt a task-isolated cpu.  It can be helpful to see both
where the interrupting cpu is, and also an approximation of where the
cpu that is being interrupted is.  The nmi_backtrace framework allows us
to discover the stack of the interrupted cpu.

I've tested that the change works as desired on tile, and build-tested
x86, arm, mips, and sparc64.  For x86 I confirmed that the generic
cpuidle stuff as well as the architecture-specific routines are in the
new cpuidle section.  For arm, mips, and sparc I just build-tested it
and made sure the generic cpuidle routines were in the new cpuidle
section, but I didn't attempt to figure out which the platform-specific
idle routines might be.  That might be more usefully done by someone
with platform experience in follow-up patches.

This patch (of 4):

Currently you can only request a backtrace of either all cpus, or all
cpus but yourself.  It can also be helpful to request a remote backtrace
of a single cpu, and since we want that, the logical extension is to
support a cpumask as the underlying primitive.

This change modifies the existing lib/nmi_backtrace.c code to take a
cpumask as its basic primitive, and modifies the linux/nmi.h code to use
the new "cpumask" method instead.

The existing clients of nmi_backtrace (arm and x86) are converted to
using the new cpumask approach in this change.

The other users of the backtracing API (sparc64 and mips) are converted
to use the cpumask approach rather than the all/allbutself approach.
The mips code ignored the "include_self" boolean but with this change it
will now also dump a local backtrace if requested.

Link: http://lkml.kernel.org/r/1472487169-14923-2-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:30 -07:00
Baoyou Xie
08ea8c07fb mm: move phys_mem_access_prot_allowed() declaration to pgtable.h
We get 1 warning when building kernel with W=1:

  drivers/char/mem.c:220:12: warning: no previous prototype for 'phys_mem_access_prot_allowed' [-Wmissing-prototypes]
   int __weak phys_mem_access_prot_allowed(struct file *file,

In fact, its declaration is spreading to several header files in
different architecture, but need to be declare in common header file.

So this patch moves phys_mem_access_prot_allowed() to pgtable.h.

Link: http://lkml.kernel.org/r/1473751597-12139-1-git-send-email-baoyou.xie@linaro.org
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Linus Torvalds
e6e3d8f8f4 Merge tag 'pci-v4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
 "Summary of PCI changes for the v4.9 merge window:

  Enumeration:
   - microblaze: Add multidomain support for procfs (Bharat Kumar Gogada)

  Resource management:
   - Ignore requested alignment for PROBE_ONLY and fixed resources (Yongji Xie)
   - Ignore requested alignment for VF BARs (Yongji Xie)

  PCI device hotplug:
   - Make core explicitly non-modular (Paul Gortmaker)

  PCIe native device hotplug:
   - Rename pcie_isr() locals for clarity (Bjorn Helgaas)
   - Return IRQ_NONE when we can't read interrupt status (Bjorn Helgaas)
   - Remove unnecessary guard (Bjorn Helgaas)
   - Clean up dmesg "Slot(%s)" messages (Bjorn Helgaas)
   - Remove useless pciehp_get_latch_status() calls (Bjorn Helgaas)
   - Clear attention LED on device add (Keith Busch)
   - Allow exclusive userspace control of indicators (Keith Busch)
   - Process all hotplug events before looking for new ones (Mayurkumar Patel)
   - Don't re-read Slot Status when queuing hotplug event (Mayurkumar Patel)
   - Don't re-read Slot Status when handling surprise event (Mayurkumar Patel)
   - Make explicitly non-modular (Paul Gortmaker)

  Power management:
   - Afford direct-complete to devices with non-standard PM (Lukas Wunner)
   - Query platform firmware for device power state (Lukas Wunner)
   - Recognize D3cold in pci_update_current_state() (Lukas Wunner)
   - Avoid unnecessary resume after direct-complete (Lukas Wunner)
   - Make explicitly non-modular (Paul Gortmaker)

  Virtualization:
   - Mark Atheros AR9580 to avoid bus reset (Maik Broemme)
   - Check for pci_setup_device() failure in pci_iov_add_virtfn() (Po Liu)

  MSI:
   - Enable PCI_MSI_IRQ_DOMAIN support for ARC (Joao Pinto)

  AER:
   - Remove aerdriver.nosourceid kernel parameter (Bjorn Helgaas)
   - Remove aerdriver.forceload kernel parameter (Bjorn Helgaas)
   - Fix aer_probe() kernel-doc comment (Cao jin)
   - Add bus flag to skip source ID matching (Jon Derrick)
   - Avoid memory allocation in interrupt handling path (Jon Derrick)
   - Cache capability position (Keith Busch)
   - Make explicitly non-modular (Paul Gortmaker)
   - Remove duplicate AER severity translation (Tyler Baicar)
   - Send correct severity to calculate AER severity (Tyler Baicar)

  Precision Time Measurement:
   - Add Precision Time Measurement (PTM) support (Jonathan Yong)
   - Add PTM clock granularity information (Bjorn Helgaas)
   - Add pci_enable_ptm() for drivers to enable PTM on endpoints (Bjorn Helgaas)

  Generic host bridge driver:
   - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
   - Make explicitly non-modular (Paul Gortmaker)

  Altera host bridge driver:
   - Remove redundant platform_get_resource() return value check (Bjorn Helgaas)
   - Poll for link training status after retraining the link (Ley Foon Tan)
   - Rework config accessors for use without a struct pci_bus (Ley Foon Tan)
   - Move retrain from fixup to altera_pcie_host_init() (Ley Foon Tan)
   - Make MSI explicitly non-modular (Paul Gortmaker)
   - Make explicitly non-modular (Paul Gortmaker)
   - Relax device number checking to allow SR-IOV (Po Liu)

  ARM Versatile host bridge driver:
   - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)

  Axis ARTPEC-6 host bridge driver:
   - Drop __init from artpec6_add_pcie_port() (Niklas Cassel)

  Freescale i.MX6 host bridge driver:
   - Make explicitly non-modular (Paul Gortmaker)

  Intel VMD host bridge driver:
   - Add quirk for AER to ignore source ID (Jon Derrick)
   - Allocate IRQ lists with correct MSI-X count (Jon Derrick)
   - Convert to use pci_alloc_irq_vectors() API (Jon Derrick)
   - Eliminate vmd_vector member from list type (Jon Derrick)
   - Eliminate index member from IRQ list (Jon Derrick)
   - Synchronize with RCU freeing MSI IRQ descs (Keith Busch)
   - Request userspace control of PCIe hotplug indicators (Keith Busch)
   - Move VMD driver to drivers/pci/host (Keith Busch)

  Marvell Aardvark host bridge driver:
   - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
   - Remove redundant dev_err call in advk_pcie_probe() (Wei Yongjun)

  Microsoft Hyper-V host bridge driver:
   - Use zero-length array in struct pci_packet (Dexuan Cui)
   - Use pci_function_description[0] in struct definitions (Dexuan Cui)
   - Remove the unused 'wrk' in struct hv_pcibus_device (Dexuan Cui)
   - Handle vmbus_sendpacket() failure in hv_compose_msi_msg() (Dexuan Cui)
   - Handle hv_pci_generic_compl() error case (Dexuan Cui)
   - Use list_move_tail() instead of list_del() + list_add_tail() (Wei Yongjun)

  NVIDIA Tegra host bridge driver:
   - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
   - Remove redundant _data suffix (Thierry Reding)
   - Use of_device_get_match_data() (Thierry Reding)

  Qualcomm host bridge driver:
   - Make explicitly non-modular (Paul Gortmaker)

  Renesas R-Car host bridge driver:
   - Consolidate register space lookup and ioremap (Bjorn Helgaas)
   - Don't disable/unprepare clocks on prepare/enable failure (Geert Uytterhoeven)
   - Add multi-MSI support (Grigory Kletsko)
   - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
   - Fix some checkpatch warnings (Sergei Shtylyov)
   - Try increasing PCIe link speed to 5 GT/s at boot (Sergei Shtylyov)

  Rockchip host bridge driver:
   - Add DT bindings for Rockchip PCIe controller (Shawn Lin)
   - Add Rockchip PCIe controller support (Shawn Lin)
   - Improve the deassert sequence of four reset pins (Shawn Lin)
   - Fix wrong transmitted FTS count (Shawn Lin)
   - Increase the Max Credit update interval (Rajat Jain)

  Samsung Exynos host bridge driver:
   - Make explicitly non-modular (Paul Gortmaker)

  ST Microelectronics SPEAr13xx host bridge driver:
   - Make explicitly non-modular (Paul Gortmaker)

  Synopsys DesignWare host bridge driver:
   - Return data directly from dw_pcie_readl_rc() (Bjorn Helgaas)
   - Exchange viewport of `MEMORYs' and `CFGs/IOs' (Dong Bo)
   - Check LTSSM training bit before deciding link is up (Jisheng Zhang)
   - Move link wait definitions to .c file (Joao Pinto)
   - Wait for iATU enable (Joao Pinto)
   - Add iATU Unroll feature (Joao Pinto)
   - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
   - Make explicitly non-modular (Paul Gortmaker)
   - Relax device number checking to allow SR-IOV (Po Liu)
   - Keep viewport fixed for IO transaction if num_viewport > 2 (Pratyush Anand)
   - Remove redundant platform_get_resource() return value check (Wei Yongjun)

  TI DRA7xx host bridge driver:
   - Make explicitly non-modular (Paul Gortmaker)

  TI Keystone host bridge driver:
   - Propagate request_irq() failure (Wei Yongjun)

  Xilinx AXI host bridge driver:
   - Keep both legacy and MSI interrupt domain references (Bharat Kumar Gogada)
   - Clear interrupt register for invalid interrupt (Bharat Kumar Gogada)
   - Clear correct MSI set bit (Bharat Kumar Gogada)
   - Dispose of MSI virtual IRQ (Bharat Kumar Gogada)
   - Make explicitly non-modular (Paul Gortmaker)
   - Relax device number checking to allow SR-IOV (Po Liu)

  Xilinx NWL host bridge driver:
   - Expand error logging (Bharat Kumar Gogada)
   - Enable all MSI interrupts using MSI mask (Bharat Kumar Gogada)
   - Make explicitly non-modular (Paul Gortmaker)

  Miscellaneous:
   - Drop CONFIG_KEXEC_CORE ifdeffery (Lukas Wunner)
   - portdrv: Make explicitly non-modular (Paul Gortmaker)
   - Make DPC explicitly non-modular (Paul Gortmaker)"

* tag 'pci-v4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (105 commits)
  x86/PCI: VMD: Move VMD driver to drivers/pci/host
  PCI: rockchip: Fix wrong transmitted FTS count
  PCI: rockchip: Improve the deassert sequence of four reset pins
  PCI: rockchip: Increase the Max Credit update interval
  PCI: rcar: Try increasing PCIe link speed to 5 GT/s at boot
  PCI/AER: Fix aer_probe() kernel-doc comment
  PCI: Ignore requested alignment for VF BARs
  PCI: Ignore requested alignment for PROBE_ONLY and fixed resources
  PCI: Avoid unnecessary resume after direct-complete
  PCI: Recognize D3cold in pci_update_current_state()
  PCI: Query platform firmware for device power state
  PCI: Afford direct-complete to devices with non-standard PM
  PCI/AER: Cache capability position
  PCI/AER: Avoid memory allocation in interrupt handling path
  x86/PCI: VMD: Request userspace control of PCIe hotplug indicators
  PCI: pciehp: Allow exclusive userspace control of indicators
  ACPI / APEI: Send correct severity to calculate AER severity
  PCI/AER: Remove duplicate AER severity translation
  x86/PCI: VMD: Synchronize with RCU freeing MSI IRQ descs
  x86/PCI: VMD: Eliminate index member from IRQ list
  ...
2016-10-07 11:46:37 -07:00
Linus Torvalds
541efb7632 Merge tag 'for-linus-4.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
 "xen features and fixes for 4.9:

   - switch to new CPU hotplug mechanism

   - support driver_override in pciback

   - require vector callback for HVM guests (the alternate mechanism via
     the platform device has been broken for ages)"

* tag 'for-linus-4.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/x86: Update topology map for PV VCPUs
  xen/x86: Initialize per_cpu(xen_vcpu, 0) a little earlier
  xen/pciback: support driver_override
  xen/pciback: avoid multiple entries in slot list
  xen/pciback: simplify pcistub device handling
  xen: Remove event channel notification through Xen PCI platform device
  xen/events: Convert to hotplug state machine
  xen/x86: Convert to hotplug state machine
  x86/xen: add missing \n at end of printk warning message
  xen/grant-table: Use kmalloc_array() in arch_gnttab_valloc()
  xen: Make VPMU init message look less scary
  xen: rename xen_pmu_init() in sys-hypervisor.c
  hotplug: Prevent alloc/free of irq descriptors during cpu up/down (again)
  xen/x86: Move irq allocation from Xen smp_op.cpu_up()
2016-10-06 11:19:10 -07:00