Commit Graph

664652 Commits

Author SHA1 Message Date
Jan Kara
9b3be66cb4 f2fs: Make flush bios explicitely sync
commit 3adc5fcb7e upstream.

Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions.

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65a
CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:49 +02:00
Jaegeuk Kim
bd3dfe5049 f2fs: check entire encrypted bigname when finding a dentry
commit 6332cd32c8 upstream.

If user has no key under an encrypted dir, fscrypt gives digested dentries.
Previously, when looking up a dentry, f2fs only checks its hash value with
first 4 bytes of the digested dentry, which didn't handle hash collisions fully.
This patch enhances to check entire dentry bytes likewise ext4.

Eric reported how to reproduce this issue by:

 # seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
100000
 # sync
 # echo 3 > /proc/sys/vm/drop_caches
 # keyctl new_session
 # find edir -type f | xargs stat -c %i | sort | uniq | wc -l
99999

Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(fixed f2fs_dentry_hash() to work even when the hash is 0)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:49 +02:00
Sheng Yong
3a625468bd f2fs: fix multiple f2fs_add_link() having same name for inline dentry
commit d3bb910c15 upstream.

Commit 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having
same name) does not cover the scenario where inline dentry is enabled.
In that case, F2FS_I(dir)->task will be NULL, and __f2fs_add_link will
lookup dentries one more time.

This patch fixes it by moving the assigment of current task to a upper
level to cover both normal and inline dentry.

Fixes: 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having same name)
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:49 +02:00
Jaegeuk Kim
e71f099677 f2fs: fix fs corruption due to zero inode page
commit 9bb02c3627 upstream.

This patch fixes the following scenario.

- f2fs_create/f2fs_mkdir             - write_checkpoint
 - f2fs_mark_inode_dirty_sync         - block_operations
                                       - f2fs_lock_all
                                       - f2fs_sync_inode_meta
                                        - f2fs_unlock_all
                                        - sync_inode_metadata
 - f2fs_lock_op
                                         - f2fs_write_inode
                                          - update_inode_page
                                           - get_node_page
                                             return -ENOENT
 - new_inode_page
  - fill_node_footer
 - f2fs_mark_inode_dirty_sync
 - ...
 - f2fs_unlock_op
                                          - f2fs_inode_synced
                                       - f2fs_lock_all
                                       - do_checkpoint

In this checkpoint, we can get an inode page which contains zeros having valid
node footer only.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:49 +02:00
Jaegeuk Kim
ed4d26a1e4 Revert "f2fs: put allocate_segment after refresh_sit_entry"
commit c6f82fe90d upstream.

This reverts commit 3436c4bdb3.

This makes a leak to register dirty segments. I reproduced the issue by
modified postmark which injects a lot of file create/delete/update and
finally triggers huge number of SSR allocations.

[Jaegeuk Kim: Change missing incorrect comment]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:48 +02:00
Jaegeuk Kim
33cbcc2556 f2fs: fix wrong max cost initialization
commit c541a51b8c upstream.

This patch fixes missing increased max cost caused by a patch that we increased
cose of data segments in greedy algorithm.

Fixes: b9cd20619 "f2fs: node segment is prior to data segment selected victim"
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:48 +02:00
Ross Zwisler
03606c8ecc dax: fix PMD data corruption when fault races with write
commit 876f29460c upstream.

This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.

Currently DAX PMD read fault can race with write(2) in the following
way:

CPU1 - write(2)                 CPU2 - read fault
                                dax_iomap_pmd_fault()
                                  ->iomap_begin() - sees hole

dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate

                                  grab_mapping_entry()
				  - we add huge zero page to the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6 ("dax: Call ->iomap_begin without entry lock during dax fault")
Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:48 +02:00
Jan Kara
5a3651b4a9 ext4: return to starting transaction in ext4_dax_huge_fault()
commit fb26a1cbed upstream.

DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes.  To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:48 +02:00
Jan Kara
85f353dbd0 mm: fix data corruption due to stale mmap reads
commit cd656375f9 upstream.

Currently, we didn't invalidate page tables during invalidate_inode_pages2()
for DAX.  That could result in e.g. 2MiB zero page being mapped into
page tables while there were already underlying blocks allocated and
thus data seen through mmap were different from data seen by read(2).
The following sequence reproduces the problem:

 - open an mmap over a 2MiB hole

 - read from a 2MiB hole, faulting in a 2MiB zero page

 - write to the hole with write(3p). The write succeeds but we
   incorrectly leave the 2MiB zero page mapping intact.

 - via the mmap, read the data that was just written. Since the zero
   page mapping is still intact we read back zeroes instead of the new
   data.

Fix the problem by unconditionally calling invalidate_inode_pages2_range()
in dax_iomap_actor() for new block allocations and by properly
invalidating page tables in invalidate_inode_pages2_range() for DAX
mappings.

Fixes: c6dcf52c23
Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:48 +02:00
Ross Zwisler
06305f51ef dax: prevent invalidation of mapped DAX entries
commit 4636e70bb0 upstream.

Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
v4.

This series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of
sync with block mappings in the filesystem and thus data seen through
mmap is different from data seen through read(2).

The series passes testing with t_mmap_stale test program from Ross and
also other mmap related tests on DAX filesystem.

This patch (of 4):

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

  invalidate_mapping_pages()
    invalidate_exceptional_entry()
      dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages()
there is an additional criteria which is that the page must not be
mapped.  This is noted in the comments above invalidate_mapping_pages()
and is checked in invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry.  This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to
its comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Fixes: c6dcf52c23 ("mm: Invalidate DAX radix tree entries only if appropriate")
Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:48 +02:00
Dan Williams
55dbfd7dcd device-dax: fix sysfs attribute deadlock
commit 565851c972 upstream.

Usage of device_lock() for dax_region attributes is unnecessary and
deadlock prone. It's unnecessary because the order of registration /
un-registration guarantees that drvdata is always valid. It's deadlock
prone because it sets up this situation:

 ndctl           D    0  2170   2082 0x00000000
 Call Trace:
  __schedule+0x31f/0x980
  schedule+0x3d/0x90
  schedule_preempt_disabled+0x15/0x20
  __mutex_lock+0x402/0x980
  ? __mutex_lock+0x158/0x980
  ? align_show+0x2b/0x80 [dax]
  ? kernfs_seq_start+0x2f/0x90
  mutex_lock_nested+0x1b/0x20
  align_show+0x2b/0x80 [dax]
  dev_attr_show+0x20/0x50

 ndctl           D    0  2186   2079 0x00000000
 Call Trace:
  __schedule+0x31f/0x980
  schedule+0x3d/0x90
  __kernfs_remove+0x1f6/0x340
  ? kernfs_remove_by_name_ns+0x45/0xa0
  ? remove_wait_queue+0x70/0x70
  kernfs_remove_by_name_ns+0x45/0xa0
  remove_files.isra.1+0x35/0x70
  sysfs_remove_group+0x44/0x90
  sysfs_remove_groups+0x2e/0x50
  dax_region_unregister+0x25/0x40 [dax]
  devm_action_release+0xf/0x20
  release_nodes+0x16d/0x2b0
  devres_release_all+0x3c/0x60
  device_release_driver_internal+0x17d/0x220
  device_release_driver+0x12/0x20
  unbind_store+0x112/0x160

ndctl/2170 is trying to acquire the device_lock() to read an attribute,
and ndctl/2186 is holding the device_lock() while trying to drain all
active attribute readers.

Thanks to Yi Zhang for the reproduction script.

Fixes: d7fe1a67f6 ("dax: add region 'id', 'size', and 'align' attributes")
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:48 +02:00
Dan Williams
8931477790 device-dax: fix cdev leak
commit ed01e50acd upstream.

If device_add() fails, cleanup the cdev. Otherwise, we leak a kobj_map()
with a stale device number.

As Jason points out, there is a small possibility that userspace has
opened and mapped the device in the time between cdev_add() and the
device_add() failure. We need a new kill_dax_dev() helper to invalidate
any established mappings.

Fixes: ba09c01d2f ("dax: convert to the cdev api")
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:48 +02:00
NeilBrown
9bcb9cc96d md/raid1: avoid reusing a resync bio after error handling.
commit 0c9d5b127f upstream.

fix_sync_read_error() modifies a bio on a newly faulty
device by setting bi_end_io to end_sync_write.
This ensure that put_buf() will still call rdev_dec_pending()
as required, but makes sure that subsequent code in
fix_sync_read_error() doesn't try to read from the device.

Unfortunately this interacts badly with sync_request_write()
which assumes that any bio with bi_end_io set to non-NULL
other than end_sync_read is safe to write to.

As the device is now faulty it doesn't make sense to write.
As the bio was recently used for a read, it is "dirty"
and not suitable for immediate submission.
In particular, ->bi_next might be non-NULL, which will cause
generic_make_request() to complain.

Break this interaction by refusing to write to devices
which are marked as Faulty.

Reported-and-tested-by: Michael Wang <yun.wang@profitbricks.com>
Fixes: 2e52d449bc ("md/raid1: add failfast handling for reads.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:48 +02:00
Jason A. Donenfeld
7d70808547 padata: free correct variable
commit 07a77929ba upstream.

The author meant to free the variable that was just allocated, instead
of the one that failed to be allocated, but made a simple typo. This
patch rectifies that.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
Amir Goldstein
f33f17b28d ovl: do not set overlay.opaque on non-dir create
commit 4a99f3c83d upstream.

The optimization for opaque dir create was wrongly being applied
also to non-dir create.

Fixes: 97c684cc91 ("ovl: create directories inside merged parent opaque")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
Björn Jacke
6a1baa3a16 CIFS: add misssing SFM mapping for doublequote
commit 85435d7a15 upstream.

SFM is mapping doublequote to 0xF020

Without this patch creating files with doublequote fails to Windows/Mac

Signed-off-by: Bjoern Jacke <bjacke@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
David Disseldorp
c4f35cc8dc cifs: fix CIFS_IOC_GET_MNT_INFO oops
commit d8a6e505d6 upstream.

An open directory may have a NULL private_data pointer prior to readdir.

Fixes: 0de1f4c6f6 ("Add way to query server fs info for smb3")
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
Rabin Vincent
33fa011644 CIFS: fix oplock break deadlocks
commit 3998e6b87d upstream.

When the final cifsFileInfo_put() is called from cifsiod and an oplock
break work is queued, lockdep complains loudly:

 =============================================
 [ INFO: possible recursive locking detected ]
 4.11.0+ #21 Not tainted
 ---------------------------------------------
 kworker/0:2/78 is trying to acquire lock:
  ("cifsiod"){++++.+}, at: flush_work+0x215/0x350

 but task is already holding lock:
  ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock("cifsiod");
   lock("cifsiod");

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by kworker/0:2/78:
  #0:  ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0
  #1:  ((&wdata->work)){+.+...}, at: process_one_work+0x255/0x8e0

 stack backtrace:
 CPU: 0 PID: 78 Comm: kworker/0:2 Not tainted 4.11.0+ #21
 Workqueue: cifsiod cifs_writev_complete
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? match_held_lock+0x20/0x2b0
  ? trace_hardirqs_off_caller+0x86/0x130
  ? mark_lock+0xa6/0x920
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? flush_work+0x215/0x350
  flush_work+0x236/0x350
  ? flush_work+0x215/0x350
  ? destroy_worker+0x170/0x170
  __cancel_work_timer+0x17d/0x210
  ? ___preempt_schedule+0x16/0x18
  cancel_work_sync+0x10/0x20
  cifsFileInfo_put+0x338/0x7f0
  cifs_writedata_release+0x2a/0x40
  ? cifs_writedata_release+0x2a/0x40
  cifs_writev_complete+0x29d/0x850
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40

This is a real warning.  Since the oplock is queued on the same
workqueue this can deadlock if there is only one worker thread active
for the workqueue (which will be the case during memory pressure when
the rescuer thread is handling it).

Furthermore, there is at least one other kind of hang possible due to
the oplock break handling if there is only worker.  (This can be
reproduced without introducing memory pressure by having passing 1 for
the max_active parameter of cifsiod.) cifs_oplock_break() can wait
indefintely in the filemap_fdatawait() while the cifs_writev_complete()
work is blocked:

 sysrq: SysRq : Show Blocked State
   task                        PC stack   pid father
 kworker/0:1     D    0    16      2 0x00000000
 Workqueue: cifsiod cifs_oplock_break
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x4a/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  cifs_oplock_break+0x651/0x710
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40
 dd              D    0   683    171 0x00000000
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x29/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  filemap_write_and_wait+0x4e/0x70
  cifs_flush+0x6a/0xb0
  filp_close+0x52/0xa0
  __close_fd+0xdc/0x150
  SyS_close+0x33/0x60
  entry_SYSCALL_64_fastpath+0x1f/0xbe

 Showing all locks held in the system:
 2 locks held by kworker/0:1/16:
  #0:  ("cifsiod"){.+.+.+}, at: process_one_work+0x255/0x8e0
  #1:  ((&cfile->oplock_break)){+.+.+.}, at: process_one_work+0x255/0x8e0

 Showing busy workqueues and worker pools:
 workqueue cifsiod: flags=0xc
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
     in-flight: 16:cifs_oplock_break
     delayed: cifs_writev_complete, cifs_echo_request
 pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 750 3

Fix these problems by creating a a new workqueue (with a rescuer) for
the oplock break work.

Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
David Disseldorp
076c404c12 cifs: fix CIFS_ENUMERATE_SNAPSHOTS oops
commit 6026685de3 upstream.

As with 618763958b, an open directory may have a NULL private_data
pointer prior to readdir. CIFS_ENUMERATE_SNAPSHOTS must check for this
before dereference.

Fixes: 834170c859 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
David Disseldorp
63d86cd970 cifs: fix leak in FSCTL_ENUM_SNAPS response handling
commit 0e5c795592 upstream.

The server may respond with success, and an output buffer less than
sizeof(struct smb_snapshot_array) in length. Do not leak the output
buffer in this case.

Fixes: 834170c859 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
Björn Jacke
160239dfbf CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
commit b704e70b7c upstream.

- trailing space maps to 0xF028
- trailing period maps to 0xF029

This fix corrects the mapping of file names which have a trailing character
that would otherwise be illegal (period or space) but is allowed by POSIX.

Signed-off-by: Bjoern Jacke <bjacke@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
Steve French
6f95f1925a SMB3: Work around mount failure when using SMB3 dialect to Macs
commit 7db0a6efdc upstream.

Macs send the maximum buffer size in response on ioctl to validate
negotiate security information, which causes us to fail the mount
as the response buffer is larger than the expected response.

Changed ioctl response processing to allow for padding of validate
negotiate ioctl response and limit the maximum response size to
maximum buffer size.

Signed-off-by: Steve French <steve.french@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
Steve French
49f70d15b5 Set unicode flag on cifs echo request to avoid Mac error
commit 26c9cb668c upstream.

Mac requires the unicode flag to be set for cifs, even for the smb
echo request (which doesn't have strings).

Without this Mac rejects the periodic echo requests (when mounting
with cifs) that we use to check if server is down

Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:47 +02:00
Sachin Prabhu
fdf150ee55 Do not return number of bytes written for ioctl CIFS_IOC_COPYCHUNK_FILE
commit 7d0c234fd2 upstream.

commit 620d8745b3 ("Introduce cifs_copy_file_range()") changes the
behaviour of the cifs ioctl call CIFS_IOC_COPYCHUNK_FILE. In case of
successful writes, it now returns the number of bytes written. This
return value is treated as an error by the xfstest cifs/001. Depending
on the errno set at that time, this may or may not result in the test
failing.

The patch fixes this by setting the return value to 0 in case of
successful writes.

Fixes: commit 620d8745b3 ("Introduce cifs_copy_file_range()")
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:46 +02:00
Sachin Prabhu
ecf9bb0fd7 Fix match_prepath()
commit cd8c42968e upstream.

Incorrect return value for shares not using the prefix path means that
we will never match superblocks for these shares.

Fixes: commit c1d8b24d18 ("Compare prepaths when comparing superblocks")
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:46 +02:00
Vlastimil Babka
d322fd67ca mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
commit 62be1511b1 upstream.

Patch series "more robust PF_MEMALLOC handling"

This series aims to unify the setting and clearing of PF_MEMALLOC, which
prevents recursive reclaim.  There are some places that clear the flag
unconditionally from current->flags, which may result in clearing a
pre-existing flag.  This already resulted in a bug report that Patch 1
fixes (without the new helpers, to make backporting easier).  Patch 2
introduces the new helpers, modelled after existing memalloc_noio_* and
memalloc_nofs_* helpers, and converts mm core to use them.  Patches 3
and 4 convert non-mm code.

This patch (of 4):

__alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
during page migration by lock_page() (see the comment in
__unmap_and_move()).  Then it unconditionally clears the flag, which can
clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
This was not a problem until commit a8161d1ed6 ("mm, page_alloc:
restructure direct compaction handling in slowpath"), because direct
compation was called only after direct reclaim, which was skipped when
PF_MEMALLOC flag was set.

Even now it's only a theoretical issue, as the new callsite of
__alloc_pages_direct_compact() is reached only for costly orders and
when gfp_pfmemalloc_allowed() is true, which means either
__GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true.  There is no
such known context, but let's play it safe and make
__alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
already set.

Fixes: a8161d1ed6 ("mm, page_alloc: restructure direct compaction handling in slowpath")
Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Chris Leech <cleech@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:46 +02:00
Johannes Weiner
41a68ab851 mm: vmscan: fix IO/refault regression in cache workingset transition
commit 2a2e48854d upstream.

Since commit 59dc76b0d4 ("mm: vmscan: reduce size of inactive file
list") we noticed bigger IO spikes during changes in cache access
patterns.

The patch in question shrunk the inactive list size to leave more room
for the current workingset in the presence of streaming IO.  However,
workingset transitions that previously happened on the inactive list are
now pushed out of memory and incur more refaults to complete.

This patch disables active list protection when refaults are being
observed.  This accelerates workingset transitions, and allows more of
the new set to establish itself from memory, without eating into the
ability to protect the established workingset during stable periods.

The workloads that were measurably affected for us were hit pretty bad
by it, with refault/majfault rates doubling and tripling during cache
transitions, and the machines sustaining half-hour periods of 100% IO
utilization, where they'd previously have sub-minute peaks at 60-90%.

Stateful services that handle user data tend to be more conservative
with kernel upgrades.  As a result we hit most page cache issues with
some delay, as was the case here.

The severity seemed to warrant a stable tag.

Fixes: 59dc76b0d4 ("mm: vmscan: reduce size of inactive file list")
Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:46 +02:00
Andrey Ryabinin
f7bccf3f16 fs/block_dev: always invalidate cleancache in invalidate_bdev()
commit a5f6a6a9c7 upstream.

invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.

Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
regardless of mapping->nrpages value.

Fixes: c515e1fd36 ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:46 +02:00
Andrey Ryabinin
601462864d fs: fix data invalidation in the cleancache during direct IO
commit 55635ba76e upstream.

Patch series "Properly invalidate data in the cleancache", v2.

We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache.  The reason for this is
that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.

Another odd thing is that we check only for ->nrpages and don't check
for ->nrexceptional, but invalidate_inode_pages2[_range] also
invalidates exceptional entries as well.  So we invalidate exceptional
entries only if ->nrpages != 0? This doesn't feel right.

 - Patch 1 fixes direct IO writes by removing ->nrpages check.
 - Patch 2 fixes similar case in invalidate_bdev().
     Note: I only fixed conditional cleancache_invalidate_inode() here.
       Do we also need to add ->nrexceptional check in into invalidate_bdev()?

 - Patches 3-4: some optimizations.

This patch (of 4):

Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero.  This can't be right,
because invalidate_inode_pages2[_range]() also invalidate data in the
cleancache via cleancache_invalidate_inode() call.  So if page cache is
empty but there is some data in the cleancache, buffered read after
direct IO write would get stale data from the cleancache.

Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.

Fix this by calling invalidate_inode_pages2[_range]() regardless of
nrpages state.

Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
they are not affected by this bug.

Fixes: c515e1fd36 ("mm/fs: add hooks to support cleancache")
Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:46 +02:00
Luis Henriques
a4c34c20fb ceph: fix memory leak in __ceph_setxattr()
commit eeca958dce upstream.

The ceph_inode_xattr needs to be released when removing an xattr.  Easily
reproducible running the 'generic/020' test from xfstests or simply by
doing:

  attr -s attr0 -V 0 /mnt/test && attr -r attr0 /mnt/test

While there, also fix the error path.

Here's the kmemleak splat:

unreferenced object 0xffff88001f86fbc0 (size 64):
  comm "attr", pid 244, jiffies 4294904246 (age 98.464s)
  hex dump (first 32 bytes):
    40 fa 86 1f 00 88 ff ff 80 32 38 1f 00 88 ff ff  @........28.....
    00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
  backtrace:
    [<ffffffff81560199>] kmemleak_alloc+0x49/0xa0
    [<ffffffff810f3e5b>] kmem_cache_alloc+0x9b/0xf0
    [<ffffffff812b157e>] __ceph_setxattr+0x17e/0x820
    [<ffffffff812b1c57>] ceph_set_xattr_handler+0x37/0x40
    [<ffffffff8111fb4b>] __vfs_removexattr+0x4b/0x60
    [<ffffffff8111fd37>] vfs_removexattr+0x77/0xd0
    [<ffffffff8111fdd1>] removexattr+0x41/0x60
    [<ffffffff8111fe65>] path_removexattr+0x75/0xa0
    [<ffffffff81120aeb>] SyS_lremovexattr+0xb/0x10
    [<ffffffff81564b20>] entry_SYSCALL_64_fastpath+0x13/0x94
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:46 +02:00
Michal Hocko
2cfc744069 fs/xattr.c: zero out memory copied to userspace in getxattr
commit 81be3dee96 upstream.

getxattr uses vmalloc to allocate memory if kzalloc fails.  This is
filled by vfs_getxattr and then copied to the userspace.  vmalloc,
however, doesn't zero out the memory so if the specific implementation
of the xattr handler is sloppy we can theoretically expose a kernel
memory.  There is no real sign this is really the case but let's make
sure this will not happen and use vzalloc instead.

Fixes: 779302e678 ("fs/xattr.c:getxattr(): improve handling of allocation failures")
Link: http://lkml.kernel.org/r/20170306103327.2766-1-mhocko@kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:46 +02:00
Martin Brandenburg
8bd8322308 orangefs: do not check possibly stale size on truncate
commit 53950ef541 upstream.

Let the server figure this out because our size might be out of date or
not present.

The bug was that

	xfs_io -f -t -c "pread -v 0 100" /mnt/foo
	echo "Test" > /mnt/foo
	xfs_io -f -t -c "pread -v 0 100" /mnt/foo

fails because the second truncate did not happen if nothing had
requested the size after the write in echo.  Thus i_size was zero (not
present) and the orangefs_setattr though i_size was zero and there was
nothing to do.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:46 +02:00
Martin Brandenburg
66f1885bea orangefs: do not set getattr_time on orangefs_lookup
commit 17930b252c upstream.

Since orangefs_lookup calls orangefs_iget which calls
orangefs_inode_getattr, getattr_time will get set.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:45 +02:00
Martin Brandenburg
a6d3c33255 orangefs: clean up oversize xattr validation
commit e675c5ec51 upstream.

Also don't check flags as this has been validated by the VFS already.

Fix an off-by-one error in the max size checking.

Stop logging just because userspace wants to write attributes which do
not fit.

This and the previous commit fix xfstests generic/020.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:45 +02:00
Martin Brandenburg
7adb15036a orangefs: fix bounds check for listxattr
commit a956af337b upstream.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:45 +02:00
Eric Biggers
1567c17063 ext4: evict inline data when writing to memory map
commit 7b4cc9787f upstream.

Currently the case of writing via mmap to a file with inline data is not
handled.  This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:

    mkfs.ext4 -O inline_data /dev/vdb
    mount /dev/vdb /mnt
    xfs_io -f /mnt/file \
	-c 'pwrite 0 1' \
	-c 'mmap -w 0 1m' \
	-c 'mwrite 0 1' \
	-c 'fsync'

	kernel BUG at fs/ext4/inode.c:2723!
	invalid opcode: 0000 [#1] SMP
	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
	task: ffff88003d3a8040 task.stack: ffffc90000300000
	RIP: 0010:ext4_writepages+0xc89/0xf8a
	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
	Call Trace:
	 ? _raw_spin_unlock+0x27/0x2a
	 ? kvm_clock_read+0x1e/0x20
	 do_writepages+0x23/0x2c
	 ? do_writepages+0x23/0x2c
	 __filemap_fdatawrite_range+0x80/0x87
	 filemap_write_and_wait_range+0x67/0x8c
	 ext4_sync_file+0x20e/0x472
	 vfs_fsync_range+0x8e/0x9f
	 ? syscall_trace_enter+0x25b/0x2d0
	 vfs_fsync+0x1c/0x1e
	 do_fsync+0x31/0x4a
	 SyS_fsync+0x10/0x14
	 do_syscall_64+0x69/0x131
	 entry_SYSCALL64_slow_path+0x25/0x25

We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be.  So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.

Reported-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:45 +02:00
Jan Kara
7621cb7966 jbd2: fix dbench4 performance regression for 'nobarrier' mounts
commit 5052b069ac upstream.

Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since
JBD2 strips REQ_FUA and REQ_FLUSH flags from submitted IO when the
filesystem is mounted with nobarrier mount option, journal superblock
writes ended up being async writes after this patch and that caused
heavy performance regression for dbench4 benchmark with high number of
processes. In my test setup with HP RAID array with non-volatile write
cache and 32 GB ram, dbench4 runs with 8 processes regressed by ~25%.

Fix the problem by making sure journal superblock writes are always
treated as synchronous since they generally block progress of the
journalling machinery and thus the whole filesystem.

Fixes: b685d3d65a
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:45 +02:00
Christian Borntraeger
7a17cbb4b8 perf annotate s390: Implement jump types for perf annotate
commit d9f8dfa9ba upstream.

Implement simple detection for all kind of jumps and branches.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Andreas Krebbel <krebbel@linux.vnet.ibm.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-s390 <linux-s390@vger.kernel.org>
Link: http://lkml.kernel.org/r/1491465112-45819-3-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:45 +02:00
Christian Borntraeger
b09eace76e perf annotate s390: Fix perf annotate error -95 (4.10 regression)
commit e77852b32d upstream.

since 4.10 perf annotate exits on s390 with an "unknown error -95".
Turns out that commit 786c1b5184 ("perf annotate: Start supporting
cross arch annotation") added a hard requirement for architecture
support when objdump is used but only provided x86 and arm support.
Meanwhile power was added so lets add s390 as well.

While at it make sure to implement the branch and jump types.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Andreas Krebbel <krebbel@linux.vnet.ibm.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-s390 <linux-s390@vger.kernel.org>
Fixes: 786c1b5184 "perf annotate: Start supporting cross arch annotation"
Link: http://lkml.kernel.org/r/1491465112-45819-2-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:45 +02:00
Adrian Hunter
08b0b5bd54 perf auxtrace: Fix no_size logic in addr_filter__resolve_kernel_syms()
commit c3a0bbc7ad upstream.

Address filtering with kernel symbols incorrectly resulted in the error
"Cannot determine size of symbol" because the no_size logic was the wrong
way around.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1490357752-27942-1-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:45 +02:00
Mike Marciniszyn
2d58452241 IB/hfi1: Prevent kernel QP post send hard lockups
commit b6eac931b9 upstream.

The driver progress routines can call cond_resched() when
a timeslice is exhausted and irqs are enabled.

If the ULP had been holding a spin lock without disabling irqs and
the post send directly called the progress routine, the cond_resched()
could yield allowing another thread from the same ULP to deadlock
on that same lock.

Correct by replacing the current hfi1_do_send() calldown with a unique
one for post send and adding an argument to hfi1_do_send() to indicate
that the send engine is running in a thread.   If the routine is not
running in a thread, avoid calling cond_resched().

Fixes: Commit 831464ce4b ("IB/hfi1: Don't call cond_resched in atomic mode when sending packets")
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:45 +02:00
Jack Morgenstein
7a48c89a3c IB/mlx4: Reduce SRIOV multicast cleanup warning message to debug level
commit fb7a91746a upstream.

A warning message during SRIOV multicast cleanup should have actually been
a debug level message. The condition generating the warning does no harm
and can fill the message log.

In some cases, during testing, some tests were so intense as to swamp the
message log with these warning messages, causing a stall in the console
message log output task. This stall caused an NMI to be sent to all CPUs
(so that they all dumped their stacks into the message log).
Aside from the message flood causing an NMI, the tests all passed.

Once the message flood which caused the NMI is removed (by reducing the
warning message to debug level), the NMI no longer occurs.

Sample message log (console log) output illustrating the flood and
resultant NMI (snippets with comments and modified with ... instead
of hex digits, to satisfy checkpatch.pl):

 <mlx4_ib> _mlx4_ib_mcg_port_cleanup: ... WARNING: group refcount 1!!!...
 *** About 4000 almost identical lines in less than one second ***
 <mlx4_ib> _mlx4_ib_mcg_port_cleanup: ... WARNING: group refcount 1!!!...
 INFO: rcu_sched detected stalls on CPUs/tasks: { 17} (...)
 *** { 17} above indicates that CPU 17 was the one that stalled ***
 sending NMI to all CPUs:
 ...
 NMI backtrace for cpu 17
 CPU: 17 PID: 45909 Comm: kworker/17:2
 Hardware name: HP ProLiant DL360p Gen8, BIOS P71 09/08/2013
 Workqueue: events fb_flashcursor
 task: ffff880478...... ti: ffff88064e...... task.ti: ffff88064e......
 RIP: 0010:[ffffffff81......]  [ffffffff81......] io_serial_in+0x15/0x20
 RSP: 0018:ffff88064e257cb0  EFLAGS: 00000002
 RAX: 0000000000...... RBX: ffffffff81...... RCX: 0000000000......
 RDX: 0000000000...... RSI: 0000000000...... RDI: ffffffff81......
 RBP: ffff88064e...... R08: ffffffff81...... R09: 0000000000......
 R10: 0000000000...... R11: ffff88064e...... R12: 0000000000......
 R13: 0000000000...... R14: ffffffff81...... R15: 0000000000......
 FS:  0000000000......(0000) GS:ffff8804af......(0000) knlGS:000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080......
 CR2: 00007f2a2f...... CR3: 0000000001...... CR4: 0000000000......
 DR0: 0000000000...... DR1: 0000000000...... DR2: 0000000000......
 DR3: 0000000000...... DR6: 00000000ff...... DR7: 0000000000......
 Stack:
 ffff88064e...... ffffffff81...... ffffffff81...... 0000000000......
 ffffffff81...... ffff88064e...... ffffffff81...... ffffffff81......
 ffffffff81...... ffff88064e...... ffffffff81...... 0000000000......
 Call Trace:
[<ffffffff813d099b>] wait_for_xmitr+0x3b/0xa0
[<ffffffff813d0b5c>] serial8250_console_putchar+0x1c/0x30
[<ffffffff813d0b40>] ? serial8250_console_write+0x140/0x140
[<ffffffff813cb5fa>] uart_console_write+0x3a/0x80
[<ffffffff813d0aae>] serial8250_console_write+0xae/0x140
[<ffffffff8107c4d1>] call_console_drivers.constprop.15+0x91/0xf0
[<ffffffff8107d6cf>] console_unlock+0x3bf/0x400
[<ffffffff813503cd>] fb_flashcursor+0x5d/0x140
[<ffffffff81355c30>] ? bit_clear+0x120/0x120
[<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[<ffffffff810a5aef>] kthread+0xcf/0xe0
[<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[<ffffffff81645858>] ret_from_fork+0x58/0x90
[<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Code: 48 89 e5 d3 e6 48 63 f6 48 03 77 10 8b 06 5d c3 66 0f 1f 44 00 00 66 66 66 6

As indicated in the stack trace above, the console output task got swamped.

Fixes: b9c5d6a643 ("IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:44 +02:00
Jack Morgenstein
295a7aab35 IB/mlx4: Fix ib device initialization error flow
commit 99e68909d5 upstream.

In mlx4_ib_add, procedure mlx4_ib_alloc_eqs is called to allocate EQs.

However, in the mlx4_ib_add error flow, procedure mlx4_ib_free_eqs is not
called to free the allocated EQs.

Fixes: e605b743f3 ("IB/mlx4: Increase the number of vectors (EQs) available for ULPs")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:44 +02:00
Shamir Rabinovitch
f98af463f8 IB/IPoIB: ibX: failed to create mcg debug file
commit 771a525840 upstream.

When udev renames the netdev devices, ipoib debugfs entries does not
get renamed. As a result, if subsequent probe of ipoib device reuse the
name then creating a debugfs entry for the new device would fail.

Also, moved ipoib_create_debug_files and ipoib_delete_debug_files as part
of ipoib event handling in order to avoid any race condition between these.

Fixes: 1732b0ef3b ([IPoIB] add path record information in debugfs)
Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:44 +02:00
Michael J. Ruhl
88ded1f18b IB/core: For multicast functions, verify that LIDs are multicast LIDs
commit 8561eae60f upstream.

The Infiniband spec defines "A multicast address is defined by a
MGID and a MLID" (section 10.5).  Currently the MLID value is not
validated.

Add check to verify that the MLID value is in the correct address
range.

Fixes: 0c33aeedb2 ("[IB] Add checks to multicast attach and detach")
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:44 +02:00
Parav Pandit
82f0feccee IB/core: Fix kernel crash during fail to initialize device
commit 4be3a4fa51 upstream.

This patch fixes the kernel crash that occurs during ib_dealloc_device()
called due to provider driver fails with an error after
ib_alloc_device() and before it can register using ib_register_device().

This crashed seen in tha lab as below which can occur with any IB device
which fails to perform its device initialization before invoking
ib_register_device().

This patch avoids touching cache and port immutable structures if device
is not yet initialized.
It also releases related memory when cache and port immutable data
structure initialization fails during register_device() state.

[81416.561946] BUG: unable to handle kernel NULL pointer dereference at (null)
[81416.570340] IP: ib_cache_release_one+0x29/0x80 [ib_core]
[81416.576222] PGD 78da66067
[81416.576223] PUD 7f2d7c067
[81416.579484] PMD 0
[81416.582720]
[81416.587242] Oops: 0000 [#1] SMP
[81416.722395] task: ffff8807887515c0 task.stack: ffffc900062c0000
[81416.729148] RIP: 0010:ib_cache_release_one+0x29/0x80 [ib_core]
[81416.735793] RSP: 0018:ffffc900062c3a90 EFLAGS: 00010202
[81416.741823] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[81416.749785] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffff880859fec000
[81416.757757] RBP: ffffc900062c3aa0 R08: ffff8808536e5ac0 R09: ffff880859fec5b0
[81416.765708] R10: 00000000536e5c01 R11: ffff8808536e5ac0 R12: ffff880859fec000
[81416.773672] R13: 0000000000000000 R14: ffff8808536e5ac0 R15: ffff88084ebc0060
[81416.781621] FS:  00007fd879fab740(0000) GS:ffff88085fac0000(0000) knlGS:0000000000000000
[81416.790522] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[81416.797094] CR2: 0000000000000000 CR3: 00000007eb215000 CR4: 00000000003406e0
[81416.805051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[81416.812997] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[81416.820950] Call Trace:
[81416.824226]  ib_device_release+0x1e/0x40 [ib_core]
[81416.829858]  device_release+0x32/0xa0
[81416.834370]  kobject_cleanup+0x63/0x170
[81416.839058]  kobject_put+0x25/0x50
[81416.843319]  ib_dealloc_device+0x25/0x40 [ib_core]
[81416.848986]  mlx5_ib_add+0x163/0x1990 [mlx5_ib]
[81416.854414]  mlx5_add_device+0x5a/0x160 [mlx5_core]
[81416.860191]  mlx5_register_interface+0x8d/0xc0 [mlx5_core]
[81416.866587]  ? 0xffffffffa09e9000
[81416.870816]  mlx5_ib_init+0x15/0x17 [mlx5_ib]
[81416.876094]  do_one_initcall+0x51/0x1b0
[81416.880861]  ? __vunmap+0x85/0xd0
[81416.885113]  ? kmem_cache_alloc_trace+0x14b/0x1b0
[81416.890768]  ? vfree+0x2e/0x70
[81416.894762]  do_init_module+0x60/0x1fa
[81416.899441]  load_module+0x15f6/0x1af0
[81416.904114]  ? __symbol_put+0x60/0x60
[81416.908709]  ? ima_post_read_file+0x3d/0x80
[81416.913828]  ? security_kernel_post_read_file+0x6b/0x80
[81416.920006]  SYSC_finit_module+0xa6/0xf0
[81416.924888]  SyS_finit_module+0xe/0x10
[81416.929568]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[81416.935089] RIP: 0033:0x7fd879494949
[81416.939543] RSP: 002b:00007ffdbc1b4e58 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[81416.947982] RAX: ffffffffffffffda RBX: 0000000001b66f00 RCX: 00007fd879494949
[81416.955965] RDX: 0000000000000000 RSI: 000000000041a13c RDI: 0000000000000003
[81416.963926] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000001b652a0
[81416.971861] R10: 0000000000000003 R11: 0000000000000202 R12: 00007ffdbc1b3e70
[81416.979763] R13: 00007ffdbc1b3e50 R14: 0000000000000005 R15: 0000000000000000
[81417.008005] RIP: ib_cache_release_one+0x29/0x80 [ib_core] RSP: ffffc900062c3a90
[81417.016045] CR2: 0000000000000000

Fixes: 55aeed0654 ("IB/core: Make ib_alloc_device init the kobject")
Fixes: 7738613e7c ("IB/core: Add per port immutable struct to ib_device")
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:44 +02:00
Jack Morgenstein
5c7f1dfa7b IB/core: Fix sysfs registration error flow
commit b312be3d87 upstream.

The kernel commit cited below restructured ib device management
so that the device kobject is initialized in ib_alloc_device.

As part of the restructuring, the kobject is now initialized in
procedure ib_alloc_device, and is later added to the device hierarchy
in the ib_register_device call stack, in procedure
ib_device_register_sysfs (which calls device_add).

However, in the ib_device_register_sysfs error flow, if an error
occurs following the call to device_add, the cleanup procedure
device_unregister is called. This call results in the device object
being deleted -- which results in various use-after-free crashes.

The correct cleanup call is device_del -- which undoes device_add
without deleting the device object.

The device object will then (correctly) be deleted in the
ib_register_device caller's error cleanup flow, when the caller invokes
ib_dealloc_device.

Fixes: 55aeed0654 ("IB/core: Make ib_alloc_device init the kobject")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:44 +02:00
Ding Tianhong
09944c2660 iov_iter: don't revert iov buffer if csum error
commit a6a5993243 upstream.

The patch 3278682123 (make skb_copy_datagram_msg() et.al. preserve
->msg_iter on error) will revert the iov buffer if copy to iter
failed, but it didn't copy any datagram if the skb_checksum_complete
error, so no need to revert any data at this place.

v2: Sabrina notice that return -EFAULT when checksum error is not correct
    here, it would confuse the caller about the return value, so fix it.

Fixes: 3278682123 ("make skb_copy_datagram_msg() et.al. preserve->msg_iter on error")
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:44 +02:00
Alex Williamson
1e0b0a9ef0 vfio/type1: Remove locked page accounting workqueue
commit 0cfef2b741 upstream.

If the mmap_sem is contented then the vfio type1 IOMMU backend will
defer locked page accounting updates to a workqueue task.  This has a
few problems and depending on which side the user tries to play, they
might be over-penalized for unmaps that haven't yet been accounted or
race the workqueue to enter more mappings than they're allowed.  The
original intent of this workqueue mechanism seems to be focused on
reducing latency through the ioctl, but we cannot do so at the cost
of correctness.  Remove this workqueue mechanism and update the
callers to allow for failure.  We can also now recheck the limit under
write lock to make sure we don't exceed it.

vfio_pin_pages_remote() also now necessarily includes an unwind path
which we can jump to directly if the consecutive page pinning finds
that we're exceeding the user's memory limits.  This avoids the
current lazy approach which does accounting and mapping up to the
fault, only to return an error on the next iteration to unwind the
entire vfio_dma.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:44 +02:00
Dennis Yang
34224e0e1c dm thin: fix a memory leak when passing discard bio down
commit 948f581a53 upstream.

dm-thin does not free the discard_parent bio after all chained sub
bios finished. The following kmemleak report could be observed after
pool with discard_passdown option processes discard bios in
linux v4.11-rc7. To fix this, we drop the discard_parent bio reference
when its endio (passdown_endio) called.

unreferenced object 0xffff8803d6b29700 (size 256):
  comm "kworker/u8:0", pid 30349, jiffies 4379504020 (age 143002.776s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    01 00 00 00 00 00 00 f0 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81a5efd9>] kmemleak_alloc+0x49/0xa0
    [<ffffffff8114ec34>] kmem_cache_alloc+0xb4/0x100
    [<ffffffff8110eec0>] mempool_alloc_slab+0x10/0x20
    [<ffffffff8110efa5>] mempool_alloc+0x55/0x150
    [<ffffffff81374939>] bio_alloc_bioset+0xb9/0x260
    [<ffffffffa018fd20>] process_prepared_discard_passdown_pt1+0x40/0x1c0 [dm_thin_pool]
    [<ffffffffa018b409>] break_up_discard_bio+0x1a9/0x200 [dm_thin_pool]
    [<ffffffffa018b484>] process_discard_cell_passdown+0x24/0x40 [dm_thin_pool]
    [<ffffffffa018b24d>] process_discard_bio+0xdd/0xf0 [dm_thin_pool]
    [<ffffffffa018ecf6>] do_worker+0xa76/0xd50 [dm_thin_pool]
    [<ffffffff81086239>] process_one_work+0x139/0x370
    [<ffffffff810867b1>] worker_thread+0x61/0x450
    [<ffffffff8108b316>] kthread+0xd6/0xf0
    [<ffffffff81a6cd1f>] ret_from_fork+0x3f/0x70
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Dennis Yang <dennisyang@qnap.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-20 14:49:44 +02:00