Commit Graph

811 Commits

Author SHA1 Message Date
Vasily Averin
181224f970 ext4: fix possible leak of s_journal_flag_rwsem in error path
commit af18e35bfd upstream.

Fixes: c8585c6fca ("ext4: fix races between changing inode journal ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.7
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21 09:24:12 +01:00
Theodore Ts'o
1124e5a89b ext4: fix possible leak of sbi->s_group_desc_leak in error path
commit 9e463084cd upstream.

Fixes: bfe0a5f47a ("ext4: add more mount time checks of the superblock")
Reported-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.18
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-21 09:24:12 +01:00
Theodore Ts'o
44dcd01ee1 ext4: fix use-after-free race in ext4_remount()'s error path
commit 33458eaba4 upstream.

It's possible for ext4_show_quota_options() to try reading
s_qf_names[i] while it is being modified by ext4_remount() --- most
notably, in ext4_remount's error path when the original values of the
quota file name gets restored.

Reported-by: syzbot+a2872d6feea6918008a9@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.2+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13 11:15:06 -08:00
Eric Biggers
0d09307bc2 ext4: show test_dummy_encryption mount option in /proc/mounts
commit 338affb548 upstream.

When in effect, add "test_dummy_encryption" to _ext4_show_options() so
that it is shown in /proc/mounts and other relevant procfs files.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-29 03:06:06 -07:00
Theodore Ts'o
22654a3b4a ext4: recalucate superblock checksum after updating free blocks/inodes
commit 4274f516d4 upstream.

When mounting the superblock, ext4_fill_super() calculates the free
blocks and free inodes and stores them in the superblock.  It's not
strictly necessary, since we don't use them any more, but it's nice to
keep them roughly aligned to reality.

Since it's not critical for file system correctness, the code doesn't
call ext4_commit_super().  The problem is that it's in
ext4_commit_super() that we recalculate the superblock checksum.  So
if we're not going to call ext4_commit_super(), we need to call
ext4_superblock_csum_set() to make sure the superblock checksum is
consistent.

Most of the time, this doesn't matter, since we end up calling
ext4_commit_super() very soon thereafter, and definitely by the time
the file system is unmounted.  However, it doesn't work in this
sequence:

mke2fs -Fq -t ext4 /dev/vdc 128M
mount /dev/vdc /vdc
cp xfstests/git-versions /vdc
godown /vdc
umount /vdc
mount /dev/vdc
tune2fs -l /dev/vdc

With this commit, the "tune2fs -l" no longer fails.

Reported-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-29 03:06:06 -07:00
Theodore Ts'o
147b89c421 ext4: clear mmp sequence number when remounting read-only
[ Upstream commit 2dca60d98e ]

Previously, when an MMP-protected file system is remounted read-only,
the kmmpd thread would exit the next time it woke up (a few seconds
later), without resetting the MMP sequence number back to
EXT4_MMP_SEQ_CLEAN.

Fix this by explicitly killing the MMP thread when the file system is
remounted read-only.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05 09:26:24 +02:00
Theodore Ts'o
dd69abaccb ext4: fix false negatives *and* false positives in ext4_check_descriptors()
commit 44de022c43 upstream.

Ext4_check_descriptors() was getting called before s_gdb_count was
initialized.  So for file systems w/o the meta_bg feature, allocation
bitmaps could overlap the block group descriptors and ext4 wouldn't
notice.

For file systems with the meta_bg feature enabled, there was a
fencepost error which would cause the ext4_check_descriptors() to
incorrectly believe that the block allocation bitmap overlaps with the
block group descriptor blocks, and it would reject the mount.

Fix both of these problems.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Gilbert <bgilbert@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09 12:16:38 +02:00
Theodore Ts'o
f547aa20b4 ext4: fix check to prevent initializing reserved inodes
commit 5012284700 upstream.

Commit 8844618d8a: "ext4: only look at the bg_flags field if it is
valid" will complain if block group zero does not have the
EXT4_BG_INODE_ZEROED flag set.  Unfortunately, this is not correct,
since a freshly created file system has this flag cleared.  It gets
almost immediately after the file system is mounted read-write --- but
the following somewhat unlikely sequence will end up triggering a
false positive report of a corrupted file system:

   mkfs.ext4 /dev/vdc
   mount -o ro /dev/vdc /vdc
   mount -o remount,rw /dev/vdc

Instead, when initializing the inode table for block group zero, test
to make sure that itable_unused count is not too large, since that is
the case that will result in some or all of the reserved inodes
getting cleared.

This fixes the failures reported by Eric Whiteney when running
generic/230 and generic/231 in the the nojournal test case.

Fixes: 8844618d8a ("ext4: only look at the bg_flags field if it is valid")
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-03 07:50:43 +02:00
Dave Jiang
8214347c26 dax: change bdev_dax_supported() to support boolean returns
commit 80660f2025 upstream.

The function return values are confusing with the way the function is
named. We expect a true or false return value but it actually returns
0/-errno.  This makes the code very confusing. Changing the return values
to return a bool where if DAX is supported then return true and no DAX
support returns false.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11 16:29:22 +02:00
Darrick J. Wong
a19385766b fs: allow per-device dax status checking for filesystems
commit ba23cba9b3 upstream.

Change bdev_dax_supported so it takes a bdev parameter.  This enables
multi-device filesystems like xfs to check that a dax device can work for
the particular filesystem.  Once that's in place, actually fix all the
parts of XFS where we need to be able to distinguish between datadev and
rtdev.

This patch fixes the problem where we screw up the dax support checking
in xfs if the datadev and rtdev have different dax capabilities.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11 16:29:22 +02:00
Jon Derrick
fba3230595 ext4: check superblock mapped prior to committing
commit a17712c8e4 upstream.

This patch attempts to close a hole leading to a BUG seen with hot
removals during writes [1].

A block device (NVME namespace in this test case) is formatted to EXT4
without partitions. It's mounted and write I/O is run to a file, then
the device is hot removed from the slot. The superblock attempts to be
written to the drive which is no longer present.

The typical chain of events leading to the BUG:
ext4_commit_super()
  __sync_dirty_buffer()
    submit_bh()
      submit_bh_wbc()
        BUG_ON(!buffer_mapped(bh));

This fix checks for the superblock's buffer head being mapped prior to
syncing.

[1] https://www.spinics.net/lists/linux-ext4/msg56527.html

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11 16:29:19 +02:00
Theodore Ts'o
54bf664ae4 ext4: add more mount time checks of the superblock
commit bfe0a5f47a upstream.

The kernel's ext4 mount-time checks were more permissive than
e2fsprogs's libext2fs checks when opening a file system.  The
superblock is considered too insane for debugfs or e2fsck to operate
on it, the kernel has no business trying to mount it.

This will make file system fuzzing tools work harder, but the failure
cases that they find will be more useful and be easier to evaluate.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11 16:29:19 +02:00
Theodore Ts'o
c24aab6d86 ext4: add more inode number paranoia checks
commit c37e9e0134 upstream.

If there is a directory entry pointing to a system inode (such as a
journal inode), complain and declare the file system to be corrupted.

Also, if the superblock's first inode number field is too small,
refuse to mount the file system.

This addresses CVE-2018-10882.

https://bugzilla.kernel.org/show_bug.cgi?id=200069

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11 16:29:18 +02:00
Theodore Ts'o
44a4bc970b ext4: only look at the bg_flags field if it is valid
commit 8844618d8a upstream.

The bg_flags field in the block group descripts is only valid if the
uninit_bg or metadata_csum feature is enabled.  We were not
consistently looking at this field; fix this.

Also block group #0 must never have uninitialized allocation bitmaps,
or need to be zeroed, since that's where the root inode, and other
special inodes are set up.  Check for these conditions and mark the
file system as corrupted if they are detected.

This addresses CVE-2018-10876.

https://bugzilla.kernel.org/show_bug.cgi?id=199403

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11 16:29:17 +02:00
Theodore Ts'o
ac93c71836 ext4: make sure bitmaps and the inode table don't overlap with bg descriptors
commit 77260807d1 upstream.

It's really bad when the allocation bitmaps and the inode table
overlap with the block group descriptors, since it causes random
corruption of the bg descriptors.  So we really want to head those off
at the pass.

https://bugzilla.kernel.org/show_bug.cgi?id=199865

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11 16:29:17 +02:00
Eric Sandeen
791a1ef7df ext4: don't complain about incorrect features when probing
[ Upstream commit 0d9366d67b ]

If mount is auto-probing for filesystem type, it will try various
filesystems in order, with the MS_SILENT flag set.  We get
that flag as the silent arg to ext4_fill_super.

If we're probing (silent==1) then don't complain about feature
incompatibilities that are found if it looks like it's actually
a different valid extN type - failed probes should be silent
in this case.

If the on-disk features are unknown even to ext4, then complain.

Reported-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
Tested-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:52:30 +02:00
Theodore Ts'o
55cc3bb0a6 ext4: add MODULE_SOFTDEP to ensure crc32c is included in the initramfs
commit 7ef79ad521 upstream.

Fixes: a45403b515 ("ext4: always initialize the crc32c checksum driver")
Reported-by: François Valenduc <francoisvalenduc@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-01 12:58:07 -07:00
Theodore Ts'o
b2623d8166 ext4: don't allow r/w mounts if metadata blocks overlap the superblock
commit 18db4b4e6f upstream.

If some metadata block, such as an allocation bitmap, overlaps the
superblock, it's very likely that if the file system is mounted
read/write, the results will not be pretty.  So disallow r/w mounts
for file systems corrupted in this particular way.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24 09:36:30 +02:00
Theodore Ts'o
26dbb30c58 ext4: always initialize the crc32c checksum driver
commit a45403b515 upstream.

The extended attribute code now uses the crc32c checksum for hashing
purposes, so we should just always always initialize it.  We also want
to prevent NULL pointer dereferences if one of the metadata checksum
features is enabled after the file sytsem is originally mounted.

This issue has been assigned CVE-2018-1094.

https://bugzilla.kernel.org/show_bug.cgi?id=199183
https://bugzilla.redhat.com/show_bug.cgi?id=1560788

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24 09:36:30 +02:00
Zhouyi Zhou
2e38988253 ext4: save error to disk in __ext4_grp_locked_error()
commit 06f29cc81f upstream.

In the function __ext4_grp_locked_error(), __save_error_info()
is called to save error info in super block block, but does not sync
that information to disk to info the subsequence fsck after reboot.

This patch writes the error information to disk.  After this patch,
I think there is no obvious EXT4 error handle branches which leads to
"Remounting filesystem read-only" will leave the disk partition miss
the subsequence fsck.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22 15:42:26 +01:00
Ross Zwisler
713cb65003 ext4: prevent data corruption with inline data + DAX
commit 559db4c6d7 upstream.

If an inode has inline data it is currently prevented from using DAX by a
check in ext4_set_inode_flags().  When the inode grows inline data via
ext4_create_inline_data() or removes its inline data via
ext4_destroy_inline_data_nolock(), the value of S_DAX can change.

Currently these changes are unsafe because we don't hold off page faults
and I/O, write back dirty radix tree entries and invalidate all mappings.
There are also issues with mm-level races when changing the value of S_DAX,
as well as issues with the VM_MIXEDMAP flag:

https://www.spinics.net/lists/linux-xfs/msg09859.html

The unsafe transition of S_DAX can reliably cause data corruption, as shown
by the following fstest:

https://patchwork.kernel.org/patch/9948381/

Fix this issue by preventing the DAX mount option from being used on
filesystems that were created to support inline data.  Inline data is an
option given to mkfs.ext4.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:48 +00:00
Matthew Garrett
357fdad075 Convert fs/*/* to SB_I_VERSION
[AV: in addition to the fix in previous commit]

Signed-off-by: Matthew Garrett <mjg59@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-18 18:51:27 -04:00
Linus Torvalds
0f0d12728e Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mount flag updates from Al Viro:
 "Another chunk of fmount preparations from dhowells; only trivial
  conflicts for that part. It separates MS_... bits (very grotty
  mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
  only a small subset of MS_... stuff).

  This does *not* convert the filesystems to new constants; only the
  infrastructure is done here. The next step in that series is where the
  conflicts would be; that's the conversion of filesystems. It's purely
  mechanical and it's better done after the merge, so if you could run
  something like

	list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')

	sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
	        -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
	        -e 's/\<MS_NODEV\>/SB_NODEV/g' \
	        -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
	        -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
	        -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
	        -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
	        -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
	        -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
	        -e 's/\<MS_SILENT\>/SB_SILENT/g' \
	        -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
	        -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
	        -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
	        -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
	        $list

  and commit it with something along the lines of 'convert filesystems
  away from use of MS_... constants' as commit message, it would save a
  quite a bit of headache next cycle"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Differentiate mount flags (MS_*) from internal superblock flags
  VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
  vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-14 18:54:01 -07:00
Linus Torvalds
89fd915c40 Merge tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm from Dan Williams:
 "A rework of media error handling in the BTT driver and other updates.
  It has appeared in a few -next releases and collected some late-
  breaking build-error and warning fixups as a result.

  Summary:

   - Media error handling support in the Block Translation Table (BTT)
     driver is reworked to address sleeping-while-atomic locking and
     memory-allocation-context conflicts.

   - The dax_device lookup overhead for xfs and ext4 is moved out of the
     iomap hot-path to a mount-time lookup.

   - A new 'ecc_unit_size' sysfs attribute is added to advertise the
     read-modify-write boundary property of a persistent memory range.

   - Preparatory fix-ups for arm and powerpc pmem support are included
     along with other miscellaneous fixes"

* tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
  libnvdimm, btt: fix format string warnings
  libnvdimm, btt: clean up warning and error messages
  ext4: fix null pointer dereference on sbi
  libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
  dax: fix FS_DAX=n BLOCK=y compilation
  libnvdimm: fix integer overflow static analysis warning
  libnvdimm, nd_blk: remove mmio_flush_range()
  libnvdimm, btt: rework error clearing
  libnvdimm: fix potential deadlock while clearing errors
  libnvdimm, btt: cache sector_size in arena_info
  libnvdimm, btt: ensure that flags were also unchanged during a map_read
  libnvdimm, btt: refactor map entry operations with macros
  libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
  libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
  ext4: perform dax_device lookup at mount
  ext2: perform dax_device lookup at mount
  xfs: perform dax_device lookup at mount
  dax: introduce a fs_dax_get_by_bdev() helper
  libnvdimm, btt: check memory allocation failure
  libnvdimm, label: fix index block size calculation
  ...
2017-09-11 13:10:57 -07:00
Linus Torvalds
ae8ac6b7db Merge branch 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota scaling updates from Jan Kara:
 "This contains changes to make the quota subsystem more scalable.

  Reportedly it improves number of files created per second on ext4
  filesystem on fast storage by about a factor of 2x"

* 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
  quota: Add lock annotations to struct members
  quota: Reduce contention on dq_data_lock
  fs: Provide __inode_get_bytes()
  quota: Inline dquot_[re]claim_reserved_space() into callsite
  quota: Inline inode_{incr,decr}_space() into callsites
  quota: Inline functions into their callsites
  ext4: Disable dirty list tracking of dquots when journalling quotas
  quota: Allow disabling tracking of dirty dquots in a list
  quota: Remove dq_wait_unused from dquot
  quota: Move locking into clear_dquot_dirty()
  quota: Do not dirty bad dquots
  quota: Fix possible corruption of dqi_flags
  quota: Propagate ->quota_read errors from v2_read_file_info()
  quota: Fix error codes in v2_read_file_info()
  quota: Push dqio_sem down to ->read_file_info()
  quota: Push dqio_sem down to ->write_file_info()
  quota: Push dqio_sem down to ->get_next_id()
  quota: Push dqio_sem down to ->release_dqblk()
  quota: Remove locking for writing to the old quota format
  quota: Do not acquire dqio_sem for dquot overwrites in v2 format
  ...
2017-09-07 15:19:35 -07:00
Colin Ian King
aed9eb1b21 ext4: fix null pointer dereference on sbi
In the case of a kzalloc failure when allocating sbi we end up
with a null pointer dereference on sbi when assigning sbi->s_daxdev.
Fix this by moving the assignment of sbi->s_daxdev to after the
null pointer check of sbi.

Detected by CoverityScan CID#1455379 ("Dereference before null check")

Fixes: 5e405595e5 ("ext4: perform dax_device lookup at mount")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-09-05 10:02:08 -07:00
Dan Williams
5e405595e5 ext4: perform dax_device lookup at mount
The ->iomap_begin() operation is a hot path, so cache the
fs_dax_get_by_host() result at mount time to avoid the incurring the
hash lookup overhead on a per-i/o basis.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 11:12:13 -07:00
zhangyi (F)
95f1fda47c ext4: fix quota inconsistency during orphan cleanup for read-only mounts
Quota does not get enabled for read-only mounts if filesystem
has quota feature, so that quotas cannot updated during orphan
cleanup, which will lead to quota inconsistency.

This patch turn on quotas during orphan cleanup for this case,
make sure quotas can be updated correctly.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org # 3.18+
2017-08-24 15:21:50 -04:00
zhangyi (F)
b0a5a9589d ext4: fix incorrect quotaoff if the quota feature is enabled
Current ext4 quota should always "usage enabled" if the
quota feautre is enabled. But in ext4_orphan_cleanup(), it
turn quotas off directly (used for the older journaled
quota), so we cannot turn it on again via "quotaon" unless
umount and remount ext4.

Simple reproduce:

  mkfs.ext4 -O project,quota /dev/vdb1
  mount -o prjquota /dev/vdb1 /mnt
  chattr -p 123 /mnt
  chattr +P /mnt
  touch /mnt/aa /mnt/bb
  exec 100<>/mnt/aa
  rm -f /mnt/aa
  sync
  echo c > /proc/sysrq-trigger

  #reboot and mount
  mount -o prjquota /dev/vdb1 /mnt
  #query status
  quotaon -Ppv /dev/vdb1
  #output
  quotaon: Cannot find mountpoint for device /dev/vdb1
  quotaon: No correct mountpoint specified.

This patch add check for journaled quotas to avoid incorrect
quotaoff when ext4 has quota feautre.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org # 3.18
2017-08-24 15:19:39 -04:00
Jan Kara
7b9ca4c61b quota: Reduce contention on dq_data_lock
dq_data_lock is currently used to protect all modifications of quota
accounting information, consistency of quota accounting on the inode,
and dquot pointers from inode. As a result contention on the lock can be
pretty heavy.

Reduce the contention on the lock by protecting quota accounting
information by a new dquot->dq_dqb_lock and consistency of quota
accounting with inode usage by inode->i_lock.

This change reduces time to create 500000 files on ext4 on ramdisk by 50
different processes in separate directories by 6% when user quota is
turned on. When those 50 processes belong to 50 different users, the
improvement is about 9%.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:07:59 +02:00
Jan Kara
91389240a2 ext4: Disable dirty list tracking of dquots when journalling quotas
When journalling quotas, we writeback all dquots immediately after
changing them as part of current transation. Thus there's no need to
write anything in dquot_writeback_dquots() and so we can avoid updating
list of dirty dquots to reduce dq_list_lock contention.

This change reduces time to create 500000 files on ext4 on ramdisk by 50
different processes in separate directories by 15% when user quota is
turned on.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:00:54 +02:00
Jan Kara
bc8230ee8e quota: Convert dqio_mutex to rwsem
Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes
yet.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 18:52:48 +02:00
Eric Whitney
a627b0a7c1 ext4: remove unused metadata accounting variables
Two variables in ext4_inode_info, i_reserved_meta_blocks and
i_allocated_meta_blocks, are unused.  Removing them saves a little
memory per in-memory inode and cleans up clutter in several tracepoints.
Adjust tracepoint output from ext4_alloc_da_blocks() for consistency
and fix a typo and whitespace near these changes.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-30 22:30:11 -04:00
David Howells
bc98a42c1f VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
Linus Torvalds
bc2c6421cb Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
 "The first major feature for ext4 this merge window is the largedir
  feature, which allows ext4 directories to support over 2 billion
  directory entries (assuming ~64 byte file names; in practice, users
  will run into practical performance limits first.) This feature was
  originally written by the Lustre team, and credit goes to Artem
  Blagodarenko from Seagate for getting this feature upstream.

  The second major major feature allows ext4 to support extended
  attribute values up to 64k. This feature was also originally from
  Lustre, and has been enhanced by Tahsin Erdogan from Google with a
  deduplication feature so that if multiple files have the same xattr
  value (for example, Windows ACL's stored by Samba), only one copy will
  be stored on disk for encoding and caching efficiency.

  We also have the usual set of bug fixes, cleanups, and optimizations"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
  ext4: fix spelling mistake: "prellocated" -> "preallocated"
  ext4: fix __ext4_new_inode() journal credits calculation
  ext4: skip ext4_init_security() and encryption on ea_inodes
  fs: generic_block_bmap(): initialize all of the fields in the temp bh
  ext4: change fast symlink test to not rely on i_blocks
  ext4: require key for truncate(2) of encrypted file
  ext4: don't bother checking for encryption key in ->mmap()
  ext4: check return value of kstrtoull correctly in reserved_clusters_store
  ext4: fix off-by-one fsmap error on 1k block filesystems
  ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
  ext4: return EIO on read error in ext4_find_entry
  ext4: forbid encrypting root directory
  ext4: send parallel discards on commit completions
  ext4: avoid unnecessary stalls in ext4_evict_inode()
  ext4: add nombcache mount option
  ext4: strong binding of xattr inode references
  ext4: eliminate xattr entry e_hash recalculation for removes
  ext4: reserve space for xattr entries/names
  quota: add get_inode_usage callback to transfer multi-inode charges
  ext4: xattr inode deduplication
  ...
2017-07-09 09:31:22 -07:00
Linus Torvalds
58f587cb0b Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypt updates from Ted Ts'o:
 "Add support for 128-bit AES and some cleanups to fscrypt"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: make ->dummy_context() return bool
  fscrypt: add support for AES-128-CBC
  fscrypt: inline fscrypt_free_filename()
2017-07-09 09:03:31 -07:00
Tahsin Erdogan
af65207c76 ext4: fix __ext4_new_inode() journal credits calculation
ea_inode feature allows creating extended attributes that are up to
64k in size. Update __ext4_new_inode() to pick increased credit limits.

To avoid overallocating too many journal credits, update
__ext4_xattr_set_credits() to make a distinction between xattr create
vs update. This helps __ext4_new_inode() because all attributes are
known to be new, so we can save credits that are normally needed to
delete old values.

Also, have fscrypt specify its maximum context size so that we don't
end up allocating credits for 64k size.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-07-06 00:01:59 -04:00
Linus Torvalds
81e3e04489 Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid
Pull uuid subsystem from Christoph Hellwig:
 "This is the new uuid subsystem, in which Amir, Andy and I have started
  consolidating our uuid/guid helpers and improving the types used for
  them. Note that various other subsystems have pulled in this tree, so
  I'd like it to go in early.

  UUID/GUID summary:

   - introduce the new uuid_t/guid_t types that are going to replace the
     somewhat confusing uuid_be/uuid_le types and make the terminology
     fit the various specs, as well as the userspace libuuid library.
     (me, based on a previous version from Amir)

   - consolidated generic uuid/guid helper functions lifted from XFS and
     libnvdimm (Amir and me)

   - conversions to the new types and helpers (Amir, Andy and me)"

* tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
  ACPI: hns_dsaf_acpi_dsm_guid can be static
  mmc: sdhci-pci: make guid intel_dsm_guid static
  uuid: Take const on input of uuid_is_null() and guid_is_null()
  thermal: int340x_thermal: fix compile after the UUID API switch
  thermal: int340x_thermal: Switch to use new generic UUID API
  acpi: always include uuid.h
  ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
  ACPI / extlog: Switch to use new generic UUID API
  ACPI / bus: Switch to use new generic UUID API
  ACPI / APEI: Switch to use new generic UUID API
  acpi, nfit: Switch to use new generic UUID API
  MAINTAINERS: add uuid entry
  tmpfs: generate random sb->s_uuid
  scsi_debug: switch to uuid_t
  nvme: switch to uuid_t
  sysctl: switch to use uuid_t
  partitions/ldm: switch to use uuid_t
  overlayfs: use uuid_t instead of uuid_be
  fs: switch ->s_uuid to uuid_t
  ima/policy: switch to use uuid_t
  ...
2017-07-03 09:55:26 -07:00
Eric Biggers
c250b7dd8e fscrypt: make ->dummy_context() return bool
This makes it consistent with ->is_encrypted(), ->empty_dir(), and
fscrypt_dummy_context_enabled().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-23 20:11:50 -04:00
Eric Biggers
9ce0151a47 ext4: forbid encrypting root directory
Currently it's possible to encrypt all files and directories on an ext4
filesystem by deleting everything, including lost+found, then setting an
encryption policy on the root directory.  However, this is incompatible
with e2fsck because e2fsck expects to find, create, and/or write to
lost+found and does not have access to any encryption keys.  Especially
problematic is that if e2fsck can't find lost+found, it will create it
without regard for whether the root directory is encrypted.  This is
wrong for obvious reasons, and it causes a later run of e2fsck to
consider the lost+found directory entry to be corrupted.

Encrypting the root directory may also be of limited use because it is
the "all-or-nothing" use case, for which dm-crypt can be used instead.
(By design, encryption policies are inherited and cannot be overridden;
so the root directory having an encryption policy implies that all files
and directories on the filesystem have that same encryption policy.)

In any case, encrypting the root directory is broken currently and must
not be allowed; so start returning an error if userspace requests it.
For now only do this in ext4, because f2fs and ubifs do not appear to
have the lost+found requirement.  We could move it into
fscrypt_ioctl_set_policy() later if desired, though.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2017-06-23 00:10:36 -04:00
Daeho Jeong
a015434480 ext4: send parallel discards on commit completions
Now, when we mount ext4 filesystem with '-o discard' option, we have to
issue all the discard commands for the blocks to be deallocated and
wait for the completion of the commands on the commit complete phase.
Because this procedure might involve a lot of sequential combinations of
issuing discard commands and waiting for that, the delay of this
procedure might be too much long, even to 17.0s in our test,
and it results in long commit delay and fsync() performance degradation.

To reduce this kind of delay, instead of adding callback for each
extent and handling all of them in a sequential manner on commit phase,
we instead add a separate list of extents to free to the superblock and
then process this list at once after transaction commits so that
we can issue all the discard commands in a parallel manner like XFS
filesystem.

Finally, we could enhance the discard command handling performance.
The result was such that 17.0s delay of a single commit in the worst
case has been enhanced to 4.8s.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Tested-by: Kitae Lee <kitae87.lee@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-06-22 23:54:33 -04:00
Tahsin Erdogan
cdb7ee4c63 ext4: add nombcache mount option
The main purpose of mb cache is to achieve deduplication in
extended attributes. In use cases where opportunity for deduplication
is unlikely, it only adds overhead.

Add a mount option to explicitly turn off mb cache.

Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:55:14 -04:00
Tahsin Erdogan
7a9ca53aea quota: add get_inode_usage callback to transfer multi-inode charges
Ext4 ea_inode feature allows storing xattr values in external inodes to
be able to store values that are bigger than a block in size. Ext4 also
has deduplication support for these type of inodes. With deduplication,
the actual storage waste is eliminated but the users of such inodes are
still charged full quota for the inodes as if there was no sharing
happening in the background.

This design requires ext4 to manually charge the users because the
inodes are shared.

An implication of this is that, if someone calls chown on a file that
has such references we need to transfer the quota for the file and xattr
inodes. Current dquot_transfer() function implicitly transfers one inode
charge. With ea_inode feature, we would like to transfer multiple inode
charges.

Add get_inode_usage callback which can interrogate the total number of
inodes that were charged for a given inode.

[ Applied fix from Colin King to make sure the 'ret' variable is
  initialized on the successful return path.  Detected by
  CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso]

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jan Kara <jack@suse.cz>
2017-06-22 11:46:48 -04:00
Tahsin Erdogan
dec214d00e ext4: xattr inode deduplication
Ext4 now supports xattr values that are up to 64k in size (vfs limit).
Large xattr values are stored in external inodes each one holding a
single value. Once written the data blocks of these inodes are immutable.

The real world use cases are expected to have a lot of value duplication
such as inherited acls etc. To reduce data duplication on disk, this patch
implements a deduplicator that allows sharing of xattr inodes.

The deduplication is based on an in-memory hash lookup that is a best
effort sharing scheme. When a xattr inode is read from disk (i.e.
getxattr() call), its crc32c hash is added to a hash table. Before
creating a new xattr inode for a value being set, the hash table is
checked to see if an existing inode holds an identical value. If such an
inode is found, the ref count on that inode is incremented. On value
removal the ref count is decremented and if it reaches zero the inode is
deleted.

The quota charging for such inodes is manually managed. Every reference
holder is charged the full size as if there was no sharing happening.
This is consistent with how xattr blocks are also charged.

[ Fixed up journal credits calculation to handle inline data and the
  rare case where an shared xattr block can get freed when two thread
  race on breaking the xattr block sharing. --tytso ]

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:44:55 -04:00
Tahsin Erdogan
47387409ee ext2, ext4: make mb block cache names more explicit
There will be a second mb_cache instance that tracks ea_inodes. Make
existing names more explicit so that it is clear that they refer to
xattr block cache.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:28:55 -04:00
Tahsin Erdogan
c1a5d5f6ab ext4: improve journal credit handling in set xattr paths
Both ext4_set_acl() and ext4_set_context() need to be made aware of
ea_inode feature when it comes to credits calculation.

Also add a sufficient credits check in ext4_xattr_set_handle() right
after xattr write lock is grabbed. Original credits calculation is done
outside the lock so there is a possiblity that the initially calculated
credits are not sufficient anymore.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-21 22:28:40 -04:00
Christoph Hellwig
85787090a2 fs: switch ->s_uuid to uuid_t
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers.  More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:12 +02:00
Tahsin Erdogan
b8cb5a545c ext4: fix quota charging for shared xattr blocks
ext4_xattr_block_set() calls dquot_alloc_block() to charge for an xattr
block when new references are made. However if dquot_initialize() hasn't
been called on an inode, request for charging is effectively ignored
because ext4_inode_info->i_dquot is not initialized yet.

Add dquot_initialize() to call paths that lead to ext4_xattr_block_set().

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:24:07 -04:00
Jan Kara
964edf66bf ext4: clear lockdep subtype for quota files on quota off
Quota files have special ranking of i_data_sem lock. We inform lockdep
about it when turning on quotas however when turning quotas off, we
don't clear the lockdep subclass from i_data_sem lock and thus when the
inode gets later reused for a normal file or directory, lockdep gets
confused and complains about possible deadlocks. Fix the problem by
resetting lockdep subclass of i_data_sem on quota off.

Cc: stable@vger.kernel.org
Fixes: daf647d2dd
Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:31:23 -04:00
Linus Torvalds
0fcc3ab23d Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "Incremental fixes and a small feature addition on top of the main
  libnvdimm 4.12 pull request:

   - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
     The size regression is fixed by moving all dax helpers into the
     dax-core and only specifying "select DAX" for FS_DAX and
     dax-capable drivers. He also asked for clarification of the
     NR_DEV_DAX config option which, on closer look, does not need to be
     a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
     for good measure.

   - Ben's attention to detail on -stable patch submissions caught a
     case where the recent fixes to arch_copy_from_iter_pmem() missed a
     condition where we strand dirty data in the cache. This is tagged
     for -stable and will also be included in the rework of the pmem api
     to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

   - Vishal adds a feature that missed the initial pull due to pending
     review feedback. It allows the kernel to clear media errors when
     initializing a BTT (atomic sector update driver) instance on a pmem
     namespace.

   - Ross noticed that the dax_device + dax_operations conversion broke
     __dax_zero_page_range(). The nvdimm unit tests fail to check this
     path, but xfstests immediately trips over it. No excuse for missing
     this before submitting the 4.12 pull request.

  These all pass the nvdimm unit tests and an xfstests spot check. The
  set has received a build success notification from the kbuild robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  filesystem-dax: fix broken __dax_zero_page_range() conversion
  libnvdimm, btt: ensure that initializing metadata clears poison
  libnvdimm: add an atomic vs process context flag to rw_bytes
  x86, pmem: Fix cache flushing for iovec write < 8 bytes
  device-dax: kill NR_DEV_DAX
  block, dax: move "select DAX" from BLOCK to FS_DAX
  device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
2017-05-12 15:43:10 -07:00