commit 421f0922a2 upstream.
At inode.c:evict_inode_truncate_pages(), when we iterate over the
inode's extent states, we access an extent state record's "state" field
after we unlocked the inode's io tree lock. This can lead to a
use-after-free issue because after we unlock the io tree that extent
state record might have been freed due to being merged into another
adjacent extent state record (a previous inflight bio for a read
operation finished in the meanwhile which unlocked a range in the io
tree and cause a merge of extent state records, as explained in the
comment before the while loop added in commit 6ca0709756 ("Btrfs: fix
hang during inode eviction due to concurrent readahead")).
Fix this by keeping a copy of the extent state's flags in a local
variable and using it after unlocking the io tree.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189
Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 30928e9baa upstream.
This could result in a really bad case where we do something like
evict
evict_refill_and_join
btrfs_commit_transaction
btrfs_run_delayed_iputs
evict
evict_refill_and_join
btrfs_commit_transaction
... forever
We have plenty of other places where we run delayed iputs that are much
safer, let those do the work.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 49940bdd57 upstream.
When we insert the file extent once the ordered extent completes we free
the reserved extent reservation as it'll have been migrated to the
bytes_used counter. However if we error out after this step we'll still
clear the reserved extent reservation, resulting in a negative
accounting of the reserved bytes for the block group and space info.
Fix this by only doing the free if we didn't successfully insert a file
extent for this extent.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fb5c39d7a8 upstream.
max_extent_size is supposed to be the largest contiguous range for the
space info, and ctl->free_space is the total free space in the block
group. We need to keep track of these separately and _only_ use the
max_free_space if we don't have a max_extent_size, as that means our
original request was too large to search any of the block groups for and
therefore wouldn't have a max_extent_size set.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ad22cf6ea4 upstream.
We can't use entry->bytes if our entry is a bitmap entry, we need to use
entry->max_extent_size in that case. Fix up all the logic to make this
consistent.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7ed586d0a8 upstream.
When using the NO_HOLES feature and logging a regular file, we were
expecting that if we find an inline extent, that either its size in RAM
(uncompressed and unenconded) matches the size of the file or if it does
not, that it matches the sector size and it represents compressed data.
This assertion does not cover a case where the length of the inline extent
is smaller than the sector size and also smaller the file's size, such
case is possible through fallocate. Example:
$ mkfs.btrfs -f -O no-holes /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar
$ xfs_io -c "falloc 40 40" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
In the above example we trigger the assertion because the inline extent's
length is 21 bytes while the file size is 80 bytes. The fallocate() call
merely updated the file's size and did not touch the existing inline
extent, as expected.
So fix this by adjusting the assertion so that an inline extent length
smaller than the file size is valid if the file size is smaller than the
filesystem's sector size.
A test case for fstests follows soon.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Fixes: a89ca6f24f ("Btrfs: fix fsync after truncate when no_holes feature is enabled")
CC: stable@vger.kernel.org # 4.14+
Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3527a018c0 upstream.
At inode.c:compress_file_range(), under the "free_pages_out" label, we can
end up dereferencing the "pages" pointer when it has a NULL value. This
case happens when "start" has a value of 0 and we fail to allocate memory
for the "pages" pointer. When that happens we jump to the "cont" label and
then enter the "if (start == 0)" branch where we immediately call the
cow_file_range_inline() function. If that function returns 0 (success
creating an inline extent) or an error (like -ENOMEM for example) we jump
to the "free_pages_out" label and then access "pages[i]" leading to a NULL
pointer dereference, since "nr_pages" has a value greater than zero at
that point.
Fix this by setting "nr_pages" to 0 when we fail to allocate memory for
the "pages" pointer.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9c7b0c2e8d upstream.
[BUG]
In the following case, rescan won't zero out the number of qgroup 1/0:
$ mkfs.btrfs -fq $DEV
$ mount $DEV /mnt
$ btrfs quota enable /mnt
$ btrfs qgroup create 1/0 /mnt
$ btrfs sub create /mnt/sub
$ btrfs qgroup assign 0/257 1/0 /mnt
$ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
$ btrfs sub snap /mnt/sub /mnt/snap
$ btrfs quota rescan -w /mnt
$ btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16.00KiB 16.00KiB none none --- ---
0/257 1016.00KiB 16.00KiB none none 1/0 ---
0/258 1016.00KiB 16.00KiB none none --- ---
1/0 1016.00KiB 16.00KiB none none --- 0/257
So far so good, but:
$ btrfs qgroup remove 0/257 1/0 /mnt
WARNING: quotas may be inconsistent, rescan needed
$ btrfs quota rescan -w /mnt
$ btrfs qgroup show -pcre /mnt
qgoupid rfer excl max_rfer max_excl parent child
-------- ---- ---- -------- -------- ------ -----
0/5 16.00KiB 16.00KiB none none --- ---
0/257 1016.00KiB 16.00KiB none none --- ---
0/258 1016.00KiB 16.00KiB none none --- ---
1/0 1016.00KiB 16.00KiB none none --- ---
^^^^^^^^^^ ^^^^^^^^ not cleared
[CAUSE]
Before rescan we call qgroup_rescan_zero_tracking() to zero out all
qgroups' accounting numbers.
However we don't mark all qgroups dirty, but rely on rescan to do so.
If we have any high level qgroup without children, it won't be marked
dirty during rescan, since we cannot reach that qgroup.
This will cause QGROUP_INFO items of childless qgroups never get updated
in the quota tree, thus their numbers will stay the same in "btrfs
qgroup show" output.
[FIX]
Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if
we have childless qgroups, their QGROUP_INFO items will still get
updated during rescan.
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Tested-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0f375eed92 upstream.
In a scenario like the following:
mkdir /mnt/A # inode 258
mkdir /mnt/B # inode 259
touch /mnt/B/bar # inode 260
sync
mv /mnt/B/bar /mnt/A/bar
mv -T /mnt/A /mnt/B
fsync /mnt/B/bar
<power fail>
After replaying the log we end up with file bar having 2 hard links, both
with the name 'bar' and one in the directory with inode number 258 and the
other in the directory with inode number 259. Also, we end up with the
directory inode 259 still existing and with the directory inode 258 still
named as 'A', instead of 'B'. In this scenario, file 'bar' should only
have one hard link, located at directory inode 258, the directory inode
259 should not exist anymore and the name for directory inode 258 should
be 'B'.
This incorrect behaviour happens because when attempting to log the old
parents of an inode, we skip any parents that no longer exist. Fix this
by forcing a full commit if an old parent no longer exists.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f2d72f42d5 upstream.
When replaying a log which contains a tmpfile (which necessarily has a
link count of 0) we end up calling inc_nlink(), at
fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
the following:
[195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40
[195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1
[195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[195191.943726] RIP: 0010:inc_nlink+0x33/0x40
[195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246
[195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006
[195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0
[195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000
[195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60
[195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8
[195191.943735] FS: 00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000
[195191.943736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0
[195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[195191.943741] Call Trace:
[195191.943778] replay_one_buffer+0x797/0x7d0 [btrfs]
[195191.943802] walk_up_log_tree+0x1c1/0x250 [btrfs]
[195191.943809] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943825] walk_log_tree+0xae/0x1d0 [btrfs]
[195191.943840] btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
[195191.943856] ? replay_dir_deletes+0x280/0x280 [btrfs]
[195191.943870] open_ctree+0x1c3b/0x22a0 [btrfs]
[195191.943887] btrfs_mount_root+0x6b4/0x800 [btrfs]
[195191.943894] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943899] ? pcpu_alloc+0x55b/0x7c0
[195191.943906] ? mount_fs+0x3b/0x140
[195191.943908] mount_fs+0x3b/0x140
[195191.943912] ? __init_waitqueue_head+0x36/0x50
[195191.943916] vfs_kern_mount+0x62/0x160
[195191.943927] btrfs_mount+0x134/0x890 [btrfs]
[195191.943936] ? rcu_read_lock_sched_held+0x3f/0x70
[195191.943938] ? pcpu_alloc+0x55b/0x7c0
[195191.943943] ? mount_fs+0x3b/0x140
[195191.943952] ? btrfs_remount+0x570/0x570 [btrfs]
[195191.943954] mount_fs+0x3b/0x140
[195191.943956] ? __init_waitqueue_head+0x36/0x50
[195191.943960] vfs_kern_mount+0x62/0x160
[195191.943963] do_mount+0x1f9/0xd40
[195191.943967] ? memdup_user+0x4b/0x70
[195191.943971] ksys_mount+0x7e/0xd0
[195191.943974] __x64_sys_mount+0x21/0x30
[195191.943977] do_syscall_64+0x60/0x1b0
[195191.943980] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[195191.943983] RIP: 0033:0x7f570e4e524a
[195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a
[195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260
[195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260
[195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff
[195191.944002] irq event stamp: 8688
[195191.944010] hardirqs last enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640
[195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
[195191.944018] softirqs last enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150
[195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60
[195191.944022] ---[ end trace 5d6e873a9a0b811a ]---
This happens because the inode does not have the flag I_LINKABLE set,
which is a runtime only flag, not meant to be persisted, set when the
inode is created through open(2) if the flag O_EXCL is not passed to it.
Except for the warning, there are no other consequences (like corruptions
or metadata inconsistencies).
Since it's pointless to replay a tmpfile as it would be deleted in a
later phase of the log replay procedure (it has a link count of 0), fix
this by not logging tmpfiles and if a tmpfile is found in a log (created
by a kernel without this change), skip the replay of the inode.
A test case for fstests follows soon.
Fixes: 471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
CC: stable@vger.kernel.org # 4.18+
Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 545e3366db upstream.
Allocating new chunks modifies both the extent and chunk tree, which can
trigger new chunk allocations. So instead of doing list_for_each_safe,
just do while (!list_empty()) so we make sure we don't exit with other
pending bg's still on our list.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 84de76a2fb upstream.
If we're allocating a new space cache inode it's likely going to be
under a transaction handle, so we need to use memalloc_nofs_save() in
order to avoid deadlocks, and more importantly lockdep messages that
make xfstests fail.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3aa7c7a31c upstream.
While testing my backport I noticed there was a panic if I ran
generic/416 generic/417 generic/418 all in a row. This just happened to
uncover a race where we had outstanding IO after we destroy all of our
workqueues, and then we'd go to queue the endio work on those free'd
workqueues.
This is because we aren't waiting for the caching threads to be done
before freeing everything up, so to fix this make sure we wait on any
outstanding caching that's being done before we free up the block group,
so we're sure to be done with all IO by the time we get to
btrfs_stop_all_workers(). This fixes the panic I was seeing
consistently in testing.
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:6112!
SMP PTI
Modules linked in:
CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Workqueue: btrfs-cache btrfs_cache_helper
RIP: 0010:btrfs_map_bio+0x346/0x370
RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x8a/0xd0
submit_one_bio+0x5d/0x80
read_extent_buffer_pages+0x18a/0x320
btree_read_extent_buffer_pages+0xbc/0x200
? alloc_extent_buffer+0x359/0x3e0
read_tree_block+0x3d/0x60
read_block_for_search.isra.30+0x1a5/0x360
btrfs_search_slot+0x41b/0xa10
btrfs_next_old_leaf+0x212/0x470
caching_thread+0x323/0x490
normal_work_helper+0xc5/0x310
process_one_work+0x141/0x340
worker_thread+0x44/0x3c0
kthread+0xf8/0x130
? process_one_work+0x340/0x340
? kthread_bind+0x10/0x10
ret_from_fork+0x35/0x40
RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
---[ end trace 827eb13e50846033 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0be88e367f upstream.
We check whether any device the file system is using supports discard in
the ioctl call, but then we attempt to trim free extents on every device
regardless of whether discard is supported. Due to the way we mask off
EOPNOTSUPP, we can end up issuing the trim operations on each free range
on devices that don't support it, just wasting time.
Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d4e329de5e upstream.
btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
device_list_mutex. The problem is that ->alloc_list is protected by the
chunk mutex. We don't want to hold the chunk mutex over the trim of the
entire file system. Fortunately, the ->dev_list list is protected by
the dev_list mutex and while it will give us all devices, including
read-only devices, we already just skip the read-only devices. Then we
can continue to take and release the chunk mutex while scanning each
device.
Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6ba9fc8e62 upstream.
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.
[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
range [0, super->total_bytes). So later btrfs_trim_fs() will only be
able to trim block groups in range [0, super->total_bytes).
While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
uses its logical address space, there is nothing limiting the location
where we put block groups.
For filesystem with frequent balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond
super->total_bytes.
In that case, btrfs will not trim existing block groups.
[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: f4c697e640 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
CC: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 93bba24d4b upstream.
Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
error happens when trimming existing block groups, it will skip the
remaining blocks and continue to trim unallocated space for each device.
The return value will only reflect the final error from device trimming.
This patch will fix such behavior by:
1) Recording the last error from block group or device trimming
The return value will also reflect the last error during trimming.
Make developer more aware of the problem.
2) Continuing trimming if possible
If we failed to trim one block group or device, we could still try
the next block group or device.
3) Report number of failures during block group and device trimming
It would be less noisy, but still gives user a brief summary of
what's going wrong.
Such behavior can avoid confusion for cases like failure to trim the
first block group and then only unallocated space is trimmed.
Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add bg_ret and dev_ret to the messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b72c3aba09 upstream.
[BUG]
For certain crafted image, whose csum root leaf has missing backref, if
we try to trigger write with data csum, it could cause deadlock with the
following kernel WARN_ON():
WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
Workqueue: btrfs-endio-write btrfs_endio_write_helper
RIP: 0010:btrfs_tree_lock+0x3e2/0x400
Call Trace:
btrfs_alloc_tree_block+0x39f/0x770
__btrfs_cow_block+0x285/0x9e0
btrfs_cow_block+0x191/0x2e0
btrfs_search_slot+0x492/0x1160
btrfs_lookup_csum+0xec/0x280
btrfs_csum_file_blocks+0x2be/0xa60
add_pending_csums+0xaf/0xf0
btrfs_finish_ordered_io+0x74b/0xc90
finish_ordered_fn+0x15/0x20
normal_work_helper+0xf6/0x500
btrfs_endio_write_helper+0x12/0x20
process_one_work+0x302/0x770
worker_thread+0x81/0x6d0
kthread+0x180/0x1d0
ret_from_fork+0x35/0x40
[CAUSE]
That crafted image has missing backref for csum tree root leaf. And
when we try to allocate new tree block, since there is no
EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
and use it.
The extent tree of the image looks like:
Normal image | This fuzzed image
----------------------------------+--------------------------------
BG 29360128 | BG 29360128
One empty slot | One empty slot
29364224: backref to UUID tree | 29364224: backref to UUID tree
Two empty slots | Two empty slots
29376512: backref to CSUM tree | One empty slot (bad type) <<<
29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
... | ...
Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
alloc tree block, it's an valid slot for btrfs.
And for finish_ordered_write, when we need to insert csum, we try to CoW
csum tree root.
By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
BG_OFFSET + 12K is already used by tree block COW for other trees, the
next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
tree.
But due to the bad type, btrfs can recognize it and still consider it as
an empty slot, and will try to use it for csum tree CoW.
Then in the following call trace, we will try to lock the new tree
block, which turns out to be the old csum tree root which is already
locked:
btrfs_search_slot() called on csum tree root, which is at 29376512
|- btrfs_cow_block()
|- btrfs_set_lock_block()
| |- Now locks tree block 29376512 (old csum tree root)
|- __btrfs_cow_block()
|- btrfs_alloc_tree_block()
|- btrfs_reserve_extent()
| Now it returns tree block 29376512, which extent tree
| shows its empty slot, but it's already hold by csum tree
|- btrfs_init_new_buffer()
|- btrfs_tree_lock()
| Triggers WARN_ON(eb->lock_owner == current->pid)
|- wait_event()
Wait lock owner to release the lock, but it's
locked by ourself, so it will deadlock
[FIX]
This patch will do the lock_owner and current->pid check at
btrfs_init_new_buffer().
So above deadlock can be avoided.
Since such problem can only happen in crafted image, we will still
trigger kernel warning for later aborted transaction, but with a little
more meaningful warning message.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 65c6e82bec upstream.
[BUG]
When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
when trying to recover balance:
kernel BUG at fs/btrfs/extent-tree.c:8956!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
Call Trace:
walk_up_tree+0x172/0x1f0 [btrfs]
btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
merge_reloc_roots+0xe1/0x1d0 [btrfs]
btrfs_recover_relocation+0x3ea/0x420 [btrfs]
open_ctree+0x1af3/0x1dd0 [btrfs]
btrfs_mount_root+0x66b/0x740 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
btrfs_mount+0x16d/0x890 [btrfs]
mount_fs+0x3b/0x16a
vfs_kern_mount.part.9+0x54/0x140
do_mount+0x1fd/0xda0
ksys_mount+0xba/0xd0
__x64_sys_mount+0x21/0x30
do_syscall_64+0x60/0x210
entry_SYSCALL_64_after_hwframe+0x49/0xbe
[CAUSE]
Extent tree corruption. In this particular case, reloc tree root's
owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
corrupted and we failed the owner check in walk_up_tree().
[FIX]
It's pretty hard to take care of every extent tree corruption, but at
least we can remove such BUG_ON() and exit more gracefully.
And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
the same root (which is obviously invalid), we needs to make
__del_reloc_root() more robust to detect such invalid sharing to avoid
possible NULL dereference as root->node can be NULL in this case.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3628b4ca64 upstream.
Some qgroup trace events like btrfs_qgroup_release_data() and
btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
not enabled.
This is caused by the lack of qgroup status check before calling some
qgroup functions. Thankfully the functions can handle quota disabled
case well and just do nothing for qgroup disabled case.
This patch will do earlier check before triggering related trace events.
And for enabled <-> disabled race case:
1) For enabled->disabled case
Disable will wipe out all qgroups data including reservation and
excl/rfer. Even if we leak some reservation or numbers, it will
still be cleared, so nothing will go wrong.
2) For disabled -> enabled case
Current btrfs_qgroup_release_data() will use extent_io tree to ensure
we won't underflow reservation. And for delayed_ref we use
head->qgroup_reserved to record the reserved space, so in that case
head->qgroup_reserved should be 0 and we won't underflow.
CC: stable@vger.kernel.org # 4.14+
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fdbd1a2e4a upstream.
We must check pg_error and call error_cleanup after any call to pg_doio.
Currently, we are skipping the unlock of a page if we encounter an error in
nfs_pageio_complete() before handing off the work to the RPC layer.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 943cff67b8 upstream.
The intention of nfs4_session_set_rwsize() was to cap the r/wsize to the
buffer sizes negotiated by the CREATE_SESSION. The initial code had a
bug whereby we would not check the values negotiated by nfs_probe_fsinfo()
(the assumption being that CREATE_SESSION will always negotiate buffer values
that are sane w.r.t. the server's preferred r/wsizes) but would only check
values set by the user in the 'mount' command.
The code was changed in 4.11 to _always_ set the r/wsize, meaning that we
now never use the server preferred r/wsizes. This is the regression that
this patch fixes.
Also rename the function to nfs4_session_limit_rwsize() in order to avoid
future confusion.
Fixes: 033853325f (NFSv4.1 respect server's max size in CREATE_SESSION")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 926674de67 upstream.
Some servers (e.g. Azure) do not include a spnego blob in the SMB3
negotiate protocol response, so on kerberos mounts ("sec=krb5")
we can fail, as we expected the server to list its supported
auth types (OIDs in the spnego blob in the negprot response).
Change this so that on krb5 mounts we default to trying krb5 if the
server doesn't list its supported protocol mechanisms.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1e77a8c204 upstream.
If backupuid mount option is sent, we can incorrectly retry
(on access denied on query info) with a cifs (FindFirst) operation
on an smb3 mount which causes the server to force the session close.
We set backup intent on open so no need for this fallback.
See kernel bugzilla 201435
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 182a79e0c1 upstream.
We return most failure of dquota_initialize() except
inode evict, this could make a bit sense, for example
we allow file removal even quota files are broken?
But it dosen't make sense to allow setting project
if quota files etc are broken.
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dc7ac6c4ca upstream.
Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.
This patch try to follow same regular of xfs project
quota:
"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."
Besides that, check and set project id'state should
be an atomic operation, protect whole operation with
inode lock, ext4_ioctl_setproject() is only used for
ioctl EXT4_IOC_FSSETXATTR, we have held mnt_want_write_file()
before ext4_ioctl_setflags(), and ext4_ioctl_setproject()
is called after ext4_ioctl_setflags(), we could share
codes, so remove it inside ext4_ioctl_setproject().
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 625ef8a3ac upstream.
Variable retries is not initialized in ext4_da_write_inline_data_begin()
which can lead to nondeterministic number of retries in case we hit
ENOSPC. Initialize retries to zero as we do everywhere else.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: bc0ca9df3b ("ext4: retry allocation when inline->extent conversion failed")
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ccd3c4373e upstream.
The code cleaning transaction's lists of checkpoint buffers has a bug
where it increases bh refcount only after releasing
journal->j_list_lock. Thus the following race is possible:
CPU0 CPU1
jbd2_log_do_checkpoint()
jbd2_journal_try_to_free_buffers()
__journal_try_to_free_buffer(bh)
...
while (transaction->t_checkpoint_io_list)
...
if (buffer_locked(bh)) {
<-- IO completes now, buffer gets unlocked -->
spin_unlock(&journal->j_list_lock);
spin_lock(&journal->j_list_lock);
__jbd2_journal_remove_checkpoint(jh);
spin_unlock(&journal->j_list_lock);
try_to_free_buffers(page);
get_bh(bh) <-- accesses freed bh
Fix the problem by grabbing bh reference before unlocking
journal->j_list_lock.
Fixes: dc6e8d669c ("jbd2: don't call get_bh() before calling __jbd2_journal_remove_checkpoint()")
Fixes: be1158cc61 ("jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()")
Reported-by: syzbot+7f4a27091759e2fe7453@syzkaller.appspotmail.com
CC: stable@vger.kernel.org
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 164a63fa6b upstream.
This reverts commit 66110abc4c.
If we clear the cold data flag out of the writeback flow, we can miscount
-1 by end_io, which incurs a deadlock caused by all I/Os being blocked during
heavy GC.
Balancing F2FS Async:
- IO (CP: 1, Data: -1, Flush: ( 0 0 1), Discard: ( ...
GC thread: IRQ
- move_data_page()
- set_page_dirty()
- clear_cold_data()
- f2fs_write_end_io()
- type = WB_DATA_TYPE(page);
here, we get wrong type
- dec_page_count(sbi, type);
- f2fs_wait_on_page_writeback()
Cc: <stable@vger.kernel.org>
Reported-and-Tested-by: Park Ju Hyung <qkrwngud825@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f18b2b83a7 ]
If the starting block number of either the source or destination file
exceeds the EOF, EXT4_IOC_MOVE_EXT should return EINVAL.
Also fixed the helper function mext_check_coverage() so that if the
logical block is beyond EOF, make it return immediately, instead of
looping until the block number wraps all the away around. This takes
long enough that if there are multiple threads trying to do pound on
an the same inode doing non-sensical things, it can end up triggering
the kernel's soft lockup detector.
Reported-by: syzbot+c61979f6f2cba5cb3c06@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 92e2921f7e upstream.
When an invalid mount option is passed to jffs2, jffs2_parse_options()
will fail and jffs2_sb_info will be freed, but then jffs2_sb_info will
be used (use-after-free) and freeed (double-free) in jffs2_kill_sb().
Fix it by removing the buggy invocation of kfree() when getting invalid
mount options.
Fixes: 92abc475d8 ("jffs2: implement mount option parsing and compression overriding")
Cc: stable@kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 169b803397 upstream.
the victim might've been rmdir'ed just before the lock_rename();
unlike the normal callers, we do not look the source up after the
parents are locked - we know it beforehand and just recheck that it's
still the child of what used to be its parent. Unfortunately,
the check is too weak - we don't spot a dead directory since its
->d_parent is unchanged, dentry is positive, etc. So we sail all
the way to ->rename(), with hosting filesystems _not_ expecting
to be asked renaming an rmdir'ed subdirectory.
The fix is easy, fortunately - the lock on parent is sufficient for
making IS_DEADDIR() on child safe.
Cc: stable@vger.kernel.org
Fixes: 9ae326a690 (CacheFiles: A cache that backs onto a mounted filesystem)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a606ebdb85 ]
The truncate transaction does not ever modify the inode btree, but
includes an associated log reservation. Update
xfs_calc_itruncate_reservation() to remove the reservation
associated with inobt updates.
[Amir: This commit was merged for kernel v4.16 and a twin commit was
merged for xfsprogs v4.16. As a result, a small xfs filesystem
formatted with features -m rmapbt=1,reflink=1 using mkfs.xfs
version >= v4.16 cannot be mounted with kernel < v4.16.
For example, xfstests generic/17{1,2,3} format a small fs and
when trying to mount it, they fail with an assert on this very
demonic line:
XFS (vdc): Log size 3075 blocks too small, minimum size is 3717 blocks
XFS (vdc): AAIEEE! Log failed size checks. Abort!
XFS: Assertion failed: 0, file: src/linux/fs/xfs/xfs_log.c, line: 666
The simple solution for stable kernels is to apply this patch,
because mkfs.xfs v4.16 is already in the wild, so we have to
assume that xfs filesystems with a "too small" log exist.
Regardless, xfsprogs maintainers should also consider reverting
the twin patch to stop creating those filesystems for the sake
of users with unpatched kernels.]
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: <stable@vger.kernel.org> # v4.9+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J . Wong <darrick.wong@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9bdda4e9cf ]
Commit 92183a4289 ("fsnotify: fix ignore mask logic in
send_to_group()") acknoledges the use case of ignoring an event on
an inode mark, because of an ignore mask on a mount mark of the same
group (i.e. I want to get all events on this file, except for the events
that came from that mount).
This change depends on correctly merging the inode marks and mount marks
group lists, so that the mount mark ignore mask would be tested in
send_to_group(). Alas, the merging of the lists did not take into
account the case where event in question is not in the mask of any of
the mount marks.
To fix this, completely remove the tests for inode and mount event masks
from the lists merging code.
Fixes: 92183a4289 ("fsnotify: fix ignore mask logic in send_to_group")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
[amir: backport to v4.14.y]
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit a725356b66 upstream.
Commit 031a072a0b ("vfs: call vfs_clone_file_range() under freeze
protection") created a wrapper do_clone_file_range() around
vfs_clone_file_range() moving the freeze protection to former, so
overlayfs could call the latter.
The more common vfs practice is to call do_xxx helpers from vfs_xxx
helpers, where freeze protecction is taken in the vfs_xxx helper, so
this anomality could be a source of confusion.
It seems that commit 8ede205541 ("ovl: add reflink/copyfile/dedup
support") may have fallen a victim to this confusion -
ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
hope of getting freeze protection on upper fs, but in fact results in
overlayfs allowing to bypass upper fs freeze protection.
Swap the names of the two helpers to conform to common vfs practice
and call the correct helpers from overlayfs and nfsd.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 031a072a0b ("vfs: call vfs_clone_file_range() under freeze...")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6f7de19ed3 ]
Commit ff3d27a048 ("btrfs: qgroup: Finish rescan when hit the last leaf
of extent tree") added a new exit for rescan finish.
However after finishing quota rescan, we set
fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the
original exit path.
While we missed that assignment of (u64)-1 in the new exit path.
The end result is, the quota status item doesn't have the same value.
(-1 vs the last bytenr + 1)
Although it doesn't affect quota accounting, it's still better to keep
the original behavior.
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Fixes: ff3d27a048 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>