The following code paths may result in high latency or even task hangs:
1. fastcommit io is throttled by wbt.
2. jbd2_fc_wait_bufs() might wait for a long time while
JBD2_FAST_COMMIT_ONGOING is set in journal->flags, and then
jbd2_journal_commit_transaction() waits for the
JBD2_FAST_COMMIT_ONGOING bit for a long time while holding the write
lock of j_state_lock.
3. start_this_handle() waits for read lock of j_state_lock which
results in high latency or task hang.
Given the fact that ext4_fc_commit() already modifies the current
process' IO priority to match that of the jbd2 thread, it should be
reasonable to match jbd2's IO submission flags as well.
Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20250827121812.1477634-1-sunjunchao@bytedance.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, in the I/O writeback path, ext4_map_blocks() may attempt to
cache additional unrelated extents in the extent status tree without
holding the inode's i_rwsem and the mapping's invalidate_lock. This can
lead to stale extent status entries remaining in certain scenarios,
potentially causing data corruption.
For example, when performing a collapse range in ext4_collapse_range(),
it clears the extent cache and dirty pages before removing blocks and
shifting extents. It also holds the i_data_sem during these two
operations. However, both ext4_ext_remove_space() and
ext4_ext_shift_extents() may briefly release the i_data_sem if journal
credits are insufficient (ext4_datasem_ensure_credits()). If another
writeback process writes dirty pages from other regions during this
interval, it may cache extents that are about to be modified. Unless
ext4_collapse_range() explicitly clears the extent cache again, these
cached entries can become stale and inconsistent with the actual
extents.
0 a n b c m
| | | | | |
[www][wwwwww][wwwwwwww]...[wwwww][wwww]...
| |
N M
Assume that block a is dirty. The collapse range operation is removing
data from n to m and drops i_data_sem immediately after removing the
extent from b to c. At the same time, a concurrent writeback begins to
write back block a; it will reloads the extent from [n, b) into the
extent status tree since it does not hold the i_rwsem or the
invalidate_lock. After the collapse range operation, it left the stale
extent [n, b), which points logical block n to N, but the actual
physical block of n should be M.
Similarly, both ext4_insert_range() and ext4_truncate() have the same
problem. ext4_punch_hole() survived since it re-add a hole extent entry
after removing space since commit 9f1118223a ("ext4: add a hole extent
entry in cache after punch").
In most cases, during dirty page writeback, the block mapping
information is likely to be found in the extent cache, making it less
necessary to search for physical extents. Consequently, loading
unrelated extent caches during writeback appears to be ineffective.
Therefore, fix this by adds EXT4_EX_NOCACHE in the writeback path to
prevent caching of unrelated extents, eliminating this potential source
of corruption.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, the EXT4_GET_BLOCKS_IO_SUBMIT flag is only used during data
writeback to indicate that in ordered mode, the journal commit thread
should skip re-submitting data and simply wait for I/O completion.
To prepare for later patches that need to detect I/O submission context
in ext4_map_blocks(), generalizes the meaning of
EXT4_GET_BLOCKS_IO_SUBMIT. This flag will be set during:
1) data I/O writeback,
2) I/O completion extents conversion,
3) journal performing commit in fast_commit.
This change doesn't affect current usage of this flag and provides a
clear way to identify I/O submission context.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch reworks fast commit's commit path to remove locking the
journal for the entire duration of a fast commit. Instead, we only lock
the journal while marking all the eligible inodes as "committing". This
allows handles to make progress in parallel with the fast commit.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://patch.msgid.link/20250508175908.1004880-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the inode that's being requested to track using ext4_fc_track_inode
is being committed, then wait until the inode finishes the
commit. Also, add calls to ext4_fc_track_inode at the right places.
With this patch, now calling ext4_reserve_inode_write() results in
inode being tracked for next fast commit. This ensures that by the
time ext4_reserve_inode_write() returns, it is ready to be modified
and won't be committed until the corresponding handle is open.
A subtle lock ordering requirement with i_data_sem (which is
documented in the code) requires that ext4_fc_track_inode() be called
before grabbing i_data_sem. So, this patch also adds explicit
ext4_fc_track_inode() calls in places where i_data_sem grabbed.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://patch.msgid.link/20250508175908.1004880-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
... rather than open-coding them. As a bonus, that avoids the pointless
work with extra allocations, etc. for long names.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The ext4_find_extent() can update the extent path so that it does not have
to allocate and free the path repeatedly, thus reducing the consumption of
memory allocation and freeing in the following functions:
ext4_ext_clear_bb
ext4_ext_replay_set_iblocks
ext4_fc_replay_add_range
ext4_fc_set_bitmaps_and_counters
No functional changes. Note that ext4_find_extent() does not support error
pointers, so in this case set path to NULL first.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20240822023545.1994557-25-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The use of path and ppath is now very confusing, so to make the code more
readable, pass path between functions uniformly, and get rid of ppath.
To get rid of the ppath in ext4_ext_insert_extent(), the following is done
here:
* Free the extents path when an error is encountered.
* Its caller needs to update ppath if it uses ppath.
* Free path when npath is used, free npath when it is not used.
* The got_allocated_blocks label in ext4_ext_map_blocks() does not
update err now, so err is updated to 0 if the err returned by
ext4_ext_search_right() is greater than 0 and is about to enter
got_allocated_blocks.
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20240822023545.1994557-15-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Function jbd2_journal_shrink_checkpoint_list() assumes that '0' is not a
valid value for transaction IDs, which is incorrect.
Furthermore, the sbi->s_fc_ineligible_tid handling also makes the same
assumption by being initialised to '0'. Fortunately, the sb flag
EXT4_MF_FC_INELIGIBLE can be used to check whether sbi->s_fc_ineligible_tid
has been previously set instead of comparing it with '0'.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240724161119.13448-5-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
When a full journal commit is on-going, any fast commit has to be enqueued
into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing
is done only once, i.e. if an inode is already queued in a previous fast
commit entry it won't be enqueued again. However, if a full commit starts
_after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
be done into FC_Q_STAGING. And this is not being done in function
ext4_fc_track_template().
This patch fixes the issue by re-enqueuing an inode into the STAGING queue
during the fast commit clean-up callback when doing a full commit. However,
to prevent a race with a fast-commit, the clean-up callback has to be called
with the journal locked.
This bug was found using fstest generic/047. This test creates several 32k
bytes files, sync'ing each of them after it's creation, and then shutting
down the filesystem. Some data may be loss in this operation; for example a
file may have it's size truncated to zero.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240717172220.14201-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
When fast-commit needs to track ranges, it has to handle inodes that have
inlined data in a different way because ext4_fc_write_inode_data(), in the
actual commit path, will attempt to map the required blocks for the range.
However, inodes that have inlined data will have it's data stored in
inode->i_block and, eventually, in the extended attribute space.
Unfortunately, because fast commit doesn't currently support extended
attributes, the solution is to mark this commit as ineligible.
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1039883
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Tested-by: Ben Hutchings <benh@debian.org>
Fixes: 9725958bb7 ("ext4: fast commit may miss tracking unwritten range during ftruncate")
Link: https://patch.msgid.link/20240618144312.17786-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
jbd2_submit_inode_data() hardcoded use of
jbd2_journal_submit_inode_data_buffers() for submission of data pages.
Make it use j_submit_inode_data_buffers hook instead. This effectively
switches ext4 fastcommits to use ext4_writepages() for data writeout
instead of generic_writepages().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221207112722.22220-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Due to several different off-by-one errors, or perhaps due to a late
change in design that wasn't fully reflected in the code that was
actually merged, there are several very strange constraints on how
fast-commit blocks are filled with tlv entries:
- tlvs must start at least 10 bytes before the end of the block, even
though the minimum tlv length is 8. Otherwise, the replay code will
ignore them. (BUG: ext4_fc_reserve_space() could violate this
requirement if called with a len of blocksize - 9 or blocksize - 8.
Fortunately, this doesn't seem to happen currently.)
- tlvs must end at least 1 byte before the end of the block. Otherwise
the replay code will consider them to be invalid. This quirk
contributed to a bug (fixed by an earlier commit) where uninitialized
memory was being leaked to disk in the last byte of blocks.
Also, strangely these constraints don't apply to the replay code in
e2fsprogs, which will accept any tlvs in the blocks (with no bounds
checks at all, but that is a separate issue...).
Given that this all seems to be a bug, let's fix it by just filling
blocks with tlv entries in the natural way.
Note that old kernels will be unable to replay fast-commit journals
created by kernels that have this commit.
Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-7-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
fast-commit of create, link, and unlink operations in encrypted
directories is completely broken because the unencrypted filenames are
being written to the fast-commit journal instead of the encrypted
filenames. These operations can't be replayed, as encryption keys
aren't present at journal replay time. It is also an information leak.
Until if/when we can get this working properly, make encrypted directory
operations ineligible for fast-commit.
Note that fast-commit operations on encrypted regular files continue to
be allowed, as they seem to work.
Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
With the new fortify string system, rework the memcpy to avoid this
warning:
memcpy: detected field-spanning write (size 60) of single field "&raw_inode->i_generation" at fs/ext4/fast_commit.c:1551 (size 4)
Cc: stable@kernel.org
Fixes: 54d9469bc5 ("fortify: Add run-time WARN for cross-field memcpy()")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For scan loop must ensure that at least EXT4_FC_TAG_BASE_LEN space. If remain
space less than EXT4_FC_TAG_BASE_LEN which will lead to out of bound read
when mounting corrupt file system image.
ADD_RANGE/HEAD/TAIL is needed to add extra check when do journal scan, as this
three tags will read data during scan, tag length couldn't less than data length
which will read.
Cc: stable@kernel.org
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20220924075233.2315259-4-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull ext4 updates from Ted Ts'o:
"Add new ioctls to set and get the file system UUID in the ext4
superblock and improved the performance of the online resizing of file
systems with bigalloc enabled.
Fixed a lot of bugs, in particular for the inline data feature,
potential races when creating and deleting inodes with shared extended
attribute blocks, and the handling of directory blocks which are
corrupted"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (37 commits)
ext4: add ioctls to get/set the ext4 superblock uuid
ext4: avoid resizing to a partial cluster size
ext4: reduce computation of overhead during resize
jbd2: fix assertion 'jh->b_frozen_data == NULL' failure when journal aborted
ext4: block range must be validated before use in ext4_mb_clear_bb()
mbcache: automatically delete entries from cache on freeing
mbcache: Remove mb_cache_entry_delete()
ext2: avoid deleting xattr block that is being reused
ext2: unindent codeblock in ext2_xattr_set()
ext2: factor our freeing of xattr block reference
ext4: fix race when reusing xattr blocks
ext4: unindent codeblock in ext4_xattr_block_set()
ext4: remove EA inode entry from mbcache on inode eviction
mbcache: add functions to delete entry if unused
mbcache: don't reclaim used entries
ext4: make sure ext4_append() always allocates new block
ext4: check if directory block is within i_size
ext4: reflect mb_optimize_scan value in options file
ext4: avoid remove directory when directory is corrupted
ext4: aligned '*' in comments
...
We use jbd_debug() in some places in ext4. It seems a bit strange to use
jbd2 debugging output function for ext4 code. Also these days
ext4_debug() uses dynamic printk so each debug message can be enabled /
disabled on its own so the time when it made some sense to have these
combined (to allow easier common selecting of messages to report) has
passed. Just convert all jbd_debug() uses in ext4 to ext4_debug().
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220608112355.4397-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>