linux

mirror of https://github.com/raspberrypi/linux.git synced 2025-12-27 04:22:58 +00:00

Author	SHA1	Message	Date
Julian Sun	a809967214	ext4: increase IO priority of fastcommit [ Upstream commit `46e75c56df` ] The following code paths may result in high latency or even task hangs: 1. fastcommit io is throttled by wbt. 2. jbd2_fc_wait_bufs() might wait for a long time while JBD2_FAST_COMMIT_ONGOING is set in journal->flags, and then jbd2_journal_commit_transaction() waits for the JBD2_FAST_COMMIT_ONGOING bit for a long time while holding the write lock of j_state_lock. 3. start_this_handle() waits for read lock of j_state_lock which results in high latency or task hang. Given the fact that ext4_fc_commit() already modifies the current process' IO priority to match that of the jbd2 thread, it should be reasonable to match jbd2's IO submission flags as well. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Julian Sun <sunjunchao@bytedance.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20250827121812.1477634-1-sunjunchao@bytedance.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-13 15:37:31 -05:00
chuguangqing	009127b0fc	fs: ext4: change GFP_KERNEL to GFP_NOFS to avoid deadlock [ Upstream commit `1534f72dc2` ] The parent function ext4_xattr_inode_lookup_create already uses GFP_NOFS for memory alloction, so the function ext4_xattr_inode_cache_find should use same gfp_flag. Signed-off-by: chuguangqing <chuguangqing@inspur.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-13 15:37:31 -05:00
Deepanshu Kartikey	1f5ccd22ff	ext4: detect invalid INLINE_DATA + EXTENTS flag combination commit `1d3ad18394` upstream. syzbot reported a BUG_ON in ext4_es_cache_extent() when opening a verity file on a corrupted ext4 filesystem mounted without a journal. The issue is that the filesystem has an inode with both the INLINE_DATA and EXTENTS flags set: EXT4-fs error (device loop0): ext4_cache_extents:545: inode #15: comm syz.0.17: corrupted extent tree: lblk 0 < prev 66 Investigation revealed that the inode has both flags set: DEBUG: inode 15 - flag=1, i_inline_off=164, has_inline=1, extents_flag=1 This is an invalid combination since an inode should have either: - INLINE_DATA: data stored directly in the inode - EXTENTS: data stored in extent-mapped blocks Having both flags causes ext4_has_inline_data() to return true, skipping extent tree validation in __ext4_iget(). The unvalidated out-of-order extents then trigger a BUG_ON in ext4_es_cache_extent() due to integer underflow when calculating hole sizes. Fix this by detecting this invalid flag combination early in ext4_iget() and rejecting the corrupted inode. Cc: stable@kernel.org Reported-and-tested-by: syzbot+038b7bf43423e132b308@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=038b7bf43423e132b308 Suggested-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Message-ID: <20250930112810.315095-1-kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-23 16:24:23 +02:00
Zhang Yi	77e5cbdbb4	ext4: wait for ongoing I/O to complete before freeing blocks commit `328a782cb1` upstream. When freeing metadata blocks in nojournal mode, ext4_forget() calls bforget() to clear the dirty flag on the buffer_head and remvoe associated mappings. This is acceptable if the metadata has not yet begun to be written back. However, if the write-back has already started but is not yet completed, ext4_forget() will have no effect. Subsequently, ext4_mb_clear_bb() will immediately return the block to the mb allocator. This block can then be reallocated immediately, potentially causing an data corruption issue. Fix this by clearing the buffer's dirty flag and waiting for the ongoing I/O to complete, ensuring that no further writes to stale data will occur. Fixes: `16e08b14a4` ("ext4: cleanup clean_bdev_aliases() calls") Cc: stable@kernel.org Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com> Closes: https://lore.kernel.org/linux-ext4/a9417096-9549-4441-9878-b1955b899b4e@huaweicloud.com/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20250916093337.3161016-3-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-23 16:24:23 +02:00
Jan Kara	c6d254748b	ext4: free orphan info with kvfree commit `971843c511` upstream. Orphan info is now getting allocated with kvmalloc_array(). Free it with kvfree() instead of kfree() to avoid complaints from mm. Reported-by: Chris Mason <clm@meta.com> Fixes: `0a6ce20c15` ("ext4: verify orphan file size is not too big") Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Message-ID: <20251007134936.7291-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:42 +02:00
Deepanshu Kartikey	a3e039869e	ext4: validate ea_ino and size in check_xattrs commit `44d2a72f4d` upstream. During xattr block validation, check_xattrs() processes xattr entries without validating that entries claiming to use EA inodes have non-zero sizes. Corrupted filesystems may contain xattr entries where e_value_size is zero but e_value_inum is non-zero, indicating invalid xattr data. Add validation in check_xattrs() to detect this corruption pattern early and return -EFSCORRUPTED, preventing invalid xattr entries from causing issues throughout the ext4 codebase. Cc: stable@kernel.org Suggested-by: Theodore Ts'o <tytso@mit.edu> Reported-by: syzbot+4c9d23743a2409b80293@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=4c9d23743a2409b80293 Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250923133245.1091761-1-kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:41 +02:00
Ahmet Eray Karadag	440b003f44	ext4: guard against EA inode refcount underflow in xattr update commit `57295e8354` upstream. syzkaller found a path where ext4_xattr_inode_update_ref() reads an EA inode refcount that is already <= 0 and then applies ref_change (often -1). That lets the refcount underflow and we proceed with a bogus value, triggering errors like: EXT4-fs error: EA inode <n> ref underflow: ref_count=-1 ref_change=-1 EXT4-fs warning: ea_inode dec ref err=-117 Make the invariant explicit: if the current refcount is non-positive, treat this as on-disk corruption, emit ext4_error_inode(), and fail the operation with -EFSCORRUPTED instead of updating the refcount. Delete the WARN_ONCE() as negative refcounts are now impossible; keep error reporting in ext4_error_inode(). This prevents the underflow and the follow-on orphan/cleanup churn. Reported-by: syzbot+0be4f339a8218d2a5bb1@syzkaller.appspotmail.com Fixes: https://syzbot.org/bug?extid=0be4f339a8218d2a5bb1 Cc: stable@kernel.org Co-developed-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Ahmet Eray Karadag <eraykrdg1@gmail.com> Message-ID: <20250920021342.45575-1-eraykrdg1@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:41 +02:00
Zhang Yi	3fb2b7550d	ext4: fix an off-by-one issue during moving extents commit `12e803c882` upstream. During the movement of a written extent, mext_page_mkuptodate() is called to read data in the range [from, to) into the page cache and to update the corresponding buffers. Therefore, we should not wait on any buffer whose start offset is >= 'to'. Otherwise, it will return -EIO and fail the extents movement. $ for i in `seq 3 -1 0`; \ do xfs_io -fs -c "pwrite -b 1024 $((i * 1024)) 1024" /mnt/foo; \ done $ umount /mnt && mount /dev/pmem1s /mnt # drop cache $ e4defrag /mnt/foo e4defrag 1.47.0 (5-Feb-2023) ext4 defragmentation for /mnt/foo [1/1]/mnt/foo: 0% [ NG ] Success: [0/1] Cc: stable@kernel.org Fixes: `a40759fb16` ("ext4: remove array of buffer_heads from mext_page_mkuptodate()") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20250912105841.1886799-1-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:41 +02:00
Theodore Ts'o	a6e94557cd	ext4: avoid potential buffer over-read in parse_apply_sb_mount_options() commit `8ecb790ea8` upstream. Unlike other strings in the ext4 superblock, we rely on tune2fs to make sure s_mount_opts is NUL terminated. Harden parse_apply_sb_mount_options() by treating s_mount_opts as a potential __nonstring. Cc: stable@vger.kernel.org Fixes: `8b67f04ab9` ("ext4: Add mount options in superblock") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Message-ID: <20250916-tune2fs-v2-1-d594dc7486f0@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:41 +02:00
Ojaswin Mujoo	3ae197d844	ext4: correctly handle queries for metadata mappings commit `46c22a8bb4` upstream. Currently, our handling of metadata is _ambiguous_ in some scenarios, that is, we end up returning unknown if the range only covers the mapping partially. For example, in the following case: $ xfs_io -c fsmap -d 0: 254:16 [0..7]: static fs metadata 8 1: 254:16 [8..15]: special 102:1 8 2: 254:16 [16..5127]: special 102:2 5112 3: 254:16 [5128..5255]: special 102:3 128 4: 254:16 [5256..5383]: special 102:4 128 5: 254:16 [5384..70919]: inodes 65536 6: 254:16 [70920..70967]: unknown 48 ... $ xfs_io -c fsmap -d 24 33 0: 254:16 [24..39]: unknown 16 <--- incomplete reporting $ xfs_io -c fsmap -d 24 33 (With patch) 0: 254:16 [16..5127]: special 102:2 5112 This is because earlier in ext4_getfsmap_meta_helper, we end up ignoring any extent that starts before our queried range, but overlaps it. While the man page [1] is a bit ambiguous on this, this fix makes the output make more sense since we are anyways returning an "unknown" extent. This is also consistent to how XFS does it: $ xfs_io -c fsmap -d ... 6: 254:16 [104..127]: free space 24 7: 254:16 [128..191]: inodes 64 ... $ xfs_io -c fsmap -d 137 150 0: 254:16 [128..191]: inodes 64 <-- full extent returned [1] https://man7.org/linux/man-pages/man2/ioctl_getfsmap.2.html Reported-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: stable@kernel.org Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <023f37e35ee280cd9baac0296cbadcbe10995cab.1757058211.git.ojaswin@linux.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:41 +02:00
Yongjian Sun	d33beb49b1	ext4: increase i_disksize to offset + len in ext4_update_disksize_before_punch() commit `9d80eaa1a1` upstream. After running a stress test combined with fault injection, we performed fsck -a followed by fsck -fn on the filesystem image. During the second pass, fsck -fn reported: Inode 131512, end of extent exceeds allowed value (logical block 405, physical block 1180540, len 2) This inode was not in the orphan list. Analysis revealed the following call chain that leads to the inconsistency: ext4_da_write_end() //does not update i_disksize ext4_punch_hole() //truncate folio, keep size ext4_page_mkwrite() ext4_block_page_mkwrite() ext4_block_write_begin() ext4_get_block() //insert written extent without update i_disksize journal commit echo 1 > /sys/block/xxx/device/delete da-write path updates i_size but does not update i_disksize. Then ext4_punch_hole truncates the da-folio yet still leaves i_disksize unchanged(in the ext4_update_disksize_before_punch function, the condition offset + len < size is met). Then ext4_page_mkwrite sees ext4_nonda_switch return 1 and takes the nodioread_nolock path, the folio about to be written has just been punched out, and it’s offset sits beyond the current i_disksize. This may result in a written extent being inserted, but again does not update i_disksize. If the journal gets committed and then the block device is yanked, we might run into this. It should be noted that replacing ext4_punch_hole with ext4_zero_range in the call sequence may also trigger this issue, as neither will update i_disksize under these circumstances. To fix this, we can modify ext4_update_disksize_before_punch to increase i_disksize to min(i_size, offset + len) when both i_size and (offset + len) are greater than i_disksize. Cc: stable@kernel.org Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20250911133024.1841027-1-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:41 +02:00
Jan Kara	2b9da798ff	ext4: verify orphan file size is not too big commit `0a6ce20c15` upstream. In principle orphan file can be arbitrarily large. However orphan replay needs to traverse it all and we also pin all its buffers in memory. Thus filesystems with absurdly large orphan files can lead to big amounts of memory consumed. Limit orphan file size to a sane value and also use kvmalloc() for allocating array of block descriptor structures to avoid large order allocations for sane but large orphan files. Reported-by: syzbot+0b92850d68d9b12934f5@syzkaller.appspotmail.com Fixes: `02f310fcf4` ("ext4: Speedup ext4 orphan inode handling") Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Message-ID: <20250909112206.10459-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:41 +02:00
Jan Kara	e9663669a5	ext4: fail unaligned direct IO write with EINVAL commit `963845748f` upstream. Commit `bc264fea0f` ("iomap: support incremental iomap_iter advances") changed the error handling logic in iomap_iter(). Previously any error from iomap_dio_bio_iter() got propagated to userspace, after this commit if ->iomap_end returns error, it gets propagated to userspace instead of an error from iomap_dio_bio_iter(). This results in unaligned writes to ext4 to silently fallback to buffered IO instead of erroring out. Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems unnecessary these days. It is enough to return ENOTBLK from ext4_iomap_begin() when we don't support DIO write for that particular file offset (due to hole). Fixes: `bc264fea0f` ("iomap: support incremental iomap_iter advances") Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Message-ID: <20250901112739.32484-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:41 +02:00
Baokun Li	d9774bbe39	ext4: add ext4_sb_bread_nofail() helper function for ext4_free_branches() commit `d8b90e6387` upstream. The implicit __GFP_NOFAIL flag in ext4_sb_bread() was removed in commit `8a83ac5494` ("ext4: call bdev_getblk() from sb_getblk_gfp()"), meaning the function can now fail under memory pressure. Most callers of ext4_sb_bread() propagate the error to userspace and do not remount the filesystem read-only. However, ext4_free_branches() handles ext4_sb_bread() failure by remounting the filesystem read-only. This implies that an ext3 filesystem (mounted via the ext4 driver) could be forcibly remounted read-only due to a transient page allocation failure, which is unacceptable. To mitigate this, introduce a new helper function, ext4_sb_bread_nofail(), which explicitly uses __GFP_NOFAIL, and use it in ext4_free_branches(). Fixes: `8a83ac5494` ("ext4: call bdev_getblk() from sb_getblk_gfp()") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-19 16:37:40 +02:00
Jan Kara	1d73b52d20	ext4: fix checks for orphan inodes commit `acf943e976` upstream. When orphan file feature is enabled, inode can be tracked as orphan either in the standard orphan list or in the orphan file. The first can be tested by checking ei->i_orphan list head, the second is recorded by EXT4_STATE_ORPHAN_FILE inode state flag. There are several places where we want to check whether inode is tracked as orphan and only some of them properly check for both possibilities. Luckily the consequences are mostly minor, the worst that can happen is that we track an inode as orphan although we don't need to and e2fsck then complains (resulting in occasional ext4/307 xfstest failures). Fix the problem by introducing a helper for checking whether an inode is tracked as orphan and use it in appropriate places. Fixes: `4a79a98c7b` ("ext4: Improve scalability of ext4 orphan file handling") Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Message-ID: <20250925123038.20264-2-jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-15 12:04:20 +02:00
Baokun Li	00110f3cfc	ext4: fix potential null deref in ext4_mb_init() commit `3c3fac6bc0` upstream. In ext4_mb_init(), ext4_mb_avg_fragment_size_destroy() may be called when sbi->s_mb_avg_fragment_size remains uninitialized (e.g., if groupinfo slab cache allocation fails). Since ext4_mb_avg_fragment_size_destroy() lacks null pointer checking, this leads to a null pointer dereference. ================================================================== EXT4-fs: no memory for groupinfo slab cache BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: Oops: 0002 [#1] SMP PTI CPU:2 UID: 0 PID: 87 Comm:mount Not tainted 6.17.0-rc2 #1134 PREEMPT(none) RIP: 0010:_raw_spin_lock_irqsave+0x1b/0x40 Call Trace: <TASK> xa_destroy+0x61/0x130 ext4_mb_init+0x483/0x540 __ext4_fill_super+0x116d/0x17b0 ext4_fill_super+0xd3/0x280 get_tree_bdev_flags+0x132/0x1d0 vfs_get_tree+0x29/0xd0 do_new_mount+0x197/0x300 __x64_sys_mount+0x116/0x150 do_syscall_64+0x50/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== Therefore, add necessary null check to ext4_mb_avg_fragment_size_destroy() to prevent this issue. The same fix is also applied to ext4_mb_largest_free_orders_destroy(). Reported-by: syzbot+1713b1aa266195b916c2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1713b1aa266195b916c2 Cc: stable@kernel.org Fixes: `f7eaacbb4e` ("ext4: convert free groups order lists to xarrays") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-10-15 12:04:20 +02:00
Linus Torvalds	074e461d9e	Merge tag 'ext4_for_linus-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: - Fix fast commit checks for file systems with ea_inode enabled - Don't drop the i_version mount option on a remount - Fix FIEMAP reporting when there are holes in a bigalloc file system - Don't fail when mounting read-only when there are inodes in the orphan file - Fix hole length overflow for indirect mapped files on file systems with an 8k or 16k block file system * tag 'ext4_for_linus-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: prevent softlockup in jbd2_log_do_checkpoint() ext4: fix incorrect function name in comment ext4: use kmalloc_array() for array space allocation ext4: fix hole length calculation overflow in non-extent inodes ext4: don't try to clear the orphan_present feature block device is r/o ext4: fix reserved gdt blocks handling in fsmap ext4: fix fsmap end of range reporting with bigalloc ext4: remove redundant __GFP_NOWARN ext4: fix unused variable warning in ext4_init_new_dir ext4: remove useless if check ext4: check fast symlink for ea_inode correctly ext4: preserve SB_I_VERSION on remount ext4: show the default enabled i_version option	2025-08-18 09:01:00 -07:00
Baolin Liu	757fc66da9	ext4: fix incorrect function name in comment Since commit `6b730a4050` “ext4: hoist ext4_block_write_begin and replace the __block_write_begin”, the comment should be updated accordingly from "__block_write_begin" to "ext4_block_write_begin". Fixes: `6b730a4050` (“ext4: hoist ext4_block_write_begin and replace...") Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/20250812021709.1120716-1-liubaolin12138@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-13 14:18:50 -04:00
Liao Yuanhong	76dba1fe27	ext4: use kmalloc_array() for array space allocation Replace kmalloc(size * sizeof) with kmalloc_array() for safer memory allocation and overflow prevention. Cc: stable@kernel.org Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Link: https://patch.msgid.link/20250811125816.570142-1-liaoyuanhong@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Zhang Yi	02c7f7219a	ext4: fix hole length calculation overflow in non-extent inodes In a filesystem with a block size larger than 4KB, the hole length calculation for a non-extent inode in ext4_ind_map_blocks() can easily exceed INT_MAX. Then it could return a zero length hole and trigger the following waring and infinite in the iomap infrastructure. ------------[ cut here ]------------ WARNING: CPU: 3 PID: 434101 at fs/iomap/iter.c:34 iomap_iter_done+0x148/0x190 CPU: 3 UID: 0 PID: 434101 Comm: fsstress Not tainted 6.16.0-rc7+ #128 PREEMPT(voluntary) Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : iomap_iter_done+0x148/0x190 lr : iomap_iter+0x174/0x230 sp : ffff8000880af740 x29: ffff8000880af740 x28: ffff0000db8e6840 x27: 0000000000000000 x26: 0000000000000000 x25: ffff8000880af830 x24: 0000004000000000 x23: 0000000000000002 x22: 000001bfdbfa8000 x21: ffffa6a41c002e48 x20: 0000000000000001 x19: ffff8000880af808 x18: 0000000000000000 x17: 0000000000000000 x16: ffffa6a495ee6cd0 x15: 0000000000000000 x14: 00000000000003d4 x13: 00000000fa83b2da x12: 0000b236fc95f18c x11: ffffa6a4978b9c08 x10: 0000000000001da0 x9 : ffffa6a41c1a2a44 x8 : ffff8000880af5c8 x7 : 0000000001000000 x6 : 0000000000000000 x5 : 0000000000000004 x4 : 000001bfdbfa8000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000004004030000 x0 : 0000000000000000 Call trace: iomap_iter_done+0x148/0x190 (P) iomap_iter+0x174/0x230 iomap_fiemap+0x154/0x1d8 ext4_fiemap+0x110/0x140 [ext4] do_vfs_ioctl+0x4b8/0xbc0 __arm64_sys_ioctl+0x8c/0x120 invoke_syscall+0x6c/0x100 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x38/0x120 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x1a0 ---[ end trace 0000000000000000 ]--- Cc: stable@kernel.org Fixes: `facab4d971` ("ext4: return hole from ext4_map_blocks()") Reported-by: Qu Wenruo <wqu@suse.com> Closes: https://lore.kernel.org/linux-ext4/9b650a52-9672-4604-a765-bb6be55d1e4a@gmx.com/ Tested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250811064532.1788289-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Theodore Ts'o	c5e104a91e	ext4: don't try to clear the orphan_present feature block device is r/o When the file system is frozen in preparation for taking an LVM snapshot, the journal is checkpointed and if the orphan_file feature is enabled, and the orphan file is empty, we clear the orphan_present feature flag. But if there are pending inodes that need to be removed the orphan_present feature flag can't be cleared. The problem comes if the block device is read-only. In that case, we can't process the orphan inode list, so it is skipped in ext4_orphan_cleanup(). But then in ext4_mark_recovery_complete(), this results in the ext4 error "Orphan file not empty on read-only fs" firing and the file system mount is aborted. Fix this by clearing the needs_recovery flag in the block device is read-only. We do this after the call to ext4_load_and_init-journal() since there are some error checks need to be done in case the journal needs to be replayed and the block device is read-only, or if the block device containing the externa journal is read-only, etc. Cc: stable@kernel.org Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1108271 Cc: stable@vger.kernel.org Fixes: `02f310fcf4` ("ext4: Speedup ext4 orphan inode handling") Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Ojaswin Mujoo	3ffbdd1f11	ext4: fix reserved gdt blocks handling in fsmap In some cases like small FSes with no meta_bg and where the resize doesn't need extra gdt blocks as it can fit in the current one, s_reserved_gdt_blocks is set as 0, which causes fsmap to emit a 0 length entry, which is incorrect. $ mkfs.ext4 -b 65536 -O bigalloc /dev/sda 5G $ mount /dev/sda /mnt/scratch $ xfs_io -c "fsmap -d" /mnt/scartch 0: 253:48 [0..127]: static fs metadata 128 1: 253:48 [128..255]: special 102:1 128 2: 253:48 [256..255]: special 102:2 0 <---- 0 len entry 3: 253:48 [256..383]: special 102:3 128 Fix this by adding a check for this case. Cc: stable@kernel.org Fixes: `0c9ec4beec` ("ext4: support GETFSMAP ioctls") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/08781b796453a5770112aa96ad14c864fbf31935.1754377641.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Ojaswin Mujoo	bae76c035b	ext4: fix fsmap end of range reporting with bigalloc With bigalloc enabled, the logic to report last extent has a bug since we try to use cluster units instead of block units. This can cause an issue where extra incorrect entries might be returned back to the user. This was flagged by generic/365 with 64k bs and -O bigalloc. Details of issue The issue was noticed on 5G 64k blocksize FS with -O bigalloc which has only 1 bg. $ xfs_io -c "fsmap -d" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 /* sb / 1: 253:48 [128..255]: special 102:1 128 / gdt / 3: 253:48 [256..383]: special 102:3 128 / block bitmap / 4: 253:48 [384..2303]: unknown 1920 / flex bg empty space / 5: 253:48 [2304..2431]: special 102:4 128 / inode bitmap / 6: 253:48 [2432..4351]: unknown 1920 / flex bg empty space / 7: 253:48 [4352..6911]: inodes 2560 8: 253:48 [6912..538623]: unknown 531712 9: 253:48 [538624..10485759]: free space `9947136` The issue can be seen with: $ xfs_io -c "fsmap -d 0 3" /mnt/scratch 0: 253:48 [0..127]: static fs metadata 128 1: 253:48 [384..2047]: unknown 1664 Only the first entry was expected to be returned but we get 2. This is because: ext4_getfsmap_datadev() first_cluster, last_cluster = 0 ... info->gfi_last = true; ext4_getfsmap_datadev_helper(sb, end_ag, last_cluster + 1, 0, info); fsb = C2B(1) = 16 fslen = 0 ... / Merge in any relevant extents from the meta_list / list_for_each_entry_safe(p, tmp, &info->gfi_meta_list, fmr_list) { ... // since fsb = 16, considers all metadata which starts before 16 blockno iter 1: error = ext4_getfsmap_helper(sb, info, p); // p = sb (0,1), nop info->gfi_next_fsblk = 1 iter 2: error = ext4_getfsmap_helper(sb, info, p); // p = gdt (1,2), nop info->gfi_next_fsblk = 2 iter 3: error = ext4_getfsmap_helper(sb, info, p); // p = blk bitmap (2,3), nop info->gfi_next_fsblk = 3 iter 4: error = ext4_getfsmap_helper(sb, info, p); // p = ino bitmap (18,19) if (rec_blk > info->gfi_next_fsblk) { // (18 > 3) // emits an extra entry * BUG ** } } Fix this by directly calling ext4_getfsmap_datadev() with a dummy record that has fmr_physical set to (end_fsb + 1) instead of last_cluster + 1. By using the block instead of cluster we get the correct behavior. Replacing ext4_getfsmap_datadev_helper() with ext4_getfsmap_helper() is okay since the gfi_lastfree and metadata checks in ext4_getfsmap_datadev_helper() are anyways redundant when we only want to emit the last allocated block of the range, as we have already taken care of emitting metadata and any last free blocks. Cc: stable@kernel.org Reported-by: Disha Goel <disgoel@linux.ibm.com> Fixes: `4a622e4d47` ("ext4: fix FS_IOC_GETFSMAP handling") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/e7472c8535c9c5ec10f425f495366864ea12c9da.1754377641.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Qianfeng Rong	4ba97589ed	ext4: remove redundant __GFP_NOWARN GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant __GFP_NOWARN. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Link: https://patch.msgid.link/20250803102243.623705-4-rongqianfeng@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Theodore Ts'o	59d8731c88	ext4: fix unused variable warning in ext4_init_new_dir Cc: stable@kernel.org Fixes: `90f097b140` ("ext4: refactor the inline directory conversion and...") Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Antonio Quartulli	f3fbaa74d9	ext4: remove useless if check This if branch is only jumping to 'out' which is defined just after the branch itself. Hence this is if-check is a no-op and can be removed. Address-Coverity-ID: 1647981 ("Incorrect expression (IDENTICAL_BRANCHES)") Signed-off-by: Antonio Quartulli <antonio@mandelbit.com> Link: https://patch.msgid.link/20250721200902.1071-1-antonio@mandelbit.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Andreas Dilger	b4cc4a4077	ext4: check fast symlink for ea_inode correctly The check for a fast symlink in the presence of only an external xattr inode is incorrect. If a fast symlink does not have an xattr block (i_file_acl == 0), but does have an external xattr inode that increases inode i_blocks, then the check for a fast symlink will incorrectly fail and __ext4_iget()->ext4_ind_check_inode() will report the inode is corrupt when it "validates" i_data[] on the next read: # ln -s foo /mnt/tmp/bar # setfattr -h -n trusted.test \ -v "$(yes \| head -n 4000)" /mnt/tmp/bar # umount /mnt/tmp # mount /mnt/tmp # ls -l /mnt/tmp ls: cannot access '/mnt/tmp/bar': Structure needs cleaning total 4 ? l?????????? ? ? ? ? ? bar # dmesg \| tail -1 EXT4-fs error (device dm-8): __ext4_iget:5098: inode #24578: block 7303014: comm ls: invalid block (note that "block 7303014" = 0x6f6f66 = "foo" in LE order). ext4_inode_is_fast_symlink() should check the superblock EXT4_FEATURE_INCOMPAT_EA_INODE feature flag, not the inode EXT4_EA_INODE_FL, since the latter is only set on the xattr inode itself, and not on the inode that uses this xattr. Cc: stable@vger.kernel.org Fixes: `fc82228a5e` ("ext4: support fast symlinks from ext3 file systems") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Reviewed-by: Li Dongyang <dongyangli@ddn.com> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Reviewed-by: Oleg Drokin <green@whamcloud.com> Reviewed-on: https://review.whamcloud.com/59879 Lustre-bug-id: https://jira.whamcloud.com/browse/LU-19121 Link: https://patch.msgid.link/20250717063709.757077-1-adilger@dilger.ca Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:15:05 -04:00
Baokun Li	f2326fd14a	ext4: preserve SB_I_VERSION on remount IMA testing revealed that after an ext4 remount, file accesses triggered full measurements even without modifications, instead of skipping as expected when i_version is unchanged. Debugging showed `SB_I_VERSION` was cleared in reconfigure_super() during remount due to commit `1ff2030739` ("ext4: unconditionally enable the i_version counter") removing the fix from commit `960e0ab63b` ("ext4: fix i_version handling on remount"). To rectify this, `SB_I_VERSION` is always set for `fc->sb_flags` in ext4_init_fs_context(), instead of `sb->s_flags` in __ext4_fill_super(), ensuring it persists across all mounts. Cc: stable@kernel.org Fixes: `1ff2030739` ("ext4: unconditionally enable the i_version counter") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250703073903.6952-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 23:14:58 -04:00
Baokun Li	6a912e8aa2	ext4: show the default enabled i_version option Display `i_version` in `/proc/fs/ext4/sdx/options`, even though it's default enabled. This aids users managing multi-version scenarios and simplifies debugging. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250703073903.6952-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-08-12 21:28:07 -04:00
Linus Torvalds	beace86e61	Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "As usual, many cleanups. The below blurbiage describes 42 patchsets. 21 of those are partially or fully cleanup work. "cleans up", "cleanup", "maintainability", "rationalizes", etc. I never knew the MM code was so dirty. "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes) addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park) adds a new kernel module which simplifies the setup and usage of DAMON in production environments. "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig) is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. "drivers/base/node.c: optimization and cleanups" (Donet Tom) contains largely uncorrelated cleanups to the NUMA node setup and management code. "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman) does some maintenance work on the userfaultfd code. "Readahead tweaks for larger folios" (Ryan Roberts) implements some tuneups for pagecache readahead when it is reading into order>0 folios. "selftests/mm: Tweaks to the cow test" (Mark Brown) provides some cleanups and consistency improvements to the selftests code. "Optimize mremap() for large folios" (Dev Jain) does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. "Remove zero_user()" (Matthew Wilcox) expunges zero_user() in favor of the more modern memzero_page(). "mm/huge_memory: vmf_insert_folio_() and vmf_insert_pfn_pud() fixes" (David Hildenbrand) addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park) provides some cleanup and consolidation work in DAMON. "use vm_flags_t consistently" (Lorenzo Stoakes) uses vm_flags_t in places where we were inappropriately using other types. "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy) increases the reliability of large page allocation in the memfd code. "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple) removes several now-unneeded PFN_ flags. "mm/damon: decouple sysfs from core" (SeongJae Park) implememnts some cleanup and maintainability work in the DAMON sysfs layer. "madvise cleanup" (Lorenzo Stoakes) does quite a lot of cleanup/maintenance work in the madvise() code. "madvise anon_name cleanups" (Vlastimil Babka) provides additional cleanups on top or Lorenzo's effort. "Implement numa node notifier" (Oscar Salvador) creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan) cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park) adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. "Misc rework on hugetlb faulting path" (Oscar Salvador) fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport) rationalizes and cleans up the highmem-specific code in the CMA allocator. "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand) provides cleanups and future-preparedness to the migration code. "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park) adds some tracepoints to some DAMON auto-tuning code. "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park) does that. "mm/damon: misc cleanups" (SeongJae Park) also does what it claims. "mm: folio_pte_batch() improvements" (David Hildenbrand) cleans up the large folio PTE batching code. "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park) facilitates dynamic alteration of DAMON's inter-node allocation policy. "Remove unmap_and_put_page()" (Vishal Moola) provides a couple of page->folio conversions. "mm: per-node proactive reclaim" (Davidlohr Bueso) implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. "mm/damon: remove damon_callback" (SeongJae Park) replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes) implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. "drop hugetlb_free_pgd_range()" (Anthony Yznaga) switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park) augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. "Some randome fixes and cleanups to swapfile" (Kemeng Shi) does what is claims. "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand) provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan) addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. "__folio_split() clean up" (Zi Yan) cleans up __folio_split()! "Optimize mprotect() for large folios" (Dev Jain) provides some quite large (>3x) speedups to mprotect() when dealing with large folios. "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian) does some cleanup work in the selftests code. "tools/testing: expand mremap testing" (Lorenzo Stoakes) extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. "selftests/damon/sysfs.py: test all parameters" (SeongJae Park) extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset" * tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits) MAINTAINERS: add missing headers to mempory policy & migration section MAINTAINERS: add missing file to cgroup section MAINTAINERS: add MM MISC section, add missing files to MISC and CORE MAINTAINERS: add missing zsmalloc file MAINTAINERS: add missing files to page alloc section MAINTAINERS: add missing shrinker files MAINTAINERS: move memremap.[ch] to hotplug section MAINTAINERS: add missing mm_slot.h file THP section MAINTAINERS: add missing interval_tree.c to memory mapping section MAINTAINERS: add missing percpu-internal.h file to per-cpu section mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() selftests/damon: introduce _common.sh to host shared function selftests/damon/sysfs.py: test runtime reduction of DAMON parameters selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion selftests/damon/sysfs.py: test DAMOS destinations commitment ...	2025-07-31 14:57:54 -07:00
Linus Torvalds	ff7dcfedf9	Merge tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major ext4 changes for 6.17: - Better scalability for ext4 block allocation - Fix insufficient credits when writing back large folios Miscellaneous bug fixes, especially when handling exteded attriutes, inline data, and fast commit" * tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (39 commits) ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr ext4: implement linear-like traversal across order xarrays ext4: refactor choose group to scan group ext4: convert free groups order lists to xarrays ext4: factor out ext4_mb_scan_group() ext4: factor out ext4_mb_might_prefetch() ext4: factor out __ext4_mb_scan_group() ext4: fix largest free orders lists corruption on mb_optimize_scan switch ext4: fix zombie groups in average fragment size lists ext4: merge freed extent with existing extents before insertion ext4: convert sbi->s_mb_free_pending to atomic_t ext4: fix typo in CR_GOAL_LEN_SLOW comment ext4: get rid of some obsolete EXT4_MB_HINT flags ext4: utilize multiple global goals to reduce contention ext4: remove unnecessary s_md_lock on update s_mb_last_group ext4: remove unnecessary s_mb_last_start ext4: separate stream goal hits from s_bal_goals for better tracking ext4: add ext4_try_lock_group() to skip busy groups ext4: initialize superblock fields in the kballoc-test.c kunit tests ext4: refactor the inline directory conversion and new directory codepaths ...	2025-07-31 10:02:44 -07:00
Linus Torvalds	57fcb7d930	Merge tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fileattr updates from Christian Brauner: "This introduces the new file_getattr() and file_setattr() system calls after lengthy discussions. Both system calls serve as successors and extensible companions to the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have started to show their age in addition to being named in a way that makes it easy to conflate them with extended attribute related operations. These syscalls allow userspace to set filesystem inode attributes on special files. One of the usage examples is the XFS quota projects. XFS has project quotas which could be attached to a directory. All new inodes in these directories inherit project ID set on parent directory. The project is created from userspace by opening and calling FS_IOC_FSSETXATTR on each inode. This is not possible for special files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left with empty project ID. Those inodes then are not shown in the quota accounting but still exist in the directory. This is not critical but in the case when special files are created in the directory with already existing project quota, these new inodes inherit extended attributes. This creates a mix of special files with and without attributes. Moreover, special files with attributes don't have a possibility to become clear or change the attributes. This, in turn, prevents userspace from re-creating quota project on these existing files. In addition, these new system calls allow the implementation of additional attributes that we couldn't or didn't want to fit into the legacy ioctls anymore" * tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: tighten a sanity check in file_attr_to_fileattr() tree-wide: s/struct fileattr/struct file_kattr/g fs: introduce file_getattr and file_setattr syscalls fs: prepare for extending file_get/setattr() fs: make vfs_fileattr_[get\|set] return -EOPNOTSUPP selinux: implement inode_file_[g\|s]etattr hooks lsm: introduce new hooks for setting/getting inode fsxattr fs: split fileattr related helpers into separate file	2025-07-28 15:24:14 -07:00
Linus Torvalds	7031769e10	Merge tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull mmap_prepare updates from Christian Brauner: "Last cycle we introduce f_op->mmap_prepare() in `c84bf6dd2b` ("mm: introduce new .mmap_prepare() file callback"). This is preferred to the existing f_op->mmap() hook as it does require a VMA to be established yet, thus allowing the mmap logic to invoke this hook far, far earlier, prior to inserting a VMA into the virtual address space, or performing any other heavy handed operations. This allows for much simpler unwinding on error, and for there to be a single attempt at merging a VMA rather than having to possibly reattempt a merge based on potentially altered VMA state. Far more importantly, it prevents inappropriate manipulation of incompletely initialised VMA state, which is something that has been the cause of bugs and complexity in the past. The intent is to gradually deprecate f_op->mmap, and in that vein this series coverts the majority of file systems to using f_op->mmap_prepare. Prerequisite steps are taken - firstly ensuring all checks for mmap capabilities use the file_has_valid_mmap_hooks() helper rather than directly checking for f_op->mmap (which is now not a valid check) and secondly updating daxdev_mapping_supported() to not require a VMA parameter to allow ext4 and xfs to be converted. Commit `bb666b7c27` ("mm: add mmap_prepare() compatibility layer for nested file systems") handles the nasty edge-case of nested file systems like overlayfs, which introduces a compatibility shim to allow f_op->mmap_prepare() to be invoked from an f_op->mmap() callback. This allows for nested filesystems to continue to function correctly with all file systems regardless of which callback is used. Once we finally convert all file systems, this shim can be removed. As a result, ecryptfs, fuse, and overlayfs remain unaltered so they can nest all other file systems. We additionally do not update resctl - as this requires an update to remap_pfn_range() (or an alternative to it) which we defer to a later series, equally we do not update cramfs which needs a mixed mapping insertion with the same issue, nor do we update procfs, hugetlbfs, syfs or kernfs all of which require VMAs for internal state and hooks. We shall return to all of these later" * tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: doc: update porting, vfs documentation to describe mmap_prepare() fs: replace mmap hook with .mmap_prepare for simple mappings fs: convert most other generic_file_mmap() users to .mmap_prepare() fs: convert simple use of generic_file__mmap() to .mmap_prepare() mm/filemap: introduce generic_file_*_mmap_prepare() helpers fs/xfs: transition from deprecated .mmap hook to .mmap_prepare fs/ext4: transition from deprecated .mmap hook to .mmap_prepare fs/dax: make it possible to check dev dax support without a VMA fs: consistently use can_mmap_file() helper mm/nommu: use file_has_valid_mmap_hooks() helper mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare	2025-07-28 13:43:25 -07:00
Linus Torvalds	278c7d9b5e	Merge tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fallocate updates from Christian Brauner: "fallocate() currently supports creating preallocated files efficiently. However, on most filesystems fallocate() will preallocate blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified. The extent state must later be converted to a written state when the user writes data into this range, which can trigger numerous metadata changes and journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. At the moment, the only method to avoid this is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth. Now that more and more flash-based storage devices are available it is possible to efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to the file mapping metadata. This flag is beneficial for subsequent pure overwriting within this range, as it can save on block allocation and, consequently, significant metadata changes" * tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw\|user}_wzeroes_unmap_sectors to queue limits	2025-07-28 13:36:49 -07:00
Theodore Ts'o	099b847ccc	ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr A syzbot fuzzed image triggered a BUG_ON in ext4_update_inline_data() when an inode had the INLINE_DATA_FL flag set but was missing the system.data extended attribute. Since this can happen due to a maiciouly fuzzed file system, we shouldn't BUG, but rather, report it as a corrupted file system. Add similar replacements of BUG_ON with EXT4_ERROR_INODE() ii ext4_create_inline_data() and ext4_inline_data_truncate(). Reported-by: syzbot+544248a761451c0df72f@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	a3ce570a5d	ext4: implement linear-like traversal across order xarrays Although we now perform ordered traversal within an xarray, this is currently limited to a single xarray. However, we have multiple such xarrays, which prevents us from guaranteeing a linear-like traversal where all groups on the right are visited before all groups on the left. For example, suppose we have 128 block groups, with a target group of 64, a target length corresponding to an order of 1, and available free groups of 16 (order 1) and group 65 (order 8): For linear traversal, when no suitable free block is found in group 64, it will search in the next block group until group 127, then start searching from 0 up to block group 63. It ensures continuous forward traversal, which is consistent with the unidirectional rotation behavior of HDD platters. Additionally, the block group lock contention during freeing block is unavoidable. The goal increasing from 0 to 64 indicates that previously scanned groups (which had no suitable free space and are likely to free blocks later) and skipped groups (which are currently in use) have newly freed some used blocks. If we allocate blocks in these groups, the probability of competing with other processes increases. For non-linear traversal, we first traverse all groups in order_1. If only group 16 has free space in this list, we first traverse [63, 128), then traverse [0, 64) to find the available group 16, and then allocate blocks in group 16. Therefore, it cannot guarantee continuous traversal in one direction, thus increasing the probability of contention. So refactor ext4_mb_scan_groups_xarray() to ext4_mb_scan_groups_xa_range() to only traverse a fixed range of groups, and move the logic for handling wrap around to the caller. The caller first iterates through all xarrays in the range [start, ngroups) and then through the range [0, start). This approach simulates a linear scan, which reduces contention between freeing blocks and allocating blocks. Assume we have the following groups, where "\|" denotes the xarray traversal start position: order_1_groups: AB \| CD order_2_groups: EF \| GH Traversal order: Before: C > D > A > B > G > H > E > F After: C > D > G > H > A > B > E > F Performance test data follows: \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 19555 \| 20049 (+2.5%) \| 315636 \| 316724 (-0.3%) \| \|mb_optimize_scan=1 \| 15496 \| 19342 (+24.8%) \| 323569 \| 328324 (+1.4%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 53192 \| 52125 (-2.0%) \| 212678 \| 215136 (+1.1%) \| \|mb_optimize_scan=1 \| 37636 \| 50331 (+33.7%) \| 214189 \| 209431 (-2.2%) \| Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-18-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	6347558764	ext4: refactor choose group to scan group This commit converts the `choose group` logic to `scan group` using previously prepared helper functions. This allows us to leverage xarrays for ordered non-linear traversal, thereby mitigating the "bouncing" issue inherent in the `choose group` mechanism. This also decouples linear and non-linear traversals, leading to cleaner and more readable code. Key changes: * ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups(). * Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear traversals, and related functions now return error codes instead of group info. * Added ext4_mb_scan_groups_linear() for performing linear scans starting from a specific group for a set number of times. * Linear scans now execute up to sbi->s_mb_max_linear_groups times, so ac_groups_linear_remaining is removed as it's no longer used. * ac->ac_criteria is now used directly instead of passing cr around. Also, ac->ac_criteria is incremented directly after groups scan fails for the corresponding criteria. * Since we're now directly scanning groups instead of finding a good group then scanning, the following variables and flags are no longer needed, s_bal_cX_groups_considered is sufficient. s_bal_p2_aligned_bad_suggestions s_bal_goal_fast_bad_suggestions s_bal_best_avail_bad_suggestions EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-17-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	f7eaacbb4e	ext4: convert free groups order lists to xarrays While traversing the list, holding a spin_lock prevents load_buddy, making direct use of ext4_try_lock_group impossible. This can lead to a bouncing scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group() fails, forcing the list traversal to repeatedly restart from grp_A. In contrast, linear traversal directly uses ext4_try_lock_group(), avoiding this bouncing. Therefore, we need a lockless, ordered traversal to achieve linear-like efficiency. Therefore, this commit converts both average fragment size lists and largest free order lists into ordered xarrays. In an xarray, the index represents the block group number and the value holds the block group information; a non-empty value indicates the block group's presence. While insertion and deletion complexity remain O(1), lookup complexity changes from O(1) to O(nlogn), which may slightly reduce single-threaded performance. Additionally, xarray insertions might fail, potentially due to memory allocation issues. However, since we have linear traversal as a fallback, this isn't a major problem. Therefore, we've only added a warning message for insertion failures here. A helper function ext4_mb_find_good_group_xarray() is added to find good groups in the specified xarray starting at the specified position start, and when it reaches ngroups-1, it wraps around to 0 and then to start-1. This ensures an ordered traversal within the xarray. Performance test results are as follows: Single-process operations on an empty disk show negligible impact, while multi-process workloads demonstrate a noticeable performance gain. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 20097 \| 19555 (-2.6%) \| 316141 \| 315636 (-0.2%) \| \|mb_optimize_scan=1 \| 13318 \| 15496 (+16.3%) \| 325273 \| 323569 (-0.5%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 53603 \| 53192 (-0.7%) \| 214243 \| 212678 (-0.7%) \| \|mb_optimize_scan=1 \| 20887 \| 37636 (+80.1%) \| 213632 \| 214189 (+0.2%) \| [ Applied spelling fixes per discussion on the ext4-list see thread referened in the Link tag. --tytso] Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-16-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	9c08e42db9	ext4: factor out ext4_mb_scan_group() Extract ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-15-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	5abd85f667	ext4: factor out ext4_mb_might_prefetch() Extract ext4_mb_might_prefetch() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-14-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	45704f92e5	ext4: factor out __ext4_mb_scan_group() Extract __ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	7d345aa1fa	ext4: fix largest free orders lists corruption on mb_optimize_scan switch The grp->bb_largest_free_order is updated regardless of whether mb_optimize_scan is enabled. This can lead to inconsistencies between grp->bb_largest_free_order and the actual s_mb_largest_free_orders list index when mb_optimize_scan is repeatedly enabled and disabled via remount. For example, if mb_optimize_scan is initially enabled, largest free order is 3, and the group is in s_mb_largest_free_orders[3]. Then, mb_optimize_scan is disabled via remount, block allocations occur, updating largest free order to 2. Finally, mb_optimize_scan is re-enabled via remount, more block allocations update largest free order to 1. At this point, the group would be removed from s_mb_largest_free_orders[3] under the protection of s_mb_largest_free_orders_locks[2]. This lock mismatch can lead to list corruption. To fix this, whenever grp->bb_largest_free_order changes, we now always attempt to remove the group from its old order list. However, we only insert the group into the new order list if `mb_optimize_scan` is enabled. This approach helps prevent lock inconsistencies and ensures the data in the order lists remains reliable. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	1c320d8e92	ext4: fix zombie groups in average fragment size lists Groups with no free blocks shouldn't be in any average fragment size list. However, when all blocks in a group are allocated(i.e., bb_fragments or bb_free is 0), we currently skip updating the average fragment size, which means the group isn't removed from its previous s_mb_avg_fragment_size[old] list. This created "zombie" groups that were always skipped during traversal as they couldn't satisfy any block allocation requests, negatively impacting traversal efficiency. Therefore, when a group becomes completely full, bb_avg_fragment_size_order is now set to -1. If the old order was not -1, a removal operation is performed; if the new order is not -1, an insertion is performed. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	e7f101a808	ext4: merge freed extent with existing extents before insertion Attempt to merge ext4_free_data with already inserted free extents prior to adding new ones. This strategy drastically cuts down the number of times locks are held. For example, if prev, new, and next extents are all mergeable, the existing code (before this patch) requires acquiring the s_md_lock three times: prev merge into new and free prev // hold lock next merge into new and free next // hold lock insert new // hold lock After the patch, it only needs to be acquired once: new merge into next and free new // no lock next merge into prev and free next // hold lock Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 20043 \| 20097 (+0.2%) \| 314331 \| 316141 (+0.5%) \| \|mb_optimize_scan=1 \| 7290 \| 13318 (+87.4%) \| 324226 \| 325273 (+0.3%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 54999 \| 53603 (-2.5%) \| 214380 \| 214243 (-0.06%)\| \|mb_optimize_scan=1 \| 13497 \| 20887 (+54.6%) \| 216276 \| 213632 (-1.2%) \| Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-10-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	0a2326f6ae	ext4: convert sbi->s_mb_free_pending to atomic_t Previously, s_md_lock was used to protect s_mb_free_pending during modifications, while smp_mb() ensured fresh reads, so s_md_lock just guarantees the atomicity of s_mb_free_pending. Thus we optimized it by converting s_mb_free_pending into an atomic variable, thereby eliminating s_md_lock and minimizing lock contention. This also prepares for future lockless merging of free extents. Following this modification, s_md_lock is exclusively responsible for managing insertions and deletions within s_freed_data_list, along with operations involving list_splice. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 19628 \| 20043 (+2.1%) \| 320885 \| 314331 (-2.0%) \| \|mb_optimize_scan=1 \| 7129 \| 7290 (+2.2%) \| 321275 \| 324226 (+0.9%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 53760 \| 54999 (+2.3%) \| 213145 \| 214380 (+0.5%) \| \|mb_optimize_scan=1 \| 12716 \| 13497 (+6.1%) \| 215262 \| 216276 (+0.4%) \| Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	9a0ed16981	ext4: fix typo in CR_GOAL_LEN_SLOW comment Remove the superfluous "find_". Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:17 -04:00
Baokun Li	4d18a0b982	ext4: get rid of some obsolete EXT4_MB_HINT flags Since nobody has used these EXT4_MB_HINT flags for ages, let's remove them. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00
Baokun Li	8f2c3b7486	ext4: utilize multiple global goals to reduce contention When allocating data blocks, if the first try (goal allocation) fails and stream allocation is on, it tries a global goal starting from the last group we used (s_mb_last_group). This helps cluster large files together to reduce free space fragmentation, and the data block contiguity also accelerates write-back to disk. However, when multiple processes allocate blocks, having just one global goal means they all fight over the same group. This drastically lowers the chances of extents merging and leads to much worse file fragmentation. To mitigate this multi-process contention, we now employ multiple global goals, with the number of goals being the minimum between the number of possible CPUs and one-quarter of the filesystem's total block group count. To ensure a consistent goal for each inode, we select the corresponding goal by taking the inode number modulo the total number of goals. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 9636 \| 19628 (+103%) \| 337597 \| 320885 (-4.9%) \| \|mb_optimize_scan=1 \| 4834 \| 7129 (+47.4%) \| 341440 \| 321275 (-5.9%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 22341 \| 53760 (+140%) \| 219707 \| 213145 (-2.9%) \| \|mb_optimize_scan=1 \| 9177 \| 12716 (+38.5%) \| 215732 \| 215262 (+0.2%) \| Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00
Baokun Li	4b41deb896	ext4: remove unnecessary s_md_lock on update s_mb_last_group After we optimized the block group lock, we found another lock contention issue when running will-it-scale/fallocate2 with multiple processes. The fallocate's block allocation and the truncate's block release were fighting over the s_md_lock. The problem is, this lock protects totally different things in those two processes: the list of freed data blocks (s_freed_data_list) when releasing, and where to start looking for new blocks (mb_last_group) when allocating. Now we only need to track s_mb_last_group and no longer need to track s_mb_last_start, so we don't need the s_md_lock lock to ensure that the two are consistent. Since s_mb_last_group is merely a hint and doesn't require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient. Besides, the s_mb_last_group data type only requires ext4_group_t (i.e., unsigned int), rendering unsigned long superfluous. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. \|CPU: Kunpeng 920 \| P80 \| P1 \| \|Memory: 512GB \|------------------------\|-------------------------\| \|960GB SSD (0.5GB/s)\| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 4821 \| 9636 (+99.8%) \| 314065 \| 337597 (+7.4%) \| \|mb_optimize_scan=1 \| 4784 \| 4834 (+1.04%) \| 316344 \| 341440 (+7.9%) \| \|CPU: AMD 9654 * 2 \| P96 \| P1 \| \|Memory: 1536GB \|------------------------\|-------------------------\| \|960GB SSD (1GB/s) \| base \| patched \| base \| patched \| \|-------------------\|-------\|----------------\|--------\|----------------\| \|mb_optimize_scan=0 \| 15371 \| 22341 (+45.3%) \| 205851 \| 219707 (+6.7%) \| \|mb_optimize_scan=1 \| 6101 \| 9177 (+50.4%) \| 207373 \| 215732 (+4.0%) \| Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00
Baokun Li	f0374d8071	ext4: remove unnecessary s_mb_last_start Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1 by default, so the no longer needed sbi->s_mb_last_start is removed. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2025-07-25 09:14:16 -04:00

1 2 3 4 5 ...

5435 Commits