linux

mirror of https://github.com/raspberrypi/linux.git synced 2026-01-05 10:47:34 +00:00

Author	SHA1	Message	Date
Kent Overstreet	2fea3aa76e	bcachefs: Filter out harmless EROFS error messages These just indicate that we're shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-07 16:58:32 -04:00
Kent Overstreet	473f09f362	bcachefs: journal_shutdown is EROFS, not EIO We often filter out EROFS errors to avoid log spew after an emergency shutdown - journal_shutdown is just another emergency shutdown error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-07 16:58:26 -04:00
Kent Overstreet	9c61856099	bcachefs: Call bch2_fs_start before getting vfs superblock This reverts `1fdbe0b184` bcachefs: Make sure c->vfs_sb is set before starting fs switched up bch2_fs_get_tree() so that we got a superblock before calling bch2_fs_start, so that c->vfs_sb would always be initialized while the filesystem was active. This turned out not to be necessary, because blk_holder_ops were implemented using our own locking, not vfs locking. And this had the side effect of creating a super_block and doing our full recovery (including potentially fsck) before setting SB_BORN, which causes things like sync calls to hang until our recovery is finished. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 16:06:35 -04:00
Kent Overstreet	aed4ccbf45	bcachefs: fix hung task timeout in journal read Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 14:21:28 -04:00
Kent Overstreet	7a69fa6571	bcachefs: Add missing barriers before wake_up_bit() wake_up() doesn't require a barrier - but wake_up_bit() does. This only affected non x86, and primarily lead to lost wakeups after btree node reads. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 14:19:10 -04:00
Kent Overstreet	50a7b899a0	bcachefs: Ensure proper write alignment There was a buggy version of bcachefs-tools which picked misaligned bucket sizes when formatting, and we're also about to do dynamic block sizes - which will allow picking logical block size or physical block size of the device per-write, allowing for better compression ratios at the cost of slightly worse write performance (i.e. forcing the device to do RMW or extra buffering). To account for this, tweak bch2_alloc_sectors_start() to properly align open_buckets to the blocksize of the write we're about to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 14:19:01 -04:00
Kent Overstreet	844f766e02	bcachefs: Improve want_cached_ptr() If promote target isn't set, rebalance should still leave a cached copy on the faster device. Fall back to foreground_target if it's set, or allow a cached copy on any device if neither are set. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-05 14:16:20 -04:00
Kent Overstreet	df2e19a883	bcachefs: thread_with_stdio: fix spinning instead of exiting bch2_stdio_redirect_vprintf() was missing a check for stdio->done, i.e. exiting. This caused the thread attempting to print to spin, and since it was being called from the kthread ran by thread_with_stdio, the userspace side hung as well. Change it to return -EPIPE - i.e. writing to a pipe that's been closed. Reported-by: Jan Solanti <jhs@psonet.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-04 14:00:14 -04:00
Alan Huang	6846100b00	bcachefs: Remove incorrect __counted_by annotation This actually reverts `86e92eeeb2` ("bcachefs: Annotate struct bch_xattr with __counted_by()"). After the x_name, there is a value. According to the disscussion[1], __counted_by assumes that the flexible array member contains exactly the amount of elements that are specified. Now there are users came across a false positive detection of an out of bounds write caused by the __counted_by here[2], so revert that. [1] https://lore.kernel.org/lkml/Zv8VDKWN1GzLRT-_@archlinux/T/#m0ce9541c5070146320efd4f928cc1ff8de69e9b2 [2] https://privatebin.net/?a0d4e97d590d71e1#9bLmp2Kb5NU6X6cZEucchDcu88HzUQwHUah8okKPReEt Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 16:38:58 -04:00
Kent Overstreet	28580052e6	bcachefs: add missing sched_annotate_sleep() 00594 ------------[ cut here ]------------ 00594 do not call blocking ops when !TASK_RUNNING; state=2 set at [<000000003e51ef4a>] prepare_to_wait_event+0x5c/0x1c0 00594 WARNING: CPU: 12 PID: 1117 at kernel/sched/core.c:8741 __might_sleep+0x74/0x88 00594 Modules linked in: 00594 CPU: 12 UID: 0 PID: 1117 Comm: umount Not tainted 6.15.0-rc4-ktest-g3a72e369412d #21845 PREEMPT 00594 Hardware name: linux,dummy-virt (DT) 00594 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00594 pc : __might_sleep+0x74/0x88 00594 lr : __might_sleep+0x74/0x88 00594 sp : ffffff80c8d67a90 00594 x29: ffffff80c8d67a90 x28: ffffff80f5903500 x27: 0000000000000000 00594 x26: 0000000000000000 x25: ffffff80cf5002a0 x24: ffffffc087dad000 00594 x23: ffffff80c8d67b40 x22: 0000000000000000 x21: 0000000000000000 00594 x20: 0000000000000242 x19: ffffffc080b92020 x18: 00000000ffffffff 00594 x17: 30303c5b20746120 x16: 74657320323d6574 x15: 617473203b474e49 00594 x14: 0000000000000001 x13: 00000000000c0000 x12: ffffff80facc0000 00594 x11: 0000000000000001 x10: 0000000000000001 x9 : ffffffc0800b0774 00594 x8 : c0000000fffbffff x7 : ffffffc087dac670 x6 : 00000000015fffa8 00594 x5 : ffffff80facbffa8 x4 : ffffff80fbd30b90 x3 : 0000000000000000 00594 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80f5903500 00594 Call trace: 00594 __might_sleep+0x74/0x88 (P) 00594 __mutex_lock+0x64/0x8d8 00594 mutex_lock_nested+0x28/0x38 00594 bch2_fs_ec_flush+0xf8/0x128 00594 __bch2_fs_read_only+0x54/0x1d8 00594 bch2_fs_read_only+0x3e0/0x438 00594 __bch2_fs_stop+0x5c/0x250 00594 bch2_put_super+0x18/0x28 00594 generic_shutdown_super+0x6c/0x140 00594 bch2_kill_sb+0x1c/0x38 00594 deactivate_locked_super+0x54/0xd0 00594 deactivate_super+0x70/0x90 00594 cleanup_mnt+0xec/0x188 00594 __cleanup_mnt+0x18/0x28 00594 task_work_run+0x90/0xd8 00594 do_notify_resume+0x138/0x148 00594 el0_svc+0x9c/0xa0 00594 el0t_64_sync_handler+0x104/0x130 00594 el0t_64_sync+0x154/0x158 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 13:54:58 -04:00
Kent Overstreet	e2699274d5	bcachefs: Fix __bch2_dev_group_set() bch2_sb_disk_groups_to_cpu() goes off of the superblock member info, so we need to set that first. Reported-by: Stijn Tintel <stijn@linux-ipv6.be> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 12:22:10 -04:00
Kent Overstreet	e660d7ca74	bcachefs: Kill ERO for i_blocks check in truncate Replace with logging the error in the superblock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 06:19:58 -04:00
Kent Overstreet	3a72e36941	bcachefs: check for inode.bi_sectors underflow Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 06:19:58 -04:00
Kent Overstreet	05450c48a3	bcachefs: Kill ERO in __bch2_i_sectors_acct() We won't be root causing this in the immediate future, and it's fairly innocuous - so just log it in the superblock. https://github.com/koverstreet/bcachefs/issues/869 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-05-01 06:19:58 -04:00
Kent Overstreet	5e63d579e7	bcachefs: readdir fixes - Don't call bch2_trans_relock() after dir_emit(); taking a transaction restart here will cause us to emit the same dirent to userspace twice - Fix incorrect checking of the return value on dir_emit(): "true" means success, keep going, but bch2_dir_emit() needs to return true when we're finished iterating. https://github.com/koverstreet/bcachefs/issues/867 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-30 11:49:34 -04:00
Kent Overstreet	2feaa92c7c	bcachefs: improve missing journal write device error message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-30 11:49:28 -04:00
Kent Overstreet	dbe4674802	bcachefs: Topology error after insert is now an ERO A user hit this, and this will naturally be easier to debug if we don't panic. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 22:42:17 -04:00
Kent Overstreet	9a4a858c9b	bcachefs: Use bch2_kvmalloc() for journal keys array We can hit this limit fairly easy when we have to reconstuct large amounts of alloc info on large filesystems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 22:42:17 -04:00
Kent Overstreet	e5a3b8cf33	bcachefs: More informative error message when shutting down due to error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 22:42:17 -04:00
Kent Overstreet	652dd6558b	bcachefs: btree_root_unreadable_and_scan_found_nothing autofix for non data btrees If loosing a btree won't cause data loss - i.e. it's an alloc btree, or we can easily reconstruct it - we shouldn't require user action to continue repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 22:42:17 -04:00
Kent Overstreet	c366b1672d	bcachefs: btree_node_data_missing is now autofix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:13 -04:00
Kent Overstreet	eca5b56ccf	bcachefs: Don't generate alloc updates to invalid buckets Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:13 -04:00
Kent Overstreet	e7f1a52849	bcachefs: Improve bch2_dev_bucket_missing() More useful error message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:13 -04:00
Kent Overstreet	002466446a	bcachefs: fix bch2_dev_buckets_resize() The resize memcpy path was totally busted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:13 -04:00
Kent Overstreet	9e9c28acfd	bcachefs: Add upgrade table entry from 0.14 There are a few errors that needed to be marked as autofix. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	3c24020119	bcachefs: Run BCH_RECOVERY_PASS_reconstruct_snapshots on missing subvol -> snapshot Fix this repair path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	bdc32a10a2	bcachefs: Add missing utf8_unload() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	70c3d89f49	bcachefs: Emit unicode version message on startup fstests expects this Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	c83311c5b9	bcachefs: Use generic_set_sb_d_ops for standard casefolding d_ops Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Kent Overstreet	a2f546330e	bcachefs: Fix losing return code in next_fiemap_extent() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-28 16:46:12 -04:00
Linus Torvalds	eb98f30442	Merge tag 'vfs-6.15-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - For some reason we went from zero to three maintainers for HFS/HFS+ in a matter of days. The lesson to learn from this might just be that we need to threaten code removal more often!? - Fix a regression introduced by enabling large folios for lage logical block sizes. This has caused issues for noref migration with large folios due to sleeping while in an atomic context. New sleeping variants of pagecache lookup helpers are introduced. These helpers take the folio lock instead of the mapping's private spinlock. The problematic users are converted to the sleeping variants and serialize against noref migration. Atomic users will bail on seeing the new BH_Migrate flag. This also shrinks the critical region of the mapping's private lock and the new blocking callers reduce contention on the spinlock for bdev mappings. - Fix two bugs in do_move_mount() when with MOVE_MOUNT_BENEATH. The first bug is using a mountpoint that is located on a mount we're not holding a reference to. The second bug is putting the mountpoint after we've called namespace_unlock() as it's no longer guaranteed that it does stay a mountpoint. - Remove a pointless call to vfs_getattr_nosec() in the devtmpfs code just to query i_mode instead of simply querying the inode directly. This also avoids lifetime issues for the dm code by an earlier bugfix this cycle that moved bdev_statx() handling into vfs_getattr_nosec(). - Fix AT_FDCWD handling with getname_maybe_null() in the xattr code. - Fix a performance regression for files when multiple callers issue a close when it's not the last reference. - Remove a duplicate noinline annotation from pipe_clear_nowait(). * tag 'vfs-6.15-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs/xattr: Fix handling of AT_FDCWD in setxattrat(2) and getxattrat(2) MAINTAINERS: hfs/hfsplus: add myself as maintainer splice: remove duplicate noinline from pipe_clear_nowait devtmpfs: don't use vfs_getattr_nosec to query i_mode fix a couple of races in MNT_TREE_BENEATH handling by do_move_mount() fs: fall back to file_ref_put() for non-last reference mm/migrate: fix sleep in atomic for large folios and buffer heads fs/ext4: use sleeping version of sb_find_get_block() fs/jbd2: use sleeping version of __find_get_block() fs/ocfs2: use sleeping version of __find_get_block() fs/buffer: use sleeping version of __find_get_block() fs/buffer: introduce sleeping flavors for pagecache lookups MAINTAINERS: add HFS/HFS+ maintainers fs/buffer: split locking for pagecache lookups	2025-04-25 15:57:21 -07:00
Linus Torvalds	349b7d77f5	Merge tag 'ceph-for-6.15-rc4' of https://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "A small CephFS encryption-related fix and a dead code cleanup" * tag 'ceph-for-6.15-rc4' of https://github.com/ceph/ceph-client: ceph: Fix incorrect flush end position calculation ceph: Remove osd_client deadcode	2025-04-25 15:51:28 -07:00
Linus Torvalds	b22a194c52	Merge tag 'xfs-fixes-6.15-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Carlos Maiolino: "This contains a fix for a build failure on some 32-bit architectures and a warning generating docs" * tag 'xfs-fixes-6.15-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove duplicate Zoned Filesystems sections in admin-guide XFS: fix zoned gc threshold math for 32-bit arches	2025-04-25 09:37:21 -07:00
Linus Torvalds	eef0dc0bd4	Merge tag 'bcachefs-2025-04-24' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: - Case insensitive directories now work - Ciemap now correctly reports on unwritten pagecache data - bcachefs tools 1.25.1 was incorrectly picking unaligned bucket sizes; fix journal and write path bugs this uncovered And assorted smaller fixes... * tag 'bcachefs-2025-04-24' of git://evilpiepirate.org/bcachefs: (24 commits) bcachefs: Rework fiemap transaction restart handling bcachefs: add fiemap delalloc extent detection bcachefs: refactor fiemap processing into extent helper and struct bcachefs: track current fiemap offset in start variable bcachefs: drop duplicate fiemap sync flag bcachefs: Fix btree_iter_peek_prev() at end of inode bcachefs: Make btree_iter_peek_prev() assert more precise bcachefs: Unit test fixes bcachefs: Print mount opts earlier bcachefs: unlink: casefold d_invalidate bcachefs: Fix casefold lookups bcachefs: Casefold is now a regular opts.h option bcachefs: Implement fileattr_(get\|set) bcachefs: Allocator now copes with unaligned buckets bcachefs: Start copygc, rebalance threads earlier bcachefs: Refactor bch2_run_recovery_passes() bcachefs: bch2_copygc_wakeup() bcachefs: Fix ref leak in write_super() bcachefs: Change __journal_entry_close() assert to ERO bcachefs: Ensure journal space is block size aligned ...	2025-04-25 09:06:14 -07:00
Jan Kara	f520bed25d	fs/xattr: Fix handling of AT_FDCWD in setxattrat(2) and getxattrat(2) Currently, setxattrat(2) and getxattrat(2) are wrongly handling the calls of the from setxattrat(AF_FDCWD, NULL, AT_EMPTY_PATH, ...) and fail with -EBADF error instead of operating on CWD. Fix it. Fixes: `6140be90ec` ("fs/xattr: add *at family syscalls") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/20250424132246.16822-2-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-25 12:11:56 +02:00
T.J. Mercier	e6f141b332	splice: remove duplicate noinline from pipe_clear_nowait pipe_clear_nowait has two noinline macros, but we only need one. I checked the whole tree, and this is the only occurrence: $ grep -r "noinline .* noinline" fs/splice.c:static noinline void noinline pipe_clear_nowait(struct file *file) $ Fixes: `0f99fc513d` ("splice: clear FMODE_NOWAIT on file if splice/vmsplice is used") Signed-off-by: "T.J. Mercier" <tjmercier@google.com> Link: https://lore.kernel.org/20250423180025.2627670-1-tjmercier@google.com Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-25 12:11:56 +02:00
Kent Overstreet	d1b0f9aa73	bcachefs: Rework fiemap transaction restart handling Restart handling in the previous patch was incorrect, so: move btree operations into a separate helper, and run it with a lockrestart_do(). Additionally, clarify whether pagecache or the btree takes precedence. Right now, the btree takes precedence: this is incorrect, but it's needed to pass fstests. Add a giant comment explaining why. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:10:29 -04:00
Brian Foster	b9b0494017	bcachefs: add fiemap delalloc extent detection bcachefs currently populates fiemap data from the extents btree. This works correctly when the fiemap sync flag is provided, but if not, it skips all delalloc extents that have not yet been flushed. This is because delalloc extents from buffered writes are first stored as reservation in the pagecache, and only become resident in the extents btree after writeback completes. Update the fiemap implementation to process holes between extents by scanning pagecache for data, via seek data/hole. If a valid data range is found over a hole in the extent btree, fake up an extent key and flag the extent as delalloc for reporting to userspace. Note that this does not necessarily change behavior for the case where there is dirty pagecache over already written extents, where when in COW mode, writeback will allocate new blocks for the underlying ranges. The existing behavior is consistent with btrfs and it is recommended to use the sync flag for the most up to date extent state from fiemap. Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:29 -04:00
Brian Foster	2d55a63709	bcachefs: refactor fiemap processing into extent helper and struct The bulk of the loop in bch2_fiemap() involves processing the current extent key from the iter, including following indirections and trimming the extent size and such. This patch makes a few changes to reduce the size of the loop and facilitate future changes to support delalloc extents. Define a new bch_fiemap_extent structure to wrap the bkey buffer that holds the extent key to report to userspace along with associated fiemap flags. Update bch2_fill_extent() to take the bch_fiemap_extent as a param instead of the individual fields. Finally, lift the bulk of the extent processing into a bch2_fiemap_extent() helper that takes the current key and formats the bch_fiemap_extent appropriately for the fill function. No functional changes intended by this patch. Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:29 -04:00
Brian Foster	d020a9fb11	bcachefs: track current fiemap offset in start variable Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:28 -04:00
Brian Foster	28d2d19ccc	bcachefs: drop duplicate fiemap sync flag FIEMAP_FLAG_SYNC handling was deliberately moved into core code in commit `45dd052e67` ("fs: handle FIEMAP_FLAG_SYNC in fiemap_prep"), released in kernel v5.8. Update bcachefs accordingly. Signed-off-by: Brian Foster <bfoster@redhat.com>	2025-04-24 19:10:28 -04:00
Kent Overstreet	353739f1d1	bcachefs: Fix btree_iter_peek_prev() at end of inode At the end of the inode, on an extents iterator, peek_slot() has to advance to the next position to avoid returning a 0 size extent, which is not allowed. Changing iter->pos confuses peek_prev(), but we don't need to call peek_slot() in this case. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	c4f89a1d35	bcachefs: Make btree_iter_peek_prev() assert more precise The issue this assert is guarding against is that in BTREE_ITER_filter_snapshots mode we only want to be iterating within a single inode number - if we iterate into another inode number with keys for a different snapshot tree, we'll loop arbitrarily long before finding a key we can return. This comes up in the unit tests, where we're using inode 0 for our test keys. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	394ef278e1	bcachefs: Unit test fixes The peek_end() tests expect an empty btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	caab547686	bcachefs: Print mount opts earlier If we aren't mounting with the correct degraded option, it's helpful to know that before we fail to mount degraded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	7cb85324c4	bcachefs: unlink: casefold d_invalidate casefolding results in additional aliases on lookup for the non-casefolded names - these need invalidating on unlink. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	9cdde3c7aa	bcachefs: Fix casefold lookups Add casefolding to bch2_lookup_trans: During the delay between when casefolding was written and when it was merged, the main filesystem lookup path grew self healing - which meant it was no longer using bch2_dirent_lookup_trans(), where casefolding on lookups happens. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:52 -04:00
Kent Overstreet	b9e1f873d2	bcachefs: Casefold is now a regular opts.h option Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2025-04-24 19:09:00 -04:00
Al Viro	0d039eac6e	fix a couple of races in MNT_TREE_BENEATH handling by do_move_mount() Normally do_lock_mount(path, _) is locking a mountpoint pinned by path and at the time when matching unlock_mount() unlocks that location it is still pinned by the same thing. Unfortunately, for 'beneath' case it's no longer that simple - the object being locked is not the one path points to. It's the mountpoint of path->mnt. The thing is, without sufficient locking ->mnt_parent may change under us and none of the locks are held at that point. The rules are * mount_lock stabilizes m->mnt_parent for any mount m. * namespace_sem stabilizes m->mnt_parent, provided that m is mounted. * if either of the above holds and refcount of m is positive, we are guaranteed the same for refcount of m->mnt_parent. namespace_sem nests inside inode_lock(), so do_lock_mount() has to take inode_lock() before grabbing namespace_sem. It does recheck that path->mnt is still mounted in the same place after getting namespace_sem, and it does take care to pin the dentry. It is needed, since otherwise we might end up with racing mount --move (or umount) happening while we were getting locks; in that case dentry would no longer be a mountpoint and could've been evicted on memory pressure along with its inode - not something you want when grabbing lock on that inode. However, pinning a dentry is not enough - the matching mount is also pinned only by the fact that path->mnt is mounted on top it and at that point we are not holding any locks whatsoever, so the same kind of races could end up with all references to that mount gone just as we are about to enter inode_lock(). If that happens, we are left with filesystem being shut down while we are holding a dentry reference on it; results are not pretty. What we need to do is grab both dentry and mount at the same time; that makes inode_lock() safe and avoids the problem with fs getting shut down under us. After taking namespace_sem we verify that path->mnt is still mounted (which stabilizes its ->mnt_parent) and check that it's still mounted at the same place. From that point on to the matching namespace_unlock() we are guaranteed that mount/dentry pair we'd grabbed are also pinned by being the mountpoint of path->mnt, so we can quietly drop both the dentry reference (as the current code does) and mnt one - it's OK to do under namespace_sem, since we are not dropping the final refs. That solves the problem on do_lock_mount() side; unlock_mount() also has one, since dentry is guaranteed to stay pinned only until the namespace_unlock(). That's easy to fix - just have inode_unlock() done earlier, while it's still pinned by mp->m_dentry. Fixes: `6ac3928156` "fs: allow to mount beneath top mount" # v6.5+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-23 08:06:22 +02:00
Linus Torvalds	bc3372351d	Merge tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - subpage mode fixes: - access correct object (folio) when looking up bit offset - fix assertion condition for number of blocks per folio - fix upper boundary of locking range in hole punch - zoned fixes: - fix potential deadlock caught by lockdep when zone reporting and device freeze run in parallel - fix zone write pointer mismatch and NULL pointer dereference when metadata are converted from DUP to RAID1 - fix error handling when reloc inode creation fails - in tree-checker, unify error code for header level check - block layer: add helpers to read zone capacity * tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: skip reporting zone for new block group block: introduce zone capacity helper btrfs: tree-checker: adjust error code for header level check btrfs: fix invalid inode pointer after failure to create reloc inode btrfs: zoned: return EIO on RAID1 block group write pointer mismatch btrfs: fix the ASSERT() inside GET_SUBPAGE_BITMAP() btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range() btrfs: subpage: access correct object when reading bitmap start in subpage_calc_start_bit()	2025-04-22 10:22:38 -07:00

1 2 3 4 5 ...

98217 Commits