linux

mirror of https://github.com/raspberrypi/linux.git synced 2026-01-05 02:37:41 +00:00

Author	SHA1	Message	Date
Demi Marie Obenour	81ca2dbefa	dm ioctl: Refuse to create device named "." or ".." Using either of these is going to greatly confuse userspace, as they are not valid symlink names and so creating the usual /dev/mapper/NAME symlink will not be possible. As creating a device with either of these names is almost certainly a userspace bug, just error out. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-23 10:31:52 -04:00
Demi Marie Obenour	a85f1a9de9	dm ioctl: Refuse to create device named "control" Typical userspace setups create a symlink under /dev/mapper with the name of the device, but /dev/mapper/control is reserved for DM's control device. Therefore, trying to create such a device is almost certain to be a userspace bug. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-23 10:31:51 -04:00
Demi Marie Obenour	249bed821b	dm ioctl: Avoid double-fetch of version The version is fetched once in check_version(), which then does some validation and then overwrites the version in userspace with the API version supported by the kernel. copy_params() then fetches the version from userspace again, and this time no validation is done. The result is that the kernel's version number is completely controllable by userspace, provided that userspace can win a race condition. Fix this flaw by not copying the version back to the kernel the second time. This is not exploitable as the version is not further used in the kernel. However, it could become a problem if future patches start relying on the version field. Cc: stable@vger.kernel.org Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-23 10:31:51 -04:00
Demi Marie Obenour	10655c7a48	dm ioctl: structs and parameter strings must not overlap The NUL terminator for each target parameter string must precede the following 'struct dm_target_spec'. Otherwise, dm_split_args() might corrupt this struct. Furthermore, the first 'struct dm_target_spec' must come after the 'struct dm_ioctl', as if it overlaps too much dm_split_args() could corrupt the 'struct dm_ioctl'. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-23 10:31:51 -04:00
Demi Marie Obenour	13f4a697f8	dm ioctl: Avoid pointer arithmetic overflow Especially on 32-bit systems, it is possible for the pointer arithmetic to overflow and cause a userspace pointer to be dereferenced in the kernel. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-23 10:31:51 -04:00
Demi Marie Obenour	b60528d9e6	dm ioctl: Check dm_target_spec is sufficiently aligned Otherwise subsequent code, if given malformed input, could dereference a misaligned 'struct dm_target_spec *'. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> # use %zu Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-23 10:31:49 -04:00
Russell Harmon	2971c05874	Documentation: dm-integrity: Document an example of how the tunables relate. Signed-off-by: Russell Harmon <eatnumber1@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-22 18:39:51 -04:00
Russell Harmon	52145f284c	Documentation: dm-integrity: Document default values. Signed-off-by: Russell Harmon <eatnumber1@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-22 18:39:50 -04:00
Russell Harmon	3b671459e6	Documentation: dm-integrity: Document the meaning of "buffer". "Buffers" are buffers of the metadata/checksum area of dm-integrity. They are always at most as large as a single metadata area on-disk, but may be smaller. Signed-off-by: Russell Harmon <eatnumber1@gmail.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-22 18:39:50 -04:00
Russell Harmon	c3ba5aa6f7	Documentation: dm-integrity: Fix minor grammatical error. "where dm-integrity uses bitmap" becomes "where dm-integrity uses a bitmap" Signed-off-by: Russell Harmon <eatnumber1@gmail.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-22 18:39:50 -04:00
Andy Shevchenko	25c9a4ab4d	dm integrity: Use %*ph for printing hexdump of a small buffer The kernel already has a helper to print a hexdump of a small buffer via pointer extension. Use that instead of open coded variant. In long term it helps to kill pr_cont() or at least narrow down its use. Note, the format is slightly changed, i.e. the trailing space is always printed. Also the IV dump is limited by 64 bytes which seems fine. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-22 18:39:50 -04:00
Mike Snitzer	fa37564624	dm thin: disable discards for thin-pool if no_discard_passdown Also rename disable_passdown_if_not_supported to disable_discard_passdown_if_not_supported. And fold passdown_enabled() into only caller. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:14 -04:00
Mike Snitzer	862c6663c1	dm: remove stale/redundant dm_internal_{suspend,resume} prototypes in dm.h dm_internal_suspend() no longer exists. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:14 -04:00
Mike Snitzer	c4f512d255	dm: skip dm-stats work in alloc_io() unless needed Don't dm_stats_record_start() if dm_stats_used() is false. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Mike Snitzer	06eed768ea	dm: avoid needless dm_io access if all IO accounting is disabled Update dm_io_acct() to eliminate most dm_io struct accesses if both block core's IO stats and dm-stats are disabled. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Li Nan	526d10061b	dm: support turning off block-core's io stats accounting Commit `bc58ba9468` ("block: add sysfs file for controlling io stats accounting") allowed users to turn off disk stat accounting completely by checking if queue flag QUEUE_FLAG_IO_STAT is set. In dm, this flag is neither set nor checked: so block-core's io stats are continuously counted and cannot be turned off. Add support for turning off block-core's io stats accounting for dm. Set QUEUE_FLAG_IO_STAT for dm's request_queue. If QUEUE_FLAG_IO_STAT is set when an io starts, record the need for block core's io stats by setting the DM_IO_BLK_STAT dm_io flag to avoid io stats being disabled in the middle of the io. DM statistics (dm-stats) is independent of block-core's io stats and remains unchanged. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Christophe JAILLET	e118029cb7	dm zone: Use the bitmap API to allocate bitmaps Use bitmap_zalloc()/bitmap_free() instead of hand-writing them. It is less verbose and it improves the semantic. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Li Lingfeng	d483001206	dm thin metadata: Fix ABBA deadlock by resetting dm_bufio_client As described in commit `8111964f1b` ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata"), ABBA deadlocks will be triggered because shrinker_rwsem currently needs to held by dm_pool_abort_metadata() as a side-effect of thin-pool metadata operation failure. The following three problem scenarios have been noticed: 1) Described by commit `8111964f1b` ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") 2) shrinker_rwsem and throttle->lock P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab super_cache_scan prune_icache_sb dispose_list evict ext4_evict_inode ext4_clear_inode ext4_discard_preallocations ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_wait_block_bitmap __ext4_error ext4_handle_error ext4_commit_super ... dm_submit_bio do_worker throttle_work_update down_write(&t->lock) -- LOCK B process_deferred_bios commit metadata_operation_failed dm_pool_abort_metadata dm_block_manager_create dm_bufio_client_create register_shrinker down_write(&shrinker_rwsem) -- LOCK A thin_map thin_bio_map thin_defer_bio_with_throttle throttle_lock down_read(&t->lock) - LOCK B 3) shrinker_rwsem and wait_on_buffer P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab ... ext4_wait_block_bitmap __ext4_error ext4_handle_error jbd2_journal_abort jbd2_journal_update_sb_errno jbd2_write_superblock submit_bh // LOCK B // RELEASE B do_worker throttle_work_update down_write(&t->lock) - LOCK B process_deferred_bios process_bio commit metadata_operation_failed dm_pool_abort_metadata dm_block_manager_create dm_bufio_client_create register_shrinker register_shrinker_prepared down_write(&shrinker_rwsem) - LOCK A bio_endio wait_on_buffer __wait_on_buffer Fix these by resetting dm_bufio_client without holding shrinker_rwsem. Fixes: `8111964f1b` ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Mikulas Patocka	2a32897c84	dm crypt: fix crypt_ctr_cipher_new return value on invalid AEAD cipher If the user specifies invalid AEAD cipher, dm-crypt should return the error returned from crypt_ctr_auth_spec, not -ENOMEM. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Mike Snitzer	ef6953fb68	dm thin: update .io_hints methods to not require handling discards last Removes assumptions about what might follow the discard setup code (previously the code would return early if discards not enabled). Makes it possible to add more capabilites to the end of each .io_hints method (which is the natural thing to do when adding new features). Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Mike Snitzer	c0a7a0ac07	dm thin: remove return code variable in pool_map Always returns DM_MAPIO_REMAPPED so no need for variable. Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Mikulas Patocka	4c2c845bdc	dm flakey: introduce random_read_corrupt and random_write_corrupt options The random_read_corrupt and random_write_corrupt options corrupt a random byte in a bio with the provided probability. The corruption only happens in the "down" interval. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Mikulas Patocka	1d9a943898	dm flakey: clone pages on write bio before corrupting them dm-flakey has an option to corrupt write bios. It corrupts the memory that is being written. This can cause system crashes or security bugs - for example, if the user writes a shared library code with O_DIRECT flag to a dm-flakey device, it corrupts the memory for all users that have the shared library mapped. Fix this bug by cloning the bio and corrupting the clone rather than the original. Also drop the test for ZERO_PAGE(0) - it can't happen because we write the cloned pages. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Mikulas Patocka	5054e778fc	dm crypt: allocate compound pages if possible It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1]. Change dm-crypt to allocate compound pages if they are available. If not, fall back to the mempool. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2023-06-16 18:24:13 -04:00
Ming Lei	245165658e	blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_none After grabbing q->sysfs_lock, q->elevator may become NULL because of elevator switch. Fix the NULL dereference on q->elevator by checking it with lock. Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-16 10:12:25 -06:00
Christoph Hellwig	84bd06c632	iov_iter: remove iov_iter_get_pages and iov_iter_get_pages_alloc Now that the direct I/O helpers have switched to use iov_iter_extract_pages, these helpers are unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-16 10:08:09 -06:00
Christoph Hellwig	e4cc64657b	block: remove BIO_PAGE_REFFED Now that all block direct I/O helpers use page pinning, this flag is unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-16 10:08:09 -06:00
Christoph Hellwig	2e82f6c3bf	splice: simplify a conditional in copy_splice_read Check for -EFAULT instead of wrapping the check in an ret < 0 block. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-16 10:08:08 -06:00
Christoph Hellwig	0b24be4691	splice: don't call file_accessed in copy_splice_read copy_splice_read calls into ->read_iter to read the data, which already calls file_accessed. Fixes: `33b3b04154` ("splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-16 10:08:08 -06:00
Jens Axboe	236f255296	Merge tag 'nvme-6.5-2023-06-16' of git://git.infradead.org/nvme into for-6.5/block Pull NVMe updates from Keith: "nvme updates for Linux 6.5 - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner)" * tag 'nvme-6.5-2023-06-16' of git://git.infradead.org/nvme: (27 commits) nvme: forward port sysfs delete fix nvme: skip optional id ctrl csi if it failed nvme-core: use nvme_ns_head_multipath instead of ns->head->disk nvmet-fcloop: Do not wait on completion when unregister fails nvme-fabrics: open code __nvmf_host_find() nvme-fabrics: error out to unlock the mutex nvme: Increase block size variable size to 32-bit nvme-fcloop: no need to return from void function nvmet-auth: remove unnecessary break after goto nvmet-auth: remove some dead code nvme-core: remove redundant check from nvme_init_ns_head nvme: move sysfs code to a dedicated sysfs.c file nvme-fabrics: prevent overriding of existing host nvme-fabrics: check hostid using uuid_equal nvme-fabrics: unify common code in admin and io queue connect nvmet: reorder fields in 'struct nvmefc_fcp_req' nvmet: reorder fields in 'struct nvme_dhchap_queue_context' nvmet: reorder fields in 'struct nvmf_ctrl_options' nvme: reorder fields in 'struct nvme_ctrl' nvmet: reorder fields in 'struct nvmet_sq' ...	2023-06-16 09:57:40 -06:00
Keith Busch	1c606f7f05	nvme: forward port sysfs delete fix We had a late fix that modified nvme_sysfs_delete() after the staging branch for the next merge window relocated the function to a new file. Port commit `2eb94dd56a` ("nvme: do not let the user delete a ctrl before a complete") to the latest to avoid a potentially confusing merge conflict. Cc: Maurizio Lombardi <mlombard@redhat.com> Cc: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>	2023-06-16 08:15:57 -07:00
Mingzhe Zou	f0854489fc	bcache: fixup btree_cache_wait list damage We get a kernel crash about "list_add corruption. next->prev should be prev (ffff9c801bc01210), but was ffff9c77b688237c. (next=ffffae586d8afe68)." crash> struct list_head 0xffff9c801bc01210 struct list_head { next = 0xffffae586d8afe68, prev = 0xffffae586d8afe68 } crash> struct list_head 0xffff9c77b688237c struct list_head { next = 0x0, prev = 0x0 } crash> struct list_head 0xffffae586d8afe68 struct list_head struct: invalid kernel virtual address: ffffae586d8afe68 type: "gdb_readmem_callback" Cannot access memory at address 0xffffae586d8afe68 [230469.019492] Call Trace: [230469.032041] prepare_to_wait+0x8a/0xb0 [230469.044363] ? bch_btree_keys_free+0x6c/0xc0 [escache] [230469.056533] mca_cannibalize_lock+0x72/0x90 [escache] [230469.068788] mca_alloc+0x2ae/0x450 [escache] [230469.080790] bch_btree_node_get+0x136/0x2d0 [escache] [230469.092681] bch_btree_check_thread+0x1e1/0x260 [escache] [230469.104382] ? finish_wait+0x80/0x80 [230469.115884] ? bch_btree_check_recurse+0x1a0/0x1a0 [escache] [230469.127259] kthread+0x112/0x130 [230469.138448] ? kthread_flush_work_fn+0x10/0x10 [230469.149477] ret_from_fork+0x35/0x40 bch_btree_check_thread() and bch_dirty_init_thread() may call mca_cannibalize() to cannibalize other cached btree nodes. Only one thread can do it at a time, so the op of other threads will be added to the btree_cache_wait list. We must call finish_wait() to remove op from btree_cache_wait before free it's memory address. Otherwise, the list will be damaged. Also should call bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up other waiters. Fixes: `8e7102273f` ("bcache: make bch_btree_check() to be multithreaded") Fixes: `b144e45fc5` ("bcache: make bch_sectors_dirty_init() to be multithreaded") Cc: stable@vger.kernel.org Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-15 07:32:55 -06:00
Zheng Wang	80fca8a10b	bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent In some specific situations, the return value of __bch_btree_node_alloc may be NULL. This may lead to a potential NULL pointer dereference in caller function like a calling chain : btree_split->bch_btree_node_alloc->__bch_btree_node_alloc. Fix it by initializing the return value in __bch_btree_node_alloc. Fixes: `cafe563591` ("bcache: A block layer cache") Cc: stable@vger.kernel.org Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-15 07:32:00 -06:00
Zheng Wang	028ddcac47	bcache: Remove unnecessary NULL point check in node allocations Due to the previous fix of __bch_btree_node_alloc, the return value will never be a NULL pointer. So IS_ERR is enough to handle the failure situation. Fix it by replacing IS_ERR_OR_NULL check by an IS_ERR check. Fixes: `cafe563591` ("bcache: A block layer cache") Cc: stable@vger.kernel.org Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-15 07:30:43 -06:00
Andrea Tomassetti	ccb8c3bd6d	bcache: Remove dead references to cache_readaheads The cache_readaheads stat counter is not used anymore and should be removed. Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-15 07:30:11 -06:00
Thomas Weißschuh	b98dd0b0a5	bcache: make kobj_type structures constant Since commit `ee6d3dd4ed` ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-15 07:30:11 -06:00
ye xingchen	a301b2deb6	bcache: Convert to use sysfs_emit()/sysfs_emit_at() APIs Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-15 07:30:11 -06:00
Yu Kuai	dd7de3704a	block: fix blktrace debugfs entries leakage Commit `99d055b4fd` ("block: remove per-disk debugfs files in blk_unregister_queue") moves blk_trace_shutdown() from blk_release_queue() to blk_unregister_queue(), this is safe if blktrace is created through sysfs, however, there is a regression in corner case. blktrace can still be enabled after del_gendisk() through ioctl if the disk is opened before del_gendisk(), and if blktrace is not shutdown through ioctl before closing the disk, debugfs entries will be leaked. Fix this problem by shutdown blktrace in disk_release(), this is safe because blk_trace_remove() is reentrant. Fixes: `99d055b4fd` ("block: remove per-disk debugfs files in blk_unregister_queue") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230610022003.2557284-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-14 20:24:03 -06:00
Yu Kuai	db59133e92	scsi: sg: fix blktrace debugfs entries leakage sg_ioctl() support to enable blktrace, which will create debugfs entries "/sys/kernel/debug/block/sgx/", however, there is no guarantee that user will remove these entries through ioctl, and deleting sg device doesn't cleanup these blktrace entries. This problem can be fixed by cleanup blktrace while releasing request_queue, however, it's not a good idea to do this special handling in common layer just for sg device. Fix this problem by shutdown bltkrace in sg_device_destroy(), where the device is deleted and all the users close the device, also grab a scsi_device reference from sg_add_device() to prevent scsi_device to be freed before sg_device_destroy(); Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230610022003.2557284-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-14 20:24:03 -06:00
Yu Kuai	cbe7cff4a7	blktrace: use inline function for blk_trace_remove() while blktrace is disabled If config is disabled, call blk_trace_remove() directly will trigger build warning, hence use inline function instead, prepare to fix blktrace debugfs entries leakage. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230610022003.2557284-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-14 20:24:03 -06:00
Pankaj Raghav	6dd4423f3f	brd: use cond_resched instead of cond_resched_rcu The body of the loop is run without RCU lock held. Use the regular cond_resched() instead of cond_resched_rcu(). Fixes: `786bb02458` ("brd: use XArray instead of radix-tree to index backing pages") Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230614133538.1279369-1-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-14 11:13:07 -06:00
Ed Tsai	30654614f3	blk-mq: check on cpu id when there is only one ctx mapping commit `f168420c62` ("blk-mq: don't redirect completion for hctx withs only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx, there will be no remote request. But for ufs, the submission and completion queues could be asymmetric. (e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and ctx won't complete request on the submission cpu. In this situation, this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result, check on cpu id when there is only one ctx mapping. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com> Suggested-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com [axboe: fixed up indentation] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-14 11:11:25 -06:00
Jens Axboe	6070131176	Merge tag 'md-next-20230613' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.5/block Pull MD updates from Song: "The major changes are: 1. Protect md_thread with rcu, by Yu Kuai; 2. Various non-urgent raid5 and raid1/10 fixes, by Yu Kuai; 3. Non-urgent raid10 fixes, by Li Nan." * tag 'md-next-20230613' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (29 commits) md/raid1-10: limit the number of plugged bio md/raid1-10: don't handle pluged bio by daemon thread md/md-bitmap: add a new helper to unplug bitmap asynchrously md/raid1-10: submit write io directly if bitmap is not enabled md/raid1-10: factor out a helper to submit normal write md/raid1-10: factor out a helper to add bio to plug md/raid10: prevent soft lockup while flush writes md/raid10: fix io loss while replacement replace rdev md/raid10: Do not add spare disk when recovery fails md/raid10: clean up md_add_new_disk() md/raid10: prioritize adding disk to 'removed' mirror md/raid10: improve code of mrdev in raid10_sync_request md/raid10: fix null-ptr-deref of mreplace in raid10_sync_request md/raid5: don't start reshape when recovery or replace is in progress md: protect md_thread with rcu md/bitmap: factor out a helper to set timeout md/bitmap: always wake up md_thread in timeout_store dm-raid: remove useless checking in raid_message() md: factor out a helper to wake up md_thread directly md: fix duplicate filename for rdev ...	2023-06-14 06:58:43 -06:00
David Howells	d44c404207	block: Fix dio_cleanup() to advance the head index Fix dio_bio_cleanup() to advance the head index into the list of pages past the pages it has released, as __blockdev_direct_IO() will call it twice if do_direct_IO() fails. The issue was causing: WARNING: CPU: 6 PID: 2220 at mm/gup.c:76 try_get_folio This can be triggered by setting up a clean pair of UDF filesystems on loopback devices and running the generic/451 xfstest with them as the scratch and test partitions. Something like the following: fallocate /mnt2/udf_scratch -l 1G fallocate /mnt2/udf_test -l 1G mknod /dev/lo0 b 7 0 mknod /dev/lo1 b 7 1 losetup lo0 /mnt2/udf_scratch losetup lo1 /mnt2/udf_test mkfs -t udf /dev/lo0 mkfs -t udf /dev/lo1 cd xfstests ./check generic/451 with xfstests configured by putting the following into local.config: export FSTYP=udf export DISABLE_UDF_TEST=1 export TEST_DEV=/dev/lo1 export TEST_DIR=/xfstest.test export SCRATCH_DEV=/dev/lo0 export SCRATCH_MNT=/xfstest.scratch Fixes: `1ccf164ec8` ("block: Use iov_iter_extract_pages() and page pinning in direct-io.c") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202306120931.a9606b88-oliver.sang@intel.com Signed-off-by: David Howells <dhowells@redhat.com> cc: Christoph Hellwig <hch@infradead.org> cc: David Hildenbrand <david@redhat.com> cc: Andrew Morton <akpm@linux-foundation.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Matthew Wilcox <willy@infradead.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: Jason Gunthorpe <jgg@nvidia.com> cc: Logan Gunthorpe <logang@deltatee.com> cc: Hillf Danton <hdanton@sina.com> cc: Christian Brauner <brauner@kernel.org> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-kernel@vger.kernel.org cc: linux-mm@kvack.org Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/1193485.1686693279@warthog.procyon.org.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-14 06:58:18 -06:00
Yu Kuai	460af1f9d9	md/raid1-10: limit the number of plugged bio bio can be added to plug infinitely, and following writeback test can trigger huge amount of plugged bio: Test script: modprobe brd rd_nr=4 rd_size=10485760 mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal echo 0 > /proc/sys/vm/dirty_background_ratio fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test Test result: Monitor /sys/block/md0/inflight will found that inflight keep increasing until fio finish writing, after running for about 2 minutes: [root@fedora ~]# cat /sys/block/md0/inflight 0 4474191 Fix the problem by limiting the number of plugged bio based on the number of copies for original bio. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com	2023-06-13 15:25:44 -07:00
Yu Kuai	9efcc2c3df	md/raid1-10: don't handle pluged bio by daemon thread current->bio_list will be set under submit_bio() context, in this case bitmap io will be added to the list and wait for current io submission to finish, while current io submission must wait for bitmap io to be done. commit `874807a831` ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix the deadlock by handling plugged bio by daemon thread. On the one hand, the deadlock won't exist after commit `a214b949d8` ("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On the other hand, current solution makes it impossible to flush plugged bio in raid1/10_make_request(), because this will cause that all the writes will goto daemon thread. In order to limit the number of plugged bio, commit `874807a831` ("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the deadlock is fixed by handling bitmap io asynchronously. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com	2023-06-13 15:25:44 -07:00
Yu Kuai	a022325ab9	md/md-bitmap: add a new helper to unplug bitmap asynchrously If bitmap is enabled, bitmap must update before submitting write io, this is why unplug callback must move these io to 'conf->pending_io_list' if 'current->bio_list' is not empty, which will suffer performance degradation. A new helper md_bitmap_unplug_async() is introduced to submit bitmap io in a kworker, so that submit bitmap io in raid10_unplug() doesn't require that 'current->bio_list' is empty. This patch prepare to limit the number of plugged bio. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com	2023-06-13 15:25:44 -07:00
Yu Kuai	7db922bae3	md/raid1-10: submit write io directly if bitmap is not enabled Commit `6cce3b23f6` ("[PATCH] md: write intent bitmap support for raid10") add bitmap support, and it changed that write io is submitted through daemon thread because bitmap need to be updated before write io. And later, plug is used to fix performance regression because all the write io will go to demon thread, which means io can't be issued concurrently. However, if bitmap is not enabled, the write io should not go to daemon thread in the first place, and plug is not needed as well. Fixes: `6cce3b23f6` ("[PATCH] md: write intent bitmap support for raid10") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-5-yukuai1@huaweicloud.com	2023-06-13 15:25:44 -07:00
Yu Kuai	8295efbe68	md/raid1-10: factor out a helper to submit normal write There are multiple places to do the same thing, factor out a helper to prevent redundant code, and the helper will be used in following patch as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-4-yukuai1@huaweicloud.com	2023-06-13 15:25:43 -07:00
Yu Kuai	5ec6ca140a	md/raid1-10: factor out a helper to add bio to plug The code in raid1 and raid10 is identical, prepare to limit the number of plugged bios. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-3-yukuai1@huaweicloud.com	2023-06-13 15:25:43 -07:00

1 2 3 4 5 ...

1185645 Commits