2019-04-30 14:42:43 -04:00
// SPDX-License-Identifier: GPL-2.0
2005-04-16 15:20:36 -07:00
/*
* gendisk handling
2020-12-10 08:55:44 +01:00
*
* Portions Copyright ( C ) 2020 Christoph Hellwig
2005-04-16 15:20:36 -07:00
*/
# include <linux/module.h>
2020-03-24 08:25:13 +01:00
# include <linux/ctype.h>
2005-04-16 15:20:36 -07:00
# include <linux/fs.h>
2007-02-20 13:57:48 -08:00
# include <linux/kdev_t.h>
2005-04-16 15:20:36 -07:00
# include <linux/kernel.h>
# include <linux/blkdev.h>
2015-05-22 17:13:32 -04:00
# include <linux/backing-dev.h>
2005-04-16 15:20:36 -07:00
# include <linux/init.h>
# include <linux/spinlock.h>
2008-10-04 23:53:21 +04:00
# include <linux/proc_fs.h>
2005-04-16 15:20:36 -07:00
# include <linux/seq_file.h>
# include <linux/slab.h>
# include <linux/kmod.h>
2021-09-20 14:33:25 +02:00
# include <linux/major.h>
2006-02-06 14:12:43 -08:00
# include <linux/mutex.h>
2008-08-25 19:47:22 +09:00
# include <linux/idr.h>
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-08 20:57:37 +01:00
# include <linux/log2.h>
2013-02-22 16:34:13 -08:00
# include <linux/pm_runtime.h>
2016-01-09 08:36:51 -08:00
# include <linux/badblocks.h>
2021-11-23 19:53:12 +01:00
# include <linux/part_stat.h>
2022-03-18 21:01:44 +08:00
# include "blk-throttle.h"
2005-04-16 15:20:36 -07:00
2008-03-04 11:23:45 +01:00
# include "blk.h"
2021-11-23 19:53:08 +01:00
# include "blk-mq-sched.h"
2021-09-29 09:12:40 +02:00
# include "blk-rq-qos.h"
2022-03-08 06:51:55 +01:00
# include "blk-cgroup.h"
2008-03-04 11:23:45 +01:00
2020-03-25 16:48:35 +01:00
static struct kobject * block_depr ;
2005-04-16 15:20:36 -07:00
block: add disk sequence number
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-13 01:05:25 +02:00
/*
* Unique , monotonically increasing sequential number associated with block
* devices instances ( i . e . incremented each time a device is attached ) .
* Associating uevents with block devices in userspace is difficult and racy :
* the uevent netlink socket is lossy , and on slow and overloaded systems has
* a very high latency .
* Block devices do not have exclusive owners in userspace , any process can set
* one up ( e . g . loop devices ) . Moreover , device names can be reused ( e . g . loop0
* can be reused again and again ) .
* A userspace process setting up a block device and watching for its events
* cannot thus reliably tell whether an event relates to the device it just set
* up or another earlier instance with the same name .
* This sequential number allows userspace processes to solve this problem , and
* uniquely associate an uevent to the lifetime to a device .
*/
static atomic64_t diskseq ;
2008-08-25 19:47:22 +09:00
/* for extended dynamic devt allocation, currently only one major is used */
2013-02-27 17:03:56 -08:00
# define NR_EXT_DEVT (1 << MINORBITS)
2020-11-26 09:23:26 +01:00
static DEFINE_IDA ( ext_devt_ida ) ;
2008-08-25 19:47:22 +09:00
2020-11-26 18:43:37 +01:00
void set_capacity ( struct gendisk * disk , sector_t sectors )
{
2020-11-26 18:47:17 +01:00
struct block_device * bdev = disk - > part0 ;
2020-11-26 18:43:37 +01:00
2021-03-01 12:04:02 +09:00
spin_lock ( & bdev - > bd_size_lock ) ;
2020-11-26 18:43:37 +01:00
i_size_write ( bdev - > bd_inode , ( loff_t ) sectors < < SECTOR_SHIFT ) ;
2021-10-18 11:39:45 -06:00
bdev - > bd_nr_sectors = sectors ;
2021-03-01 12:04:02 +09:00
spin_unlock ( & bdev - > bd_size_lock ) ;
2020-11-26 18:43:37 +01:00
}
EXPORT_SYMBOL ( set_capacity ) ;
2020-03-13 05:30:05 +00:00
/*
2020-11-16 15:56:56 +01:00
* Set disk capacity and notify if the size is not currently zero and will not
* be set to zero . Returns true if a uevent was sent , otherwise false .
2020-03-13 05:30:05 +00:00
*/
2020-11-16 15:56:56 +01:00
bool set_capacity_and_notify ( struct gendisk * disk , sector_t size )
2020-03-13 05:30:05 +00:00
{
sector_t capacity = get_capacity ( disk ) ;
2020-11-26 18:43:37 +01:00
char * envp [ ] = { " RESIZE=1 " , NULL } ;
2020-03-13 05:30:05 +00:00
set_capacity ( disk , size ) ;
2020-11-26 18:43:37 +01:00
/*
* Only print a message and send a uevent if the gendisk is user visible
* and alive . This avoids spamming the log and udev when setting the
* initial capacity during probing .
*/
if ( size = = capacity | |
2021-08-09 08:40:28 +02:00
! disk_live ( disk ) | |
( disk - > flags & GENHD_FL_HIDDEN ) )
2020-11-26 18:43:37 +01:00
return false ;
2020-03-13 05:30:05 +00:00
2020-11-26 18:43:37 +01:00
pr_info ( " %s: detected capacity change from %lld to %lld \n " ,
2021-02-23 16:50:15 +08:00
disk - > disk_name , capacity , size ) ;
2020-11-12 17:50:04 +01:00
2020-11-26 18:43:37 +01:00
/*
* Historically we did not send a uevent for changes to / from an empty
* device .
*/
if ( ! capacity | | ! size )
return false ;
kobject_uevent_env ( & disk_to_dev ( disk ) - > kobj , KOBJ_CHANGE , envp ) ;
return true ;
2020-03-13 05:30:05 +00:00
}
2020-11-16 15:56:56 +01:00
EXPORT_SYMBOL_GPL ( set_capacity_and_notify ) ;
2020-03-13 05:30:05 +00:00
2020-11-27 16:43:51 +01:00
static void part_stat_read_all ( struct block_device * part ,
struct disk_stats * stat )
2020-03-25 16:07:06 +03:00
{
int cpu ;
memset ( stat , 0 , sizeof ( struct disk_stats ) ) ;
for_each_possible_cpu ( cpu ) {
2020-11-27 16:43:51 +01:00
struct disk_stats * ptr = per_cpu_ptr ( part - > bd_stats , cpu ) ;
2020-03-25 16:07:06 +03:00
int group ;
for ( group = 0 ; group < NR_STAT_GROUPS ; group + + ) {
stat - > nsecs [ group ] + = ptr - > nsecs [ group ] ;
stat - > sectors [ group ] + = ptr - > sectors [ group ] ;
stat - > ios [ group ] + = ptr - > ios [ group ] ;
stat - > merges [ group ] + = ptr - > merges [ group ] ;
}
stat - > io_ticks + = ptr - > io_ticks ;
}
}
2020-11-24 09:36:54 +01:00
static unsigned int part_in_flight ( struct block_device * part )
2017-08-08 17:51:45 -06:00
{
2020-05-13 12:49:33 +02:00
unsigned int inflight = 0 ;
2018-12-06 11:41:20 -05:00
int cpu ;
2017-08-08 17:51:45 -06:00
2018-12-06 11:41:20 -05:00
for_each_possible_cpu ( cpu ) {
2018-12-06 11:41:21 -05:00
inflight + = part_stat_local_read_cpu ( part , in_flight [ 0 ] , cpu ) +
part_stat_local_read_cpu ( part , in_flight [ 1 ] , cpu ) ;
2018-12-06 11:41:20 -05:00
}
2018-12-06 11:41:21 -05:00
if ( ( int ) inflight < 0 )
inflight = 0 ;
2018-12-06 11:41:20 -05:00
2018-12-06 11:41:21 -05:00
return inflight ;
2017-08-08 17:51:45 -06:00
}
2020-11-24 09:36:54 +01:00
static void part_in_flight_rw ( struct block_device * part ,
unsigned int inflight [ 2 ] )
2018-04-26 00:21:59 -07:00
{
2018-12-06 11:41:20 -05:00
int cpu ;
inflight [ 0 ] = 0 ;
inflight [ 1 ] = 0 ;
for_each_possible_cpu ( cpu ) {
inflight [ 0 ] + = part_stat_local_read_cpu ( part , in_flight [ 0 ] , cpu ) ;
inflight [ 1 ] + = part_stat_local_read_cpu ( part , in_flight [ 1 ] , cpu ) ;
}
if ( ( int ) inflight [ 0 ] < 0 )
inflight [ 0 ] = 0 ;
if ( ( int ) inflight [ 1 ] < 0 )
inflight [ 1 ] = 0 ;
2018-04-26 00:21:59 -07:00
}
2005-04-16 15:20:36 -07:00
/*
* Can be deleted altogether . Later .
*
*/
2017-06-16 17:48:21 -06:00
# define BLKDEV_MAJOR_HASH_SIZE 255
2005-04-16 15:20:36 -07:00
static struct blk_major_name {
struct blk_major_name * next ;
int major ;
char name [ 16 ] ;
2022-01-04 08:16:47 +01:00
# ifdef CONFIG_BLOCK_LEGACY_AUTOLOAD
2020-10-29 15:58:28 +01:00
void ( * probe ) ( dev_t devt ) ;
2022-01-04 08:16:47 +01:00
# endif
2006-03-31 02:30:32 -08:00
} * major_names [ BLKDEV_MAJOR_HASH_SIZE ] ;
2020-10-29 15:58:26 +01:00
static DEFINE_MUTEX ( major_names_lock ) ;
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 20:52:13 +09:00
static DEFINE_SPINLOCK ( major_names_spinlock ) ;
2005-04-16 15:20:36 -07:00
/* index in the above - for now: assume no multimajor ranges */
2010-12-17 09:00:18 +01:00
static inline int major_to_index ( unsigned major )
2005-04-16 15:20:36 -07:00
{
2006-03-31 02:30:32 -08:00
return major % BLKDEV_MAJOR_HASH_SIZE ;
2006-01-14 13:20:38 -08:00
}
2006-03-31 02:30:32 -08:00
# ifdef CONFIG_PROC_FS
2008-09-03 09:01:09 +02:00
void blkdev_show ( struct seq_file * seqf , off_t offset )
2006-01-14 13:20:38 -08:00
{
2006-03-31 02:30:32 -08:00
struct blk_major_name * dp ;
2006-01-14 13:20:38 -08:00
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 20:52:13 +09:00
spin_lock ( & major_names_spinlock ) ;
2017-06-16 17:48:21 -06:00
for ( dp = major_names [ major_to_index ( offset ) ] ; dp ; dp = dp - > next )
if ( dp - > major = = offset )
2008-09-03 09:01:09 +02:00
seq_printf ( seqf , " %3d %s \n " , dp - > major , dp - > name ) ;
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 20:52:13 +09:00
spin_unlock ( & major_names_spinlock ) ;
2005-04-16 15:20:36 -07:00
}
2006-03-31 02:30:32 -08:00
# endif /* CONFIG_PROC_FS */
2005-04-16 15:20:36 -07:00
2009-02-20 08:12:51 +01:00
/**
2020-11-14 18:08:21 +01:00
* __register_blkdev - register a new block device
2009-02-20 08:12:51 +01:00
*
2018-02-05 18:25:27 -08:00
* @ major : the requested major device number [ 1. . BLKDEV_MAJOR_MAX - 1 ] . If
* @ major = 0 , try to allocate any unused major number .
2009-02-20 08:12:51 +01:00
* @ name : the name of the new block device as a zero terminated string
2021-11-03 16:04:34 -07:00
* @ probe : pre - devtmpfs / pre - udev callback used to create disks when their
* pre - created device node is accessed . When a probe call uses
* add_disk ( ) and it fails the driver must cleanup resources . This
* interface may soon be removed .
2009-02-20 08:12:51 +01:00
*
* The @ name must be unique within the system .
*
2017-03-30 17:11:36 -03:00
* The return value depends on the @ major input parameter :
*
2018-02-05 18:25:27 -08:00
* - if a major device number was requested in range [ 1. . BLKDEV_MAJOR_MAX - 1 ]
* then the function returns zero on success , or a negative error code
2017-03-30 17:11:36 -03:00
* - if any unused major number was requested with @ major = 0 parameter
2009-02-20 08:12:51 +01:00
* then the return value is the allocated major number in range
2018-02-05 18:25:27 -08:00
* [ 1. . BLKDEV_MAJOR_MAX - 1 ] or a negative error code otherwise
*
* See Documentation / admin - guide / devices . txt for the list of allocated
* major numbers .
2020-11-14 18:08:21 +01:00
*
* Use register_blkdev instead for any new code .
2009-02-20 08:12:51 +01:00
*/
2020-10-29 15:58:28 +01:00
int __register_blkdev ( unsigned int major , const char * name ,
void ( * probe ) ( dev_t devt ) )
2005-04-16 15:20:36 -07:00
{
struct blk_major_name * * n , * p ;
int index , ret = 0 ;
2020-10-29 15:58:26 +01:00
mutex_lock ( & major_names_lock ) ;
2005-04-16 15:20:36 -07:00
/* temporary */
if ( major = = 0 ) {
for ( index = ARRAY_SIZE ( major_names ) - 1 ; index > 0 ; index - - ) {
if ( major_names [ index ] = = NULL )
break ;
}
if ( index = = 0 ) {
2019-02-17 10:21:56 -05:00
printk ( " %s: failed to get major for %s \n " ,
__func__ , name ) ;
2005-04-16 15:20:36 -07:00
ret = - EBUSY ;
goto out ;
}
major = index ;
ret = major ;
}
2017-06-16 17:48:21 -06:00
if ( major > = BLKDEV_MAJOR_MAX ) {
2019-02-17 10:21:56 -05:00
pr_err ( " %s: major requested (%u) is greater than the maximum (%u) for %s \n " ,
__func__ , major , BLKDEV_MAJOR_MAX - 1 , name ) ;
2017-06-16 17:48:21 -06:00
ret = - EINVAL ;
goto out ;
}
2005-04-16 15:20:36 -07:00
p = kmalloc ( sizeof ( struct blk_major_name ) , GFP_KERNEL ) ;
if ( p = = NULL ) {
ret = - ENOMEM ;
goto out ;
}
p - > major = major ;
2022-01-04 08:16:47 +01:00
# ifdef CONFIG_BLOCK_LEGACY_AUTOLOAD
2020-10-29 15:58:28 +01:00
p - > probe = probe ;
2022-01-04 08:16:47 +01:00
# endif
2005-04-16 15:20:36 -07:00
strlcpy ( p - > name , name , sizeof ( p - > name ) ) ;
p - > next = NULL ;
index = major_to_index ( major ) ;
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 20:52:13 +09:00
spin_lock ( & major_names_spinlock ) ;
2005-04-16 15:20:36 -07:00
for ( n = & major_names [ index ] ; * n ; n = & ( * n ) - > next ) {
if ( ( * n ) - > major = = major )
break ;
}
if ( ! * n )
* n = p ;
else
ret = - EBUSY ;
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 20:52:13 +09:00
spin_unlock ( & major_names_spinlock ) ;
2005-04-16 15:20:36 -07:00
if ( ret < 0 ) {
2018-02-05 18:25:27 -08:00
printk ( " register_blkdev: cannot get major %u for %s \n " ,
2005-04-16 15:20:36 -07:00
major , name ) ;
kfree ( p ) ;
}
out :
2020-10-29 15:58:26 +01:00
mutex_unlock ( & major_names_lock ) ;
2005-04-16 15:20:36 -07:00
return ret ;
}
2020-10-29 15:58:28 +01:00
EXPORT_SYMBOL ( __register_blkdev ) ;
2005-04-16 15:20:36 -07:00
2007-07-17 04:03:47 -07:00
void unregister_blkdev ( unsigned int major , const char * name )
2005-04-16 15:20:36 -07:00
{
struct blk_major_name * * n ;
struct blk_major_name * p = NULL ;
int index = major_to_index ( major ) ;
2020-10-29 15:58:26 +01:00
mutex_lock ( & major_names_lock ) ;
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 20:52:13 +09:00
spin_lock ( & major_names_spinlock ) ;
2005-04-16 15:20:36 -07:00
for ( n = & major_names [ index ] ; * n ; n = & ( * n ) - > next )
if ( ( * n ) - > major = = major )
break ;
2007-07-17 04:03:45 -07:00
if ( ! * n | | strcmp ( ( * n ) - > name , name ) ) {
WARN_ON ( 1 ) ;
} else {
2005-04-16 15:20:36 -07:00
p = * n ;
* n = p - > next ;
}
block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.
======================================================
WARNING: possible circular locking dependency detected
5.14.0+ #757 Tainted: G E
------------------------------------------------------
systemd-udevd/7568 is trying to acquire lock:
ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560
but task is already holding lock:
ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&lo->lo_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_killable_nested+0x17/0x20
lo_open+0x23/0x50 [loop]
blkdev_get_by_dev+0x199/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #5 (&disk->open_mutex){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
bd_register_pending_holders+0x20/0x100
device_add_disk+0x1ae/0x390
loop_add+0x29c/0x2d0 [loop]
blk_request_module+0x5a/0xb0
blkdev_get_no_open+0x27/0xa0
blkdev_get_by_dev+0x5f/0x540
blkdev_open+0x58/0x90
do_dentry_open+0x144/0x3a0
path_openat+0xa57/0xda0
do_filp_open+0x9f/0x140
do_sys_openat2+0x71/0x150
__x64_sys_openat+0x78/0xa0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #4 (major_names_lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
blkdev_show+0x19/0x80
devinfo_show+0x52/0x60
seq_read_iter+0x2d5/0x3e0
proc_reg_read_iter+0x41/0x80
vfs_read+0x2ac/0x330
ksys_read+0x6b/0xd0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&p->lock){+.+.}-{3:3}:
lock_acquire+0xbe/0x1f0
__mutex_lock_common+0xb6/0xe10
mutex_lock_nested+0x17/0x20
seq_read_iter+0x37/0x3e0
generic_file_splice_read+0xf3/0x170
splice_direct_to_actor+0x14e/0x350
do_splice_direct+0x84/0xd0
do_sendfile+0x263/0x430
__se_sys_sendfile64+0x96/0xc0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#3){.+.+}-{0:0}:
lock_acquire+0xbe/0x1f0
lo_write_bvec+0x96/0x280 [loop]
loop_process_work+0xa68/0xc10 [loop]
process_one_work+0x293/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
lock_acquire+0xbe/0x1f0
process_one_work+0x280/0x480
worker_thread+0x23d/0x4b0
kthread+0x163/0x180
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
validate_chain+0x1f0d/0x33e0
__lock_acquire+0x92d/0x1030
lock_acquire+0xbe/0x1f0
flush_workqueue+0x8c/0x560
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
__loop_clr_fd+0xb4/0x400 [loop]
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
2 locks held by systemd-udevd/7568:
#0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
#1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]
stack backtrace:
CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x79/0xbf
print_circular_bug+0x5d6/0x5e0
? stack_trace_save+0x42/0x60
? save_trace+0x3d/0x2d0
check_noncircular+0x10b/0x120
validate_chain+0x1f0d/0x33e0
? __lock_acquire+0x953/0x1030
? __lock_acquire+0x953/0x1030
__lock_acquire+0x92d/0x1030
? flush_workqueue+0x70/0x560
lock_acquire+0xbe/0x1f0
? flush_workqueue+0x70/0x560
flush_workqueue+0x8c/0x560
? flush_workqueue+0x70/0x560
? sched_clock_cpu+0xe/0x1a0
? drain_workqueue+0x41/0x140
drain_workqueue+0x80/0x140
destroy_workqueue+0x47/0x4f0
? blk_mq_freeze_queue_wait+0xac/0xd0
__loop_clr_fd+0xb4/0x400 [loop]
? __mutex_unlock_slowpath+0x35/0x230
blkdev_put+0x14a/0x1d0
blkdev_close+0x1c/0x20
__fput+0xfd/0x220
task_work_run+0x69/0xc0
exit_to_user_mode_prepare+0x1ce/0x1f0
syscall_exit_to_user_mode+0x26/0x60
do_syscall_64+0x4c/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f0fd4c661f7
Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050
Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.
The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.
Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 20:52:13 +09:00
spin_unlock ( & major_names_spinlock ) ;
2020-10-29 15:58:26 +01:00
mutex_unlock ( & major_names_lock ) ;
2005-04-16 15:20:36 -07:00
kfree ( p ) ;
}
EXPORT_SYMBOL ( unregister_blkdev ) ;
2021-05-21 07:50:51 +02:00
int blk_alloc_ext_minor ( void )
2008-08-25 19:47:22 +09:00
{
2013-02-27 17:03:57 -08:00
int idx ;
2008-08-25 19:47:22 +09:00
2022-03-26 15:50:46 +01:00
idx = ida_alloc_range ( & ext_devt_ida , 0 , NR_EXT_DEVT - 1 , GFP_KERNEL ) ;
2021-08-24 09:52:16 +02:00
if ( idx = = - ENOSPC )
return - EBUSY ;
return idx ;
2008-08-25 19:47:22 +09:00
}
2021-05-21 07:50:51 +02:00
void blk_free_ext_minor ( unsigned int minor )
2008-08-25 19:47:22 +09:00
{
2021-08-24 09:52:16 +02:00
ida_free ( & ext_devt_ida , minor ) ;
2019-04-02 20:06:34 +08:00
}
2008-08-25 19:47:23 +09:00
static char * bdevt_str ( dev_t devt , char * buf )
{
if ( MAJOR ( devt ) < = 0xff & & MINOR ( devt ) < = 0xff ) {
char tbuf [ BDEVT_SIZE ] ;
snprintf ( tbuf , BDEVT_SIZE , " %02x%02x " , MAJOR ( devt ) , MINOR ( devt ) ) ;
snprintf ( buf , BDEVT_SIZE , " %-9s " , tbuf ) ;
} else
snprintf ( buf , BDEVT_SIZE , " %03x:%05x " , MAJOR ( devt ) , MINOR ( devt ) ) ;
return buf ;
}
2021-01-24 11:02:39 +01:00
void disk_uevent ( struct gendisk * disk , enum kobject_action action )
{
struct block_device * part ;
2021-04-06 08:23:02 +02:00
unsigned long idx ;
2021-01-24 11:02:39 +01:00
2021-04-06 08:23:02 +02:00
rcu_read_lock ( ) ;
xa_for_each ( & disk - > part_tbl , idx , part ) {
if ( bdev_is_partition ( part ) & & ! bdev_nr_sectors ( part ) )
continue ;
2021-07-01 10:16:37 +02:00
if ( ! kobject_get_unless_zero ( & part - > bd_device . kobj ) )
2021-04-06 08:23:02 +02:00
continue ;
rcu_read_unlock ( ) ;
2021-01-24 11:02:39 +01:00
kobject_uevent ( bdev_kobj ( part ) , action ) ;
2021-07-01 10:16:37 +02:00
put_device ( & part - > bd_device ) ;
2021-04-06 08:23:02 +02:00
rcu_read_lock ( ) ;
}
rcu_read_unlock ( ) ;
2021-01-24 11:02:39 +01:00
}
EXPORT_SYMBOL_GPL ( disk_uevent ) ;
2023-02-17 10:21:59 +08:00
int disk_scan_partitions ( struct gendisk * disk , fmode_t mode )
2020-09-21 09:19:46 +02:00
{
struct block_device * bdev ;
2023-02-17 10:22:00 +08:00
int ret = 0 ;
2020-09-21 09:19:46 +02:00
2021-11-22 14:06:23 +01:00
if ( disk - > flags & ( GENHD_FL_NO_PART | GENHD_FL_HIDDEN ) )
2021-11-22 14:06:16 +01:00
return - EINVAL ;
2022-05-27 07:58:06 +02:00
if ( test_bit ( GD_SUPPRESS_PART_SCAN , & disk - > state ) )
return - EINVAL ;
2021-11-22 14:06:16 +01:00
if ( disk - > open_partitions )
return - EBUSY ;
2020-09-21 09:19:46 +02:00
set_bit ( GD_NEED_PART_SCAN , & disk - > state ) ;
2023-02-17 10:22:00 +08:00
/*
* If the device is opened exclusively by current thread already , it ' s
* safe to scan partitons , otherwise , use bd_prepare_to_claim ( ) to
* synchronize with other exclusive openers and other partition
* scanners .
*/
if ( ! ( mode & FMODE_EXCL ) ) {
ret = bd_prepare_to_claim ( disk - > part0 , disk_scan_partitions ) ;
if ( ret )
return ret ;
}
bdev = blkdev_get_by_dev ( disk_devt ( disk ) , mode & ~ FMODE_EXCL , NULL ) ;
2021-11-22 14:06:16 +01:00
if ( IS_ERR ( bdev ) )
2023-02-17 10:22:00 +08:00
ret = PTR_ERR ( bdev ) ;
else
2023-03-07 18:55:52 +08:00
blkdev_put ( bdev , mode & ~ FMODE_EXCL ) ;
2023-02-17 10:22:00 +08:00
if ( ! ( mode & FMODE_EXCL ) )
bd_abort_claiming ( disk - > part0 , disk_scan_partitions ) ;
return ret ;
2020-09-21 09:19:46 +02:00
}
2005-04-16 15:20:36 -07:00
/**
2021-08-04 11:41:47 +02:00
* device_add_disk - add disk information to kernel list
2016-06-15 18:17:27 -07:00
* @ parent : parent device for the disk
2005-04-16 15:20:36 -07:00
* @ disk : per - device partitioning information
2018-09-28 08:17:19 +02:00
* @ groups : Additional per - device sysfs groups
2005-04-16 15:20:36 -07:00
*
* This function registers the partitioning information in @ disk
* with the kernel .
*/
2021-11-09 16:29:49 -08:00
int __must_check device_add_disk ( struct device * parent , struct gendisk * disk ,
const struct attribute_group * * groups )
2021-08-04 11:41:47 +02:00
2005-04-16 15:20:36 -07:00
{
2021-08-18 16:45:33 +02:00
struct device * ddev = disk_to_dev ( disk ) ;
2021-05-21 07:50:51 +02:00
int ret ;
2008-04-30 00:54:32 -07:00
2022-03-04 21:08:03 -05:00
/* Only makes sense for bio-based to set ->poll_bio */
if ( queue_is_mq ( disk - > queue ) & & disk - > fops - > poll_bio )
return - EINVAL ;
2019-09-05 18:51:33 +09:00
/*
* The disk queue should now be all set with enough information about
* the device for the elevator code to pick an adequate default
* elevator if one is needed , that is , for devices requesting queue
* registration .
*/
2021-08-04 11:41:47 +02:00
elevator_init_mq ( disk - > queue ) ;
2019-09-05 18:51:33 +09:00
2021-05-21 07:50:51 +02:00
/*
* If the driver provides an explicit major number it also must provide
* the number of minors numbers supported , and those will be used to
* setup the gendisk .
* Otherwise just allocate the device numbers for both the whole device
* and all partitions from the extended dev_t space .
2008-08-25 19:56:17 +09:00
*/
2022-10-22 10:16:15 +08:00
ret = - EINVAL ;
2021-05-21 07:50:51 +02:00
if ( disk - > major ) {
2021-08-18 16:45:40 +02:00
if ( WARN_ON ( ! disk - > minors ) )
2022-10-22 10:16:15 +08:00
goto out_exit_elevator ;
2021-05-21 07:50:52 +02:00
if ( disk - > minors > DISK_MAX_PARTS ) {
pr_err ( " block: can't allocate more than %d partitions \n " ,
DISK_MAX_PARTS ) ;
disk - > minors = DISK_MAX_PARTS ;
}
2021-12-17 23:51:25 +09:00
if ( disk - > first_minor + disk - > minors > MINORMASK + 1 )
2022-10-22 10:16:15 +08:00
goto out_exit_elevator ;
2021-05-21 07:50:51 +02:00
} else {
2021-08-18 16:45:40 +02:00
if ( WARN_ON ( disk - > minors ) )
2022-10-22 10:16:15 +08:00
goto out_exit_elevator ;
2008-08-25 19:56:17 +09:00
2021-05-21 07:50:51 +02:00
ret = blk_alloc_ext_minor ( ) ;
2021-08-18 16:45:40 +02:00
if ( ret < 0 )
2022-10-22 10:16:15 +08:00
goto out_exit_elevator ;
2021-05-21 07:50:51 +02:00
disk - > major = BLOCK_EXT_MAJOR ;
2021-08-24 09:52:15 +02:00
disk - > first_minor = ret ;
2008-08-25 19:56:17 +09:00
}
2021-05-21 07:50:51 +02:00
2021-08-18 16:45:33 +02:00
/* delay uevents, until we scanned partition table */
dev_set_uevent_suppress ( ddev , 1 ) ;
ddev - > parent = parent ;
ddev - > groups = groups ;
dev_set_name ( ddev , " %s " , disk - > disk_name ) ;
2021-08-18 16:45:34 +02:00
if ( ! ( disk - > flags & GENHD_FL_HIDDEN ) )
ddev - > devt = MKDEV ( disk - > major , disk - > first_minor ) ;
2021-08-18 16:45:40 +02:00
ret = device_add ( ddev ) ;
if ( ret )
2021-12-21 17:18:51 +01:00
goto out_free_ext_minor ;
ret = disk_alloc_events ( disk ) ;
if ( ret )
goto out_device_del ;
2021-08-18 16:45:33 +02:00
if ( ! sysfs_deprecated ) {
ret = sysfs_create_link ( block_depr , & ddev - > kobj ,
kobject_name ( & ddev - > kobj ) ) ;
2021-08-18 16:45:40 +02:00
if ( ret )
goto out_device_del ;
2021-08-18 16:45:33 +02:00
}
/*
* avoid probable deadlock caused by allocating memory with
* GFP_KERNEL in runtime_resume callback of its all ancestor
* devices
*/
pm_runtime_set_memalloc_noio ( ddev , true ) ;
2023-02-03 16:03:43 +01:00
ret = blk_integrity_add ( disk ) ;
if ( ret )
2023-02-14 19:33:06 +01:00
goto out_del_block_link ;
2023-02-03 16:03:43 +01:00
2021-08-18 16:45:33 +02:00
disk - > part0 - > bd_holder_dir =
kobject_create_and_add ( " holders " , & ddev - > kobj ) ;
2021-11-03 09:40:23 -07:00
if ( ! disk - > part0 - > bd_holder_dir ) {
ret = - ENOMEM ;
2021-08-18 16:45:40 +02:00
goto out_del_integrity ;
2021-11-03 09:40:23 -07:00
}
2021-08-18 16:45:33 +02:00
disk - > slave_dir = kobject_create_and_add ( " slaves " , & ddev - > kobj ) ;
2021-11-03 09:40:23 -07:00
if ( ! disk - > slave_dir ) {
ret = - ENOMEM ;
2021-08-18 16:45:40 +02:00
goto out_put_holder_dir ;
2021-11-03 09:40:23 -07:00
}
2021-08-18 16:45:33 +02:00
2021-08-18 16:45:40 +02:00
ret = blk_register_queue ( disk ) ;
if ( ret )
goto out_put_slave_dir ;
2021-08-18 16:45:37 +02:00
2021-11-22 14:06:23 +01:00
if ( ! ( disk - > flags & GENHD_FL_HIDDEN ) ) {
2021-08-18 16:45:34 +02:00
ret = bdi_register ( disk - > bdi , " %u:%u " ,
disk - > major , disk - > first_minor ) ;
2021-08-18 16:45:40 +02:00
if ( ret )
goto out_unregister_queue ;
2021-08-18 16:45:34 +02:00
bdi_set_owner ( disk - > bdi , ddev ) ;
2021-08-18 16:45:40 +02:00
ret = sysfs_create_link ( & ddev - > kobj ,
& disk - > bdi - > dev - > kobj , " bdi " ) ;
if ( ret )
goto out_unregister_bdi ;
2021-08-18 16:45:34 +02:00
2023-02-17 10:22:00 +08:00
/* Make sure the first partition scan will be proceed */
if ( get_capacity ( disk ) & & ! ( disk - > flags & GENHD_FL_NO_PART ) & &
! test_bit ( GD_SUPPRESS_PART_SCAN , & disk - > state ) )
set_bit ( GD_NEED_PART_SCAN , & disk - > state ) ;
2021-08-18 16:45:35 +02:00
bdev_add ( disk - > part0 , ddev - > devt ) ;
2021-11-22 14:06:16 +01:00
if ( get_capacity ( disk ) )
2023-02-17 10:21:59 +08:00
disk_scan_partitions ( disk , FMODE_READ ) ;
2021-08-18 16:45:33 +02:00
/*
* Announce the disk and partitions after all partitions are
2021-08-18 16:45:34 +02:00
* created . ( for hidden disks uevents remain suppressed forever )
2021-08-18 16:45:33 +02:00
*/
dev_set_uevent_suppress ( ddev , 0 ) ;
disk_uevent ( disk , KOBJ_ADD ) ;
2022-10-10 15:18:57 +02:00
} else {
/*
* Even if the block_device for a hidden gendisk is not
* registered , it needs to have a valid bd_dev so that the
* freeing of the dynamic major works .
*/
disk - > part0 - > bd_dev = MKDEV ( disk - > major , disk - > first_minor ) ;
2021-08-18 16:45:33 +02:00
}
2021-08-18 16:45:37 +02:00
disk_update_readahead ( disk ) ;
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-08 20:57:37 +01:00
disk_add_events ( disk ) ;
2022-02-15 10:45:10 +01:00
set_bit ( GD_ADDED , & disk - > state ) ;
2021-08-18 16:45:40 +02:00
return 0 ;
out_unregister_bdi :
if ( ! ( disk - > flags & GENHD_FL_HIDDEN ) )
bdi_unregister ( disk - > bdi ) ;
out_unregister_queue :
blk_unregister_queue ( disk ) ;
2022-10-29 15:13:55 +08:00
rq_qos_exit ( disk - > queue ) ;
2021-08-18 16:45:40 +02:00
out_put_slave_dir :
kobject_put ( disk - > slave_dir ) ;
2022-11-15 22:10:45 +08:00
disk - > slave_dir = NULL ;
2021-08-18 16:45:40 +02:00
out_put_holder_dir :
kobject_put ( disk - > part0 - > bd_holder_dir ) ;
out_del_integrity :
blk_integrity_del ( disk ) ;
out_del_block_link :
if ( ! sysfs_deprecated )
sysfs_remove_link ( block_depr , dev_name ( ddev ) ) ;
out_device_del :
device_del ( ddev ) ;
out_free_ext_minor :
if ( disk - > major = = BLOCK_EXT_MAJOR )
blk_free_ext_minor ( disk - > first_minor ) ;
2022-10-22 10:16:15 +08:00
out_exit_elevator :
if ( disk - > queue - > elevator )
elevator_exit ( disk - > queue ) ;
2021-11-09 16:29:49 -08:00
return ret ;
2005-04-16 15:20:36 -07:00
}
2016-06-15 18:17:27 -07:00
EXPORT_SYMBOL ( device_add_disk ) ;
2005-04-16 15:20:36 -07:00
2022-02-17 08:52:31 +01:00
/**
* blk_mark_disk_dead - mark a disk as dead
* @ disk : disk to mark as dead
*
* Mark as disk as dead ( e . g . surprise removed ) and don ' t accept any new I / O
* to this disk .
*/
void blk_mark_disk_dead ( struct gendisk * disk )
{
set_bit ( GD_DEAD , & disk - > state ) ;
blk_queue_start_drain ( disk - > queue ) ;
2022-11-01 16:00:37 +01:00
/*
* Stop buffered writers from dirtying pages that can ' t be written out .
*/
set_capacity_and_notify ( disk , 0 ) ;
2022-02-17 08:52:31 +01:00
}
EXPORT_SYMBOL_GPL ( blk_mark_disk_dead ) ;
2020-06-19 20:47:23 +00:00
/**
* del_gendisk - remove the gendisk
* @ disk : the struct gendisk to remove
*
* Removes the gendisk and all its associated resources . This deletes the
* partitions associated with the gendisk , and unregisters the associated
* request_queue .
*
* This is the counter to the respective __device_add_disk ( ) call .
*
* The final removal of the struct gendisk happens when its refcount reaches 0
* with put_disk ( ) , which should be called after del_gendisk ( ) , if
* __device_add_disk ( ) was used .
2020-06-19 20:47:25 +00:00
*
* Drivers exist which depend on the release of the gendisk to be synchronous ,
* it should not be deferred .
*
* Context : can sleep
2020-06-19 20:47:23 +00:00
*/
2010-12-08 20:57:36 +01:00
void del_gendisk ( struct gendisk * disk )
2005-04-16 15:20:36 -07:00
{
2021-09-29 09:12:40 +02:00
struct request_queue * q = disk - > queue ;
2020-06-19 20:47:25 +00:00
might_sleep ( ) ;
2021-08-24 16:43:10 +02:00
if ( WARN_ON_ONCE ( ! disk_live ( disk ) & & ! ( disk - > flags & GENHD_FL_HIDDEN ) ) )
2020-10-29 15:58:24 +01:00
return ;
2015-10-21 13:19:49 -04:00
blk_integrity_del ( disk ) ;
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-08 20:57:37 +01:00
disk_del_events ( disk ) ;
2021-05-25 08:12:56 +02:00
mutex_lock ( & disk - > open_mutex ) ;
2021-07-22 09:53:56 +02:00
remove_inode_hash ( disk - > part0 - > bd_inode ) ;
2021-04-06 08:22:55 +02:00
blk_drop_partitions ( disk ) ;
2021-05-25 08:12:56 +02:00
mutex_unlock ( & disk - > open_mutex ) ;
2021-04-06 08:22:56 +02:00
2021-04-06 08:22:53 +02:00
fsync_bdev ( disk - > part0 ) ;
__invalidate_device ( disk - > part0 , true ) ;
2021-09-29 09:12:40 +02:00
/*
* Fail any new I / O .
*/
set_bit ( GD_DEAD , & disk - > state ) ;
2022-06-19 08:05:51 +02:00
if ( test_bit ( GD_OWNS_QUEUE , & disk - > state ) )
blk_queue_flag_set ( QUEUE_FLAG_DYING , q ) ;
2010-12-08 20:57:36 +01:00
set_capacity ( disk , 0 ) ;
2021-09-29 09:12:40 +02:00
/*
* Prevent new I / O from crossing bio_queue_enter ( ) .
*/
blk_queue_start_drain ( q ) ;
2020-10-29 15:58:24 +01:00
if ( ! ( disk - > flags & GENHD_FL_HIDDEN ) ) {
2017-11-02 21:29:53 +03:00
sysfs_remove_link ( & disk_to_dev ( disk ) - > kobj , " bdi " ) ;
2020-10-29 15:58:24 +01:00
2017-03-08 17:48:33 +01:00
/*
* Unregister bdi before releasing device numbers ( as they can
* get reused and we ' d get clashes in sysfs ) .
*/
2021-08-09 16:17:43 +02:00
bdi_unregister ( disk - > bdi ) ;
2017-03-08 17:48:33 +01:00
}
2010-12-08 20:57:36 +01:00
2020-10-29 15:58:24 +01:00
blk_unregister_queue ( disk ) ;
2010-12-08 20:57:36 +01:00
2020-11-26 18:47:17 +01:00
kobject_put ( disk - > part0 - > bd_holder_dir ) ;
2010-12-08 20:57:36 +01:00
kobject_put ( disk - > slave_dir ) ;
2022-11-15 22:10:45 +08:00
disk - > slave_dir = NULL ;
2010-12-08 20:57:36 +01:00
2020-11-24 09:36:54 +01:00
part_stat_set_all ( disk - > part0 , 0 ) ;
2020-11-26 18:47:17 +01:00
disk - > part0 - > bd_stamp = 0 ;
2010-12-08 20:57:36 +01:00
if ( ! sysfs_deprecated )
sysfs_remove_link ( block_depr , dev_name ( disk_to_dev ( disk ) ) ) ;
2013-02-22 16:34:13 -08:00
pm_runtime_set_memalloc_noio ( disk_to_dev ( disk ) , false ) ;
2010-12-08 20:57:36 +01:00
device_del ( disk_to_dev ( disk ) ) ;
2021-10-26 18:12:04 +08:00
2022-09-19 16:40:49 +02:00
blk_mq_freeze_queue_wait ( q ) ;
2022-09-21 20:04:58 +02:00
blk_throtl_cancel_bios ( disk ) ;
2022-03-18 21:01:44 +08:00
2021-10-26 18:12:04 +08:00
blk_sync_queue ( q ) ;
blk_flush_integrity ( ) ;
2022-10-30 17:47:30 +08:00
if ( queue_is_mq ( q ) )
blk_mq_cancel_work_sync ( q ) ;
2022-06-14 09:48:24 +02:00
blk_mq_quiesce_queue ( q ) ;
if ( q - > elevator ) {
mutex_lock ( & q - > sysfs_lock ) ;
elevator_exit ( q ) ;
mutex_unlock ( & q - > sysfs_lock ) ;
}
rq_qos_exit ( q ) ;
blk_mq_unquiesce_queue ( q ) ;
2021-10-26 18:12:04 +08:00
/*
2022-06-19 08:05:51 +02:00
* If the disk does not own the queue , allow using passthrough requests
* again . Else leave the queue frozen to fail all I / O .
2021-10-26 18:12:04 +08:00
*/
2022-06-19 08:05:51 +02:00
if ( ! test_bit ( GD_OWNS_QUEUE , & disk - > state ) ) {
blk_queue_flag_clear ( QUEUE_FLAG_INIT_DONE , q ) ;
__blk_mq_unfreeze_queue ( q , true ) ;
} else {
if ( queue_is_mq ( q ) )
blk_mq_exit_queue ( q ) ;
}
2005-04-16 15:20:36 -07:00
}
2010-12-08 20:57:36 +01:00
EXPORT_SYMBOL ( del_gendisk ) ;
2005-04-16 15:20:36 -07:00
2021-09-22 20:37:08 +08:00
/**
* invalidate_disk - invalidate the disk
* @ disk : the struct gendisk to invalidate
*
* A helper to invalidates the disk . It will clean the disk ' s associated
* buffer / page caches and reset its internal states so that the disk
* can be reused by the drivers .
*
* Context : can sleep
*/
void invalidate_disk ( struct gendisk * disk )
{
struct block_device * bdev = disk - > part0 ;
invalidate_bdev ( bdev ) ;
bdev - > bd_inode - > i_mapping - > wb_err = 0 ;
set_capacity ( disk , 0 ) ;
}
EXPORT_SYMBOL ( invalidate_disk ) ;
2016-01-09 08:36:51 -08:00
/* sysfs access to bad-blocks list. */
static ssize_t disk_badblocks_show ( struct device * dev ,
struct device_attribute * attr ,
char * page )
{
struct gendisk * disk = dev_to_disk ( dev ) ;
if ( ! disk - > bb )
return sprintf ( page , " \n " ) ;
return badblocks_show ( disk - > bb , page , 0 ) ;
}
static ssize_t disk_badblocks_store ( struct device * dev ,
struct device_attribute * attr ,
const char * page , size_t len )
{
struct gendisk * disk = dev_to_disk ( dev ) ;
if ( ! disk - > bb )
return - ENXIO ;
return badblocks_store ( disk - > bb , page , len , 0 ) ;
}
2022-01-04 08:16:47 +01:00
# ifdef CONFIG_BLOCK_LEGACY_AUTOLOAD
2020-11-26 09:23:26 +01:00
void blk_request_module ( dev_t devt )
2020-10-29 15:58:27 +01:00
{
2020-10-29 15:58:28 +01:00
unsigned int major = MAJOR ( devt ) ;
struct blk_major_name * * n ;
mutex_lock ( & major_names_lock ) ;
for ( n = & major_names [ major_to_index ( major ) ] ; * n ; n = & ( * n ) - > next ) {
if ( ( * n ) - > major = = major & & ( * n ) - > probe ) {
( * n ) - > probe ( devt ) ;
mutex_unlock ( & major_names_lock ) ;
return ;
}
}
mutex_unlock ( & major_names_lock ) ;
2020-10-29 15:58:27 +01:00
if ( request_module ( " block-major-%d-%d " , MAJOR ( devt ) , MINOR ( devt ) ) > 0 )
/* Make old-style 2.4 aliases work */
request_module ( " block-major-%d " , MAJOR ( devt ) ) ;
}
2022-01-04 08:16:47 +01:00
# endif /* CONFIG_BLOCK_LEGACY_AUTOLOAD */
2020-10-29 15:58:27 +01:00
2008-05-22 17:21:08 -04:00
/*
* print a full list of all partitions - intended for places where the root
* filesystem can ' t be mounted and thus to give the victim some idea of what
* went wrong
*/
void __init printk_all_partitions ( void )
{
2008-09-03 08:57:12 +02:00
struct class_dev_iter iter ;
struct device * dev ;
class_dev_iter_init ( & iter , & block_class , NULL , & disk_type ) ;
while ( ( dev = class_dev_iter_next ( & iter ) ) ) {
struct gendisk * disk = dev_to_disk ( dev ) ;
2020-11-24 09:52:59 +01:00
struct block_device * part ;
2008-08-25 19:47:23 +09:00
char devt_buf [ BDEVT_SIZE ] ;
2021-04-06 08:22:59 +02:00
unsigned long idx ;
2008-09-03 08:57:12 +02:00
/*
* Don ' t show empty devices or things that have been
2011-03-30 22:57:33 -03:00
* suppressed
2008-09-03 08:57:12 +02:00
*/
2021-11-22 14:06:21 +01:00
if ( get_capacity ( disk ) = = 0 | | ( disk - > flags & GENHD_FL_HIDDEN ) )
2008-09-03 08:57:12 +02:00
continue ;
/*
2021-04-06 08:22:59 +02:00
* Note , unlike / proc / partitions , I am showing the numbers in
* hex - the same format as the root = option takes .
2008-09-03 08:57:12 +02:00
*/
2021-04-06 08:22:59 +02:00
rcu_read_lock ( ) ;
xa_for_each ( & disk - > part_tbl , idx , part ) {
if ( ! bdev_nr_sectors ( part ) )
continue ;
2021-07-27 08:25:14 +02:00
printk ( " %s%s %10llu %pg %s " ,
2021-04-06 08:22:59 +02:00
bdev_is_partition ( part ) ? " " : " " ,
2020-11-24 09:52:59 +01:00
bdevt_str ( part - > bd_dev , devt_buf ) ,
2021-07-27 08:25:14 +02:00
bdev_nr_sectors ( part ) > > 1 , part ,
2020-11-24 09:52:59 +01:00
part - > bd_meta_info ?
part - > bd_meta_info - > uuid : " " ) ;
2021-04-06 08:22:59 +02:00
if ( bdev_is_partition ( part ) )
2008-08-25 19:56:14 +09:00
printk ( " \n " ) ;
2021-04-06 08:22:59 +02:00
else if ( dev - > parent & & dev - > parent - > driver )
printk ( " driver: %s \n " ,
dev - > parent - > driver - > name ) ;
else
printk ( " (driver?) \n " ) ;
2008-08-25 19:56:14 +09:00
}
2021-04-06 08:22:59 +02:00
rcu_read_unlock ( ) ;
2008-09-03 08:57:12 +02:00
}
class_dev_iter_exit ( & iter ) ;
2007-05-09 02:33:24 -07:00
}
2005-04-16 15:20:36 -07:00
# ifdef CONFIG_PROC_FS
/* iterator */
2008-09-03 08:57:12 +02:00
static void * disk_seqf_start ( struct seq_file * seqf , loff_t * pos )
2008-05-22 17:21:08 -04:00
{
2008-09-03 08:57:12 +02:00
loff_t skip = * pos ;
struct class_dev_iter * iter ;
struct device * dev ;
2008-05-22 17:21:08 -04:00
2008-08-28 09:27:42 +02:00
iter = kmalloc ( sizeof ( * iter ) , GFP_KERNEL ) ;
2008-09-03 08:57:12 +02:00
if ( ! iter )
return ERR_PTR ( - ENOMEM ) ;
seqf - > private = iter ;
class_dev_iter_init ( iter , & block_class , NULL , & disk_type ) ;
do {
dev = class_dev_iter_next ( iter ) ;
if ( ! dev )
return NULL ;
} while ( skip - - ) ;
return dev_to_disk ( dev ) ;
2008-05-22 17:21:08 -04:00
}
2008-09-03 08:57:12 +02:00
static void * disk_seqf_next ( struct seq_file * seqf , void * v , loff_t * pos )
2005-04-16 15:20:36 -07:00
{
2007-05-21 22:08:01 +02:00
struct device * dev ;
2005-04-16 15:20:36 -07:00
2008-09-03 08:57:12 +02:00
( * pos ) + + ;
dev = class_dev_iter_next ( seqf - > private ) ;
2008-09-03 08:53:37 +02:00
if ( dev )
2008-05-22 17:21:08 -04:00
return dev_to_disk ( dev ) ;
2008-09-03 08:53:37 +02:00
2005-04-16 15:20:36 -07:00
return NULL ;
}
2008-09-03 08:57:12 +02:00
static void disk_seqf_stop ( struct seq_file * seqf , void * v )
2008-05-22 17:21:08 -04:00
{
2008-09-03 08:57:12 +02:00
struct class_dev_iter * iter = seqf - > private ;
2008-05-22 17:21:08 -04:00
2008-09-03 08:57:12 +02:00
/* stop is called even after start failed :-( */
if ( iter ) {
class_dev_iter_exit ( iter ) ;
kfree ( iter ) ;
2016-07-29 10:40:31 +02:00
seqf - > private = NULL ;
2008-08-16 14:30:30 +02:00
}
2005-04-16 15:20:36 -07:00
}
2008-09-03 08:57:12 +02:00
static void * show_partition_start ( struct seq_file * seqf , loff_t * pos )
2005-04-16 15:20:36 -07:00
{
block: Don't use static to define "void *p" in show_partition_start()
I met a odd prblem:read /proc/partitions may return zero.
I wrote a file test.c:
int main()
{
char buff[4096];
int ret;
int fd;
printf("pid=%d\n",getpid());
while (1) {
fd = open("/proc/partitions", O_RDONLY);
if (fd < 0) {
printf("open error %s\n", strerror(errno));
return 0;
}
ret = read(fd, buff, 4096);
if (ret <= 0)
printf("ret=%d, %s, %ld\n", ret,
strerror(errno), lseek(fd,0,SEEK_CUR));
close(fd);
}
exit(0);
}
You can reproduce by:
1:while true;do cat /proc/partitions > /dev/null ;done
2:./test
I reviewed the code and found:
>> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
>> {
>> static void *p;
>>
>> p = disk_seqf_start(seqf, pos);
>> if (!IS_ERR_OR_NULL(p) && !*pos)
>> seq_puts(seqf, "major minor #blocks name\n\n");
>> return p;
>> }
test cat /proc/partitions
p = disk_seqf_start()(Not NULL)
p = disk_seqf_start()(NULL because pos)
if (!IS_ERR_OR_NULL(p) && !*pos)
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-03 10:42:00 +02:00
void * p ;
2008-09-03 08:57:12 +02:00
p = disk_seqf_start ( seqf , pos ) ;
2010-12-17 08:58:36 +01:00
if ( ! IS_ERR_OR_NULL ( p ) & & ! * pos )
2008-09-03 08:57:12 +02:00
seq_puts ( seqf , " major minor #blocks name \n \n " ) ;
return p ;
2005-04-16 15:20:36 -07:00
}
2008-09-03 09:01:09 +02:00
static int show_partition ( struct seq_file * seqf , void * v )
2005-04-16 15:20:36 -07:00
{
struct gendisk * sgp = v ;
2020-11-24 09:52:59 +01:00
struct block_device * part ;
2021-04-06 08:23:00 +02:00
unsigned long idx ;
2005-04-16 15:20:36 -07:00
2021-11-22 14:06:21 +01:00
if ( ! get_capacity ( sgp ) | | ( sgp - > flags & GENHD_FL_HIDDEN ) )
2005-04-16 15:20:36 -07:00
return 0 ;
2021-04-06 08:23:00 +02:00
rcu_read_lock ( ) ;
xa_for_each ( & sgp - > part_tbl , idx , part ) {
if ( ! bdev_nr_sectors ( part ) )
continue ;
2021-07-27 08:25:15 +02:00
seq_printf ( seqf , " %4d %7d %10llu %pg \n " ,
2020-11-24 09:52:59 +01:00
MAJOR ( part - > bd_dev ) , MINOR ( part - > bd_dev ) ,
2021-07-27 08:25:15 +02:00
bdev_nr_sectors ( part ) > > 1 , part ) ;
2021-04-06 08:23:00 +02:00
}
rcu_read_unlock ( ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
2008-10-04 23:53:21 +04:00
static const struct seq_operations partitions_op = {
2008-09-03 08:57:12 +02:00
. start = show_partition_start ,
. next = disk_seqf_next ,
. stop = disk_seqf_stop ,
2007-05-21 22:08:01 +02:00
. show = show_partition
2005-04-16 15:20:36 -07:00
} ;
# endif
static int __init genhd_device_init ( void )
{
2008-04-21 10:51:07 -07:00
int error ;
block_class . dev_kobj = sysfs_dev_block_kobj ;
error = class_register ( & block_class ) ;
2008-03-11 17:13:15 -07:00
if ( unlikely ( error ) )
return error ;
2005-04-16 15:20:36 -07:00
blk_dev_init ( ) ;
2007-05-21 22:08:01 +02:00
block: fix boot failure with CONFIG_DEBUG_BLOCK_EXT_DEVT=y and nash
We run into system boot failure with kernel 2.6.28-rc. We found it on a
couple of machines, including T61 notebook, nehalem machine, and another
HPC NX6325 notebook. All the machines use FedoraCore 8 or FedoraCore 9.
With kernel prior to 2.6.28-rc, system boot doesn't fail.
I debug it and locate the root cause. Pls. see
http://bugzilla.kernel.org/show_bug.cgi?id=11899
https://bugzilla.redhat.com/show_bug.cgi?id=471517
As a matter of fact, there are 2 bugs.
1)root=/dev/sda1, system boot randomly fails. Mostly, boot for 5 times
and fails once. nash has a bug. Some of its functions misuse return
value 0. Sometimes, 0 means timeout and no uevent available. Sometimes,
0 means nash gets an uevent, but the uevent isn't block-related (for
exmaple, usb). If by coincidence, kernel tells nash that uevents are
available, but kernel also set timeout, nash might stops collecting
other uevents in queue if current uevent isn't block-related. I work
out a patch for nash to fix it.
http://bugzilla.kernel.org/attachment.cgi?id=18858
2) root=LABEL=/, system always can't boot. initrd init reports
switchroot fails. Here is an executation branch of nash when booting:
(1) nash read /sys/block/sda/dev; Assume major is 8 (on my desktop)
(2) nash query /proc/devices with the major number; It found line
"8 sd";
(3) nash use 'sd' to search its own probe table to find device (DISK)
type for the device and add it to its own list;
(4) Later on, it probes all devices in its list to get filesystem
labels; scsi register "8 sd" always.
When major is 259, nash fails to find the device(DISK) type. I enables
CONFIG_DEBUG_BLOCK_EXT_DEVT=y when compiling kernel, so 259 is picked up
for device /dev/sda1, which causes nash to fail to find device (DISK)
type.
To fixing issue 2), I create a patch for nash and another patch for
kernel.
http://bugzilla.kernel.org/attachment.cgi?id=18859
http://bugzilla.kernel.org/attachment.cgi?id=18837
Below is the patch for kernel 2.6.28-rc4. It registers blkext, a new
block device in proc/devices.
With 2 patches on nash and 1 patch on kernel, I boot my machines for
dozens of times without failure.
Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-14 08:26:30 +01:00
register_blkdev ( BLOCK_EXT_MAJOR , " blkext " ) ;
2007-05-21 22:08:01 +02:00
/* create top-level block dir */
2010-09-08 16:54:17 +02:00
if ( ! sysfs_deprecated )
block_depr = kobject_create_and_add ( " block " , NULL ) ;
2007-11-06 10:36:58 -08:00
return 0 ;
2005-04-16 15:20:36 -07:00
}
subsys_initcall ( genhd_device_init ) ;
2007-05-21 22:08:01 +02:00
static ssize_t disk_range_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
2005-04-16 15:20:36 -07:00
{
2007-05-21 22:08:01 +02:00
struct gendisk * disk = dev_to_disk ( dev ) ;
2005-04-16 15:20:36 -07:00
2007-05-21 22:08:01 +02:00
return sprintf ( buf , " %d \n " , disk - > minors ) ;
2005-04-16 15:20:36 -07:00
}
2008-08-25 19:47:23 +09:00
static ssize_t disk_ext_range_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct gendisk * disk = dev_to_disk ( dev ) ;
2021-11-22 14:06:22 +01:00
return sprintf ( buf , " %d \n " ,
( disk - > flags & GENHD_FL_NO_PART ) ? 1 : DISK_MAX_PARTS ) ;
2008-08-25 19:47:23 +09:00
}
2007-05-21 22:08:01 +02:00
static ssize_t disk_removable_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
2005-10-01 14:49:43 +02:00
{
2007-05-21 22:08:01 +02:00
struct gendisk * disk = dev_to_disk ( dev ) ;
2005-10-01 14:49:43 +02:00
2007-05-21 22:08:01 +02:00
return sprintf ( buf , " %d \n " ,
( disk - > flags & GENHD_FL_REMOVABLE ? 1 : 0 ) ) ;
2005-10-01 14:49:43 +02:00
}
2017-11-02 21:29:53 +03:00
static ssize_t disk_hidden_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct gendisk * disk = dev_to_disk ( dev ) ;
return sprintf ( buf , " %d \n " ,
( disk - > flags & GENHD_FL_HIDDEN ? 1 : 0 ) ) ;
}
2008-06-13 09:41:00 +02:00
static ssize_t disk_ro_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct gendisk * disk = dev_to_disk ( dev ) ;
2008-08-25 19:56:10 +09:00
return sprintf ( buf , " %d \n " , get_disk_ro ( disk ) ? 1 : 0 ) ;
2008-06-13 09:41:00 +02:00
}
2020-03-24 08:25:13 +01:00
ssize_t part_size_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
2020-11-27 16:43:51 +01:00
return sprintf ( buf , " %llu \n " , bdev_nr_sectors ( dev_to_bdev ( dev ) ) ) ;
2020-03-24 08:25:13 +01:00
}
ssize_t part_stat_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
2020-11-27 16:43:51 +01:00
struct block_device * bdev = dev_to_bdev ( dev ) ;
2021-10-14 15:03:30 +01:00
struct request_queue * q = bdev_get_queue ( bdev ) ;
2020-03-25 16:07:06 +03:00
struct disk_stats stat ;
2020-03-24 08:25:13 +01:00
unsigned int inflight ;
2020-05-13 12:49:33 +02:00
if ( queue_is_mq ( q ) )
2020-11-27 16:43:51 +01:00
inflight = blk_mq_in_flight ( q , bdev ) ;
2020-05-13 12:49:33 +02:00
else
2020-11-27 16:43:51 +01:00
inflight = part_in_flight ( bdev ) ;
2020-03-25 16:07:06 +03:00
2022-02-17 14:42:47 +08:00
if ( inflight ) {
part_stat_lock ( ) ;
update_io_ticks ( bdev , jiffies , true ) ;
part_stat_unlock ( ) ;
}
part_stat_read_all ( bdev , & stat ) ;
2020-03-24 08:25:13 +01:00
return sprintf ( buf ,
" %8lu %8lu %8llu %8u "
" %8lu %8lu %8llu %8u "
" %8u %8u %8u "
" %8lu %8lu %8llu %8u "
" %8lu %8u "
" \n " ,
2020-03-25 16:07:06 +03:00
stat . ios [ STAT_READ ] ,
stat . merges [ STAT_READ ] ,
( unsigned long long ) stat . sectors [ STAT_READ ] ,
( unsigned int ) div_u64 ( stat . nsecs [ STAT_READ ] , NSEC_PER_MSEC ) ,
stat . ios [ STAT_WRITE ] ,
stat . merges [ STAT_WRITE ] ,
( unsigned long long ) stat . sectors [ STAT_WRITE ] ,
( unsigned int ) div_u64 ( stat . nsecs [ STAT_WRITE ] , NSEC_PER_MSEC ) ,
2020-03-24 08:25:13 +01:00
inflight ,
2020-03-25 16:07:06 +03:00
jiffies_to_msecs ( stat . io_ticks ) ,
2020-03-25 16:07:08 +03:00
( unsigned int ) div_u64 ( stat . nsecs [ STAT_READ ] +
stat . nsecs [ STAT_WRITE ] +
stat . nsecs [ STAT_DISCARD ] +
stat . nsecs [ STAT_FLUSH ] ,
NSEC_PER_MSEC ) ,
2020-03-25 16:07:06 +03:00
stat . ios [ STAT_DISCARD ] ,
stat . merges [ STAT_DISCARD ] ,
( unsigned long long ) stat . sectors [ STAT_DISCARD ] ,
( unsigned int ) div_u64 ( stat . nsecs [ STAT_DISCARD ] , NSEC_PER_MSEC ) ,
stat . ios [ STAT_FLUSH ] ,
( unsigned int ) div_u64 ( stat . nsecs [ STAT_FLUSH ] , NSEC_PER_MSEC ) ) ;
2020-03-24 08:25:13 +01:00
}
ssize_t part_inflight_show ( struct device * dev , struct device_attribute * attr ,
char * buf )
{
2020-11-27 16:43:51 +01:00
struct block_device * bdev = dev_to_bdev ( dev ) ;
2021-10-14 15:03:30 +01:00
struct request_queue * q = bdev_get_queue ( bdev ) ;
2020-03-24 08:25:13 +01:00
unsigned int inflight [ 2 ] ;
2020-05-13 12:49:33 +02:00
if ( queue_is_mq ( q ) )
2020-11-27 16:43:51 +01:00
blk_mq_in_flight_rw ( q , bdev , inflight ) ;
2020-05-13 12:49:33 +02:00
else
2020-11-27 16:43:51 +01:00
part_in_flight_rw ( bdev , inflight ) ;
2020-05-13 12:49:33 +02:00
2020-03-24 08:25:13 +01:00
return sprintf ( buf , " %8u %8u \n " , inflight [ 0 ] , inflight [ 1 ] ) ;
}
2007-05-21 22:08:01 +02:00
static ssize_t disk_capability_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
2007-05-23 13:57:38 -07:00
{
2023-02-03 16:02:09 +01:00
dev_warn_once ( dev , " the capability attribute has been deprecated. \n " ) ;
return sprintf ( buf , " 0 \n " ) ;
2007-05-23 13:57:38 -07:00
}
2007-05-21 22:08:01 +02:00
2009-05-22 17:17:53 -04:00
static ssize_t disk_alignment_offset_show ( struct device * dev ,
struct device_attribute * attr ,
char * buf )
{
struct gendisk * disk = dev_to_disk ( dev ) ;
2022-04-15 06:52:48 +02:00
return sprintf ( buf , " %d \n " , bdev_alignment_offset ( disk - > part0 ) ) ;
2009-05-22 17:17:53 -04:00
}
2009-11-10 11:50:21 +01:00
static ssize_t disk_discard_alignment_show ( struct device * dev ,
struct device_attribute * attr ,
char * buf )
{
struct gendisk * disk = dev_to_disk ( dev ) ;
2022-04-15 06:52:50 +02:00
return sprintf ( buf , " %d \n " , bdev_alignment_offset ( disk - > part0 ) ) ;
2009-11-10 11:50:21 +01:00
}
2021-07-13 01:05:28 +02:00
static ssize_t diskseq_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
struct gendisk * disk = dev_to_disk ( dev ) ;
return sprintf ( buf , " %llu \n " , disk - > diskseq ) ;
}
2018-05-24 13:38:59 -06:00
static DEVICE_ATTR ( range , 0444 , disk_range_show , NULL ) ;
static DEVICE_ATTR ( ext_range , 0444 , disk_ext_range_show , NULL ) ;
static DEVICE_ATTR ( removable , 0444 , disk_removable_show , NULL ) ;
static DEVICE_ATTR ( hidden , 0444 , disk_hidden_show , NULL ) ;
static DEVICE_ATTR ( ro , 0444 , disk_ro_show , NULL ) ;
static DEVICE_ATTR ( size , 0444 , part_size_show , NULL ) ;
static DEVICE_ATTR ( alignment_offset , 0444 , disk_alignment_offset_show , NULL ) ;
static DEVICE_ATTR ( discard_alignment , 0444 , disk_discard_alignment_show , NULL ) ;
static DEVICE_ATTR ( capability , 0444 , disk_capability_show , NULL ) ;
static DEVICE_ATTR ( stat , 0444 , part_stat_show , NULL ) ;
static DEVICE_ATTR ( inflight , 0444 , part_inflight_show , NULL ) ;
static DEVICE_ATTR ( badblocks , 0644 , disk_badblocks_show , disk_badblocks_store ) ;
2021-07-13 01:05:28 +02:00
static DEVICE_ATTR ( diskseq , 0444 , diskseq_show , NULL ) ;
2020-03-24 08:25:13 +01:00
2006-12-08 02:39:46 -08:00
# ifdef CONFIG_FAIL_MAKE_REQUEST
2020-03-24 08:25:13 +01:00
ssize_t part_fail_show ( struct device * dev ,
struct device_attribute * attr , char * buf )
{
2020-11-27 16:43:51 +01:00
return sprintf ( buf , " %d \n " , dev_to_bdev ( dev ) - > bd_make_it_fail ) ;
2020-03-24 08:25:13 +01:00
}
ssize_t part_fail_store ( struct device * dev ,
struct device_attribute * attr ,
const char * buf , size_t count )
{
int i ;
if ( count > 0 & & sscanf ( buf , " %d " , & i ) > 0 )
2020-11-27 16:43:51 +01:00
dev_to_bdev ( dev ) - > bd_make_it_fail = i ;
2020-03-24 08:25:13 +01:00
return count ;
}
2007-05-21 22:08:01 +02:00
static struct device_attribute dev_attr_fail =
2018-05-24 13:38:59 -06:00
__ATTR ( make - it - fail , 0644 , part_fail_show , part_fail_store ) ;
2020-03-24 08:25:13 +01:00
# endif /* CONFIG_FAIL_MAKE_REQUEST */
2008-09-14 05:56:33 -07:00
# ifdef CONFIG_FAIL_IO_TIMEOUT
static struct device_attribute dev_attr_fail_timeout =
2018-05-24 13:38:59 -06:00
__ATTR ( io - timeout - fail , 0644 , part_timeout_show , part_timeout_store ) ;
2008-09-14 05:56:33 -07:00
# endif
2007-05-21 22:08:01 +02:00
static struct attribute * disk_attrs [ ] = {
& dev_attr_range . attr ,
2008-08-25 19:47:23 +09:00
& dev_attr_ext_range . attr ,
2007-05-21 22:08:01 +02:00
& dev_attr_removable . attr ,
2017-11-02 21:29:53 +03:00
& dev_attr_hidden . attr ,
2008-06-13 09:41:00 +02:00
& dev_attr_ro . attr ,
2007-05-21 22:08:01 +02:00
& dev_attr_size . attr ,
2009-05-22 17:17:53 -04:00
& dev_attr_alignment_offset . attr ,
2009-11-10 11:50:21 +01:00
& dev_attr_discard_alignment . attr ,
2007-05-21 22:08:01 +02:00
& dev_attr_capability . attr ,
& dev_attr_stat . attr ,
block: Seperate read and write statistics of in_flight requests v2
Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-06 20:16:55 +02:00
& dev_attr_inflight . attr ,
2016-01-09 08:36:51 -08:00
& dev_attr_badblocks . attr ,
2021-06-24 09:38:43 +02:00
& dev_attr_events . attr ,
& dev_attr_events_async . attr ,
& dev_attr_events_poll_msecs . attr ,
2021-07-13 01:05:28 +02:00
& dev_attr_diskseq . attr ,
2007-05-21 22:08:01 +02:00
# ifdef CONFIG_FAIL_MAKE_REQUEST
& dev_attr_fail . attr ,
2008-09-14 05:56:33 -07:00
# endif
# ifdef CONFIG_FAIL_IO_TIMEOUT
& dev_attr_fail_timeout . attr ,
2007-05-21 22:08:01 +02:00
# endif
NULL
} ;
2017-04-27 14:46:26 -07:00
static umode_t disk_visible ( struct kobject * kobj , struct attribute * a , int n )
{
struct device * dev = container_of ( kobj , typeof ( * dev ) , kobj ) ;
struct gendisk * disk = dev_to_disk ( dev ) ;
if ( a = = & dev_attr_badblocks . attr & & ! disk - > bb )
return 0 ;
return a - > mode ;
}
2007-05-21 22:08:01 +02:00
static struct attribute_group disk_attr_group = {
. attrs = disk_attrs ,
2017-04-27 14:46:26 -07:00
. is_visible = disk_visible ,
2007-05-21 22:08:01 +02:00
} ;
2009-06-24 10:06:31 -07:00
static const struct attribute_group * disk_attr_groups [ ] = {
2007-05-21 22:08:01 +02:00
& disk_attr_group ,
2022-06-28 19:18:45 +02:00
# ifdef CONFIG_BLK_DEV_IO_TRACE
& blk_trace_attr_group ,
# endif
2007-05-21 22:08:01 +02:00
NULL
2005-04-16 15:20:36 -07:00
} ;
2020-06-19 20:47:23 +00:00
/**
* disk_release - releases all allocated resources of the gendisk
* @ dev : the device representing this disk
*
* This function releases all allocated resources of the gendisk .
*
* Drivers which used __device_add_disk ( ) have a gendisk with a request_queue
* assigned . Since the request_queue sits on top of the gendisk for these
* drivers we also call blk_put_queue ( ) for them , and we expect the
* request_queue refcount to reach 0 at this point , and so the request_queue
* will also be freed prior to the disk .
2020-06-19 20:47:25 +00:00
*
* Context : can sleep
2020-06-19 20:47:23 +00:00
*/
2007-05-21 22:08:01 +02:00
static void disk_release ( struct device * dev )
2005-04-16 15:20:36 -07:00
{
2007-05-21 22:08:01 +02:00
struct gendisk * disk = dev_to_disk ( dev ) ;
2020-06-19 20:47:25 +00:00
might_sleep ( ) ;
2021-10-14 15:02:31 +02:00
WARN_ON_ONCE ( disk_live ( disk ) ) ;
2020-06-19 20:47:25 +00:00
2022-07-20 15:05:41 +02:00
/*
* To undo the all initialization from blk_mq_init_allocated_queue in
* case of a probe failure where add_disk is never called we have to
* call blk_mq_exit_queue here . We can ' t do this for the more common
* teardown case ( yet ) as the tagset can be gone by the time the disk
* is released once it was added .
*/
if ( queue_is_mq ( disk - > queue ) & &
test_bit ( GD_OWNS_QUEUE , & disk - > state ) & &
! test_bit ( GD_ADDED , & disk - > state ) )
blk_mq_exit_queue ( disk - > queue ) ;
2023-02-14 19:33:06 +01:00
blkcg_exit_disk ( disk ) ;
2022-07-27 12:22:57 -04:00
bioset_exit ( & disk - > bio_split ) ;
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().
However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.
Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():
1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.
2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.
[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055] <TASK>
[12622.918394] scsi_mq_get_budget+0x1a/0x110
[12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404] ? pick_next_task_fair+0x39/0x390
[12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593] process_one_work+0x1e8/0x3c0
[12622.954059] worker_thread+0x50/0x3b0
[12622.958144] ? rescuer_thread+0x370/0x370
[12622.962616] kthread+0x158/0x180
[12622.966218] ? set_kthread_struct+0x40/0x40
[12622.970884] ret_from_fork+0x22/0x30
[12622.974875] </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-16 09:43:43 +08:00
implement in-kernel gendisk events handling
Currently, media presence polling for removeable block devices is done
from userland. There are several issues with this.
* Polling is done by periodically opening the device. For SCSI
devices, the command sequence generated by such action involves a
few different commands including TEST_UNIT_READY. This behavior,
while perfectly legal, is different from Windows which only issues
single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
ATAPI devices lock up after being periodically queried such command
sequences.
* There is no reliable and unintrusive way for a userland program to
tell whether the target device is safe for media presence polling.
For example, polling for media presence during an on-going burning
session can make it fail. The polling program can avoid this by
opening the device with O_EXCL but then it risks making a valid
exclusive user of the device fail w/ -EBUSY.
* Userland polling is unnecessarily heavy and in-kernel implementation
is lighter and better coordinated (workqueue, timer slack).
This patch implements framework for in-kernel disk event handling,
which includes media presence polling.
* bdops->check_events() is added, which supercedes ->media_changed().
It should check whether there's any pending event and return if so.
Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
called parallelly.
* gendisk->events and ->async_events are added. These should be
initialized by block driver before passing the device to add_disk().
The former contains the mask of all supported events and the latter
the mask of all events which the device can report without polling.
/sys/block/*/events[_async] export these to userland.
* Kernel parameter block.events_dfl_poll_msecs controls the system
polling interval (default is 0 which means disable) and
/sys/block/*/events_poll_msecs control polling intervals for
individual devices (default is -1 meaning use system setting). Note
that if a device can report all supported events asynchronously and
its polling interval isn't explicitly set, the device won't be
polled regardless of the system polling interval.
* If a device is opened exclusively with write access, event checking
is automatically disabled until all write exclusive accesses are
released.
* There are event 'clearing' events. For example, both of currently
defined events are cleared after the device has been successfully
opened. This information is passed to ->check_events() callback
using @clearing argument as a hint.
* Event checking is always performed from system_nrt_wq and timer
slack is set to 25% for polling.
* Nothing changes for drivers which implement ->media_changed() but
not ->check_events(). Going forward, all drivers will be converted
to ->check_events() and ->media_change() will be dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-12-08 20:57:37 +01:00
disk_release_events ( disk ) ;
2005-04-16 15:20:36 -07:00
kfree ( disk - > random ) ;
2022-07-06 09:03:42 +02:00
disk_free_zone_bitmaps ( disk ) ;
2021-01-24 11:02:41 +01:00
xa_destroy ( & disk - > part_tbl ) ;
2022-03-08 06:51:55 +01:00
2021-08-16 15:46:24 +02:00
disk - > queue - > disk = NULL ;
2021-08-16 15:19:09 +02:00
blk_put_queue ( disk - > queue ) ;
2022-02-15 10:45:10 +01:00
if ( test_bit ( GD_ADDED , & disk - > state ) & & disk - > fops - > free_disk )
disk - > fops - > free_disk ( disk ) ;
2021-07-22 09:54:02 +02:00
iput ( disk - > part0 - > bd_inode ) ; /* frees the disk */
2005-04-16 15:20:36 -07:00
}
2021-07-13 01:05:26 +02:00
2022-11-23 13:25:19 +01:00
static int block_uevent ( const struct device * dev , struct kobj_uevent_env * env )
2021-07-13 01:05:26 +02:00
{
2022-11-23 13:25:19 +01:00
const struct gendisk * disk = dev_to_disk ( dev ) ;
2021-07-13 01:05:26 +02:00
return add_uevent_var ( env , " DISKSEQ=%llu " , disk - > diskseq ) ;
}
2007-05-21 22:08:01 +02:00
struct class block_class = {
. name = " block " ,
2021-07-13 01:05:26 +02:00
. dev_uevent = block_uevent ,
2005-04-16 15:20:36 -07:00
} ;
2023-01-11 12:30:08 +01:00
static char * block_devnode ( const struct device * dev , umode_t * mode ,
2023-01-04 14:44:02 -07:00
kuid_t * uid , kgid_t * gid )
{
struct gendisk * disk = dev_to_disk ( dev ) ;
if ( disk - > fops - > devnode )
return disk - > fops - > devnode ( disk , mode ) ;
return NULL ;
}
2020-06-01 13:12:05 -07:00
const struct device_type disk_type = {
2007-05-21 22:08:01 +02:00
. name = " disk " ,
. groups = disk_attr_groups ,
. release = disk_release ,
2023-01-04 14:44:02 -07:00
. devnode = block_devnode ,
2005-04-16 15:20:36 -07:00
} ;
2008-05-23 09:44:11 -07:00
# ifdef CONFIG_PROC_FS
2008-09-03 09:01:09 +02:00
/*
* aggregate disk stat collector . Uses the same stats that the sysfs
* entries do , above , but makes them available through one seq_file .
*
* The output looks suspiciously like / proc / partitions with a bunch of
* extra fields .
*/
static int diskstats_show ( struct seq_file * seqf , void * v )
2005-04-16 15:20:36 -07:00
{
struct gendisk * gp = v ;
2020-11-24 09:52:59 +01:00
struct block_device * hd ;
2018-12-06 11:41:21 -05:00
unsigned int inflight ;
2020-03-25 16:07:06 +03:00
struct disk_stats stat ;
2021-04-06 08:23:01 +02:00
unsigned long idx ;
2005-04-16 15:20:36 -07:00
/*
2008-08-25 19:56:05 +09:00
if ( & disk_to_dev ( gp ) - > kobj . entry = = block_class . devices . next )
2008-09-03 09:01:09 +02:00
seq_puts ( seqf , " major minor name "
2005-04-16 15:20:36 -07:00
" rio rmerge rsect ruse wio wmerge "
" wsect wuse running use aveq "
" \n \n " ) ;
*/
2011-06-13 10:45:43 +02:00
2021-04-06 08:23:01 +02:00
rcu_read_lock ( ) ;
xa_for_each ( & gp - > part_tbl , idx , hd ) {
if ( bdev_is_partition ( hd ) & & ! bdev_nr_sectors ( hd ) )
continue ;
2020-05-13 12:49:33 +02:00
if ( queue_is_mq ( gp - > queue ) )
2020-11-24 09:52:59 +01:00
inflight = blk_mq_in_flight ( gp - > queue , hd ) ;
2020-05-13 12:49:33 +02:00
else
2020-11-24 09:52:59 +01:00
inflight = part_in_flight ( hd ) ;
2020-03-25 16:07:06 +03:00
2022-02-17 14:42:47 +08:00
if ( inflight ) {
part_stat_lock ( ) ;
update_io_ticks ( hd , jiffies , true ) ;
part_stat_unlock ( ) ;
}
part_stat_read_all ( hd , & stat ) ;
2021-07-27 08:25:13 +02:00
seq_printf ( seqf , " %4d %7d %pg "
2018-07-18 04:47:40 -07:00
" %lu %lu %lu %u "
" %lu %lu %lu %u "
" %u %u %u "
2019-11-21 13:40:26 +03:00
" %lu %lu %lu %u "
" %lu %u "
" \n " ,
2021-07-27 08:25:13 +02:00
MAJOR ( hd - > bd_dev ) , MINOR ( hd - > bd_dev ) , hd ,
2020-03-25 16:07:06 +03:00
stat . ios [ STAT_READ ] ,
stat . merges [ STAT_READ ] ,
stat . sectors [ STAT_READ ] ,
( unsigned int ) div_u64 ( stat . nsecs [ STAT_READ ] ,
NSEC_PER_MSEC ) ,
stat . ios [ STAT_WRITE ] ,
stat . merges [ STAT_WRITE ] ,
stat . sectors [ STAT_WRITE ] ,
( unsigned int ) div_u64 ( stat . nsecs [ STAT_WRITE ] ,
NSEC_PER_MSEC ) ,
2018-12-06 11:41:21 -05:00
inflight ,
2020-03-25 16:07:06 +03:00
jiffies_to_msecs ( stat . io_ticks ) ,
2020-03-25 16:07:08 +03:00
( unsigned int ) div_u64 ( stat . nsecs [ STAT_READ ] +
stat . nsecs [ STAT_WRITE ] +
stat . nsecs [ STAT_DISCARD ] +
stat . nsecs [ STAT_FLUSH ] ,
NSEC_PER_MSEC ) ,
2020-03-25 16:07:06 +03:00
stat . ios [ STAT_DISCARD ] ,
stat . merges [ STAT_DISCARD ] ,
stat . sectors [ STAT_DISCARD ] ,
( unsigned int ) div_u64 ( stat . nsecs [ STAT_DISCARD ] ,
NSEC_PER_MSEC ) ,
stat . ios [ STAT_FLUSH ] ,
( unsigned int ) div_u64 ( stat . nsecs [ STAT_FLUSH ] ,
NSEC_PER_MSEC )
2008-02-08 11:04:56 +01:00
) ;
2005-04-16 15:20:36 -07:00
}
2021-04-06 08:23:01 +02:00
rcu_read_unlock ( ) ;
2011-06-13 10:45:43 +02:00
2005-04-16 15:20:36 -07:00
return 0 ;
}
2008-10-06 12:55:38 +04:00
static const struct seq_operations diskstats_op = {
2008-09-03 08:57:12 +02:00
. start = disk_seqf_start ,
. next = disk_seqf_next ,
. stop = disk_seqf_stop ,
2005-04-16 15:20:36 -07:00
. show = diskstats_show
} ;
2008-10-04 23:53:21 +04:00
static int __init proc_genhd_init ( void )
{
2018-04-13 19:44:18 +02:00
proc_create_seq ( " diskstats " , 0 , NULL , & diskstats_op ) ;
proc_create_seq ( " partitions " , 0 , NULL , & partitions_op ) ;
2008-10-04 23:53:21 +04:00
return 0 ;
}
module_init ( proc_genhd_init ) ;
2008-05-23 09:44:11 -07:00
# endif /* CONFIG_PROC_FS */
2005-04-16 15:20:36 -07:00
2021-05-25 08:13:00 +02:00
dev_t part_devt ( struct gendisk * disk , u8 partno )
{
2021-05-25 08:13:01 +02:00
struct block_device * part ;
2021-05-25 08:13:00 +02:00
dev_t devt = 0 ;
2021-05-25 08:13:01 +02:00
rcu_read_lock ( ) ;
part = xa_load ( & disk - > part_tbl , partno ) ;
if ( part )
2021-05-25 08:13:00 +02:00
devt = part - > bd_dev ;
2021-05-25 08:13:01 +02:00
rcu_read_unlock ( ) ;
2021-05-25 08:13:00 +02:00
return devt ;
}
2008-09-03 09:01:09 +02:00
dev_t blk_lookup_devt ( const char * name , int partno )
2008-05-22 17:21:08 -04:00
{
2008-09-03 08:57:12 +02:00
dev_t devt = MKDEV ( 0 , 0 ) ;
struct class_dev_iter iter ;
struct device * dev ;
2008-05-22 17:21:08 -04:00
2008-09-03 08:57:12 +02:00
class_dev_iter_init ( & iter , & block_class , NULL , & disk_type ) ;
while ( ( dev = class_dev_iter_next ( & iter ) ) ) {
2008-05-22 17:21:08 -04:00
struct gendisk * disk = dev_to_disk ( dev ) ;
2009-01-06 10:44:43 -08:00
if ( strcmp ( dev_name ( dev ) , name ) )
2008-09-03 09:01:48 +02:00
continue ;
2009-02-18 10:33:59 +01:00
if ( partno < disk - > minors ) {
/* We need to return the right devno, even
* if the partition doesn ' t exist yet .
*/
devt = MKDEV ( MAJOR ( dev - > devt ) ,
MINOR ( dev - > devt ) + partno ) ;
2021-05-25 08:13:00 +02:00
} else {
devt = part_devt ( disk , partno ) ;
if ( devt )
break ;
2008-09-03 08:57:12 +02:00
}
2008-08-16 14:30:30 +02:00
}
2008-09-03 08:57:12 +02:00
class_dev_iter_exit ( & iter ) ;
2007-05-21 22:08:01 +02:00
return devt ;
}
2021-08-16 15:19:08 +02:00
struct gendisk * __alloc_disk_node ( struct request_queue * q , int node_id ,
struct lock_class_key * lkclass )
2005-06-23 00:08:19 -07:00
{
struct gendisk * disk ;
2013-08-29 15:21:42 -07:00
disk = kzalloc_node ( sizeof ( struct gendisk ) , GFP_KERNEL , node_id ) ;
2020-08-31 20:02:37 +02:00
if ( ! disk )
2022-08-11 20:23:37 -03:00
return NULL ;
2011-01-07 08:43:37 +01:00
2022-07-27 12:22:57 -04:00
if ( bioset_init ( & disk - > bio_split , BIO_POOL_SIZE , 0 , 0 ) )
goto out_free_disk ;
2021-08-09 16:17:43 +02:00
disk - > bdi = bdi_alloc ( node_id ) ;
if ( ! disk - > bdi )
2022-07-27 12:22:57 -04:00
goto out_free_bioset ;
2021-08-09 16:17:43 +02:00
2021-10-14 15:03:26 +01:00
/* bdev_alloc() might need the queue, set before the first call */
disk - > queue = q ;
2020-11-26 18:47:17 +01:00
disk - > part0 = bdev_alloc ( disk , 0 ) ;
if ( ! disk - > part0 )
2021-08-09 16:17:43 +02:00
goto out_free_bdi ;
2020-11-26 09:23:26 +01:00
2020-08-31 20:02:37 +02:00
disk - > node_id = node_id ;
2021-05-25 08:12:56 +02:00
mutex_init ( & disk - > open_mutex ) ;
2021-01-24 11:02:41 +01:00
xa_init ( & disk - > part_tbl ) ;
if ( xa_insert ( & disk - > part_tbl , 0 , disk - > part0 , GFP_KERNEL ) )
goto out_destroy_part_tbl ;
2020-08-31 20:02:37 +02:00
2023-02-14 19:33:06 +01:00
if ( blkcg_init_disk ( disk ) )
goto out_erase_part0 ;
2020-08-31 20:02:37 +02:00
rand_initialize_disk ( disk ) ;
disk_to_dev ( disk ) - > class = & block_class ;
disk_to_dev ( disk ) - > type = & disk_type ;
device_initialize ( disk_to_dev ( disk ) ) ;
block: add disk sequence number
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-13 01:05:25 +02:00
inc_diskseq ( disk ) ;
2021-08-16 15:46:24 +02:00
q - > disk = disk ;
2021-08-16 15:19:05 +02:00
lockdep_init_map ( & disk - > lockdep_map , " (bio completion) " , lkclass , 0 ) ;
2021-08-04 11:41:42 +02:00
# ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
INIT_LIST_HEAD ( & disk - > slave_bdevs ) ;
# endif
2005-04-16 15:20:36 -07:00
return disk ;
2020-08-31 20:02:37 +02:00
2023-02-14 19:33:06 +01:00
out_erase_part0 :
xa_erase ( & disk - > part_tbl , 0 ) ;
2021-01-24 11:02:41 +01:00
out_destroy_part_tbl :
xa_destroy ( & disk - > part_tbl ) ;
2021-10-02 18:23:02 +09:00
disk - > part0 - > bd_disk = NULL ;
2021-07-22 09:54:02 +02:00
iput ( disk - > part0 - > bd_inode ) ;
2021-08-09 16:17:43 +02:00
out_free_bdi :
bdi_put ( disk - > bdi ) ;
2022-07-27 12:22:57 -04:00
out_free_bioset :
bioset_exit ( & disk - > bio_split ) ;
2020-08-31 20:02:37 +02:00
out_free_disk :
kfree ( disk ) ;
return NULL ;
2005-04-16 15:20:36 -07:00
}
2021-08-16 15:19:05 +02:00
struct gendisk * __blk_alloc_disk ( int node , struct lock_class_key * lkclass )
2021-05-21 07:50:55 +02:00
{
struct request_queue * q ;
struct gendisk * disk ;
2022-11-01 16:00:47 +01:00
q = blk_alloc_queue ( node ) ;
2021-05-21 07:50:55 +02:00
if ( ! q )
return NULL ;
2021-08-16 15:19:08 +02:00
disk = __alloc_disk_node ( q , node , lkclass ) ;
2021-05-21 07:50:55 +02:00
if ( ! disk ) {
2022-06-19 08:05:51 +02:00
blk_put_queue ( q ) ;
2021-05-21 07:50:55 +02:00
return NULL ;
}
2022-06-19 08:05:51 +02:00
set_bit ( GD_OWNS_QUEUE , & disk - > state ) ;
2021-05-21 07:50:55 +02:00
return disk ;
}
EXPORT_SYMBOL ( __blk_alloc_disk ) ;
2020-06-19 20:47:23 +00:00
/**
* put_disk - decrements the gendisk refcount
2020-07-30 18:42:30 -07:00
* @ disk : the struct gendisk to decrement the refcount for
2020-06-19 20:47:23 +00:00
*
* This decrements the refcount for the struct gendisk . When this reaches 0
* we ' ll have disk_release ( ) called .
2020-06-19 20:47:25 +00:00
*
2022-07-20 15:05:41 +02:00
* Note : for blk - mq disk put_disk must be called before freeing the tag_set
* when handling probe errors ( that is before add_disk ( ) is called ) .
*
2020-06-19 20:47:25 +00:00
* Context : Any context , but the last reference must not be dropped from
* atomic context .
2020-06-19 20:47:23 +00:00
*/
2005-04-16 15:20:36 -07:00
void put_disk ( struct gendisk * disk )
{
if ( disk )
2020-11-10 07:25:37 +01:00
put_device ( disk_to_dev ( disk ) ) ;
2005-04-16 15:20:36 -07:00
}
EXPORT_SYMBOL ( put_disk ) ;
2009-07-28 09:13:13 +02:00
static void set_disk_ro_uevent ( struct gendisk * gd , int ro )
{
char event [ ] = " DISK_RO=1 " ;
char * envp [ ] = { event , NULL } ;
if ( ! ro )
event [ 8 ] = ' 0 ' ;
kobject_uevent_env ( & disk_to_dev ( gd ) - > kobj , KOBJ_CHANGE , envp ) ;
}
2021-01-09 11:42:51 +01:00
/**
* set_disk_ro - set a gendisk read - only
* @ disk : gendisk to operate on
2021-01-29 05:55:05 +01:00
* @ read_only : % true to set the disk read - only , % false set the disk read / write
2021-01-09 11:42:51 +01:00
*
* This function is used to indicate whether a given disk device should have its
* read - only flag set . set_disk_ro ( ) is typically used by device drivers to
* indicate whether the underlying physical device is write - protected .
*/
void set_disk_ro ( struct gendisk * disk , bool read_only )
2005-04-16 15:20:36 -07:00
{
2021-01-09 11:42:51 +01:00
if ( read_only ) {
if ( test_and_set_bit ( GD_READ_ONLY , & disk - > state ) )
return ;
} else {
if ( ! test_and_clear_bit ( GD_READ_ONLY , & disk - > state ) )
return ;
2009-07-28 09:13:13 +02:00
}
2021-01-09 11:42:51 +01:00
set_disk_ro_uevent ( disk , read_only ) ;
2005-04-16 15:20:36 -07:00
}
EXPORT_SYMBOL ( set_disk_ro ) ;
block: add disk sequence number
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-07-13 01:05:25 +02:00
void inc_diskseq ( struct gendisk * disk )
{
disk - > diskseq = atomic64_inc_return ( & diskseq ) ;
}