License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
/* SPDX-License-Identifier: GPL-2.0 */
2008-01-29 14:51:59 +01:00
# ifndef BLK_INTERNAL_H
# define BLK_INTERNAL_H
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 00:37:18 +00:00
# include <linux/blk-crypto.h>
2021-03-31 09:30:00 +02:00
# include <linux/memblock.h> /* for max_pfn/max_low_pfn */
2018-09-25 13:30:08 -07:00
# include <xen/xen.h>
block: Inline encryption support for blk-mq
We must have some way of letting a storage device driver know what
encryption context it should use for en/decrypting a request. However,
it's the upper layers (like the filesystem/fscrypt) that know about and
manages encryption contexts. As such, when the upper layer submits a bio
to the block layer, and this bio eventually reaches a device driver with
support for inline encryption, the device driver will need to have been
told the encryption context for that bio.
We want to communicate the encryption context from the upper layer to the
storage device along with the bio, when the bio is submitted to the block
layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
represent an encryption context (note that we can't use the bi_private
field in struct bio to do this because that field does not function to pass
information across layers in the storage stack). We also introduce various
functions to manipulate the bio_crypt_ctx and make the bio/request merging
logic aware of the bio_crypt_ctx.
We also make changes to blk-mq to make it handle bios with encryption
contexts. blk-mq can merge many bios into the same request. These bios need
to have contiguous data unit numbers (the necessary changes to blk-merge
are also made to ensure this) - as such, it suffices to keep the data unit
number of just the first bio, since that's all a storage driver needs to
infer the data unit number to use for each data block in each bio in a
request. blk-mq keeps track of the encryption context to be used for all
the bios in a request with the request's rq_crypt_ctx. When the first bio
is added to an empty request, blk-mq will program the encryption context
of that bio into the request_queue's keyslot manager, and store the
returned keyslot in the request's rq_crypt_ctx. All the functions to
operate on encryption contexts are in blk-crypto.c.
Upper layers only need to call bio_crypt_set_ctx with the encryption key,
algorithm and data_unit_num; they don't have to worry about getting a
keyslot for each encryption context, as blk-mq/blk-crypto handles that.
Blk-crypto also makes it possible for request-based layered devices like
dm-rq to make use of inline encryption hardware by cloning the
rq_crypt_ctx and programming a keyslot in the new request_queue when
necessary.
Note that any user of the block layer can submit bios with an
encryption context, such as filesystems, device-mapper targets, etc.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-14 00:37:18 +00:00
# include "blk-crypto-internal.h"
2011-12-14 00:33:37 +01:00
2021-09-20 14:33:23 +02:00
struct elevator_type ;
2014-05-13 15:10:52 -06:00
/* Max future timer expiry for timeouts */
# define BLK_MAX_TIMEOUT (5 * HZ)
2017-01-31 14:53:20 -08:00
extern struct dentry * blk_debugfs_root ;
2014-09-25 23:23:43 +08:00
struct blk_flush_queue {
unsigned int flush_pending_idx : 1 ;
unsigned int flush_running_idx : 1 ;
block: fix null pointer dereference in blk_mq_rq_timed_out()
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:
[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-27 16:19:55 +08:00
blk_status_t rq_status ;
2014-09-25 23:23:43 +08:00
unsigned long flush_pending_since ;
struct list_head flush_queue [ 2 ] ;
struct list_head flush_data_in_flight ;
struct request * flush_rq ;
2015-08-09 03:41:51 -04:00
2014-09-25 23:23:43 +08:00
spinlock_t mq_flush_lock ;
} ;
2008-01-29 14:51:59 +01:00
extern struct kmem_cache * blk_requestq_cachep ;
2021-12-03 21:15:32 +08:00
extern struct kmem_cache * blk_requestq_srcu_cachep ;
2008-01-29 14:51:59 +01:00
extern struct kobj_type blk_queue_ktype ;
2011-12-14 00:33:37 +01:00
extern struct ida blk_queue_ida ;
2008-01-29 14:51:59 +01:00
2011-12-14 00:33:38 +01:00
static inline void __blk_get_queue ( struct request_queue * q )
{
kobject_get ( & q - > kobj ) ;
}
2021-08-18 09:09:25 +08:00
bool is_flush_rq ( struct request * req ) ;
block: fix null pointer dereference in blk_mq_rq_timed_out()
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:
[ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[ 108.827059] PGD 0 P4D 0
[ 108.827313] Oops: 0000 [#1] SMP PTI
[ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[ 108.829503] Workqueue: kblockd blk_mq_timeout_work
[ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[ 108.838191] Call Trace:
[ 108.838406] bt_iter+0x74/0x80
[ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450
[ 108.839074] ? __switch_to_asm+0x34/0x70
[ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40
[ 108.840273] ? syscall_return_via_sysret+0xf/0x7f
[ 108.840732] blk_mq_timeout_work+0x74/0x200
[ 108.841151] process_one_work+0x297/0x680
[ 108.841550] worker_thread+0x29c/0x6f0
[ 108.841926] ? rescuer_thread+0x580/0x580
[ 108.842344] kthread+0x16a/0x1a0
[ 108.842666] ? kthread_flush_work+0x170/0x170
[ 108.843100] ret_from_fork+0x35/0x40
The bug is caused by the race between timeout handle and completion for
flush request.
When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.
After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.
However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.
We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
-------
v2:
- move rq_status from struct request to struct blk_flush_queue
v3:
- remove unnecessary '{}' pair.
v4:
- let spinlock to protect 'fq->rq_status'
v5:
- move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-27 16:19:55 +08:00
2020-03-09 22:41:37 +01:00
struct blk_flush_queue * blk_alloc_flush_queue ( int node , int cmd_size ,
gfp_t flags ) ;
2014-09-25 23:23:47 +08:00
void blk_free_flush_queue ( struct blk_flush_queue * q ) ;
2014-09-25 23:23:40 +08:00
2015-10-21 13:20:12 -04:00
void blk_freeze_queue ( struct request_queue * q ) ;
2021-09-29 09:12:41 +02:00
void __blk_mq_unfreeze_queue ( struct request_queue * q , bool force_atomic ) ;
2021-09-29 09:12:40 +02:00
void blk_queue_start_drain ( struct request_queue * q ) ;
2021-11-04 12:45:51 -06:00
int __bio_queue_enter ( struct request_queue * q , struct bio * bio ) ;
2022-02-16 12:45:10 +08:00
void submit_bio_noacct_nocheck ( struct bio * bio ) ;
2021-11-04 12:45:51 -06:00
static inline bool blk_try_enter_queue ( struct request_queue * q , bool pm )
{
rcu_read_lock ( ) ;
if ( ! percpu_ref_tryget_live_rcu ( & q - > q_usage_counter ) )
goto fail ;
/*
* The code that increments the pm_only counter must ensure that the
* counter is globally visible before the queue is unfrozen .
*/
if ( blk_queue_pm_only ( q ) & &
( ! pm | | queue_rpm_status ( q ) = = RPM_SUSPENDED ) )
goto fail_put ;
rcu_read_unlock ( ) ;
return true ;
fail_put :
blk_queue_exit ( q ) ;
fail :
rcu_read_unlock ( ) ;
return false ;
}
static inline int bio_queue_enter ( struct bio * bio )
{
struct request_queue * q = bdev_get_queue ( bio - > bi_bdev ) ;
if ( blk_try_enter_queue ( q , false ) )
return 0 ;
return __bio_queue_enter ( q , bio ) ;
}
2015-10-21 13:20:12 -04:00
2021-02-02 18:19:19 +01:00
# define BIO_INLINE_VECS 4
2021-02-02 18:19:29 +01:00
struct bio_vec * bvec_alloc ( mempool_t * pool , unsigned short * nr_vecs ,
gfp_t gfp_mask ) ;
void bvec_free ( mempool_t * pool , struct bio_vec * bv , unsigned short nr_vecs ) ;
2021-01-11 11:05:56 +08:00
2018-09-24 09:43:52 +02:00
static inline bool biovec_phys_mergeable ( struct request_queue * q ,
struct bio_vec * vec1 , struct bio_vec * vec2 )
2018-09-24 09:43:50 +02:00
{
2018-09-24 09:43:52 +02:00
unsigned long mask = queue_segment_boundary ( q ) ;
2018-09-24 09:43:53 +02:00
phys_addr_t addr1 = page_to_phys ( vec1 - > bv_page ) + vec1 - > bv_offset ;
phys_addr_t addr2 = page_to_phys ( vec2 - > bv_page ) + vec2 - > bv_offset ;
2018-09-24 09:43:52 +02:00
if ( addr1 + vec1 - > bv_len ! = addr2 )
2018-09-24 09:43:50 +02:00
return false ;
2019-03-29 15:07:54 +08:00
if ( xen_domain ( ) & & ! xen_biovec_phys_mergeable ( vec1 , vec2 - > bv_page ) )
2018-09-24 09:43:50 +02:00
return false ;
2018-09-24 09:43:52 +02:00
if ( ( addr1 | mask ) ! = ( ( addr2 + vec2 - > bv_len - 1 ) | mask ) )
return false ;
2018-09-24 09:43:50 +02:00
return true ;
}
2018-09-24 09:43:49 +02:00
static inline bool __bvec_gap_to_prev ( struct request_queue * q ,
struct bio_vec * bprv , unsigned int offset )
{
2018-11-07 14:58:14 +01:00
return ( offset & queue_virt_boundary ( q ) ) | |
2018-09-24 09:43:49 +02:00
( ( bprv - > bv_offset + bprv - > bv_len ) & queue_virt_boundary ( q ) ) ;
}
/*
* Check if adding a bio_vec after bprv with offset would create a gap in
* the SG list . Most drivers don ' t care about this , but some do .
*/
static inline bool bvec_gap_to_prev ( struct request_queue * q ,
struct bio_vec * bprv , unsigned int offset )
{
if ( ! queue_virt_boundary ( q ) )
return false ;
return __bvec_gap_to_prev ( q , bprv , offset ) ;
}
2021-09-20 14:33:26 +02:00
static inline bool rq_mergeable ( struct request * rq )
{
if ( blk_rq_is_passthrough ( rq ) )
return false ;
if ( req_op ( rq ) = = REQ_OP_FLUSH )
return false ;
if ( req_op ( rq ) = = REQ_OP_WRITE_ZEROES )
return false ;
if ( req_op ( rq ) = = REQ_OP_ZONE_APPEND )
return false ;
if ( rq - > cmd_flags & REQ_NOMERGE_FLAGS )
return false ;
if ( rq - > rq_flags & RQF_NOMERGE_FLAGS )
return false ;
return true ;
}
/*
* There are two different ways to handle DISCARD merges :
* 1 ) If max_discard_segments > 1 , the driver treats every bio as a range and
* send the bios to controller together . The ranges don ' t need to be
* contiguous .
* 2 ) Otherwise , the request will be normal read / write requests . The ranges
* need to be contiguous .
*/
static inline bool blk_discard_mergable ( struct request * req )
{
if ( req_op ( req ) = = REQ_OP_DISCARD & &
queue_max_discard_segments ( req - > q ) > 1 )
return true ;
return false ;
}
2015-10-21 13:20:23 -04:00
# ifdef CONFIG_BLK_DEV_INTEGRITY
void blk_flush_integrity ( void ) ;
2017-07-03 16:58:43 -06:00
bool __bio_integrity_endio ( struct bio * ) ;
2019-12-05 10:09:01 +08:00
void bio_integrity_free ( struct bio * bio ) ;
2017-07-03 16:58:43 -06:00
static inline bool bio_integrity_endio ( struct bio * bio )
{
if ( bio_integrity ( bio ) )
return __bio_integrity_endio ( bio ) ;
return true ;
}
2018-09-24 09:43:47 +02:00
2020-10-06 09:07:17 +02:00
bool blk_integrity_merge_rq ( struct request_queue * , struct request * ,
struct request * ) ;
2020-10-06 09:07:18 +02:00
bool blk_integrity_merge_bio ( struct request_queue * , struct request * ,
struct bio * ) ;
2020-10-06 09:07:17 +02:00
2018-09-24 09:43:47 +02:00
static inline bool integrity_req_gap_back_merge ( struct request * req ,
struct bio * next )
{
struct bio_integrity_payload * bip = bio_integrity ( req - > bio ) ;
struct bio_integrity_payload * bip_next = bio_integrity ( next ) ;
return bvec_gap_to_prev ( req - > q , & bip - > bip_vec [ bip - > bip_vcnt - 1 ] ,
bip_next - > bip_vec [ 0 ] . bv_offset ) ;
}
static inline bool integrity_req_gap_front_merge ( struct request * req ,
struct bio * bio )
{
struct bio_integrity_payload * bip = bio_integrity ( bio ) ;
struct bio_integrity_payload * bip_next = bio_integrity ( req - > bio ) ;
return bvec_gap_to_prev ( req - > q , & bip - > bip_vec [ bip - > bip_vcnt - 1 ] ,
bip_next - > bip_vec [ 0 ] . bv_offset ) ;
}
2020-03-25 16:48:41 +01:00
2021-08-18 16:45:38 +02:00
int blk_integrity_add ( struct gendisk * disk ) ;
2020-03-25 16:48:41 +01:00
void blk_integrity_del ( struct gendisk * ) ;
2018-09-24 09:43:47 +02:00
# else /* CONFIG_BLK_DEV_INTEGRITY */
2020-10-06 09:07:17 +02:00
static inline bool blk_integrity_merge_rq ( struct request_queue * rq ,
struct request * r1 , struct request * r2 )
{
return true ;
}
2020-10-06 09:07:18 +02:00
static inline bool blk_integrity_merge_bio ( struct request_queue * rq ,
struct request * r , struct bio * b )
{
return true ;
}
2018-09-24 09:43:47 +02:00
static inline bool integrity_req_gap_back_merge ( struct request * req ,
struct bio * next )
{
return false ;
}
static inline bool integrity_req_gap_front_merge ( struct request * req ,
struct bio * bio )
{
return false ;
}
2015-10-21 13:20:23 -04:00
static inline void blk_flush_integrity ( void )
{
}
2017-07-03 16:58:43 -06:00
static inline bool bio_integrity_endio ( struct bio * bio )
{
return true ;
}
2019-12-05 10:09:01 +08:00
static inline void bio_integrity_free ( struct bio * bio )
{
}
2021-08-18 16:45:38 +02:00
static inline int blk_integrity_add ( struct gendisk * disk )
2020-03-25 16:48:41 +01:00
{
2021-08-18 16:45:38 +02:00
return 0 ;
2020-03-25 16:48:41 +01:00
}
static inline void blk_integrity_del ( struct gendisk * disk )
{
}
2018-09-24 09:43:47 +02:00
# endif /* CONFIG_BLK_DEV_INTEGRITY */
2008-01-29 14:51:59 +01:00
2014-05-13 15:10:52 -06:00
unsigned long blk_rq_timeout ( unsigned long timeout ) ;
2014-04-24 08:51:47 -06:00
void blk_add_timer ( struct request * req ) ;
2021-11-17 07:14:03 +01:00
const char * blk_status_to_str ( blk_status_t status ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
bool blk_attempt_plug_merge ( struct request_queue * q , struct bio * bio ,
2021-11-23 17:04:41 +01:00
unsigned int nr_segs ) ;
2020-08-28 10:52:55 +08:00
bool blk_bio_list_merge ( struct request_queue * q , struct list_head * list ,
struct bio * bio , unsigned int nr_segs ) ;
blk-mq: new multi-queue block IO queueing mechanism
Linux currently has two models for block devices:
- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.
- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.
With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.
The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.
This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.
blk-mq provides various helper functions, which include:
- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.
- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.
- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.
- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.
- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.
For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).
Contributions in this patch from the following people:
Shaohua Li <shli@fusionio.com>
Alexander Gordeev <agordeev@redhat.com>
Christoph Hellwig <hch@infradead.org>
Mike Christie <michaelc@cs.wisc.edu>
Matias Bjorling <m@bjorling.me>
Jeff Moyer <jmoyer@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24 09:20:05 +01:00
2021-10-06 12:01:07 -06:00
/*
* Plug flush limits
*/
# define BLK_MAX_REQUEST_COUNT 32
# define BLK_PLUG_FLUSH_SIZE (128 * 1024)
2009-04-23 11:05:18 +09:00
/*
* Internal elevator interface
*/
2016-10-20 15:12:13 +02:00
# define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED)
2009-04-23 11:05:18 +09:00
2021-11-18 23:30:41 +08:00
void blk_insert_flush ( struct request * rq ) ;
2010-09-03 11:56:16 +02:00
2018-08-21 15:15:03 +08:00
int elevator_switch_mq ( struct request_queue * q ,
struct elevator_type * new_e ) ;
2021-11-23 19:53:07 +01:00
void elevator_exit ( struct request_queue * q ) ;
block: split .sysfs_lock into two locks
The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
required.
However, when mq & iosched kobjects are removed via
blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
too. This way causes AB-BA lock because the kernfs built-in lock of
'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
On the other hand, it isn't necessary to acquire q->sysfs_lock for
both blk_mq_unregister_dev() & elv_unregister_queue() because
clearing REGISTERED flag prevents storing to 'queue/scheduler'
from being happened. Also sysfs write(store) is exclusive, so no
necessary to hold the lock for elv_unregister_queue() when it is
called in switching elevator path.
So split .sysfs_lock into two: one is still named as .sysfs_lock for
covering sync .store, the other one is named as .sysfs_dir_lock
for covering kobjects and related status change.
sysfs itself can handle the race between add/remove kobjects and
showing/storing attributes under kobjects. For switching scheduler
via storing to 'queue/scheduler', we use the queue flag of
QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
we can avoid to hold .sysfs_lock during removing/adding kobjects.
[1] lockdep warning
======================================================
WARNING: possible circular locking dependency detected
5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
------------------------------------------------------
rmmod/777 is trying to acquire lock:
00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
but task is already holding lock:
00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&q->sysfs_lock){+.+.}:
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__mutex_lock+0x14a/0xa9b
blk_mq_hw_sysfs_show+0x63/0xb6
sysfs_kf_seq_show+0x11f/0x196
seq_read+0x2cd/0x5f2
vfs_read+0xc7/0x18c
ksys_read+0xc4/0x13e
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (kn->count#202){++++}:
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
__kernfs_remove+0x237/0x40b
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->sysfs_lock);
lock(kn->count#202);
lock(&q->sysfs_lock);
lock(kn->count#202);
*** DEADLOCK ***
2 locks held by rmmod/777:
#0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
#1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
stack backtrace:
CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
Call Trace:
dump_stack+0x9a/0xe6
check_noncircular+0x207/0x251
? print_circular_bug+0x32a/0x32a
? find_usage_backwards+0x84/0xb0
check_prev_add+0x5d2/0xc45
validate_chain+0xed3/0xf94
? check_prev_add+0xc45/0xc45
? mark_lock+0x11b/0x804
? check_usage_forwards+0x1ca/0x1ca
__lock_acquire+0x95f/0xa2f
lock_acquire+0x1b4/0x1e8
? kernfs_remove_by_name_ns+0x59/0x72
__kernfs_remove+0x237/0x40b
? kernfs_remove_by_name_ns+0x59/0x72
? kernfs_next_descendant_post+0x7d/0x7d
? strlen+0x10/0x23
? strcmp+0x22/0x44
kernfs_remove_by_name_ns+0x59/0x72
remove_files+0x61/0x96
sysfs_remove_group+0x81/0xa4
sysfs_remove_groups+0x3b/0x44
kobject_del+0x44/0x94
blk_mq_unregister_dev+0x83/0xdd
blk_unregister_queue+0xa0/0x10b
del_gendisk+0x259/0x3fa
? disk_events_poll_msecs_store+0x12b/0x12b
? check_flags+0x1ea/0x204
? mark_held_locks+0x1f/0x7a
null_del_dev+0x8b/0x1c3 [null_blk]
null_exit+0x5c/0x95 [null_blk]
__se_sys_delete_module+0x204/0x337
? free_module+0x39f/0x39f
? blkcg_maybe_throttle_current+0x8a/0x718
? rwlock_bug+0x62/0x62
? __blkcg_punt_bio_submit+0xd0/0xd0
? trace_hardirqs_on_thunk+0x1a/0x20
? mark_held_locks+0x1f/0x7a
? do_syscall_64+0x4c/0x295
do_syscall_64+0xa7/0x295
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb696cdbe6b
Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 19:01:48 +08:00
int elv_register_queue ( struct request_queue * q , bool uevent ) ;
2018-01-17 11:48:08 -08:00
void elv_unregister_queue ( struct request_queue * q ) ;
2020-03-24 08:25:13 +01:00
ssize_t part_size_show ( struct device * dev , struct device_attribute * attr ,
char * buf ) ;
ssize_t part_stat_show ( struct device * dev , struct device_attribute * attr ,
char * buf ) ;
ssize_t part_inflight_show ( struct device * dev , struct device_attribute * attr ,
char * buf ) ;
ssize_t part_fail_show ( struct device * dev , struct device_attribute * attr ,
char * buf ) ;
ssize_t part_fail_store ( struct device * dev , struct device_attribute * attr ,
const char * buf , size_t count ) ;
2008-09-14 05:56:33 -07:00
ssize_t part_timeout_show ( struct device * , struct device_attribute * , char * ) ;
ssize_t part_timeout_store ( struct device * , struct device_attribute * ,
const char * , size_t ) ;
2021-10-13 12:43:41 -06:00
static inline bool blk_may_split ( struct request_queue * q , struct bio * bio )
{
switch ( bio_op ( bio ) ) {
case REQ_OP_DISCARD :
case REQ_OP_SECURE_ERASE :
case REQ_OP_WRITE_ZEROES :
return true ; /* non-trivial splitting decisions */
default :
break ;
}
/*
* All drivers must accept single - segments bios that are < = PAGE_SIZE .
* This is a quick and dirty check that relies on the fact that
* bi_io_vec [ 0 ] is always valid if a bio has data . The check might
* lead to occasional false negatives when bios are cloned , but compared
* to the performance impact of cloned bios themselves the loop below
* doesn ' t matter anyway .
*/
return q - > limits . chunk_sectors | | bio - > bi_vcnt ! = 1 | |
bio - > bi_io_vec - > bv_len + bio - > bi_io_vec - > bv_offset > PAGE_SIZE ;
}
void __blk_queue_split ( struct request_queue * q , struct bio * * bio ,
unsigned int * nr_segs ) ;
2019-06-06 12:29:01 +02:00
int ll_back_merge_fn ( struct request * req , struct bio * bio ,
unsigned int nr_segs ) ;
2021-06-23 11:36:34 +02:00
bool blk_attempt_req_merge ( struct request_queue * q , struct request * rq ,
2011-03-21 10:14:27 +01:00
struct request * next ) ;
2019-06-06 12:29:02 +02:00
unsigned int blk_recalc_rq_segments ( struct request * rq ) ;
2009-07-03 17:48:17 +09:00
void blk_rq_set_mixed_merge ( struct request * rq ) ;
2012-02-08 09:19:38 +01:00
bool blk_rq_merge_ok ( struct request * rq , struct bio * bio ) ;
2017-02-08 14:46:48 +01:00
enum elv_merge blk_try_merge ( struct request * rq , struct bio * bio ) ;
2008-01-29 14:04:06 +01:00
2008-03-04 11:23:45 +01:00
int blk_dev_init ( void ) ;
2009-04-24 08:10:11 +02:00
/*
* Contribute to IO statistics IFF :
*
* a ) it ' s attached to a gendisk , and
2019-10-10 17:36:26 -06:00
* b ) the queue had IO stats enabled when this request was started
2009-04-24 08:10:11 +02:00
*/
2018-08-16 22:51:40 +08:00
static inline bool blk_do_io_stat ( struct request * rq )
2009-02-02 08:42:32 +01:00
{
2022-03-08 06:51:47 +01:00
return ( rq - > rq_flags & RQF_IO_STAT ) & & ! blk_rq_is_passthrough ( rq ) ;
2021-10-09 13:25:41 +01:00
}
2021-11-17 07:14:01 +01:00
void update_io_ticks ( struct block_device * part , unsigned long now , bool end ) ;
2009-02-02 08:42:32 +01:00
2017-02-08 14:46:47 +01:00
static inline void req_set_nomerge ( struct request_queue * q , struct request * req )
{
req - > cmd_flags | = REQ_NOMERGE ;
if ( req = = q - > last_merge )
q - > last_merge = NULL ;
}
2018-10-29 20:57:17 +08:00
/*
* The max size one bio can handle is UINT_MAX becasue bvec_iter . bi_size
* is defined as ' unsigned int ' , meantime it has to aligned to with logical
* block size which is the minimum accepted unit by hardware .
*/
static inline unsigned int bio_allowed_max_sectors ( struct request_queue * q )
{
return round_down ( UINT_MAX , queue_logical_block_size ( q ) ) > > 9 ;
}
block: improve discard bio alignment in __blkdev_issue_discard()
This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.
Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.
Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
/dev/sdb, fill data into the first 50GB by,
# dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
# blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
# sg_get_lba_status /dev/sdb -m 1048 --lba=0
descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated
descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated
Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.
The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.
25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
26 sector_t nr_sects, gfp_t gfp_mask, int flags,
27 struct bio **biop)
28 {
29 struct request_queue *q = bdev_get_queue(bdev);
[snipped]
56
57 while (nr_sects) {
58 sector_t req_sects = min_t(sector_t, nr_sects,
59 bio_allowed_max_sectors(q));
60
61 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
62
63 bio = blk_next_bio(bio, 0, gfp_mask);
64 bio->bi_iter.bi_sector = sector;
65 bio_set_dev(bio, bdev);
66 bio_set_op_attrs(bio, op, 0);
67
68 bio->bi_iter.bi_size = req_sects << 9;
69 sector += req_sects;
70 nr_sects -= req_sects;
[snipped]
79 }
80
81 *biop = bio;
82 return 0;
83 }
84 EXPORT_SYMBOL(__blkdev_issue_discard);
At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.
If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.
This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.
But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked (including partition offset), if it is not
aligned to discard granularity, the first split location should make
sure following bio has bi_sector aligned to discard granularity. Then
there won't be still-mapped fragment in the middle of the discard range.
The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated
We an see there is no 2048 sectors segment anymore, everything is clean.
Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 10:42:30 +08:00
/*
* The max bio size which is aligned to q - > limits . discard_granularity . This
* is a hint to split large discard bio in generic block layer , then if device
* driver needs to split the discard bio into smaller ones , their bi_size can
* be very probably and easily aligned to discard_granularity of the device ' s
* queue .
*/
static inline unsigned int bio_aligned_discard_max_sectors (
struct request_queue * q )
{
return round_down ( UINT_MAX , q - > limits . discard_granularity ) > >
SECTOR_SHIFT ;
}
2011-12-14 00:33:40 +01:00
/*
* Internal io_context interface
*/
2021-11-26 12:58:10 +01:00
struct io_cq * ioc_find_get_icq ( struct request_queue * q ) ;
2021-11-26 12:58:17 +01:00
struct io_cq * ioc_lookup_icq ( struct request_queue * q ) ;
2021-12-09 07:31:31 +01:00
# ifdef CONFIG_BLK_ICQ
2011-12-14 00:33:42 +01:00
void ioc_clear_queue ( struct request_queue * q ) ;
2021-12-09 07:31:31 +01:00
# else
static inline void ioc_clear_queue ( struct request_queue * q )
{
}
# endif /* CONFIG_BLK_ICQ */
2011-12-14 00:33:40 +01:00
2017-03-27 10:51:37 -07:00
# ifdef CONFIG_BLK_DEV_THROTTLING_LOW
extern ssize_t blk_throtl_sample_time_show ( struct request_queue * q , char * page ) ;
extern ssize_t blk_throtl_sample_time_store ( struct request_queue * q ,
const char * page , size_t count ) ;
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 10:51:41 -07:00
extern void blk_throtl_bio_endio ( struct bio * bio ) ;
blk-throttle: add a mechanism to estimate IO latency
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 15:19:42 -07:00
extern void blk_throtl_stat_add ( struct request * rq , u64 time ) ;
blk-throttle: add a simple idle detection
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 10:51:41 -07:00
# else
static inline void blk_throtl_bio_endio ( struct bio * bio ) { }
blk-throttle: add a mechanism to estimate IO latency
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-27 15:19:42 -07:00
static inline void blk_throtl_stat_add ( struct request * rq , u64 time ) { }
2017-03-27 10:51:37 -07:00
# endif
2011-10-19 14:31:18 +02:00
2021-03-31 09:30:00 +02:00
void __blk_queue_bounce ( struct request_queue * q , struct bio * * bio ) ;
static inline bool blk_queue_may_bounce ( struct request_queue * q )
{
return IS_ENABLED ( CONFIG_BOUNCE ) & &
q - > limits . bounce = = BLK_BOUNCE_HIGH & &
max_low_pfn > = max_pfn ;
}
2017-06-19 09:26:21 +02:00
static inline void blk_queue_bounce ( struct request_queue * q , struct bio * * bio )
{
2021-03-31 09:30:00 +02:00
if ( unlikely ( blk_queue_may_bounce ( q ) & & bio_has_data ( * bio ) ) )
__blk_queue_bounce ( q , bio ) ;
2017-06-19 09:26:21 +02:00
}
2018-07-03 11:15:01 -04:00
# ifdef CONFIG_BLK_CGROUP_IOLATENCY
extern int blk_iolatency_init ( struct request_queue * q ) ;
# else
static inline int blk_iolatency_init ( struct request_queue * q ) { return 0 ; }
# endif
2018-10-12 19:08:50 +09:00
# ifdef CONFIG_BLK_DEV_ZONED
void blk_queue_free_zone_bitmaps ( struct request_queue * q ) ;
2021-01-28 13:47:32 +09:00
void blk_queue_clear_zone_settings ( struct request_queue * q ) ;
2018-10-12 19:08:50 +09:00
# else
static inline void blk_queue_free_zone_bitmaps ( struct request_queue * q ) { }
2021-01-28 13:47:32 +09:00
static inline void blk_queue_clear_zone_settings ( struct request_queue * q ) { }
2018-10-12 19:08:50 +09:00
# endif
2021-05-21 07:50:51 +02:00
int blk_alloc_ext_minor ( void ) ;
void blk_free_ext_minor ( unsigned int minor ) ;
2020-03-25 16:48:41 +01:00
# define ADDPART_FLAG_NONE 0
# define ADDPART_FLAG_RAID 1
# define ADDPART_FLAG_WHOLEDISK 2
2021-08-10 17:45:10 +02:00
int bdev_add_partition ( struct gendisk * disk , int partno , sector_t start ,
sector_t length ) ;
2021-08-10 17:45:11 +02:00
int bdev_del_partition ( struct gendisk * disk , int partno ) ;
2021-08-10 17:45:12 +02:00
int bdev_resize_partition ( struct gendisk * disk , int partno , sector_t start ,
sector_t length ) ;
2022-01-24 10:39:12 +01:00
void blk_drop_partitions ( struct gendisk * disk ) ;
2020-03-25 16:48:41 +01:00
2020-05-12 17:55:46 +09:00
int bio_add_hw_page ( struct request_queue * q , struct bio * bio ,
2020-03-27 18:48:37 +01:00
struct page * page , unsigned int len , unsigned int offset ,
2020-05-12 17:55:46 +09:00
unsigned int max_sectors , bool * same_page ) ;
2020-03-27 18:48:37 +01:00
2021-12-03 21:15:32 +08:00
static inline struct kmem_cache * blk_get_queue_kmem_cache ( bool srcu )
{
if ( srcu )
return blk_requestq_srcu_cachep ;
return blk_requestq_cachep ;
}
struct request_queue * blk_alloc_queue ( int node_id , bool alloc_srcu ) ;
2021-11-22 14:06:16 +01:00
int disk_scan_partitions ( struct gendisk * disk , fmode_t mode ) ;
2021-05-21 07:51:16 +02:00
2021-08-18 16:45:39 +02:00
int disk_alloc_events ( struct gendisk * disk ) ;
2021-06-24 09:38:42 +02:00
void disk_add_events ( struct gendisk * disk ) ;
void disk_del_events ( struct gendisk * disk ) ;
void disk_release_events ( struct gendisk * disk ) ;
2022-01-24 10:39:11 +01:00
void disk_block_events ( struct gendisk * disk ) ;
void disk_unblock_events ( struct gendisk * disk ) ;
void disk_flush_events ( struct gendisk * disk , unsigned int mask ) ;
2021-06-24 09:38:43 +02:00
extern struct device_attribute dev_attr_events ;
extern struct device_attribute dev_attr_events_async ;
extern struct device_attribute dev_attr_events_poll_msecs ;
2021-06-24 09:38:42 +02:00
2021-10-12 13:12:21 +02:00
static inline void bio_clear_polled ( struct bio * bio )
2021-08-12 11:42:53 -06:00
{
/* can't support alloc cache if we turn off polling */
2022-03-24 16:35:24 -04:00
bio - > bi_opf & = ~ ( REQ_POLLED | REQ_ALLOC_CACHE ) ;
2021-08-12 11:42:53 -06:00
}
2021-10-12 12:44:50 +02:00
long blkdev_ioctl ( struct file * file , unsigned cmd , unsigned long arg ) ;
2021-10-12 12:44:49 +02:00
long compat_blkdev_ioctl ( struct file * file , unsigned cmd , unsigned long arg ) ;
2021-09-07 16:13:02 +02:00
extern const struct address_space_operations def_blk_aops ;
block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.
This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.
To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.
The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges. In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.
struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files. The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.
E.g. for a dual actuator HDD, the user sees:
$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
| |-- nr_sectors
| `-- sector
`-- 1
|-- nr_sectors
`-- sector
For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.
Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.
The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 11:22:19 +09:00
int disk_register_independent_access_ranges ( struct gendisk * disk ,
struct blk_independent_access_ranges * new_iars ) ;
void disk_unregister_independent_access_ranges ( struct gendisk * disk ) ;
2021-11-17 07:13:58 +01:00
# ifdef CONFIG_FAIL_MAKE_REQUEST
bool should_fail_request ( struct block_device * part , unsigned int bytes ) ;
# else /* CONFIG_FAIL_MAKE_REQUEST */
static inline bool should_fail_request ( struct block_device * part ,
unsigned int bytes )
{
return false ;
}
# endif /* CONFIG_FAIL_MAKE_REQUEST */
2021-10-14 14:39:59 -06:00
/*
* Optimized request reference counting . Ideally we ' d make timeouts be more
* clever , as that ' s the only reason we need references at all . . . But until
* this happens , this is faster than using refcount_t . Also see :
*
* abc54d634334 ( " io_uring: switch to atomic_t for io_kiocb reference count " )
*/
# define req_ref_zero_or_close_to_overflow(req) \
( ( unsigned int ) atomic_read ( & ( req - > ref ) ) + 127u < = 127u )
static inline bool req_ref_inc_not_zero ( struct request * req )
{
return atomic_inc_not_zero ( & req - > ref ) ;
}
static inline bool req_ref_put_and_test ( struct request * req )
{
WARN_ON_ONCE ( req_ref_zero_or_close_to_overflow ( req ) ) ;
return atomic_dec_and_test ( & req - > ref ) ;
}
static inline void req_ref_set ( struct request * req , int value )
{
atomic_set ( & req - > ref , value ) ;
}
static inline int req_ref_read ( struct request * req )
{
return atomic_read ( & req - > ref ) ;
}
2011-10-19 14:31:18 +02:00
# endif /* BLK_INTERNAL_H */