6015 Commits

Author SHA1 Message Date
Mikulas Patocka
0c8e9c2d66 dm zoned: fix invalid memory access
Commit 75d66ffb48efb30f2dd42f041ba8b39c5b2bd115 ("dm zoned: properly
handle backing device failure") triggers a coverity warning:

*** CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
/drivers/md/dm-zoned-target.c: 137 in dmz_submit_bio()
131             clone->bi_private = bioctx;
132
133             bio_advance(bio, clone->bi_iter.bi_size);
134
135             refcount_inc(&bioctx->ref);
136             generic_make_request(clone);
>>>     CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
>>>     Dereferencing freed pointer "clone".
137             if (clone->bi_status == BLK_STS_IOERR)
138                     return -EIO;
139
140             if (bio_op(bio) == REQ_OP_WRITE && dmz_is_seq(zone))
141                     zone->wp_block += nr_blocks;
142

The "clone" bio may be processed and freed before the check
"clone->bi_status == BLK_STS_IOERR" - so this check can access invalid
memory.

Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-26 10:33:58 -04:00
Jaskaran Khurana
88cd3e6cfa dm verity: add root hash pkcs#7 signature verification
The verification is to support cases where the root hash is not secured
by Trusted Boot, UEFI Secureboot or similar technologies.

One of the use cases for this is for dm-verity volumes mounted after
boot, the root hash provided during the creation of the dm-verity volume
has to be secure and thus in-kernel validation implemented here will be
used before we trust the root hash and allow the block device to be
created.

The signature being provided for verification must verify the root hash
and must be trusted by the builtin keyring for verification to succeed.

The hash is added as a key of type "user" and the description is passed
to the kernel so it can look it up and use it for verification.

Adds CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG which can be turned on if root
hash verification is needed.

Kernel commandline dm_verity module parameter 'require_signatures' will
indicate whether to force root hash signature verification (for all dm
verity volumes).

Signed-off-by: Jaskaran Khurana <jaskarankhurana@linux.microsoft.com>
Tested-and-Reviewed-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:14 -04:00
Ard Biesheuvel
39d13a1ac4 dm crypt: reuse eboiv skcipher for IV generation
Instead of instantiating a separate cipher to perform the encryption
needed to produce the IV, reuse the skcipher used for the block data
and invoke it one additional time for each block to encrypt a zero
vector and use the output as the IV.

For CBC mode, this is equivalent to using the bare block cipher, but
without the risk of ending up with a non-time invariant implementation
of AES when the skcipher itself is time variant (e.g., arm64 without
Crypto Extensions has a NEON based time invariant implementation of
cbc(aes) but no time invariant implementation of the core cipher other
than aes-ti, which is not enabled by default).

This approach is a compromise between dm-crypt API flexibility and
reducing dependence on parts of the crypto API that should not usually
be exposed to other subsystems, such as the bare cipher API.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:13 -04:00
Mikulas Patocka
123d87d553 dm: make dm_table_find_target return NULL
Currently, if we pass too high sector number to dm_table_find_target, it
returns zeroed dm_target structure and callers test if the structure is
zeroed with the macro dm_target_is_valid.

However, returning NULL is common practice to indicate errors.

This patch refactors the dm code, so that dm_table_find_target returns
NULL and its callers test the returned value for NULL. The macro
dm_target_is_valid is deleted. In alloc_targets, we no longer allocate an
extra zeroed target.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:13:12 -04:00
Mikulas Patocka
1cfd5d3399 dm table: fix invalid memory accesses with too high sector number
If the sector number is too high, dm_table_find_target() should return a
pointer to a zeroed dm_target structure (the caller should test it with
dm_target_is_valid).

However, for some table sizes, the code in dm_table_find_target() that
performs btree lookup will access out of bound memory structures.

Fix this bug by testing the sector number at the beginning of
dm_table_find_target(). Also, add an "inline" keyword to the function
dm_table_get_size() because this is a hot path.

Fixes: 512875bd9661 ("dm: table detect io beyond device")
Cc: stable@vger.kernel.org
Reported-by: Zhang Tao <kontais@zoho.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-23 10:11:42 -04:00
ZhangXiaoxu
ae148243d3 dm space map metadata: fix missing store of apply_bops() return value
In commit 6096d91af0b6 ("dm space map metadata: fix occasional leak
of a metadata block on resize"), we refactor the commit logic to a new
function 'apply_bops'.  But when that logic was replaced in out() the
return value was not stored.  This may lead out() returning a wrong
value to the caller.

Fixes: 6096d91af0b6 ("dm space map metadata: fix occasional leak of a metadata block on resize")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-22 16:11:24 -04:00
ZhangXiaoxu
e4f9d60138 dm btree: fix order of block initialization in btree_split_beneath
When btree_split_beneath() splits a node to two new children, it will
allocate two blocks: left and right.  If right block's allocation
failed, the left block will be unlocked and marked dirty.  If this
happened, the left block'ss content is zero, because it wasn't
initialized with the btree struct before the attempot to allocate the
right block.  Upon return, when flushing the left block to disk, the
validator will fail when check this block.  Then a BUG_ON is raised.

Fix this by completely initializing the left block before allocating and
initializing the right block.

Fixes: 4dcb8b57df359 ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path")
Cc: stable@vger.kernel.org
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-22 16:11:23 -04:00
Wenwen Wang
dc1a3e8e0c dm raid: add missing cleanup in raid_ctr()
If rs_prepare_reshape() fails, no cleanup is executed, leading to
leak of the raid_set structure allocated at the beginning of
raid_ctr(). To fix this issue, go to the label 'bad' if the error
occurs.

Fixes: 11e4723206683 ("dm raid: stop keeping raid set frozen altogether")
Cc: stable@vger.kernel.org
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21 11:47:05 -04:00
Dan Carpenter
e0702d90b7 dm zoned: fix potential NULL dereference in dmz_do_reclaim()
This function is supposed to return error pointers so it matches the
dmz_get_rnd_zone_for_reclaim() function.  The current code could lead to
a NULL dereference in dmz_do_reclaim()

Fixes: b234c6d7a703 ("dm zoned: improve error handling in reclaim")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21 11:29:30 -04:00
Bryan Gurney
08c04c84a5 dm dust: use dust block size for badblocklist index
Change the "frontend" dust_remove_block, dust_add_block, and
dust_query_block functions to store the "dust block number", instead
of the sector number corresponding to the "dust block number".

For the "backend" functions dust_map_read and dust_map_write,
right-shift by sect_per_block_shift.  This fixes the inability to
emulate failure beyond the first sector of each "dust block" (for
devices with a "dust block size" larger than 512 bytes).

Fixes: e4f3fabd67480bf ("dm: add dust target")
Cc: stable@vger.kernel.org
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-21 11:27:17 -04:00
Mikulas Patocka
5729b6e5a1 dm integrity: fix a crash due to BUG_ON in __journal_read_write()
Fix a crash that was introduced by the commit 724376a04d1a. The crash is
reported here: https://gitlab.com/cryptsetup/cryptsetup/issues/468

When reading from the integrity device, the function
dm_integrity_map_continue calls find_journal_node to find out if the
location to read is present in the journal. Then, it calculates how many
sectors are consecutively stored in the journal. Then, it locks the range
with add_new_range and wait_and_add_new_range.

The problem is that during wait_and_add_new_range, we hold no locks (we
don't hold ic->endio_wait.lock and we don't hold a range lock), so the
journal may change arbitrarily while wait_and_add_new_range sleeps.

The code then goes to __journal_read_write and hits
BUG_ON(journal_entry_get_sector(je) != logical_sector); because the
journal has changed.

In order to fix this bug, we need to re-check the journal location after
wait_and_add_new_range. We restrict the length to one block in order to
not complicate the code too much.

Fixes: 724376a04d1a ("dm integrity: implement fair range locks")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 16:01:57 -04:00
Dmitry Fomichev
ad1bd578bd dm zoned: fix a few typos
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:43 -04:00
Dmitry Fomichev
bae9a0aa33 dm zoned: add SPDX license identifiers
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:42 -04:00
Dmitry Fomichev
75d66ffb48 dm zoned: properly handle backing device failure
dm-zoned is observed to lock up or livelock in case of hardware
failure or some misconfiguration of the backing zoned device.

This patch adds a new dm-zoned target function that checks the status of
the backing device. If the request queue of the backing device is found
to be in dying state or the SCSI backing device enters offline state,
the health check code sets a dm-zoned target flag prompting all further
incoming I/O to be rejected. In order to detect backing device failures
timely, this new function is called in the request mapping path, at the
beginning of every reclaim run and before performing any metadata I/O.

The proper way out of this situation is to do

dmsetup remove <dm-zoned target>

and recreate the target when the problem with the backing device
is resolved.

Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:42 -04:00
Dmitry Fomichev
d7428c5011 dm zoned: improve error handling in i/o map code
Some errors are ignored in the I/O path during queueing chunks
for processing by chunk works. Since at least these errors are
transient in nature, it should be possible to retry the failed
incoming commands.

The fix -

Errors that can happen while queueing chunks are carried upwards
to the main mapping function and it now returns DM_MAPIO_REQUEUE
for any incoming requests that can not be properly queued.

Error logging/debug messages are added where needed.

Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:41 -04:00
Dmitry Fomichev
b234c6d7a7 dm zoned: improve error handling in reclaim
There are several places in reclaim code where errors are not
propagated to the main function, dmz_reclaim(). This function
is responsible for unlocking zones that might be still locked
at the end of any failed reclaim iterations. As the result,
some device zones may be left permanently locked for reclaim,
degrading target's capability to reclaim zones.

This patch fixes these issues as follows -

Make sure that dmz_reclaim_buf(), dmz_reclaim_seq_data() and
dmz_reclaim_rnd_data() return error codes to the caller.

dmz_reclaim() function is renamed to dmz_do_reclaim() to avoid
clashing with "struct dmz_reclaim" and is modified to return the
error to the caller.

dmz_get_zone_for_reclaim() now returns an error instead of NULL
pointer and reclaim code checks for that error.

Error logging/debug messages are added where necessary.

Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:40 -04:00
Dmitry Fomichev
d1fef41465 dm kcopyd: always complete failed jobs
This patch fixes a problem in dm-kcopyd that may leave jobs in
complete queue indefinitely in the event of backing storage failure.

This behavior has been observed while running 100% write file fio
workload against an XFS volume created on top of a dm-zoned target
device. If the underlying storage of dm-zoned goes to offline state
under I/O, kcopyd sometimes never issues the end copy callback and
dm-zoned reclaim work hangs indefinitely waiting for that completion.

This behavior was traced down to the error handling code in
process_jobs() function that places the failed job to complete_jobs
queue, but doesn't wake up the job handler. In case of backing device
failure, all outstanding jobs may end up going to complete_jobs queue
via this code path and then stay there forever because there are no
more successful I/O jobs to wake up the job handler.

This patch adds a wake() call to always wake up kcopyd job wait queue
for all I/O jobs that fail before dm_io() gets called for that job.

The patch also sets the write error status in all sub jobs that are
failed because their master job has failed.

Fixes: b73c67c2cbb00 ("dm kcopyd: add sequential write feature")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:39 -04:00
Mikulas Patocka
cf3591ef83 Revert "dm bufio: fix deadlock with loop device"
Revert the commit bd293d071ffe65e645b4d8104f9d8fe15ea13862. The proper
fix has been made available with commit d0a255e795ab ("loop: set
PF_MEMALLOC_NOIO for the worker thread").

Note that the fix offered by commit bd293d071ffe doesn't really prevent
the deadlock from occuring - if we look at the stacktrace reported by
Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex -
i.e. it has already successfully taken the mutex. Changing the mutex
from mutex_lock to mutex_trylock won't help with deadlocks that happen
afterwards.

PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
   #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
   #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
   #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
   #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
   #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
   #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
   #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
   #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
   #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
   #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
  #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
  #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
  #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
  #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
  #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Fixes: bd293d071ffe ("dm bufio: fix deadlock with loop device")
Depends-on: d0a255e795ab ("loop: set PF_MEMALLOC_NOIO for the worker thread")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-08-15 15:57:39 -04:00
Linus Torvalds
50e73a4a41 for-linus-20190809
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1NmNIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpouEEADA1e2HH9AO8QGWBVSeFi5ZdRnEXsJUfNGD
 d576M9empGI/hkbui0yY2tuodnMwmhUj641GTRC2dzl7RRyGcTMmqtGYPcyczlpU
 +6Yil3+XVNYTXRpUsKWs32H9aubNZl/L3rCcJImGvMgLW5YtEjAZJFIFIzXWWwDJ
 aZpTtnOH1+D3HH6HT35xk+aytSYZ7LsZ7X3LI9ZumKOUd2HJZGUkfWLXxgSuuUh/
 /WaBEQ9xzDNARmfx9Qb/2wSAE7XOInupPr86fI9dnmXHZ8rwhsvHRIZEvNIIqF5Z
 KzbF+rGJZ2bizKFpVFdlrIIyfBleFlQGFzYnGrs/+47zAl/CGicqmuAhGcOdGQXd
 wyj6G66iBcBIQC9hsYhPtglyQSk95tzsPZLZa7/1TlhEKl3mpor4w30NrLz/P5oy
 gdIivDhKP7aRFzBWw/2O4TOn3HhGxnZWVni6icOHE/pBQLBW12Ulc1nT0SiJ8RAt
 PYhMCFwz0Xc7pYuEfGZKwNSCIrx4i7f+spWEAt0Fget/KaB993zm+K4OzGbfBv6r
 FrZ5g/2MTVBJITmJVN2LDysF4EVEOzTULX+bCQOOWC7wmE/poCyNWj8gCpunIZsY
 V5DAB+cbEw/OLfpdiogXvCcxrbukKHV8AVdMdLYB5yEAuPN0JMqcunMjECxpNAha
 1wsv86ljvw==
 =Au30
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190809' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Revert of a bcache patch that caused an oops for some (Coly)

 - ata rb532 unused warning fix (Gustavo)

 - AoE kernel crash fix (He)

 - Error handling fixup for blkdev_get() (Jan)

 - libata read/write translation and SFF PIO fix (me)

 - Use after free and error handling fix for O_DIRECT fragments. There's
   still a nowait + sync oddity in there, we'll nail that start next
   week. If all else fails, I'll queue a revert of the NOWAIT change.
   (me)

 - Loop GFP_KERNEL -> GFP_NOIO deadlock fix (Mikulas)

 - Two BFQ regression fixes that caused crashes (Paolo)

* tag 'for-linus-20190809' of git://git.kernel.dk/linux-block:
  bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"
  loop: set PF_MEMALLOC_NOIO for the worker thread
  bdev: Fixup error handling in blkdev_get()
  block, bfq: handle NULL return value by bfq_init_rq()
  block, bfq: move update of waker and woken list to queue freeing
  block, bfq: reset last_completed_rq_bfqq if the pointed queue is freed
  block: aoe: Fix kernel crash due to atomic sleep when exiting
  libata: add SG safety checks in SFF pio transfers
  libata: have ata_scsi_rw_xlat() fail invalid passthrough requests
  block: fix O_DIRECT error handling for bio fragments
  ata: rb532_cf: Fix unused variable warning in rb532_pata_driver_probe
2019-08-09 09:28:18 -07:00
Coly Li
20621fedb2 bcache: Revert "bcache: use sysfs_match_string() instead of __sysfs_match_string()"
This reverts commit 89e0341af082dbc170019f908846f4a424efc86b.

In drivers/md/bcache/sysfs.c:bch_snprint_string_list(), NULL pointer at
the end of list is necessary. Remove the NULL from last element of each
lists will cause the following panic,

[ 4340.455652] bcache: register_cache() registered cache device nvme0n1
[ 4340.464603] bcache: register_bdev() registered backing device sdk
[ 4421.587335] bcache: bch_cached_dev_run() cached dev sdk is running already
[ 4421.587348] bcache: bch_cached_dev_attach() Caching sdk as bcache0 on set 354e1d46-d99f-4d8b-870b-078b80dc88a6
[ 5139.247950] general protection fault: 0000 [#1] SMP NOPTI
[ 5139.247970] CPU: 9 PID: 5896 Comm: cat Not tainted 4.12.14-95.29-default #1 SLE12-SP4
[ 5139.247988] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 04/18/2019
[ 5139.248006] task: ffff888fb25c0b00 task.stack: ffff9bbacc704000
[ 5139.248021] RIP: 0010:string+0x21/0x70
[ 5139.248030] RSP: 0018:ffff9bbacc707bf0 EFLAGS: 00010286
[ 5139.248043] RAX: ffffffffa7e432e3 RBX: ffff8881c20da02a RCX: ffff0a00ffffff04
[ 5139.248058] RDX: 3f00656863616362 RSI: ffff8881c20db000 RDI: ffffffffffffffff
[ 5139.248075] RBP: ffff8881c20db000 R08: 0000000000000000 R09: ffff8881c20da02a
[ 5139.248090] R10: 0000000000000004 R11: 0000000000000000 R12: ffff9bbacc707c48
[ 5139.248104] R13: 0000000000000fd6 R14: ffffffffc0665855 R15: ffffffffc0665855
[ 5139.248119] FS:  00007faf253b8700(0000) GS:ffff88903f840000(0000) knlGS:0000000000000000
[ 5139.248137] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5139.248149] CR2: 00007faf25395008 CR3: 0000000f72150006 CR4: 00000000007606e0
[ 5139.248164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5139.248179] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5139.248193] PKRU: 55555554
[ 5139.248200] Call Trace:
[ 5139.248210]  vsnprintf+0x1fb/0x510
[ 5139.248221]  snprintf+0x39/0x40
[ 5139.248238]  bch_snprint_string_list.constprop.15+0x5b/0x90 [bcache]
[ 5139.248256]  __bch_cached_dev_show+0x44d/0x5f0 [bcache]
[ 5139.248270]  ? __alloc_pages_nodemask+0xb2/0x210
[ 5139.248284]  bch_cached_dev_show+0x2c/0x50 [bcache]
[ 5139.248297]  sysfs_kf_seq_show+0xbb/0x190
[ 5139.248308]  seq_read+0xfc/0x3c0
[ 5139.248317]  __vfs_read+0x26/0x140
[ 5139.248327]  vfs_read+0x87/0x130
[ 5139.248336]  SyS_read+0x42/0x90
[ 5139.248346]  do_syscall_64+0x74/0x160
[ 5139.248358]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 5139.248370] RIP: 0033:0x7faf24eea370
[ 5139.248379] RSP: 002b:00007fff82d03f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 5139.248395] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007faf24eea370
[ 5139.248411] RDX: 0000000000020000 RSI: 00007faf25396000 RDI: 0000000000000003
[ 5139.248426] RBP: 00007faf25396000 R08: 00000000ffffffff R09: 0000000000000000
[ 5139.248441] R10: 000000007c9d4d41 R11: 0000000000000246 R12: 00007faf25396000
[ 5139.248456] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[ 5139.248892] Code: ff ff ff 0f 1f 80 00 00 00 00 49 89 f9 48 89 cf 48 c7 c0 e3 32 e4 a7 48 c1 ff 30 48 81 fa ff 0f 00 00 48 0f 46 d0 48 85 ff 74 45 <44> 0f b6 02 48 8d 42 01 45 84 c0 74 38 48 01 fa 4c 89 cf eb 0e

The simplest way to fix is to revert commit 89e0341af082 ("bcache: use
sysfs_match_string() instead of __sysfs_match_string()").

This bug was introduced in Linux v5.2, so this fix only applies to
Linux v5.2 is enough for stable tree maintainer.

Fixes: 89e0341af082 ("bcache: use sysfs_match_string() instead of __sysfs_match_string()")
Cc: stable@vger.kernel.org
Cc: Alexandru Ardelean <alexandru.ardelean@analog.com>
Reported-by: Peifeng Lin <pflin@suse.com>
Acked-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-09 07:37:33 -06:00
Hou Tao
449808a254 raid1: factor out a common routine to handle the completion of sync write
It's just code clean-up.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Guoqing Jiang
0d8ed0e9bf md: don't call spare_active in md_reap_sync_thread if all member devices can't work
When add one disk to array, the md_reap_sync_thread is responsible
to activate the spare and set In_sync flag for the new member in
spare_active().

But if raid1 has one member disk A, and disk B is added to the array.
Then we offline A before all the datas are synchronized from A to B,
obviously B doesn't have the latest data as A, but B is still marked
with In_sync flag.

So let's not call spare_active under the condition, otherwise B is
still showed with 'U' state which is not correct.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Guoqing Jiang
062f5b2ae1 md: don't set In_sync if array is frozen
When a disk is added to array, the following path is called in mdadm.

Manage_subdevs -> sysfs_freeze_array
               -> Manage_add
               -> sysfs_set_str(&info, NULL, "sync_action","idle")

Then from kernel side, Manage_add invokes the path (add_new_disk ->
validate_super = super_1_validate) to set In_sync flag.

Since In_sync means "device is in_sync with rest of array", and the new
added disk need to resync thread to help the synchronization of data.
And md_reap_sync_thread would call spare_active to set In_sync for the
new added disk finally. So don't set In_sync if array is in frozen.

Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Guoqing Jiang
9a567843f7 md: allow last device to be forcibly removed from RAID1/RAID10.
When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed.  This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.

However the current implementation also stops an admin from removing
the last device by direct action.  This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.

Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.

So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.

Signed-off-by: NeilBrown <neilb@suse.de>
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Andy Shevchenko
cf89160793 md: Convert to use int_pow()
Instead of linear approach to calculate power of 10, use generic int_pow()
which does it better.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Yufen Yu
7cee6d4e60 md/raid10: end bio when the device faulty
Just like raid1, we do not queue write error bio to retry write
and acknowlege badblocks, when the device is faulty.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Yufen Yu
eeba6809d8 md/raid1: end bio when the device faulty
When write bio return error, it would be added to conf->retry_list
and wait for raid1d thread to retry write and acknowledge badblocks.

In narrow_write_error(), the error bio will be split in the unit of
badblock shift (such as one sector) and raid1d thread issues them
one by one. Until all of the splited bio has finished, raid1d thread
can go on processing other things, which is time consuming.

But, there is a scene for error handling that is not necessary.
When the device has been set faulty, flush_bio_list() may end
bios in pending_bio_list with error status. Since these bios
has not been issued to the device actually, error handlding to
retry write and acknowledge badblocks make no sense.

Even without that scene, when the device is faulty, badblocks info
can not be written out to the device. Thus, we also no need to
handle the error IO.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Xiao Ni
143f6e733b md/raid6: Set R5_ReadError when there is read failure on parity disk
7471fb77ce4d ("md/raid6: Fix anomily when recovering a single device in
RAID6.") avoids rereading P when it can be computed from other members.
However, this misses the chance to re-write the right data to P. This
patch sets R5_ReadError if the re-read fails.

Also, when re-read is skipped, we also missed the chance to reset
rdev->read_errors to 0. It can fail the disk when there are many read
errors on P member disk (other disks don't have read error)

V2: upper layer read request don't read parity/Q data. So there is no
need to consider such situation.

This is Reported-by: kbuild test robot <lkp@intel.com>

Fixes: 7471fb77ce4d ("md/raid6: Fix anomily when recovering a single device in RAID6.")
Cc: <stable@vger.kernel.org> #4.4+
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Hou Tao
4675719d0f raid1: use an int as the return value of raise_barrier()
Using a sector_t as the return value is misleading, because
raise_barrier() only return 0 or -EINTR.

Also add comments for the return values of raise_barrier().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07 10:25:02 -07:00
Ming Lei
226b4fc75c blk-mq: add callback of .cleanup_rq
SCSI maintains its own driver private data hooked off of each SCSI
request, and the pridate data won't be freed after scsi_queue_rq()
returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
(e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
fully dispatched them, due to a lower level SCSI driver's resource
limitation identified in scsi_queue_rq(). Currently SCSI's per-request
private data is leaked when the upper layer driver (dm-rq) frees and
then retries these requests in response to BLK_STS_RESOURCE or
BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().

This usecase is so specialized that it doesn't warrant training an
existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
account for freeing its driver private data -- doing so would add an
extra branch for handling a special case that all other consumers of
SCSI (and blk-mq) won't ever need to worry about.

So the most pragmatic way forward is to delegate freeing SCSI driver
private data to the upper layer driver (dm-rq).  Do so by adding
new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
from dm-rq.  A following commit will implement the .cleanup_rq() hook
in scsi_mq_ops.

Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04 21:42:06 -06:00
Mike Snitzer
9c50a98f55 dm table: fix various whitespace issues with recent DAX code
Also, rename device_synchronous to device_dax_synchronous.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-30 18:59:24 -04:00
Pankaj Gupta
5348deb138 dm table: fix dax_dev NULL dereference in device_synchronous()
If a device doesn't support DAX its 'dax_dev' is NULL.  Fix
device_synchronous() to first check if dax_dev is NULL before
dereferencing it.

Fixes: 2e9ee0955d3c ("dm: enable synchronous dax")
Reported-by: jencce.kernel@gmail.com
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-30 18:58:54 -04:00
Linus Torvalds
0441281965 for-linus-20190726
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl07DGAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplf5EADOOvOdsz9N/Iw8ZHHHJXCqKR26zZv75G1z
 0h1PGC7p0JZQbYFo0Zo7mjiRBGlg6tlXc2d4Gyl94XJKDwjeYTcFDvbvERdYa+MH
 d2RiFkAfR967Ri4fb+FP5L3mYOQdMJ/zk0xCDHLv/DcxeFLa5a9EJS1+vBSR+AcB
 0JpJWuHypGqGmbTaL0z9q2pmx0mgA1ERlWQtkMLrsEr2Vqg/rrjGwe2bGFY00lXc
 vKtFkpfugKc4zVAPSzC1YZgojfDDpGNEA4QMtxMsEH4hqyMpHhrtUedNY5QrjC0B
 p9h6aPXXYr2KhGP0grrEytzaYUOzK2crK5h+q+1vu6nOgx2EgmnLM9tBu/LuRH1j
 uUzKJOa3/AE+bU7uZEsaUerTBsHrgEBa1x8G92obYRnjgW3aCD2CaSbjjBhNxTZ4
 1dXyr0DTHFXZmfcfWja5tO26JTPzjwVOrwiRyU0S727UsdVJupoHiYLr5fwaDfgn
 /Du2I/XWvFtflm5i0ND0sdcX1yRlFiGZ9e45z1QFaFmcteKKWzRBDlC6mQzI/lw3
 oc583mhDR3tRtJxow+wn6AuMUehFRh8wj0UhL/MEMjLW8GiqXU5aRtanT+22Xz4L
 saNDQieeEnV7raMYXMP0qIhkJtrNASmJQos+MOJAEGOWcS2ePIUUio2kSXie+071
 BphJd2RamQ==
 =HIzH
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Several io_uring fixes/improvements:
     - Blocking fix for O_DIRECT (me)
     - Latter page slowness for registered buffers (me)
     - Fix poll hang under certain conditions (me)
     - Defer sequence check fix for wrapped rings (Zhengyuan)
     - Mismatch in async inc/dec accounting (Zhengyuan)
     - Memory ordering issue that could cause stall (Zhengyuan)
      - Track sequential defer in bytes, not pages (Zhengyuan)

 - NVMe pull request from Christoph

 - Set of hang fixes for wbt (Josef)

 - Redundant error message kill for libahci (Ding)

 - Remove unused blk_mq_sched_started_request() and related ops (Marcos)

 - drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

 - blkcg ->pd_stat() non-debug print (Tejun)

 - bcache memory leak fix (Wei)

 - Comment fix (Akinobu)

 - BFQ perf regression fix (Paolo)

* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
  io_uring: ensure ->list is initialized for poll commands
  Revert "nvme-pci: don't create a read hctx mapping without read queues"
  nvme: fix multipath crash when ANA is deactivated
  nvme: fix memory leak caused by incorrect subsystem free
  nvme: ignore subnqn for ADATA SX6000LNP
  drbd: dynamically allocate shash descriptor
  block: blk-mq: Remove blk_mq_sched_started_request and started_request
  bcache: fix possible memory leak in bch_cached_dev_run()
  io_uring: track io length in async_list based on bytes
  io_uring: don't use iov_iter_advance() for fixed buffers
  block: properly handle IOCB_NOWAIT for async O_DIRECT IO
  blk-mq: allow REQ_NOWAIT to return an error inline
  io_uring: add a memory barrier before atomic_read
  rq-qos: use a mb for got_token
  rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
  rq-qos: don't reset has_sleepers on spurious wakeups
  rq-qos: fix missed wake-ups in rq_qos_throttle
  wait: add wq_has_single_sleeper helper
  block, bfq: check also in-flight I/O in dispatch plugging
  block: fix sysfs module parameters directory path in comment
  ...
2019-07-26 10:32:12 -07:00
Wei Yongjun
5d9e06d60e bcache: fix possible memory leak in bch_cached_dev_run()
memory malloced in bch_cached_dev_run() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: 0b13efecf5f2 ("bcache: add return value check to bch_cached_dev_run()")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-22 08:15:17 -06:00
Linus Torvalds
3bfe1fc467 - Fix zone state management race in DM zoned target by eliminating
the unnecessary DMZ_ACTIVE state.
 
 - A couple fixes for issues the DM snapshot target's optional discard
   support added during first week of the 5.3 merge.
 
 - Increase default size of outstanding IO that is allowed for a each
   dm-kcopyd client and introduce tunable to allow user adjust.
 
 - Update DM core to use printk ratelimiting functions rather than
   duplicate them and in doing so fix an issue where DMDEBUG_LIMIT() rate
   limited KERN_DEBUG messages had excessive "callbacks suppressed"
   messages.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl0w2IsTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWgPeCACMtDQrHqMTOT7OPDRxSJjZixefzL32
 lFi31mjjEb7GoxiS3dBepdJmQiUwROINdGLIGTfBAlH05b/8fgFgE6iCGZ9uzad4
 0PNe9q7pbtfQDLXx+mVMjEdK6P/ilmVFXCW8VQpAAeUFL+dwXYHHIbmQZ/rahOz5
 8nn6wGBQ/LRRcbV0hBHNXQymIXPxMweMxO3usSuKbfhe7JjRwslThGbZ4KVwjCwl
 sLl5mEWXwTKUemGXsXFbCbtH/rnZpbaiAkBedT0oV8g8atRBeQyj0vk48htincj7
 Uv6xGjJGXuqUkcvQnTx3C1fk3lH5xJb5MTL3WEN0g6fOmyJd1sMd0gd/
 =4Rpg
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull more device mapper updates from Mike Snitzer:

 - Fix zone state management race in DM zoned target by eliminating the
   unnecessary DMZ_ACTIVE state.

 - A couple fixes for issues the DM snapshot target's optional discard
   support added during first week of the 5.3 merge.

 - Increase default size of outstanding IO that is allowed for a each
   dm-kcopyd client and introduce tunable to allow user adjust.

 - Update DM core to use printk ratelimiting functions rather than
   duplicate them and in doing so fix an issue where DMDEBUG_LIMIT()
   rate limited KERN_DEBUG messages had excessive "callbacks suppressed"
   messages.

* tag 'for-5.3/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: use printk ratelimiting functions
  dm kcopyd: Increase default sub-job size to 512KB
  dm snapshot: fix oversights in optional discard support
  dm zoned: fix zone state management race
2019-07-18 14:49:33 -07:00
Linus Torvalds
f8c3500cd1 - virtio_pmem: The new virtio_pmem facility introduces a paravirtualized
persistent memory device that allows a guest VM to use DAX mechanisms to
   access a host-file with host-page-cache. It arranges for MAP_SYNC to
   be disabled and instead triggers a host fsync() when a 'write-cache
   flush' command is sent to the virtual disk device.
 
 - Miscellaneous small fixups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdMHwpAAoJEB7SkWpmfYgCUYoP/3vcgYBAaXNksyALF0iowPoP
 z4J0KoaOA1CzRFEQtCWUQa84CWj+XoSewwSeyrIkqKQvx/gghXblK+GVjVzBn0BD
 hmmiKr8af4DdxfzYdEXJp65cCpIiVMaJiGr20Aj9ObwvWJb4QZbz9q7hnPt6KgiI
 jVND3BpP3OERb4ZFcibdmJT5foKooMcXVG6+luVe+hc1+ZZQxJBsBaqie4brQIFq
 j59NX3HfHH2fr1vVwnVH0CO4tgbgYg9wZ2EivGu6wBWvORjrr7KiSSbOYP68EBtd
 lUoNps+vQtGnfXGwNzAjp1wuknrQYYh4/KMKjep7hiZD39rgyvBpbHbyynKzQCWV
 REe8cXr/nwphsENvBAUBiqY999EWVIxdT2iaVaSA6K/31JQAC5AFyxVK/P2Ke1SK
 rvePZ++iLQ1o4phTxQPNlVUqF9jOrFVVICGwMDqaqSkOsD9YKQdFClfOF/1ntlDz
 V0bs+Y0Pe8AJCd9ESep4X+vHAWRRIb4EQIuwLaX8RJoY+r1fGye9RPthpYYzvXKp
 DI2iJztFO3anzj2i9htNPUFIaiUmIhzEvG32O2If2yc5FL02hMpHPoFx6vHhe6s3
 f8OJ+olsJK+/IIrV8+DHqYvhzylOYIhmRTvIxIxaNDPHkhR1i2RDQ6KKK1YZmsr8
 MjAZ+Ym0GadDivs+wcM6
 =uAMG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "Primarily just the virtio_pmem driver:

   - virtio_pmem

     The new virtio_pmem facility introduces a paravirtualized
     persistent memory device that allows a guest VM to use DAX
     mechanisms to access a host-file with host-page-cache. It arranges
     for MAP_SYNC to be disabled and instead triggers a host fsync()
     when a 'write-cache flush' command is sent to the virtual disk
     device.

   - Miscellaneous small fixups"

* tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  virtio_pmem: fix sparse warning
  xfs: disable map_sync for async flush
  ext4: disable map_sync for async flush
  dax: check synchronous mapping is supported
  dm: enable synchronous dax
  libnvdimm: add dax_dev sync flag
  virtio-pmem: Add virtio pmem driver
  libnvdimm: nd_region flush callback support
  libnvdimm, namespace: Drop uuid_t implementation detail
2019-07-18 10:52:08 -07:00
Nikos Tsironis
c663e04097 dm kcopyd: Increase default sub-job size to 512KB
Currently, kcopyd has a sub-job size of 64KB and a maximum number of 8
sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KB of
I/O in flight.

This upper limit to the amount of in-flight I/O under-utilizes fast
devices and results in decreased throughput, e.g., when writing to a
snapshotted thin LV with I/O size less than the pool's block size (so
COW is performed using kcopyd).

Increase kcopyd's default sub-job size to 512KB, so we have a maximum of
4MB of I/O in flight for each kcopyd job. This results in an up to 96%
improvement of bandwidth when writing to a snapshotted thin LV, with I/O
sizes less than the pool's block size.

Also, add dm_mod.kcopyd_subjob_size_kb module parameter to allow users
to fine tune the sub-job size of kcopyd. The default value of this
parameter is 512KB and the maximum allowed value is 1024KB.

We evaluate the performance impact of the change by running the
snap_breaking_throughput benchmark, from the device mapper test suite
[1].

The benchmark:

  1. Creates a 1G thin LV
  2. Provisions the thin LV
  3. Takes a snapshot of the thin LV
  4. Writes to the thin LV with:

      dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=<I/O size>

Running this benchmark with various thin pool block sizes and dd I/O
sizes (all combinations triggering the use of kcopyd) we get the
following results:

+-----------------+-------------+------------------+-----------------+
| Pool block size | dd I/O size | BW before (MB/s) | BW after (MB/s) |
+-----------------+-------------+------------------+-----------------+
|       1 MB      |      256 KB |       242        |       280       |
|       1 MB      |      512 KB |       238        |       295       |
|                 |             |                  |                 |
|       2 MB      |      256 KB |       238        |       354       |
|       2 MB      |      512 KB |       241        |       380       |
|       2 MB      |        1 MB |       245        |       394       |
|                 |             |                  |                 |
|       4 MB      |      256 KB |       248        |       412       |
|       4 MB      |      512 KB |       234        |       432       |
|       4 MB      |        1 MB |       251        |       474       |
|       4 MB      |        2 MB |       257        |       504       |
|                 |             |                  |                 |
|       8 MB      |      256 KB |       239        |       420       |
|       8 MB      |      512 KB |       256        |       431       |
|       8 MB      |        1 MB |       264        |       467       |
|       8 MB      |        2 MB |       264        |       502       |
|       8 MB      |        4 MB |       281        |       537       |
+-----------------+-------------+------------------+-----------------+

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:24:16 -04:00
Mike Snitzer
3ee25485ba dm snapshot: fix oversights in optional discard support
__find_snapshots_sharing_cow() should always be used with _origins_lock
held so fix snapshot_io_hints() accordingly.  Also, once a snapshot is
being merged discards must not be allowed -- otherwise incorrect or
duplicate work will be performed.

Fixes: 2e6023850e177d ("dm snapshot: add optional discard support features")
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:12:30 -04:00
Damien Le Moal
3b8cafdd54 dm zoned: fix zone state management race
dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
backend device is being actively read or written and so cannot be
reclaimed. This flag is set as long as the zone atomic reference
counter is not 0. When this atomic is decremented and reaches 0 (e.g.
on BIO completion), the active flag is cleared and set again whenever
the zone is reused and BIO issued with the atomic counter incremented.
These 2 operations (atomic inc/dec and flag set/clear) are however not
always executed atomically under the target metadata mutex lock and
this causes the warning:

WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));

in dmz_deactivate_zone() to be displayed. This problem is regularly
triggered with xfstests generic/209, generic/300, generic/451 and
xfs/077 with XFS being used as the file system on the dm-zoned target
device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
generic/300 trigger the warning with ext4 use.

This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
and managing the "ACTIVE" state by directly looking at the reference
counter value. To do so, the functions dmz_activate_zone() and
dmz_deactivate_zone() are changed to inline functions respectively
calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
is changed to an inline function calling atomic_read().

Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:03:48 -04:00
Linus Torvalds
c309b6f242 docs conversion for v5.3-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAl0tpocACgkQCF8+vY7k
 4RWoxA//b/fmDXP3WPzrjjSmpyB9ml0/epKzPbT5S2j0lftqKBmet29k+PCjVrTx
 Nq2QauehY9ug5h8UMVUCmzPr95F0tSIGRoqk1vrn7z0K3q6k1SHrtvqbY1Bgb2Uk
 Qvh2YFU4fQLJg8WAbExCjxCdbdmBKQVGKTwCtM+tP5OMxwAFOmQrjGaUaKCKIIA2
 7Wzrx8CpSji+bJ3uK/d36c+4M9oDly5eaxBhoboL3BI0y+GqwiSASGwTO7BxrPOg
 0wq5IZHnqS8+bprT9xQdDOqf+UOY9U1cxE/+sqsHxblfUEx9gfLy/R+FLmJn+SS9
 Z3yLy4SqVHQMpWBjEAGodohikF60PAuTdymSC11jqFaKCUxWrIZg5xO+0blMrxPF
 7vYIexutCkaBMHBlNaNsHIqB7B/2FGGKoN7QW64hwvwJCGvF7OmJcV+R4bROGvh4
 nFuis9/Nm66Fq7I3aw37ThyZ0aWZdaQ0QJTH9ksxU/ZCz2hhMNYu/rXggrDvkS4U
 nr77ZT5Gd7nj4b110zf8+99uiGiinY6hTfzPAuTCLBhaxwrv4/xDHAhpwdEB5T4j
 8gOkxV8c0XWtL7sKqhGJvs/RRe2za0Y9XH6fyxsYfWcfuLjEvug8ouXMad9gxFWH
 DL3WnKJEMGLScei2wux4kGOwEbkR1bUf2cHJfh3GpCB/y8vgLOc=
 =smxY
 -----END PGP SIGNATURE-----

Merge tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull rst conversion of docs from Mauro Carvalho Chehab:
 "As agreed with Jon, I'm sending this big series directly to you, c/c
  him, as this series required a special care, in order to avoid
  conflicts with other trees"

* tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
  docs: kbuild: fix build with pdf and fix some minor issues
  docs: block: fix pdf output
  docs: arm: fix a breakage with pdf output
  docs: don't use nested tables
  docs: gpio: add sysfs interface to the admin-guide
  docs: locking: add it to the main index
  docs: add some directories to the main documentation index
  docs: add SPDX tags to new index files
  docs: add a memory-devices subdir to driver-api
  docs: phy: place documentation under driver-api
  docs: serial: move it to the driver-api
  docs: driver-api: add remaining converted dirs to it
  docs: driver-api: add xilinx driver API documentation
  docs: driver-api: add a series of orphaned documents
  docs: admin-guide: add a series of orphaned documents
  docs: cgroup-v1: add it to the admin-guide book
  docs: aoe: add it to the driver-api book
  docs: add some documentation dirs to the driver-api book
  docs: driver-model: move it to the driver-api book
  docs: lp855x-driver.rst: add it to the driver-api book
  ...
2019-07-16 12:21:41 -07:00
Linus Torvalds
9637d51734 for-linus-20190715
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0s1ZEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiCEEACE9H/pXoegTTWIVPVajMlsa19UHIeilk4N
 GI7oKSiirQEMZnAOmrEzgB4/0zyYQsVypys0gZlYUD3GJVsXDT3zzjNXL5NpVg/O
 nqwSGWMHBSjWkLbaM40Pb2QLXsYgveptNL+9PtxrgtoYPoT5/+TyrJMFrRfi72EK
 WFeNDKOu6aJxpJ26JSsckJ0gluKeeEpRoEqsgHGIwaMIGHQf+b+ikk7tel5FAIgA
 uDwwD+Oxsdgh/ChsXL0d90GkcbcSp6GQ7GybxVmw/tPijx6mpeIY72xY3Zx+t8zF
 b71UNk6NmCKjOPO/6fiuYKKTYw+KhzlyEKO0j675HKfx2AhchEwKw0irp4yUlydA
 zxWYmz4U7iRgktJtymv3J4FEQQ3S6d1EnuQkQNX1LwiOsEsfzhkWi+7jy7KFhZoJ
 AqtYzqnOXvLx92q0vloj06HtK6zo+I/MINldy0+qn9lq0N0VF+dctyztAHLsF7P6
 pUtS6i7l1JSFKAmMhC31sIj5TImaehM2e/TWMUPEDZaO96oKCmQwOF1oiloc6vlW
 h4xWsxP/9zOFcWNyPzy6Vo3JUXWRvFA7K+jV3Hsukw6rVHiNCGVYGSlTv8Roi5b7
 I4ggu9R2JOGyku7UIlL50IRxEyjAp11LaO8yHhcCnRB65rmyBuNMQNcfOsfxpZ5Y
 1mtSNhm5TQ==
 =g8xI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "A later pull request with some followup items. I had some vacation
  coming up to the merge window, so certain things items were delayed a
  bit. This pull request also contains fixes that came in within the
  last few days of the merge window, which I didn't want to push right
  before sending you a pull request.

  This contains:

   - NVMe pull request, mostly fixes, but also a few minor items on the
     feature side that were timing constrained (Christoph et al)

   - Report zones fixes (Damien)

   - Removal of dead code (Damien)

   - Turn on cgroup psi memstall (Josef)

   - block cgroup MAINTAINERS entry (Konstantin)

   - Flush init fix (Josef)

   - blk-throttle low iops timing fix (Konstantin)

   - nbd resize fixes (Mike)

   - nbd 0 blocksize crash fix (Xiubo)

   - block integrity error leak fix (Wenwen)

   - blk-cgroup writeback and priority inheritance fixes (Tejun)"

* tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits)
  MAINTAINERS: add entry for block io cgroup
  null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED
  block: Limit zone array allocation size
  sd_zbc: Fix report zones buffer allocation
  block: Kill gfp_t argument of blkdev_report_zones()
  block: Allow mapping of vmalloc-ed buffers
  block/bio-integrity: fix a memory leak bug
  nvme: fix NULL deref for fabrics options
  nbd: add netlink reconfigure resize support
  nbd: fix crash when the blksize is zero
  block: Disable write plugging for zoned block devices
  block: Fix elevator name declaration
  block: Remove unused definitions
  nvme: fix regression upon hot device removal and insertion
  blk-throttle: fix zero wait time for iops throttled group
  block: Fix potential overflow in blk_report_zones()
  blkcg: implement REQ_CGROUP_PUNT
  blkcg, writeback: Implement wbc_blkcg_css()
  blkcg, writeback: Add wbc->no_cgroup_owner
  blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
  ...
2019-07-15 21:20:52 -07:00
Mauro Carvalho Chehab
6cf2a73cb2 docs: device-mapper: move it to the admin-guide
The DM support describes lots of aspects related to mapped
disk partitions from the userspace PoV.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-15 11:03:01 -03:00
Linus Torvalds
a1240cf74e Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
 "This includes changes to let percpu_ref release the backing percpu
  memory earlier after it has been switched to atomic in cases where the
  percpu ref is not revived.

  This will help recycle percpu memory earlier in cases where the
  refcounts are pinned for prolonged periods of time"

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT
  md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
  io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
  percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag
2019-07-14 16:17:18 -07:00
Linus Torvalds
2260840592 - Add encrypted byte-offset initialization vector (eboiv) to DM crypt.
- Add optional discard features to DM snapshot which allow freeing space
   from a DM device whose free space was exhausted.
 
 - Various small improvements to use struct_size() and kzalloc().
 
 - Fix to check if DM thin metadata is in fail_io mode before attempting
   to update the superblock to set the needs_check flag.  Otherwise the
   DM thin-pool can hang.
 
 - Fix DM bufio shrinker's potential for ABBA recursion deadlock with DM
   thin provisioning on loop usecase.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl0o0/YTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWiG3B/wI9//FMbHHd9KboFdDQpBNGKaYEIa+
 ZQCPRghzvODBW416yujC1xlIA4bdYyVcQ1wPqCqCDJhXndaDUpMzyRxnPTI0zm4U
 PTZNmWuXO3SmSv7QuHgaCuMIWXIvyOcGLHEb5wqWZJMZ+t4Hf14RrwWQ19d98/hO
 ff7MO70h8sAlFb8lMv6Mxa/TU8O7FoE3EBssfNOF8kHfdFNZnvrOSTvBRhmFTXPQ
 P5RsgTC2KSo8bt5lqqpcMa3XTolx+CE3Dww1SaOFNU+jM4P6n6HUTHeNDcLuyYSc
 JlaV19nFMrarTwzVbyJJqiJwlZzlH/J5arplytg5TldE37EPcl8lHuaU
 =2oWT
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Add encrypted byte-offset initialization vector (eboiv) to DM crypt.

 - Add optional discard features to DM snapshot which allow freeing
   space from a DM device whose free space was exhausted.

 - Various small improvements to use struct_size() and kzalloc().

 - Fix to check if DM thin metadata is in fail_io mode before attempting
   to update the superblock to set the needs_check flag. Otherwise the
   DM thin-pool can hang.

 - Fix DM bufio shrinker's potential for ABBA recursion deadlock with DM
   thin provisioning on loop usecase.

* tag 'for-5.3/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm bufio: fix deadlock with loop device
  dm snapshot: add optional discard support features
  dm crypt: implement eboiv - encrypted byte-offset initialization vector
  dm crypt: remove obsolete comment about plumb IV
  dm crypt: wipe private IV struct after key invalid flag is set
  dm integrity: use kzalloc() instead of kmalloc() + memset()
  dm: update stale comment in end_clone_bio()
  dm log writes: fix incorrect comment about the logged sequence example
  dm log writes: use struct_size() to calculate size of pending_block
  dm crypt: use struct_size() when allocating encryption context
  dm integrity: always set version on superblock update
  dm thin metadata: check if in fail_io mode when setting needs_check
2019-07-13 15:24:31 -07:00
Junxiao Bi
bd293d071f dm bufio: fix deadlock with loop device
When thin-volume is built on loop device, if available memory is low,
the following deadlock can be triggered:

One process P1 allocates memory with GFP_FS flag, direct alloc fails,
memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
IO to complete in __try_evict_buffer().

But this IO may never complete if issued to an underlying loop device
that forwards it using direct-IO, which allocates memory using
GFP_KERNEL (see: do_blockdev_direct_IO()).  If allocation fails, memory
reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
will be invoked, and since the mutex is already held by P1 the loop
thread will hang, and IO will never complete.  Resulting in ABBA
deadlock.

Cc: stable@vger.kernel.org
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-12 09:59:37 -04:00
Mike Snitzer
2e6023850e dm snapshot: add optional discard support features
discard_zeroes_cow - a discard issued to the snapshot device that maps
to entire chunks to will zero the corresponding exception(s) in the
snapshot's exception store.

discard_passdown_origin - a discard to the snapshot device is passed down
to the snapshot-origin's underlying device.  This doesn't cause copy-out
to the snapshot exception store because the snapshot-origin target is
bypassed.

The discard_passdown_origin feature depends on the discard_zeroes_cow
feature being enabled.

When these 2 features are enabled they allow a temporarily read-only
device that has completely exhausted its free space to recover space.
To do so dm-snapshot provides temporary buffer to accommodate writes
that the temporarily read-only device cannot handle yet.  Once the upper
layer frees space (e.g. fstrim to XFS) the discards issued to the
dm-snapshot target will be issued to underlying read-only device whose
free space was exhausted.  In addition those discards will also cause
zeroes to be written to the snapshot exception store if corresponding
exceptions exist.  If the underlying origin device provides
deduplication for zero blocks then if/when the snapshot is merged backed
to the origin those blocks will become unused.  Once the origin has
gained adequate space, merging the snapshot back to the thinly
provisioned device will permit continued use of that device without the
temporary space provided by the snapshot.

Requested-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-12 09:59:36 -04:00
Damien Le Moal
bd976e5272 block: Kill gfp_t argument of blkdev_report_zones()
Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
preparation of using vmalloc() for large report buffer and zone array
allocations used by this function, remove its "gfp_t gfp_mask" argument
and rely on the caller context to use memalloc_noio_save/restore() where
necessary (block layer zone revalidation and dm-zoned I/O error path).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-11 20:04:37 -06:00
Linus Torvalds
028db3e290 Revert "Merge tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs"
This reverts merge 0f75ef6a9cff49ff612f7ce0578bced9d0b38325 (and thus
effectively commits

   7a1ade847596 ("keys: Provide KEYCTL_GRANT_PERMISSION")
   2e12256b9a76 ("keys: Replace uid/gid/perm permissions checking with an ACL")

that the merge brought in).

It turns out that it breaks booting with an encrypted volume, and Eric
biggers reports that it also breaks the fscrypt tests [1] and loading of
in-kernel X.509 certificates [2].

The root cause of all the breakage is likely the same, but David Howells
is off email so rather than try to work it out it's getting reverted in
order to not impact the rest of the merge window.

 [1] https://lore.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/
 [2] https://lore.kernel.org/lkml/20190710013225.GB7973@sol.localdomain/

Link: https://lore.kernel.org/lkml/CAHk-=wjxoeMJfeBahnWH=9zShKp2bsVy527vo3_y8HfOdhwAAw@mail.gmail.com/
Reported-by: Eric Biggers <ebiggers@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-10 18:43:43 -07:00
Linus Torvalds
e9a83bd232 It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro.  These create more
    than the usual number of simple but annoying merge conflicts with other
    trees, unfortunately.  He has a lot more of these waiting on the wings
    that, I think, will go to you directly later on.
 
  - A new document on how to use merges and rebases in kernel repos, and one
    on Spectre vulnerabilities.
 
  - Various improvements to the build system, including automatic markup of
    function() references because some people, for reasons I will never
    understand, were of the opinion that :c:func:``function()`` is
    unattractive and not fun to type.
 
  - We now recommend using sphinx 1.7, but still support back to 1.4.
 
  - Lots of smaller improvements, warning fixes, typo fixes, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl0krAEPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yg98H/AuLqO9LpOgUjF4LhyjxGPdzJkY9RExSJ7km
 gznyreLCZgFaJR+AY6YDsd4Jw6OJlPbu1YM/Qo3C3WrZVFVhgL/s2ebvBgCo50A8
 raAFd8jTf4/mGCHnAqRotAPQ3mETJUk315B66lBJ6Oc+YdpRhwXWq8ZW2bJxInFF
 3HDvoFgMf0KhLuMHUkkL0u3fxH1iA+KvDu8diPbJYFjOdOWENz/CV8wqdVkXRSEW
 DJxIq89h/7d+hIG3d1I7Nw+gibGsAdjSjKv4eRKauZs4Aoxd1Gpl62z0JNk6aT3m
 dtq4joLdwScydonXROD/Twn2jsu4xYTrPwVzChomElMowW/ZBBY=
 =D0eO
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.3' of git://git.lwn.net/linux

Pull Documentation updates from Jonathan Corbet:
 "It's been a relatively busy cycle for docs:

   - A fair pile of RST conversions, many from Mauro. These create more
     than the usual number of simple but annoying merge conflicts with
     other trees, unfortunately. He has a lot more of these waiting on
     the wings that, I think, will go to you directly later on.

   - A new document on how to use merges and rebases in kernel repos,
     and one on Spectre vulnerabilities.

   - Various improvements to the build system, including automatic
     markup of function() references because some people, for reasons I
     will never understand, were of the opinion that
     :c:func:``function()`` is unattractive and not fun to type.

   - We now recommend using sphinx 1.7, but still support back to 1.4.

   - Lots of smaller improvements, warning fixes, typo fixes, etc"

* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
  docs: automarkup.py: ignore exceptions when seeking for xrefs
  docs: Move binderfs to admin-guide
  Disable Sphinx SmartyPants in HTML output
  doc: RCU callback locks need only _bh, not necessarily _irq
  docs: format kernel-parameters -- as code
  Doc : doc-guide : Fix a typo
  platform: x86: get rid of a non-existent document
  Add the RCU docs to the core-api manual
  Documentation: RCU: Add TOC tree hooks
  Documentation: RCU: Rename txt files to rst
  Documentation: RCU: Convert RCU UP systems to reST
  Documentation: RCU: Convert RCU linked list to reST
  Documentation: RCU: Convert RCU basic concepts to reST
  docs: filesystems: Remove uneeded .rst extension on toctables
  scripts/sphinx-pre-install: fix out-of-tree build
  docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
  Documentation: PGP: update for newer HW devices
  Documentation: Add section about CPU vulnerabilities for Spectre
  Documentation: platform: Delete x86-laptop-drivers.txt
  docs: Note that :c:func: should no longer be used
  ...
2019-07-09 12:34:26 -07:00
Milan Broz
b9411d73bd dm crypt: implement eboiv - encrypted byte-offset initialization vector
This IV is used in some BitLocker devices with CBC encryption mode.

IV is encrypted little-endian byte-offset (with the same key and cipher
as the volume).

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-09 14:14:23 -04:00