5028 Commits

Author SHA1 Message Date
Mikulas Patocka
93e6442c76 dm: add basic support for using the select or poll function
Add the ability to poll on the /dev/mapper/control device.  The select
or poll function waits until any event happens on any dm device since
opening the /dev/mapper/control device.  When select or poll returns the
device as readable, we must close and reopen the device to wait for new
dm events.

Usage:
1. open the /dev/mapper/control device
2. scan the event numbers of all devices we are interested in and process
   them
3. call select, poll or epoll on the handle (it waits until some new event
   happens since opening the device)
4. close the /dev/mapper/control handle
5. go to step 1

The next commit allows to re-arm the polling without closing and
reopening the device.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-06-19 11:03:49 -04:00
Ming Lei
f660174e8b blk-mq: use the introduced blk_mq_unquiesce_queue()
blk_mq_unquiesce_queue() is used for unquiescing the
queue explicitly, so replace blk_mq_start_stopped_hw_queues()
with it.

For the scsi part, this patch takes Bart's suggestion to
switch to block quiesce/unquiesce API completely.

Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Cc: dm-devel@redhat.com
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:20:04 -06:00
NeilBrown
9b10f6a9c2 block: remove bio_clone() and all references.
bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
NeilBrown
5a136fdf5a bcache: use kmalloc to allocate bio in bch_data_verify()
This function allocates a bio, then a collection
of pages.  It copes with failure.

It currently uses a mempool() to allocate the bio,
but alloc_page() to allocate the pages.  These fail
in different ways, so the usage is inconsistent.

Change the bio_clone() to bio_clone_kmalloc()
so that no pool is used either for the bio or the pages.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by : Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
NeilBrown
47e0fb461f blk: make the bioset rescue_workqueue optional.
This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer().  It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.

All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios.  This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.

biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes)
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
NeilBrown
011067b056 blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
NeilBrown
af67c31fba blk: remove bio_set arg from blk_queue_split()
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 12:40:59 -06:00
Lidong Zhong
8df7202439 md: change the initialization value for a spare device spot to MD_DISK_ROLE_SPARE
The value for spare spot of sb->dev_roles is changed from
MD_DISK_ROLE_FAULTY to MD_DISK_ROLE_SPARE to keep align
with the value when the superblock is firstly created in
userspace.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-16 12:04:09 -07:00
Guoqing Jiang
037d2ff62c md/raid1: remove unused bio in sync_request_write
The "bio" is not used in sync_request_write after commit a68e58703575
("md/raid1: split out two sub-functions from sync_request_write").

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-16 12:04:09 -07:00
Guoqing Jiang
1cdd125794 md/raid10: fix FailFast test for wrong device
We need to test FailFast flag for replacement device here
since the set up for writing is for the replacement, so we
need fix it like:

- if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
+ if (test_bit(FailFast, &conf->mirrors[d].replacement->flags))

Since commit f90145f317ef ("md/raid10: add rcu protection
to rdev access in raid10_sync_request.") had added the rcu
protection for the part, so let's extend the range protected
by rcu and use rdev directly.

Fixes: 1919cbb ("md/raid10: add failfast handling for writes.")
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-16 12:04:08 -07:00
Jens Axboe
c27b2d634f Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-4.13/block
Pull NVMe changes for 4.13 from Christoph:

Highlights:

 - UUID identifier support from Johannes
 - Lots of cleanups from Sagi
 - Host Memory Buffer support from me

And lots of cleanups and smaller fixes of course.

Note that the UUID identifier changes are based on top of the uuid tree.
I am the maintainer of that tree and will send it to Linus as soon as
4.12 is released as various other trees depend on it as well (and the
diffstat includes those changes unfortunately)
2017-06-16 10:14:59 -06:00
Dan Williams
abebfbe2f7 dm: add ->flush() dax operation support
Allow device-mapper to route flush operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.

This conceptually allows for an array of mixed device drivers with
varying flush implementations.

Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-15 14:34:59 -07:00
Mike Snitzer
cd15fb64ee Revert "dm mirror: use all available legs on multiple failures"
This reverts commit 12a7cf5ba6c776a2621d8972c7d42e8d3d959d20.

This commit apparently attempted to fix an issue that didn't really
exist, furthermore: this commit is the source of deadlocks and crashes
seen in multiple cases related to failing the primary mirror dev while
syncing.

Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-06-15 08:39:15 -04:00
Dan Carpenter
047385b3dd dm: missing break in process_queued_bios()
his used to be a fall through case, but we shifted code around and I
think we want a break here now.

Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t")

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-14 08:22:55 -06:00
Mikulas Patocka
f9c79bc05a md: don't use flush_signals in userspace processes
The function flush_signals clears all pending signals for the process. It
may be used by kernel threads when we need to prepare a kernel thread for
responding to signals. However using this function for an userspaces
processes is incorrect - clearing signals without the program expecting it
can cause misbehavior.

The raid1 and raid5 code uses flush_signals in its request routine because
it wants to prepare for an interruptible wait. This patch drops
flush_signals and uses sigprocmask instead to block all signals (including
SIGKILL) around the schedule() call. The signals are not lost, but the
schedule() call won't respond to them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-13 10:18:02 -07:00
NeilBrown
cc27b0c78c md: fix deadlock between mddev_suspend() and md_write_start()
If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.

We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().

Reported-by: Nix <nix@esperi.org.uk>
Fix: 68866e425be2(MD: no sync IO while suspended)
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-13 10:18:01 -07:00
Christoph Hellwig
fdd050b5b3 Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base 2017-06-13 11:45:14 +02:00
Ondrej Mosnáček
2ad50606f8 dm integrity: reject mappings too large for device
dm-integrity would successfully create mappings with the number of
sectors greater than the provided data sector count.  Attempts to read
sectors of this mapping that were beyond the provided data sector count
would then yield run-time messages of the form "device-mapper:
integrity: Too big sector number: ...".

Fix this by emitting an error when the requested mapping size is bigger
than the provided data sector count.

Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-06-12 17:05:55 -04:00
Jens Axboe
8f66439eec Linux 4.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
 biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
 kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
 4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
 2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
 W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
 =m2Gx
 -----END PGP SIGNATURE-----

Merge tag 'v4.12-rc5' into for-4.13/block

We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.

Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-12 08:30:13 -06:00
Dan Williams
7e026c8c0a dm: add ->copy_from_iter() dax operation support
Allow device-mapper to route copy_from_iter operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.

This conceptually allows for an array of mixed device drivers with
varying copy_from_iter implementations.

Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-06-09 09:22:21 -07:00
Christoph Hellwig
4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
fc17b6534e blk-mq: switch ->queue_rq return value to blk_status_t
Use the same values for use for request completion errors as the return
value from ->queue_rq.  BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
2a842acab1 block: introduce new block status code type
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings.  This patch
instead introduces a new  blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning.  Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.

For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.

blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
1be5690984 dm: change ->end_io calling convention
Turn the error paramter into a pointer so that target drivers can change
the value, and make sure only DM_ENDIO_* values are returned from the
methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
846785e6a5 dm: don't return errnos from ->map
Instead use the special DM_MAPIO_KILL return value to return -EIO just
like we do for the request based path.  Note that dm-log-writes returned
-ENOMEM in a few places, which now becomes -EIO instead.  No consumer
treats -ENOMEM special so this shouldn't be an issue (and it should
use a mempool to start with to make guaranteed progress).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
14ef1e4826 dm mpath: merge do_end_io_bio into multipath_end_io_bio
This simplifies the code and especially the error passing a bit and
will help with the next patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig
9966afaf91 dm: fix REQ_RAHEAD handling
A few (but not all) dm targets use a special EWOULDBLOCK error code for
failing REQ_RAHEAD requests that fail due to a lack of available resources.
But no one else knows about this magic code, and lower level drivers also
don't generate it when failing read-ahead requests for similar reasons.

So remove this special casing and ignore all additional error handling for
REQ_RAHEAD - if this was a real underlying error we'd get a normal read
once the real read comes in.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
NeilBrown
a415c0f106 md: initialise ->writes_pending in personality modules.
The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.

Move the initialization to the personality modules
that need it.  This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.

Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-06-05 16:04:35 -07:00
Amir Goldstein
e6fd2093a8 md: namespace private helper names
The md private helper uuid_equal() collides with a generic helper
of the same name.

Rename the md private helper to md_uuid_equal() and do the same for
md_sb_equal().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaohua Li <shli@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:56:37 +02:00
Linus Torvalds
60c42a31dc - a DM verity fix for a mode when no salt is used
- a fix to DM to account for the possibility that PREFLUSH or FUA are
   used without the SYNC flag if the underlying storage doesn't have a
   volatile write-cache
 
 - a DM ioctl memory allocation flag fix to use __GFP_HIGH to allow
   emergency forward progress (by using memory reserves as last resort)
 
 - a small DM integrity cleanup to use kvmalloc() instead of duplicating
   the same
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJZMZqnAAoJEMUj8QotnQNaB3UIAKDWS1bAAuPR6iiIdlffmZzf
 DxbCkpgCIHcmuihNKm+57P+Vwa7a5OFb7LSZhU7p2ormncrzEmLRP4D5twtyxTNt
 f1zT85gB3FXZVRDbcOP7yx2JuZdimu41kYBc9PaQuyTaNUl8lisnQhTVuJ9m0VO3
 jdgQpSD0jQPhSFd/rA4hi75l2MeKwOJA2cXsfBgSxIW+rflYFi7SoNfG7wQattpU
 ArdDEgU++lRuk9FaGVHOIgTduhutk8NtRQFmxXZlE10KEf05SbBLVr1EGgKLVnyC
 72+NI8OKeqic9R4phyWNZhErWnteXy9bbQtvBOpEIs8byFY176byzBkGBXjHebE=
 =MLRg
 -----END PGP SIGNATURE-----

Merge tag 'for-4.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - a DM verity fix for a mode when no salt is used

 - a fix to DM to account for the possibility that PREFLUSH or FUA are
   used without the SYNC flag if the underlying storage doesn't have a
   volatile write-cache

 - a DM ioctl memory allocation flag fix to use __GFP_HIGH to allow
   emergency forward progress (by using memory reserves as last resort)

 - a small DM integrity cleanup to use kvmalloc() instead of duplicating
   the same

* tag 'for-4.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: make flush bios explicitly sync
  dm ioctl: restore __GFP_HIGH in copy_params()
  dm integrity: use kvmalloc() instead of dm_integrity_kvmalloc()
  dm verity: fix no salt use case
2017-06-02 11:50:37 -07:00
Jan Kara
5a8948f8a3 md: Make flush bios explicitely sync
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

CC: linux-raid@vger.kernel.org
CC: Shaohua Li <shli@kernel.org>
Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-31 09:25:53 -07:00
Jan Kara
ff0361b34a dm: make flush bios explicitly sync
Commit b685d3d65ac7 ("block: treat REQ_FUA and REQ_PREFLUSH as
synchronous") removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions.

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65ac7 ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous")
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-31 10:50:23 -04:00
Nix
e153903686 md: report sector of stripes with check mismatches
This makes it possible, with appropriate filesystem support, for a
sysadmin to tell what is affected by the mismatch, and whether
it should be ignored (if it's inside a swap partition, for
instance).

We ratelimit to prevent log flooding: if there are so many
mismatches that ratelimiting is necessary, the individual messages
are relatively unlikely to be important (either the machine is
swapping like crazy or something is very wrong with the disk).

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-24 16:31:13 -07:00
Kyungchan Koh
4179bc30b2 md: uuid debug statement now in processor byte order.
Previously, the uuid debug statements were printed in little-endian
format, which wasn't consistent in machines that might not be in
little-endian byte order. With this change, the output will be
consistent for all machines with different byte-ordering.

Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-24 15:58:43 -07:00
Junaid Shahid
8c1e2162f2 dm ioctl: restore __GFP_HIGH in copy_params()
Commit d224e9381897 ("drivers/md/dm-ioctl.c: use kvmalloc rather than
opencoded variant") left out the __GFP_HIGH flag when converting from
__vmalloc to kvmalloc.  This can cause the DM ioctl to fail in some low
memory situations where it wouldn't have failed earlier.  Add __GFP_HIGH
back to avoid any potential regression.

Fixes: d224e9381897 ("drivers/md/dm-ioctl.c: use kvmalloc rather than opencoded variant")
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-22 19:30:03 -04:00
Mikulas Patocka
702a6204f8 dm integrity: use kvmalloc() instead of dm_integrity_kvmalloc()
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-22 14:09:52 -04:00
Gilad Ben-Yossef
f52236e0b0 dm verity: fix no salt use case
DM-Verity has an (undocumented) mode where no salt is used.  This was
never handled directly by the DM-Verity code, instead working due to the
fact that calling crypto_shash_update() with a zero length data is an
implicit noop.

This is no longer the case now that we have switched to
crypto_ahash_update().  Fix the issue by introducing explicit handling
of the no salt use case to DM-Verity.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Reported-by: Marian Csontos <mcsontos@redhat.com>
Fixes: d1ac3ff ("dm verity: switch to using asynchronous hash crypto API")
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-22 13:49:03 -04:00
Guoqing Jiang
2dffdc0724 md-cluster: fix potential lock issue in add_new_disk
The add_new_disk returns with communication locked if
__sendmsg returns failure, fix it with call unlock_comm
before return.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-21 20:37:09 -07:00
Linus Torvalds
8b4822de59 Merge tag 'md/4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:

 - Several bug fixes for raid5-cache from Song Liu, mainly handle
   journal disk error

 - Fix bad block handling in choosing raid1 disk from Tomasz Majchrzak

 - Simplify external metadata array sysfs handling from Artur
   Paszkiewicz

 - Optimize raid0 discard handling from me, now raid0 will dispatch
   large discard IO directly to underlayer disks.

* tag 'md/4.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  raid1: prefer disk without bad blocks
  md/r5cache: handle sync with data in write back cache
  md/r5cache: gracefully handle journal device errors for writeback mode
  md/raid1/10: avoid unnecessary locking
  md/raid5-cache: in r5l_do_submit_io(), submit io->split_bio first
  md/md0: optimize raid0 discard handling
  md: don't return -EAGAIN in md_allow_write for external metadata arrays
  md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
2017-05-18 12:04:41 -07:00
Colin Ian King
7e1b9521f5 dm cache: handle kmalloc failure allocating background_tracker struct
Currently there is no kmalloc failure check on the allocation of
the background_tracker struct in btracker_create(), and so a NULL return
will lead to a NULL pointer dereference.  Add a NULL check.

Detected by CoverityScan, CID#1416587 ("Dereference null return value")

Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-17 09:44:53 -04:00
Mikulas Patocka
13840d3801 dm bufio: make the parameter "retain_bytes" unsigned long
Change the type of the parameter "retain_bytes" from unsigned to
unsigned long, so that on 64-bit machines the user can set more than
4GiB of data to be retained.

Also, change the type of the variable "count" in the function
"__evict_old_buffers" to unsigned long.  The assignment
"count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];"
could result in unsigned long to unsigned overflow and that could result
in buffers not being freed when they should.

While at it, avoid division in get_retain_buffers().  Division is slow,
we can change it to shift because we have precalculated the log2 of
block size.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-16 15:12:08 -04:00
Christoph Hellwig
f98e0eb680 dm mpath: multipath_clone_and_map must not return -EIO
Since 412445ac ("dm: introduce a new DM_MAPIO_KILL return value"), the
clone_and_map_rq methods must not return errno values, so fix it up
to properly return DM_MAPIO_KILL, instead of the -EIO value that snuck
in due to a conflict between two patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-15 15:09:53 -04:00
Christoph Hellwig
18a482f524 dm mpath: don't return -EIO from dm_report_EIO
Instead just turn the macro into a helper for the warning message.
This removes an unnecessary assignment and will allow the next commit to
fix a place where -EIO is the wrong return value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-15 15:09:52 -04:00
Christoph Hellwig
ece0728037 dm rq: add a missing break to map_request
We don't want to bug when receiving a DM_MAPIO_KILL value..

Fixes: 412445ac ("dm: introduce a new DM_MAPIO_KILL return value")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-15 15:09:51 -04:00
Joe Thornber
0377a07c7a dm space map disk: fix some book keeping in the disk space map
When decrementing the reference count for a block, the free count wasn't
being updated if the reference count went to zero.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-15 15:09:50 -04:00
Joe Thornber
91bcdb92d3 dm thin metadata: call precommit before saving the roots
These calls were the wrong way round in __write_initial_superblock.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-15 15:09:49 -04:00
Joe Thornber
2e63309507 dm cache policy smq: don't do any writebacks unless IDLE
If there are no clean blocks to be demoted the writeback will be
triggered at that point.  Preemptively writing back can hurt high IO
load scenarios.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-14 21:54:33 -04:00
Joe Thornber
49b7f76890 dm cache: simplify the IDLE vs BUSY state calculation
Drop the MODERATE state since it wasn't buying us much.

Also, in check_migrations(), prepare for the next commit ("dm cache
policy smq: don't do any writebacks unless IDLE") by deferring to the
policy to make the final decision on whether writebacks can be
serviced.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-14 21:54:33 -04:00
Joe Thornber
701e03e4e1 dm cache: track all IO to the cache rather than just the origin device's IO
IO tracking used to throttle writebacks when the origin device is busy.

Even if all the IO is going to the fast device, writebacks can
significantly degrade performance.  So track all IO to gauge whether the
cache is busy or not.

Otherwise, synthetic IO tests (e.g. fio) that might send all IO to the
fast device wouldn't cause writebacks to get throttled.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-14 21:54:33 -04:00
Joe Thornber
6cf4cc8f8b dm cache policy smq: stop preemptively demoting blocks
It causes a lot of churn if the working set's size is close to the fast
device's size.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-05-14 21:54:33 -04:00