IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Journal device stores data in a log structure. We need record the log
start. Here we override md superblock recovery_offset for this purpose.
This field of a journal device is meaningless otherwise.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Next patches will use a disk as raid5/6 journaling. We need a new disk
role to present the journal device and add MD_FEATURE_JOURNAL to
feature_map for backward compability.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Add the following two macros for special roles: spare and faulty
MD_DISK_ROLE_SPARE 0xffff
MD_DISK_ROLE_FAULTY 0xfffe
Add MD_DISK_ROLE_MAX 0xff00 as the maximal possible regular role,
and minimal value of special role.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
To incorporate --grow feature executed on one node, other nodes need to
acknowledge the change in number of disks. Call update_raid_disks()
to update internal data structures.
This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
this results in a deadlock. This is done so it can safely allocate memory
(which might trigger writeback which might write to raid1). This is
not required for md with a bitmap.
In the clustered case, we don't perform md_update_sb() in md_allow_write(),
but in do_md_run(). Also we disable safemode for clustered mode.
mddev->recovery_cp need not be set in check_sb_changes() because this
is required only when a node reads another node's bitmap. mddev->recovery_cp
(which is read from sb->resync_offset), is set only if mddev is in_sync.
Since we disabled safemode, in_sync is set to zero.
In a clustered environment, the MD may not be in sync because another
node could be writing to it. So make sure that in_sync is not set in
case of clustered node in __md_stop_writes().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
This patches fixes sparse warnings like incorrect type in assignment
(different base types), cast to restricted __le64.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
In Linux 3.9 we introduce a new 'far' layout for RAID10 which was
supposed to rotate the replicas differently and so provide better
resilience. In particular it could survive more combinations of 2
drive failures.
Unfortunately. due to a coding error, this some did what was wanted,
sometimes improved less than we hoped, and sometimes - in very
unlikely circumstances - put multiple replicas on the same device so
the redundancy was harmed.
No public user-space tool has created arrays using this layout so it
is very unlikely that zero-redundancy arrays actually exist. Probably
no arrays using any form of the new layout exist. But we cannot be
certain.
So use another bit in the 'layout' number and introduce a bug-fixed
version of the layout.
Also when assembling an array, if it has a zero-redundancy layout,
give a warning.
Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data. If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.
However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit. Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced. This leads to data corruption.
We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.
So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.
This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.
Backports will require at least
Commit: 95af587e95aa ("md/raid10: ensure device failure recorded before write request returns.")
as well. I'll send that to 'stable' separately.
Note that of the two tests of R10BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed. However doing it this way makes the
patch more obviously correct. I will tidy the code up in a
future merge window.
Reported-by: Nate Dailey <nate.dailey@stratus.com>
Fixes: bd870a16c594 ("md/raid10: Handle write errors by updating badblock log.")
Signed-off-by: NeilBrown <neilb@suse.com>
When a write fails and a bad-block-list is present, we can
update the bad-block-list instead of writing the data. If
this succeeds then it is OK clear the relevant bitmap-bit as
no further 'sync' of the block is needed.
However if writing the bad-block-list fails then we need to
treat the write as failed and particularly must not clear
the bitmap bit. Otherwise the device can be re-added (after
any hardware connection issues are resolved) and because the
relevant bit in the bitmap is clear, that block will not be
resynced. This leads to data corruption.
We already delay the final bio_endio() on the write until
the bad-block-list is written so that when the write
returns: either that data is safe, the bad-block record is
safe, or the fact that the device is faulty is safe.
However we *don't* delay the clearing of the bitmap, so the
bitmap bit can be recorded as cleared before we know if the
bad-block-list was written safely.
So: delay that until the write really is safe.
i.e. move the call to close_write() until just before
calling bio_endio(), and recheck the 'is array degraded'
status before making that call.
This bug goes back to v3.1 when bad-block-lists were
introduced, though it only affects arrays created with
mdadm-3.3 or later as only those have bad-block lists.
Backports will require at least
Commit: 55ce74d4bfe1 ("md/raid1: ensure device failure recorded before write request returns.")
as well. I'll send that to 'stable' separately.
Note that of the two tests of R1BIO_WriteError that this
patch adds, the first is certain to fail and the second is
certain to succeed. However doing it this way makes the
patch more obviously correct. I will tidy the code up in a
future merge window.
Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com>
Cc: Jes Sorensen <Jes.Sorensen@redhat.com>
Fixes: cd5ff9a16f08 ("md/raid1: Handle write errors by updating badblock log.")
Signed-off-by: NeilBrown <neilb@suse.com>
If the CLEAN_SHUTDOWN flag is not set when a cache is loaded then all cache
blocks are marked as dirty and a full writeback occurs.
__commit_transaction() is responsible for setting/clearing
CLEAN_SHUTDOWN (based the flags_mutator that is passed in).
Fix this issue, of the cache's on-disk flags being wrong, by making sure
__commit_transaction() does not reset the flags after the mutator has
altered the flags in preparation for them being serialized to disk.
before:
sb_flags = mutator(le32_to_cpu(disk_super->flags));
disk_super->flags = cpu_to_le32(sb_flags);
disk_super->flags = cpu_to_le32(cmd->flags);
after:
disk_super->flags = cpu_to_le32(cmd->flags);
sb_flags = mutator(le32_to_cpu(disk_super->flags));
disk_super->flags = cpu_to_le32(sb_flags);
Reported-by: Bogdan Vasiliev <bogdan.vasiliev@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
btree_split_beneath()'s error path had an outstanding FIXME that speaks
directly to the potential for _not_ cleaning up a previously allocated
bufio-backed block.
Fix this by releasing the previously allocated bufio block using
unlock_block().
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <thornber@redhat.com>
Cc: stable@vger.kernel.org
Commit 4c7e309340ff ("dm btree remove: fix bug in redistribute3") wasn't
a complete fix for redistribute3().
The redistribute3 function takes 3 btree nodes and shares out the entries
evenly between them. If the three nodes in total contained
(MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting
rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in
the center.
Fix this issue by being more careful about calculating the target number
of entries for the left and right nodes.
Unit tested in userspace using this program:
https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.c
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Synchronize pending i/o against a change in the integrity profile to
avoid the possibility of spurious integrity errors. Given linear_add()
is suspending the mddev before manipulating the mddev, do the same for
the other personalities.
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that the integrity profile is statically allocated there is no work
to do when shutting down an integrity enabled block device.
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <JBottomley@Odin.com>
Acked-by: NeilBrown <neilb@suse.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.
This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.
Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:
- Moving the blk_integrity definition to genhd.h.
- Inlining blk_integrity in struct gendisk.
- Removing the dynamic allocation code.
- Adding helper functions which allow gendisk to set up and tear down
the integrity sysfs dir when a disk is added/deleted.
- Adding a blk_integrity_revalidate() callback for updating the stable
pages bdi setting.
- The calls that depend on whether a device has an integrity profile or
not now key off of the bi->profile pointer.
- Simplifying the integrity support routines in DM (Mike Snitzer).
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This was introduced with 9e882242c6193ae6f416f2d8d8db0d9126bd996b
which changed the return value of submit_bio_wait() to return != 0 on
error, but didn't update the caller accordingly.
Fixes: 9e882242c6 ("block: Add submit_bio_wait(), remove from md")
Cc: stable@vger.kernel.org (v3.10)
Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
This was introduced with 9e882242c6193ae6f416f2d8d8db0d9126bd996b
which changed the return value of submit_bio_wait() to return != 0 on
error, but didn't update the caller accordingly.
Fixes: 9e882242c6 ("block: Add submit_bio_wait(), remove from md")
Cc: stable@vger.kernel.org (v3.10)
Reported-by: Bill Kuzeja <William.Kuzeja@stratus.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
As cmsg.raid_slot is le32, comparing for >0 is not meaningful.
So introduce cpu-endian 'raid_slot' and only assign to cmsg.raid_slot
when we know value is valid.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.com>
md-cluster: A better way for METADATA_UPDATED processing
The processing of METADATA_UPDATED message is too simple and prone to
errors. Besides, it would not update the internal data structures as
required.
This set of patches reads the superblock from one of the device of the MD
and checks for changes in the in-memory data structures. If there is a change,
it performs the necessary actions to keep the internal data structures
as it would be in the primary node.
An example is if a devices turns faulty. The algorithm is:
1. The initiator node marks the device as faulty and updates the superblock
2. The initiator node sends METADATA_UPDATED with an advisory device number to the rest of the nodes.
3. The receiving node on receiving the METADATA_UPDATED message
3.1 Reads the superblock
3.2 Detects a device has failed by comparing with memory structure
3.3 Calls the necessary functions to record the failure and get the device out of the active array.
3.4 Acknowledges the message.
The patch series also fixes adding the disk which was impacted because of
the changes.
Patches can also be found at
https://github.com/goldwynr/linux branch md-next
Changes since V2:
- Fix status synchrnoization after --add and --re-add operations
- Included Guoqing's patches on endian correctness, zeroing cmsg etc
- Restructure add_new_disk() and cancel()
If an unsupported option is given then the early return from
persistent_ctr() leaked memory allocated for the 'pstore' and never
destroyed the 'metadata_wq'.
Fixes: b0d3cc011e53 ("dm snapshot: add new persistent store option to support overflow")
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
For cluster raid, we should not kick it from array if the disk can't be
remove from array successfully.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
During the past test, the node occasionally received the msg which is
sent from itself, this case should not happen in theory, but it is
better to avoid it in case something wrong happened.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
The receive daemon prints kernel messages for every network message
received. This would fill the kernel message log with unnecessary messages.
Remove the pr_info() messages.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Adding the disk worked incorrectly with the new reload code. Fix it:
- No operation should be performed on rdev marked as Candidate
- After a metadata update operation, kick disk if role is 0xfffe
else clear Candidate bit and continue with the regular change check.
- Saving the mode of the lock resource to check if token lock is already
locked, because it can be called twice while adding a disk. However,
unlock_comm() must be called only once.
- add_new_disk() is called by the node initiating the --add operation.
If it needs to be canceled, call add_new_disk_cancel(). The operation
is completed by md_update_sb() which will write and unlock the
communication.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Resync or recovery must be performed by only one node at a time.
A DLM lock resource, resync_lockres provides the mutual exclusion
so that only one node performs the recovery/resync at a time.
If a node is unable to get the resync_lockres, because recovery is
being performed by another node, it set MD_RECOVER_NEEDED so as
to schedule recovery in the future.
Remove the debug message in resync_info_update()
used during development.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
In a clustered environment, a change such as marking a device faulty,
can be recorded by any of the nodes. This is communicated to all the
nodes and re-recording such a change is unnecessary, and quite often
pretty disruptive.
With this patch, just before the update, we detect for the changes
and if the changes are already in superblock, we abort the update
after clearing all the flags
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
md_reload_sb is too simplistic and it explicitly needs to determine
the changes made by the writing node. However, there are multiple areas
where a simple reload could fail.
Instead, read the superblock of one of the "good" rdevs and update
the necessary information:
- read the superblock into a newly allocated page, by temporarily
swapping out rdev->sb_page and calling ->load_super.
- if that fails return
- if it succeeds, call check_sb_changes
1. iterates over list of active devices and checks the matching
dev_roles[] value.
If that is 'faulty', the device must be marked as faulty
- call md_error to mark the device as faulty. Make sure
not to set CHANGE_DEVS and wakeup mddev->thread or else
it would initiate a resync process, which is the responsibility
of the "primary" node.
- clear the Blocked bit
- Call remove_and_add_spares() to hot remove the device.
If the device is 'spare':
- call remove_and_add_spares() to get the number of spares
added in this operation.
- Reduce mddev->degraded to mark the array as not degraded.
2. reset recovery_cp
- read the rest of the rdevs to update recovery_offset. If recovery_offset
is equal to MaxSector, call spare_active() to set it In_sync
This required that recovery_offset be initialized to MaxSector, as
opposed to zero so as to communicate the end of sync for a rdev.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
remove_and_add_spares() checks for all devices to activate spare.
Change it to activate a specific device if a non-null rdev
argument is passed.
remove_and_add_spares() can be used to activate spares in
slot_store() as well.
For hot_remove_disk(), check if rdev->raid_disk == -1 before
calling remove_and_add_spares()
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
When the suspended_area is deleted, the suspended processes
must be woken up in order to complete their I/O.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Previously, BITMAP_NEEDS_SYNC message is sent when the resyc
aborts, but it could abort for different reasons, and not all
of reasons require another node to take over the resync ownship.
It is better make BITMAP_NEEDS_SYNC message only be sent when
the node is leaving cluster with dirty bitmap. And we also need
to ensure dlm connection is ok.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Suspending the entire device for resync could take too long. Resync
in small chunks.
cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:
1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
list.
bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
to assemble a clustered device.
In order to maximize compatibility, the major version is set to
BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.
Added MD_FEATURE_CLUSTERED in order to return error for older
kernels which would assemble MD even if the bitmap is corrupted.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
process_suspend_info - which handles the RESYNCING request - must not
reply until all writes which were initiated before the request arrived,
have completed.
As a by-product, all process_* functions now take mddev as their
first arguement making it uniform.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Very careless bug earler in 4.3-rc, now fixed :-)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWGYrdAAoJEDnsnt1WYoG56vcP/03E7DoycDQ7uH46i2szf3M/
isy0dk+2S9D87iBz/xf4RXvAY7FTHy9/vIG/o6UKYkhSRzm6T6xCbrwd+duS2TSc
yQM33BJ1VdM+trGj5ywrdF8guwRMjW4NFPnez16moVSVZDbNK2pUdZiw8kGSi39n
hpjftyefojISG6rbDGBGK2JiVTNOqDjMH2Ny8MhX2J5ryQQOsd6+9ojgri3nfTbP
6PmP08QyVxdYA3ZUlTZaKUNZ8AQHgoydhiEyGbdCewcE8pYaeEUqvcBi4DrDOil8
9BGHnf755Wl3k26P8uBsvri9zp+SZl0LEZLhSpyFpRmCTaFGn0pnSKJ0intnRTPc
JZ7gTY6q1Bt5DXToZw7hHVWgxjos8aweS2JLzSJloB6FFlDCckypkvSR4GQL7R9N
jIYntfwaQaJIgUSzVo/Aw6vjBTWbqyLHf3DP8ImsPSe/z0gjtRiyPkjoZgthuYp2
ErLoVe/JgKstR0gmobbdRhShIfXMFVAIwasXOXfq4Ye4LRwvfAwP2UDHrC25mb0O
IJi6fMqf3bWxmLIzFUcTe8Z2nzuKolAgP2rcd6kb0bbLxE4Y5xtzCV8fgnhk2obw
HvP4zZnacLKx8Nvet+YGUKjVJU3wx4RTgyGLU4WqC13fwZREeJLWwxgK859ZJ8yl
k+TQud5fKgfkX20+eTA0
=qdcM
-----END PGP SIGNATURE-----
Merge tag 'md/4.3-rc4-fix' of git://neil.brown.name/md
Pull md bugfix from Neil Brown:
"One bug fix for raid1/raid10.
Very careless bug earler in 4.3-rc, now fixed :-)"
* tag 'md/4.3-rc4-fix' of git://neil.brown.name/md:
crash in md-raid1 and md-raid10 due to incorrect list manipulation
- DM core AB-BA deadlock fix in the device destruction path (vs device
creation's DM table swap).
- DM raid fix to properly round up the region_size to the next
power-of-2.
- DM cache fix for a NULL pointer seen while switching from the
"cleaner" cache policy.
2 fixes for regressions introduced during the 4.3 merge:
- request-based DM error propagation regressed due to incorrect
changes introduced when adding the bi_error field to bio.
- DM snapshot fix to only support snapshots that overflow if the client
(e.g. lvm2) is prepared to deal with the associated snapshot status
interface change.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWGC/jAAoJEMUj8QotnQNaTgYIAJz1AG5IcHz8D3zi8+MBWXFL
WAYrXfXSxexsymVKFsqi6z9fYiW5fRZ41/+Kl8/dYnhBIS8uUzWlad2qw/JFg+zC
l/EzdHWjakzuGm9/quK2h/CBC/3pmRH9UeKgzOPODOpAzkJfrKoO4/J7JPIi3JyP
esE/2F2TBwERL4oC74UB7/nuM/xckS/DRjbd3B82/IsfM5n+MARvuSSrqWcPEu8h
Hh5k42KyA+Tq7uElLnXF8phFOCJCn9IyI+QLdxj33PfDxwrtXMvV6Sxw7FS8b7oF
/gw3Dod4sEv+EJZ1A+O9mxGBk3ajCpMvUYbcY6owIHyB1mKWiSKyvyBPyIY6RiQ=
=2z9t
-----END PGP SIGNATURE-----
Merge tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull dm fixes from Mike Snitzer:
"Three stable fixes:
- DM core AB-BA deadlock fix in the device destruction path (vs
device creation's DM table swap).
- DM raid fix to properly round up the region_size to the next
power-of-2.
- DM cache fix for a NULL pointer seen while switching from the
"cleaner" cache policy.
Two fixes for regressions introduced during the 4.3 merge:
- request-based DM error propagation regressed due to incorrect
changes introduced when adding the bi_error field to bio.
- DM snapshot fix to only support snapshots that overflow if the
client (e.g. lvm2) is prepared to deal with the associated
snapshot status interface change"
* tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm snapshot: add new persistent store option to support overflow
dm cache: fix NULL pointer when switching from cleaner policy
dm: fix request-based dm error reporting
dm raid: fix round up of default region size
dm: fix AB-BA deadlock in __dm_destroy()
Commit 76c44f6d80 introduced the possibly for "Overflow" to be reported
by the snapshot device's status. Older userspace (e.g. lvm2) does not
handle the "Overflow" status response.
Fix this incompatibility by requiring newer userspace code, that can
cope with "Overflow", request the persistent store with overflow support
by using "PO" (Persistent with Overflow) for the snapshot store type.
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Fixes: 76c44f6d80 ("dm snapshot: don't invalidate on-disk image on snapshot write overflow")
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The cleaner policy doesn't make use of the per cache block hint space in
the metadata (unlike the other policies). When switching from the
cleaner policy to mq or smq a NULL pointer crash (in dm_tm_new_block)
was observed. The crash was caused by bugs in dm-cache-metadata.c
when trying to skip creation of the hint btree.
The minimal fix is to change hint size for the cleaner policy to 4 bytes
(only hint size supported).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The commit 55ce74d4bfe1b9444436264c637f39a152d1e5ac (md/raid1: ensure
device failure recorded before write request returns) is causing crash in
the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
reproducible.
The reason for the crash is that the newly added code in raid1d moves the
list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
then incorrectly pops the bio from conf->bio_end_io_list (which is empty
because the list was alrady moved).
Raid-10 has a similar bug.
Kernel Fault: Code=15 regs=000000006ccb8640 (Addr=0000000100000000)
CPU: 3 PID: 1930 Comm: mdX_raid1 Not tainted 4.2.0-rc5-bisect+ #35
task: 000000006cc1f258 ti: 000000006ccb8000 task.ti: 000000006ccb8000
YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001001111111000001111 Not tainted
r00-03 000000ff0804fe0f 000000001059d000 000000001059f818 000000007f16be38
r04-07 000000001059d000 000000007f16be08 0000000000200200 0000000000000001
r08-11 000000006ccb8260 000000007b7934d0 0000000000000001 0000000000000000
r12-15 000000004056f320 0000000000000000 0000000000013dd0 0000000000000000
r16-19 00000000f0d00ae0 0000000000000000 0000000000000000 0000000000000001
r20-23 000000000800000f 0000000042200390 0000000000000000 0000000000000000
r24-27 0000000000000001 000000000800000f 000000007f16be08 000000001059d000
r28-31 0000000100000000 000000006ccb8560 000000006ccb8640 0000000000000000
sr00-03 0000000000249800 0000000000000000 0000000000000000 0000000000249800
sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001059f61c 000000001059f620
IIR: 0f8010c6 ISR: 0000000000000000 IOR: 0000000100000000
CPU: 3 CR30: 000000006ccb8000 CR31: 0000000000000000
ORIG_R28: 000000001059d000
IAOQ[0]: call_bio_endio+0x34/0x1a8 [raid1]
IAOQ[1]: call_bio_endio+0x38/0x1a8 [raid1]
RP(r2): raid_end_bio_io+0x88/0x168 [raid1]
Backtrace:
[<000000001059f818>] raid_end_bio_io+0x88/0x168 [raid1]
[<00000000105a4f64>] raid1d+0x144/0x1640 [raid1]
[<000000004017fd5c>] kthread+0x144/0x160
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 55ce74d4bfe1 ("md/raid1: ensure device failure recorded before write request returns.")
Fixes: 95af587e95aa ("md/raid10: ensure device failure recorded before write request returns.")
Signed-off-by: NeilBrown <neilb@suse.com>
end_clone_bio() is a endio callback for clone bio and should check
and save the clone's bi_error for error reporting. However,
4246a0b63bd8 ("block: add a bi_error field to struct bio") changed
the function to check the original bio's bi_error, which is 0.
Without this fix, clone's error is ignored and reported to the
original request as success. Thus data corruption will be observed.
Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Two tagged for -stable
One is really a cleanup to match and improve kmemcache interface.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJWDjJNAAoJEDnsnt1WYoG5bOkP/ioJ8DZWkobOWSpnjbNCKIyg
xrX3FlTq8MJHPfeqGDzfznjYTZ7vb9ZYkZNkn1HUIOXKCkG0hqr1GL1eVZmKAbgZ
B3nuyIuArZe+IXQ5mMoMXn5qpp7/2mO/JPaqBBrUmxHMx+c+Xx0LC0QUdL7GXzY5
oQ8SahoLrl7Xl4/i9dSuhVD9rDhzuC7ZmykLkYrtquxFC69tH4PRUWak0RXXvHsE
mzADdqCwATLUu2FvEudoaCecXHxRmcn47CuALcqdaZF+VVPe8WsjIySmeVDRCixZ
k9njCdNiqtoKzb87MJECclYbCdHUVcKMNqaOoBkLaZnJumNFABwrPP3LnMtdaNpy
TrjYh3x5/xrdOgmWBML2gK/suEtaN2hgT6KyI38rAwlYQlEppxd94ZbIH0Q0wY+L
Unhcn28h56janKYVzyumA0Z5p6fbpxkI2OLEws4HzSqq6Ajpuc7yxDSCbUmE2vXL
WIoVAgH6PEr5sUCMH7xxqWejoXDi1KinPPVELKuMTWCiwRFr3CnZZzPXGJX5DXSG
nS9HCR35WmXuQx9pqC4/YOk7HBmllnNMHUrFlOYCzAn2qbjsCZ0whNlKe78qvN2z
+OYiVRF8KmSNAkP+S47sxeyEEYMi4aKVNe1ur1jVjYmA5keIdmjbnIRjGXfSNzff
PdvMqZcGouq4jsz2fqQf
=yqg5
-----END PGP SIGNATURE-----
Merge tag 'md/4.3-fixes' of git://neil.brown.name/md
Pull md fixes from Neil Brown:
"Assorted fixes for md in 4.3-rc.
Two tagged for -stable, and one is really a cleanup to match and
improve kmemcache interface.
* tag 'md/4.3-fixes' of git://neil.brown.name/md:
md/bitmap: don't pass -1 to bitmap_storage_alloc.
md/raid1: Avoid raid1 resync getting stuck
md: drop null test before destroy functions
md: clear CHANGE_PENDING in readonly array
md/raid0: apply base queue limits *before* disk_stack_limits
md/raid5: don't index beyond end of array in need_this_block().
raid5: update analysis state for failed stripe
md: wait for pending superblock updates before switching to read-only
Commit 3a0f9aaee028 ("dm raid: round region_size to power of two")
intended to make sure that the default region size is a power of two.
However, the logic in that commit is incorrect and sets the variable
region_size to 0 or 1, depending on whether min_region_size is a power
of two.
Fix this logic, using roundup_pow_of_two(), so that region_size is
properly rounded up to the next power of two.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 3a0f9aaee028 ("dm raid: round region_size to power of two")
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Passing -1 to bitmap_storage_alloc() causes page->index to be set to
-1, which is quite problematic.
So only pass ->cluster_slot if mddev_is_clustered().
Fixes: b97e92574c0b ("Use separate bitmaps for each nodes in the cluster")
Cc: stable@vger.kernel.org (v4.1+)
Signed-off-by: NeilBrown <neilb@suse.com>