IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
To implement sync I/O support within the LightNVM core, the end_io
functions are refactored to take an end_io function pointer instead of
testing for initialized media manager, followed by calling its end_io
function.
Sync I/O can then be implemented using a callback that signal I/O
completion. This is similar to the logic found in blk_to_execute_io().
By implementing it this way, the underlying device I/Os submission logic
is abstracted away from core, targets, and media managers.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
The modified variables are only used in the file mtip32xx.c.
As such, the static keyword is inserted to define that object
to be only visible to the current code module during compilation.
Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The max outstanding commands is being printed with a 0x prefix to
suggest it is a hex value, when in fact the integer decimal %d format
specifier is being used and this is a bit confusing. Use %x instead to
match the proceeding 0x prefix.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
We have split the setting up of all the resources in two steps:
1) talk_to_blkback - which figures out the num_ring_pages (from
the default value of zero), sets up shadow and so
2) blkfront_connect - does the real part of filling out the
internal structures.
The problem is if we bypass the 1) step and go straight to 2)
and call blkfront_setup_indirect where we use the macro
BLK_RING_SIZE - which returns an negative value (because
sz is zero - since num_ring_pages is zero - since it has never
been set).
We can fix this by making sure that we always have called
talk_to_blkback before going to blkfront_connect.
Or we could set in blkfront_probe info->nr_ring_pages = 1
to have a default value. But that looks odd - as we haven't
actually negotiated any ring size.
This patch changes XenbusStateConnected state to detect if
we haven't done the initial handshake - and if so continue
on as if were in XenbusStateInitWait state.
We also roll the error recovery (freeing the structure) into
talk_to_blkback error path - which is safe since that function
is only called from blkback_changed.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch fixs two memleaks:
backtrace:
[<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
[<ffffffff81205e3b>] kmem_cache_alloc+0xbb/0x1d0
[<ffffffff81534028>] xen_blkbk_probe+0x58/0x230
[<ffffffff8146adb6>] xenbus_dev_probe+0x76/0x130
[<ffffffff81511716>] driver_probe_device+0x166/0x2c0
[<ffffffff815119bc>] __device_attach_driver+0xac/0xb0
[<ffffffff8150fa57>] bus_for_each_drv+0x67/0x90
[<ffffffff81511ab7>] __device_attach+0xc7/0x120
[<ffffffff81511b23>] device_initial_probe+0x13/0x20
[<ffffffff8151059a>] bus_probe_device+0x9a/0xb0
[<ffffffff8150f0a1>] device_add+0x3b1/0x5c0
[<ffffffff8150f47e>] device_register+0x1e/0x30
[<ffffffff8146a9e8>] xenbus_probe_node+0x158/0x170
[<ffffffff8146abaf>] xenbus_dev_changed+0x1af/0x1c0
[<ffffffff8146b1bb>] backend_changed+0x1b/0x20
[<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
unreferenced object 0xffff880007ba8ef8 (size 224):
backtrace:
[<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
[<ffffffff81205c73>] __kmalloc+0xd3/0x1e0
[<ffffffff81534d87>] frontend_changed+0x2c7/0x580
[<ffffffff8146af12>] xenbus_otherend_changed+0xa2/0xb0
[<ffffffff8146b2c0>] frontend_changed+0x10/0x20
[<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
[<ffffffff810d3e97>] kthread+0xd7/0xf0
[<ffffffff817c4a9f>] ret_from_fork+0x3f/0x70
[<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800048dcd38 (size 224):
The first leak is caused by not put() the be->blkif reference
which we had gotten in xen_blkif_alloc(), while the second is
us not freeing blkif->rings in the right place.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Make st_* statistics per ring and the VBD sysfs would iterate over all the
rings.
Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings
are torn down, so it's safe.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Aligned the variables on the same column.
The minimal size of request in the block framework is always PAGE_SIZE.
It means that when 64KB guest is support, the request will at least be
64KB.
Although, if the backend doesn't support indirect descriptor (such as QDISK
in QEMU), a ring request is only able to accommodate 11 segments of 4KB
(i.e 44KB).
The current frontend is assuming that an I/O request will always fit in
a ring request. This is not true any more when using 64KB page
granularity and will therefore crash during boot.
On ARM64, the ABI is completely neutral to the page granularity used by
the domU. The guest has the choice between different page granularity
supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
This can't be enforced by the hypervisor and therefore it's possible to
run guests using different page granularity.
So we can't mandate the block backend to support indirect descriptor
when the frontend is using 64KB page granularity and have to fix it
properly in the frontend.
The solution exposed below is based on modifying directly the frontend
guest rather than asking the block framework to support smaller size
(i.e < PAGE_SIZE). This is because the change is the block framework are
not trivial as everything seems to relying on a struct *page (see [1]).
Although, it may be possible that someone succeed to do it in the future
and we would therefore be able to use it.
Given that a block request may not fit in a single ring request, a
second request is introduced for the data that cannot fit in the first
one. This means that the second ring request should never be used on
Linux if the page size is smaller than 44KB.
To achieve the support of the extra ring request, the block queue size
is divided by two. Therefore, the ring will always contain enough space
to accommodate 2 ring requests. While this will reduce the overall
performance, it will make the implementation more contained. The way
forward to get better performance is to implement in the backend either
indirect descriptor or multiple grants ring.
Note that the parameters blk_queue_max_* helpers haven't been updated.
The block code will set the mimimum size supported and we may be able
to support directly any change in the block framework that lower down
the minimal size of a request.
[1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The code to get a request is always the same. Therefore we can factorize
it in a single function.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen_blkif_schedule() kthread calls try_to_freeze() at the beginning of
every attempt to purge the LRU. This operation can't ever succeed though,
as the kthread hasn't marked itself as freezable.
Before (hopefully eventually) kthread freezing gets converted to fileystem
freezing, we'd rather mark xen_blkif_schedule() freezable (as it can
generate I/O during suspend).
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
With the multi-queue support we could fail at setting up
some of the rings and fail the connection. That meant that
all resources tied to rings[0..n-1] (where n is the ring
that failed to be setup). Eventually the frontend will switch
to the states and we will call xen_blkif_disconnect.
However we do not want to be at the mercy of the frontend
deciding when to change states. This allows us to do the
cleanup right away and freeing resources.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Make pool of persistent grants and free pages per-queue/ring instead of
per-device to get better scalability.
Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB
[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting
Results:
iops1: After patch "xen/blkfront: make persistent grants per-queue".
iops2: After this patch.
Queues: 1 4 8 16
Iops orig(k): 810 1064 780 700
Iops1(k): 810 1230(~20%) 1024(~20%) 850(~20%)
Iops2(k): 810 1410(~35%) 1354(~75%) 1440(~100%)
With 4 queues after this commit we can get ~75% increase in IOPS, and
performance won't drop if increasing queue numbers.
Please find the respective chart in this link:
https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Backend advertises "multi-queue-max-queues" to front, also get the negotiated
number from "multi-queue-num-queues" written by blkfront.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in
"xen/blkback: get the number of hardware queues/rings from blkfront".
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align variables in the structures.
Split per ring information to an new structure "xen_blkif_ring", so that one vbd
device can be associated with one or more rings/hardware queues.
Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we
may have multi backend threads.
This patch is a preparation for supporting multi hardware queues/rings.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align the variables in the structure.
According to this piece code:
"
pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
xen_blkif_max_ring_order, XENBUS_MAX_RING_GRANT_ORDER);
"
if xen_blkif_max_ring_order is bigger that XENBUS_MAX_RING_GRANT_ORDER,
need to set xen_blkif_max_ring_order using XENBUS_MAX_RING_GRANT_ORDER,
but not 0.
Signed-off-by: Peng Fan <van.freenix@gmail.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Make persistent grants per-queue/ring instead of per-device, so that we can
drop the 'dev_lock' and get better scalability.
Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB
[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting
Queues: 1 4 8 16
Iops orig(k): 810 1064 780 700
Iops patched(k): 810 1230(~20%) 1024(~20%) 850(~20%)
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We do the same exact operations a bit earlier in the
function.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The max number of hardware queues for xen/blkfront is set by parameter
'max_queues'(default 4), while it is also capped by the max value that the
xen/blkback exposes through XenStore key 'multi-queue-max-queues'.
The negotiated number is the smaller one and would be written back to xenstore
as "multi-queue-num-queues", blkback needs to read this negotiated number.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
After patch "xen/blkfront: separate per ring information out of device
info", per-ring data is protected by a per-device lock ('io_lock').
This is not a good way and will effect the scalability, so introduce a
per-ring lock ('ring_lock').
The old 'io_lock' is renamed to 'dev_lock' which protects the ->grants list and
->persistent_gnts_c which are shared by all rings.
Note that in 'blkfront_probe' the 'blkfront_info' is setup via kzalloc
so setting ->persistent_gnts_c to zero is not needed.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in
patch "xen/blkfront: negotiate number of queues/rings to be used with backend"
so as to make review easier.
Note that blkfront_gather_backend_features does not call
blkfront_setup_indirect anymore (as that needs to be done per ring).
That means that in blkif_recover/blkif_connect we have to do it in a loop
(bounded by nr_rings).
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Split per ring information to a new structure "blkfront_ring_info".
A ring is the representation of a hardware queue, every vbd device can associate
with one or more rings depending on how many hardware queues/rings to be used.
This patch is a preparation for supporting real multi hardware queues/rings.
We also add a backpointer to 'struct blkfront_info' (dev_info) which
is not needed (we could use containers_of) but further patch
("xen/blkfront: pseudo support for multi hardware queues/rings")
will make allocation of 'blkfront_ring_info' dynamic.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If null_blk is run in NULL_IRQ_TIMER mode and with queue_mode NULL_Q_RQ,
we need to restart the queue from the hrtimer interrupt. We can't
directly invoke the request_fn from that context, so punt the queue run
to async kblockd context.
Tested-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 8182503df1ba used monotonic time, but if the adapter is
using the seconds for logging entries, then we'll get duplicate
entries if the system is rebooted. Use real time instead.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 8182503df1ba ("block: sx8.c: Replace timeval with ktime_t")
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block layer fixes from Jens Axboe:
"Three small fixes for 4.4 final. Specifically:
- The segment issue fix from Junichi, where the old IO path does a
bio limit split before potentially bouncing the pages. We need to
do that in the right order, to ensure that limitations are met.
- A NVMe surprise removal IO hang fix from Keith.
- A use-after-free in null_blk, introduced by a previous patch in
this series. From Mike Krinkin"
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: fix use-after-free error
block: ensure to split after potentially bouncing a bio
NVMe: IO ending fixes on surprise removal
blk_end_request_all may free request, so we need to save
request_queue pointer before blk_end_request_all call.
The problem was introduced in commit cf8ecc5a8455266f8d51
("null_blk: guarantee device restart in all irq modes")
and causes general protection fault with slab poisoning
enabled.
Fixes: cf8ecc5a8455266f8d51 ("null_blk: guarantee device
restart in all irq modes")
Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
32-bit systems using 'struct timeval' will break in the year 2038,
in order to avoid that replace the code with more appropriate types.
This patch replaces timeval with 64 bit ktime_t which is y2038 safe.
Since st->timestamp is only interested in seconds, directly using
time64_t here. Function ktime_get_seconds is used since it uses
monotonic instead of real time and thus will not cause overflow.
Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
- XSA-155 security fixes to backend drivers.
- XSA-157 security fixes to pciback.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWdDrXAAoJEFxbo/MsZsTR3N0H/0Lvz6MWBARCje7livbz7nqE
PS0Bea+2yAfNhCDDiDlpV0lor8qlyfWDF6lGhLjItldAzahag3ZDKDf1Z/lcQvhf
3MwFOcOVZE8lLtvLT6LGnPuehi1Mfdi1Qk1/zQhPhsq6+FLPLT2y+whmBihp8mMh
C12f7KRg5r3U7eZXNB6MEtGA0RFrOp0lBdvsiZx3qyVLpezj9mIe0NueQqwY3QCS
xQ0fILp/x2EnZNZuzgghFTPRxMAx5ReOezgn9Rzvq4aThD+irz1y6ghkYN4rG2s2
tyYOTqBnjJEJEQ+wmYMhnfCwVvDffztG+uI9hqN31QFJiNB0xsjSWFCkDAWchiU=
=Argz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- XSA-155 security fixes to backend drivers.
- XSA-157 security fixes to pciback.
* tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen-pciback: fix up cleanup path when alloc fails
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
xen/pciback: Do not install an IRQ handler for MSI interrupts.
xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
xen/pciback: Save xen_pci_op commands before processing it
xen-scsiback: safely copy requests
xen-blkback: read from indirect descriptors only once
xen-blkback: only read request operation from shared ring once
xen-netback: use RING_COPY_REQUEST() throughout
xen-netback: don't use last request to determine minimum Tx credit
xen: Add RING_COPY_REQUEST()
xen/x86/pvh: Use HVM's flush_tlb_others op
xen: Resume PMU from non-atomic context
xen/events/fifo: Consume unprocessed events when a CPU dies
Since indirect descriptors are in memory shared with the frontend, the
frontend could alter the first_sect and last_sect values after they have
been validated but before they are recorded in the request. This may
result in I/O requests that overflow the foreign page, possibly
overwriting local pages when the I/O request is executed.
When parsing indirect descriptors, only read first_sect and last_sect
once.
This is part of XSA155.
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
A compiler may load a switch statement value multiple times, which could
be bad when the value is in memory shared with the frontend.
When converting a non-native request to a native one, ensure that
src->operation is only loaded once by using READ_ONCE().
This is part of XSA155.
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Pull block layer fixes from Jens Axboe:
"A set of fixes for the current series. This contains:
- A bunch of fixes for lightnvm, should be the last round for this
series. From Matias and Wenwei.
- A writeback detach inode fix from Ilya, also marked for stable.
- A block (though it says SCSI) fix for an OOPS in SCSI runtime power
management.
- Module init error path fixes for null_blk from Minfei"
* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: Fix error path in module initialization
lightnvm: do not compile in debugging by default
lightnvm: prevent gennvm module unload on use
lightnvm: fix media mgr registration
lightnvm: replace req queue with nvmdev for lld
lightnvm: comments on constants
lightnvm: check mm before use
lightnvm: refactor spin_unlock in gennvm_get_blk
lightnvm: put blks when luns configure failed
lightnvm: use flags in rrpc_get_blk
block: detach bdev inode from its wb in __blkdev_put()
SCSI: Fix NULL pointer dereference in runtime PM
Module couldn't release resource properly during the initialization. To
fix this issue, we will clean up the proper resource before returning.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There's no reason for temparea to be static, since it's only used for
temporary sprintf output. It's not immediately obvious that the output
will always fit (in the worst case, the output including '\0' is
exactly 32 bytes), so save a future reader from worrying about that.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
In the case where a request queue is passed to the low lever lightnvm
device drive integration, the device driver might pass its admin
commands through another queue. Instead pass nvm_dev, and let the
low level drive the appropriate queue.
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull Ceph fix from Sage Weil:
"This addresses a refcounting bug that leads to a use-after-free"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: don't put snap_context twice in rbd_queue_workfn()
Commit 4e752f0ab0e8 ("rbd: access snapshot context and mapping size
safely") moved ceph_get_snap_context() out of rbd_img_request_create()
and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the
error path in rbd_queue_workfn(). However, rbd_img_request_create()
consumes a ref on snapc, so calling ceph_put_snap_context() after
a successful rbd_img_request_create() leads to an extra put. Fix it.
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t. Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This commit at least doubles the maximum value for
completion_nsec. This helps in special cases where one wants/needs to
emulate an extremely slow I/O (for example to spot bugs).
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In single-queue (block layer) mode,the function null_rq_prep_fn stops
the device if alloc_cmd fails. Then, once stopped, the device must be
restarted on the next command completion, so that the request(s) for
which alloc_cmd failed can be requeued. Otherwise the device hangs.
Unfortunately, device restart is currently performed only for delayed
completions, i.e., in irqmode==2. This fact causes hangs, for the
above reasons, with the other irqmodes in combination with single-queue
block layer.
This commits addresses this issue by making sure that, if stopped, the
device is properly restarted for all irqmodes on completions.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna AVanzini <avanzini@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For the Timer IRQ mode (i.e., when command completions are delayed),
there is one timer for each CPU. Each of these timers
. has a completion queue associated with it, containing all the
command completions to be executed when the timer fires;
. is set, and a new completion-to-execute is inserted into its
completion queue, every time the dispatch code for a new command
happens to be executed on the CPU related to the timer.
This implies that, if the dispatch of a new command happens to be
executed on a CPU whose timer has already been set, but has not yet
fired, then the timer is set again, to the completion time of the
newly arrived command. When the timer eventually fires, all its queued
completions are executed.
This way of handling delayed command completions entails the following
problem: if more than one command completion is inserted into the
queue of a timer before the timer fires, then the expiration time for
the timer is moved forward every time each of these completions is
enqueued. As a consequence, only the last completion enqueued enjoys a
correct execution time, while all previous completions are unjustly
delayed until the last completion is executed (and at that time they
are executed all together).
Specifically, if all the above completions are enqueued almost at the
same time, then the problem is negligible. On the opposite end, if
every completion is enqueued a while after the previous completion was
enqueued (in the extreme case, it is enqueued only right before the
timer would have expired), then every enqueued completion, except for
the last one, experiences an inflated delay, proportional to the number
of completions enqueued after it. In the end, commands, and thus I/O
requests, may be completed at an arbitrarily lower rate than the
desired one.
This commit addresses this issue by replacing per-CPU timers with
per-command timers, i.e., by associating an individual timer with each
command.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In case the lower level device size changed, but some other internal
details of the resize did not work out, drbd_determine_dev_size() would
try to restore the previous settings, trusting
drbd_md_set_sector_offsets() to "do the right thing", but overlooked
that this internally may set the meta data base offset based on device size.
This could end up with incomplete on-disk meta data layout change, and
ultimately lead to data corruption (if the failure was not noticed or
ignored by the operator, and other things go wrong as well).
Just remember all meta data related offsets/sizes,
and on error restore them all.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
During handshake communication, we also reconsider our device size,
using drbd_determine_dev_size(). Just in case we need to change the
offsets or layout of our on-disk metadata, we lock out application
and other meta data IO, and wait for the activity log to be "idle"
(no more referenced extents).
If this handshake happens just after a connection loss, with a fencing
policy of "resource-and-stonith", we have frozen IO.
If, additionally, the activity log was "starving" (too many incoming
random writes at that point in time), it won't become idle, ever,
because of the frozen IO, and this would be a lockup of the receiver
thread, and consquentially of DRBD.
Previous logic (re-)initialized with a special "empty" transaction
block, which required the activity log to fully drain first.
Instead, write out some standard activity log transactions.
Using lc_try_lock_for_transaction() instead of lc_try_lock() does not
care about pending activity log references, avoiding the potential
deadlock.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
To be able to "force out" an activity log transaction,
even if there are no pending updates.
This will be used to relocate the on-disk activity log,
if the on-disk offsets have to be changed,
without the need to empty the activity log first.
While at it, move the definition,
so we can drop the forward declaration of a static helper.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>