IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
ZRAM performs direct LZO compression algorithm calls, making it the one
and only option. While LZO is generally performs well, LZ4 algorithm
tends to have a faster decompression (see http://code.google.com/p/lz4/
for full report)
Name Ratio C.speed D.speed
MB/s MB/s
LZ4 (r101) 2.084 422 1820
LZO 2.06 2.106 414 600
Thus, users who have mostly read (decompress) usage scenarious or mixed
workflow (writes with relatively high read ops number) will benefit from
using LZ4 compression backend.
Introduce compressing backend abstraction zcomp in order to support
multiple compression algorithms with the following set of operations:
.create
.destroy
.compress
.decompress
Schematically zram write() usually contains the following steps:
0) preparation (decompression of partioal IO, etc.)
1) lock buffer_lock mutex (protects meta compress buffers)
2) compress (using meta compress buffers)
3) alloc and map zs_pool object
4) copy compressed data (from meta compress buffers) to object allocated by 3)
5) free previous pool page, assign a new one
6) unlock buffer_lock mutex
As we can see, compressing buffers must remain untouched from 1) to 4),
because, otherwise, concurrent write() can overwrite data. At the same
time, zram_meta must be aware of a) specific compression algorithm memory
requirements and b) necessary locking to protect compression buffers. To
remove requirement a) new struct zcomp_strm introduced, which contains a
compress/decompress `buffer' and compression algorithm `private' part.
While struct zcomp implements zcomp_strm stream handling and locking and
removes requirement b) from zram meta. zcomp ->create() and ->destroy(),
respectively, allocate and deallocate algorithm specific zcomp_strm
`private' part.
Every zcomp has zcomp stream and mutex to protect its compression stream.
Stream usage semantics remains the same -- only one write can hold stream
lock and use its buffers. zcomp_strm_find() turns caller into exclusive
user of a stream (holding stream mutex until zram release stream), and
zcomp_strm_release() makes zcomp stream available (unlock the stream
mutex). Hence no concurrent write (compression) operations possible at
the moment.
iozone -t 3 -R -r 16K -s 60M -I +Z
test base patched
--------------------------------------------------
Initial write 597992.91 591660.58
Rewrite 609674.34 616054.97
Read 2404771.75 2452909.12
Re-read 2459216.81 2470074.44
Reverse Read 1652769.66 1589128.66
Stride read 2202441.81 2202173.31
Random read 2236311.47 2276565.31
Mixed workload 1423760.41 1709760.06
Random write 579584.08 615933.86
Pwrite 597550.02 594933.70
Pread 1703672.53 1718126.72
Fwrite 1330497.06 1461054.00
Fread 3922851.00 3957242.62
Usage examples:
comp = zcomp_create(NAME) /* NAME e.g. "lzo" */
which initialises compressing backend if requested algorithm is supported.
Compress:
zstrm = zcomp_strm_find(comp)
zcomp_compress(comp, zstrm, src, &dst_len)
[..] /* copy compressed data */
zcomp_strm_release(comp, zstrm)
Decompress:
zcomp_decompress(comp, src, src_len, dst);
Free compessing backend and its zcomp stream:
zcomp_destroy(comp)
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
allocate new `zram_meta' in disksize_store() only for uninitialised zram
device, saving a number of allocations and deallocations in case if
disksize_store() was called on currently used device. at the same time
zram_meta stack variable is not necessary, because we can set ->meta
directly. there is also no need in setting QUEUE_FLAG_NONROT queue on
every disksize_store(), set it once during device creation.
[minchan@kernel.org: handle zram->meta alloc fail case]
[minchan@kernel.org: prevent lockdep spew of init_lock]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram accounted but did not report numbers of failed read and write
queries. make these stats available as failed_reads and failed_writes
attrs.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preparation patch for stats code duplication removal.
1) use atomic64_t for `pages_zero' and `pages_stored' zram stats.
2) `compr_size' and `pages_zero' struct zram_stats members did not
follow the existing device attr naming scheme: zram_stats.ATTR has
ATTR_show() function. rename them:
-- compr_size -> compr_data_size
-- pages_zero -> zero_pages
Minchan Kim's note:
If we really have trouble with atomic stat operation, we could
change it with percpu_counter so that it could solve atomic overhead and
unnecessary memory space by introducing unsigned long instead of 64bit
atomic_t.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove `good' and `bad' compressed sub-requests stats. RW request may
cause a number of RW sub-requests. zram used to account `good' compressed
sub-queries (with compressed size less than 50% of original size), `bad'
compressed sub-queries (with compressed size greater that 75% of original
size), leaving sub-requests with compression size between 50% and 75% of
original size not accounted and not reported. zram already accounts each
sub-request's compression size so we can calculate real device compression
ratio.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not pass rw argument down the __zram_make_request() -> zram_bvec_rw()
chain, decode it in zram_bvec_rw() instead. Besides, this is the place
where we distinguish READ and WRITE bio data directions, so account zram
RW stats here, instead of __zram_make_request(). This also allows to
account a real number of zram READ/WRITE operations, not just requests
(single RW request may cause a number of zram RW ops with separate
locking, compression/decompression, etc).
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce init_done() helper function which allows us to drop `init_done'
struct zram member. init_done() uses the fact that ->init_done == 1
equals to ->meta != NULL.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph updates from Sage Weil:
"The biggest chunk is a series of patches from Ilya that add support
for new Ceph osd and crush map features, including some new tunables,
primary affinity, and the new encoding that is needed for erasure
coding support. This brings things into parity with the server side
and the looming firefly release. There is also support for allocation
hints in RBD that help limit fragmentation on the server side.
There is also a series of patches from Zheng fixing NFS reexport,
directory fragmentation support, flock vs fnctl behavior, and some
issues with clustered MDS.
Finally, there are some miscellaneous fixes from Yunchuan Wen for
fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
behavior"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
ceph: skip invalid dentry during dcache readdir
libceph: dump pool {read,write}_tier to debugfs
libceph: output primary affinity values on osdmap updates
ceph: flush cap release queue when trimming session caps
ceph: don't grabs open file reference for aborted request
ceph: drop extra open file reference in ceph_atomic_open()
ceph: preallocate buffer for readdir reply
libceph: enable PRIMARY_AFFINITY feature bit
libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
libceph: add support for osd primary affinity
libceph: add support for primary_temp mappings
libceph: return primary from ceph_calc_pg_acting()
libceph: switch ceph_calc_pg_acting() to new helpers
libceph: introduce apply_temps() helper
libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
libceph: ceph_can_shift_osds(pool) and pool type defines
libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
libceph: enable OSDMAP_ENC feature bit
libceph: primary_affinity decode bits
libceph: primary_affinity infrastructure
...
In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order). Backwards compatibility is taken care
of on the libceph/osd side.
"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once. The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into. To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
In preparation for prefixing rbd writes with an allocation hint
introduce a num_ops parameter for rbd_osd_req_create(). The rationale
is that not every write request is a write op that needs to be prefixed
(e.g. watch op), so the num_ops logic needs to be in the callers.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Our longest osd request now contains 3 ops: copyup+hint+write.
Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was
hard-coded to 2. Fix it, and switch to rbd_assert while at it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is
not only insufficient, but also triggers an rbd_assert() in
rbd_obj_request_destroy():
Assertion failure in rbd_obj_request_destroy() at line 1867:
rbd_assert(obj_request->img_request == NULL);
rbd_img_obj_request_add() adds obj_requests to the img_request, the
opposite is rbd_img_obj_request_del(). Use it.
Fixes: http://tracker.ceph.com/issues/7327
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Commit 03507db631c94 ("rbd: fix buffer size for writes to images with
snapshots") moved the call to rbd_img_obj_request_add() up, making the
out_partial label bogus. Remove it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
doubling of the default queue length though.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTOirvAAoJENkgDmzRrbjxpqAP/3114WOubf4MNoi24eXNkU2x
ZC++orq408A72g8dB7A0XAhatKc8Ay0bDP5hOZNAgTHm2hjYxmhpC9UlKEzzElpW
yjwR0wYBXA0RoJuZqCp9MNgOtkU54QoQZ0c4EZblagslUZBmKQPUDE7XdgkaqO6o
A3azfCjAFDu523Azep9Npj1sk+H+VH3OIXyMGY+uyHBw1a2rnmIhn4lQCQ+pX+YO
3wpqxEragpjBAizs1CAAB9wWm2O8zVACxkoUXVYI8Tu60n99lwr9Abxlc0oHSIig
E7kBnyzQVHfkDrPXR3EdwTi3Hwd/BaOiW4dPvQ3IJKvPOzoiS4H3IpJCnCV5PfRb
VHl3q//SzPQ+GXH7WH2Fhb9JxoLxBRzFcy3kdIR1wBHYahAOiQjcLgapOO5mVq3X
PJy9CDs2L9rjbtxvQWtnl62V3JFGw+ZdhhG/BjeC5Who/aSh/mDoss7/qdfrxKJx
z5IWYSlJw7ighOuF0dPdCKAX9WiWqENvga31Q2svrH4Hxlx6eGumEmX+YQw0iRAf
ddMYA+1qLT4myPTN0nORFM+T/mkZHNkNMCr0qylRFH0j6hSiDxwWqQG0eXA661By
W6nIkW++sj2Vkk4knMGCXSyMmy9Nrv+1R+8unQJCXixYevotP5JEY0DoCQwlGuuq
xa0UR+2q9Htnbytu8S0K
=AoMS
-----END PGP SIGNATURE-----
Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio updates from Rusty Russell:
"Nothing exciting: virtio-blk users might see a bit of a boost from the
doubling of the default queue length though"
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
virtio-blk: base queue-depth on virtqueue ringsize or module param
Revert a02bbb1ccfe8: MAINTAINERS: add virtio-dev ML for virtio
virtio: fail adding buffer on broken queues.
virtio-rng: don't crash if virtqueue is broken.
virtio_balloon: don't crash if virtqueue is broken.
virtio_blk: don't crash, report error if virtqueue is broken.
virtio_net: don't crash if virtqueue is broken.
virtio_balloon: don't softlockup on huge balloon changes.
virtio: Use pci_enable_msix_exact() instead of pci_enable_msix()
MAINTAINERS: virtio-dev is subscribers only
tools/virtio: add a missing )
tools/virtio: fix missing kmemleak_ignore symbol
tools/virtio: update internal copies of headers
Pull block driver update from Jens Axboe:
"On top of the core pull request, here's the pull request for the
driver related changes for 3.15. It contains:
- Improvements for msi-x registration for block drivers (mtip32xx,
skd, cciss, nvme) from Alexander Gordeev.
- A round of cleanups and improvements for drbd from Andreas
Gruenbacher and Rashika Kheria.
- A round of clanups and improvements for bcache from Kent.
- Removal of sleep_on() and friends in DAC960, ataflop, swim3 from
Arnd Bergmann.
- Bug fix for a bug in the mtip32xx async completion code from Sam
Bradshaw.
- Bug fix for accidentally bouncing IO on 32-bit platforms with
mtip32xx from Felipe Franciosi"
* 'for-3.15/drivers' of git://git.kernel.dk/linux-block: (103 commits)
bcache: remove nested function usage
bcache: Kill bucket->gc_gen
bcache: Kill unused freelist
bcache: Rework btree cache reserve handling
bcache: Kill btree_io_wq
bcache: btree locking rework
bcache: Fix a race when freeing btree nodes
bcache: Add a real GC_MARK_RECLAIMABLE
bcache: Add bch_keylist_init_single()
bcache: Improve priority_stats
bcache: Better alloc tracepoints
bcache: Kill dead cgroup code
bcache: stop moving_gc marking buckets that can't be moved.
bcache: Fix moving_pred()
bcache: Fix moving_gc deadlocking with a foreground write
bcache: Fix discard granularity
bcache: Fix another bug recovering from unclean shutdown
bcache: Fix a bug recovering from unclean shutdown
bcache: Fix a journalling reclaim after recovery bug
bcache: Fix a null ptr deref in journal replay
...
Pull core block layer updates from Jens Axboe:
"This is the pull request for the core block IO bits for the 3.15
kernel. It's a smaller round this time, it contains:
- Various little blk-mq fixes and additions from Christoph and
myself.
- Cleanup of the IPI usage from the block layer, and associated
helper code. From Frederic Weisbecker and Jan Kara.
- Duplicate code cleanup in bio-integrity from Gu Zheng. This will
give you a merge conflict, but that should be easy to resolve.
- blk-mq notify spinlock fix for RT from Mike Galbraith.
- A blktrace partial accounting bug fix from Roman Pen.
- Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"
* 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
blk-mq: add REQ_SYNC early
rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
blk-mq: support partial I/O completions
blk-mq: merge blk_mq_insert_request and blk_mq_run_request
blk-mq: remove blk_mq_alloc_rq
blk-mq: don't dump CPU -> hw queue map on driver load
blk-mq: fix wrong usage of hctx->state vs hctx->flags
blk-mq: allow blk_mq_init_commands() to return failure
block: remove old blk_iopoll_enabled variable
blktrace: fix accounting of partially completed requests
smp: Rename __smp_call_function_single() to smp_call_function_single_async()
smp: Remove wait argument from __smp_call_function_single()
watchdog: Simplify a little the IPI call
smp: Move __smp_call_function_single() below its safe version
smp: Consolidate the various smp_call_function_single() declensions
smp: Teach __smp_call_function_single() to check for offline cpus
smp: Remove unused list_head from csd
smp: Iterate functions through llist_for_each_entry_safe()
block: Stop abusing rq->csd.list in blk-softirq
block: Remove useless IPI struct initialization
...
Pull workqueue changes from Tejun Heo:
"PREPARE_[DELAYED_]WORK() were used to change the work function of work
items without fully reinitializing it; however, this makes workqueue
consider the work item as a different one from before and allows the
work item to start executing before the previous instance is finished
which can lead to extremely subtle issues which are painful to debug.
The interface has never been popular. This pull request contains
patches to remove existing usages and kill the interface. As one of
the changes was routed during the last devel cycle and another
depended on a pending change in nvme, for-3.15 contains a couple merge
commits.
In addition, interfaces which were deprecated quite a while ago -
__cancel_delayed_work() and WQ_NON_REENTRANT - are removed too"
* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove deprecated WQ_NON_REENTRANT
workqueue: Spelling s/instensive/intensive/
workqueue: remove PREPARE_[DELAYED_]WORK()
staging/fwserial: don't use PREPARE_WORK
afs: don't use PREPARE_WORK
nvme: don't use PREPARE_WORK
usb: don't use PREPARE_DELAYED_WORK
floppy: don't use PREPARE_[DELAYED_]WORK
ps3-vuart: don't use PREPARE_WORK
wireless/rt2x00: don't use PREPARE_WORK in rt2800usb.c
workqueue: Remove deprecated __cancel_delayed_work()
Pull Ceph fix from Sage Weil:
"This drops a bad assert that a few users have been hitting but we've
only recently been able to track down"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: drop an unsafe assertion
Olivier Bonvalet reported having repeated crashes due to a failed
assertion he was hitting in rbd_img_obj_callback():
Assertion failure in rbd_img_obj_callback() at line 2165:
rbd_assert(which >= img_request->next_completion);
With a lot of help from Olivier with reproducing the problem
we were able to determine the object and image requests had
already been completed (and often freed) at the point the
assertion failed.
There was a great deal of discussion on the ceph-devel mailing list
about this. The problem only arose when there were two (or more)
object requests in an image request, and the problem was always
seen when the second request was being completed.
The problem is due to a race in the window between setting the
"done" flag on an object request and checking the image request's
next completion value. When the first object request completes, it
checks to see if its successor request is marked "done", and if
so, that request is also completed. In the process, the image
request's next_completion value is updated to reflect that both
the first and second requests are completed. By the time the
second request is able to check the next_completion value, it
has been set to a value *greater* than its own "which" value,
which caused an assertion to fail.
Fix this problem by skipping over any completion processing
unless the completing object request is the next one expected.
Test only for inequality (not >=), and eliminate the bad
assertion.
Tested-by: Olivier Bonvalet <ob@daevel.fr>
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Checkpatch has started warning against using DEFINE_PCI_DEVICE_TABLE,
so replace it. Also update the copyright date and bump the module
version number to 0.9.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
dev->max_hw_sectors may be zero to indicate the device has no limit on
the number of sectors. nvme_trans_do_nvme_io() should use the software
limit, since this is guaranteed to be non-zero.
Reported-by: Mundu <mundu2510@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This adds rcu protected access to a queue in the nvme IOCTL path
to fix potential races between a surprise removal and queue usage in
nvme_submit_sync_cmd. The fix holds the rcu_read_lock() here to prevent
the nvme_queue from freeing while this path is executing so it can't
sleep, and so this path will no longer wait for a available command
id should they all be in use at the time a passthrough IOCTL request
is received.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This adds rcu protected access to nvme_queue to fix a race between a
surprise removal freeing the queue and a thread with open reference on
a NVMe block device using that queue.
The queues do not need to be rcu protected during the initialization or
shutdown parts, so I've added a helper function for raw deferencing
to get around the sparse errors.
There is still a hole in the IOCTL path for the same problem, which is
fixed in a subsequent patch.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
If an NVMe device becomes ready but fails to create IO queues, the driver
creates a character device handle so the device can be managed. The
device reference count needs to be initialized before creating the
character device.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Add CONFIG_PM_SLEEP to suspend/resume functions to fix the following
build warning when CONFIG_PM_SLEEP is not selected. This is because
sleep PM callbacks defined by SIMPLE_DEV_PM_OPS are only used when
the CONFIG_PM_SLEEP is enabled.
drivers/block/nvme-core.c:2541:12: warning: 'nvme_suspend' defined but not used [-Wunused-function]
drivers/block/nvme-core.c:2550:12: warning: 'nvme_resume' defined but not used [-Wunused-function]
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Venkatash spake thus:
virtio-blk set the default queue depth to 64 requests, which was
insufficient for high-IOPS devices. Instead set the blk-queue depth to
the device's virtqueue depth divided by two (each I/O requires at least
two VQ entries).
But behold, Ted added a module parameter:
Also allow the queue depth to be something which can be set at module
load time or via a kernel boot-time parameter, for
testing/benchmarking purposes.
And I rewrote it substantially, mainly to take
VIRTIO_RING_F_INDIRECT_DESC into account.
As QEMU sets the vq size for PCI to 128, Venkatash's patch wouldn't
have made a change. This version does (since QEMU also offers
VIRTIO_RING_F_INDIRECT_DESC.
Inspired-by: "Theodore Ts'o" <tytso@mit.edu>
Based-on-the-true-story-of: Venkatesh Srinivas <venkateshs@google.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Cc: virtualization@lists.linux-foundation.org
Cc: Frank Swiderski <fes@google.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If drivers do dynamic allocation in the hardware command init
path, then we need to be able to handle and return failures.
And if they do allocations or mappings in the init command path,
then we need a cleanup function to free up that space at exit
time. So add blk_mq_free_commands() as the cleanup function.
This is required for the mtip32xx driver conversion to blk-mq.
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch fixes 2 issues in the fast completion path:
1) Possible double completions / double dma_unmap_sg() calls due to lack
of atomicity in the check and subsequent dereference of the upper layer
callback function. Fixed with cmpxchg before unmap and callback.
2) Regression in unaligned IO constraining workaround for p420m devices.
Fixed by checking if IO is unaligned and using proper semaphore if so.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
If the buffers are unmapped after completing a request, then stale data
might be in the request.
Signed-off-by: Felipe Franciosi <felipe@paradoxo.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
We need to set the queue bounce limit during the device initialization to
prevent excessive bouncing on 32 bit architectures.
Signed-off-by: Felipe Franciosi <felipe@paradoxo.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
As result of deprecation of MSI-X/MSI enablement functions
pci_enable_msix() and pci_enable_msi_block() all drivers
using these two interfaces need to be updated to use the
new pci_enable_msi_range() or pci_enable_msi_exact()
and pci_enable_msix_range() or pci_enable_msix_exact()
interfaces.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently the driver falls back to INTx mode when MSI-X
initialization failed. This is a suboptimal behaviour
for chips that also support MSI. This update changes that
behaviour and falls back to MSI mode in case MSI-X mode
initialization failed.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: iss_storagedev@hp.com
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
interruptible_sleep_on is racy and going away. This replaces the one
caller in the swim3 driver with the equivalent race-free
wait_event_interruptible call. Since we're here already, this
also fixes the case where we get interrupted from atomic context,
which used to just spin in the loop.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
sleep_on() is inherently racy, and has been deprecated for a long time.
This fixes two instances in the atari floppy driver:
* fdc_wait/fdc_busy becomes an open-coded mutex. We cannot use the
regular mutex since it gets released in interrupt context. The
open-coded version using wait_event() and cmpxchg() is equivalent
to the existing code but does the checks atomically, and we can
now safely check the condition with irqs enabled.
* format_wait becomes a completion, which is the natural structure
here. The format ioctl waits for the background task to either
complete or abort.
This does not attempt to fix the preexisting bug of calling schedule
with local interrupts disabled.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michael Schmitz <schmitz@biophys.uni-duesseldorf.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
sleep_on and its variants are going away. The use of sleep_on() in
DAC960_V2_ExecuteUserCommand seems to be bogus because the command
by the time we get there, the command has completed already and
we just enter the timeout. Based on this interpretation, I concluded
that we can replace it with a simple msleep(1000) and rearrange the
code around it slightly.
The interruptible_sleep_on_timeout in DAC960_gam_ioctl seems equivalent
to the race-free version using wait_event_interruptible_timeout.
I left the driver to return -EINTR rather than -ERESTARTSYS to preserve
the timeout behavior.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit "mtip32xx: Use pci_enable_msix_range() instead of
pci_enable_msix()" was unnecessary, since pci_enable_msi()
function is not deprecated and is still preferable for
enabling the single MSI mode. This update reverts usage of
pci_enable_msi() function.
Besides, the changelog for that commit was bogus, since
mtip32xx driver uses MSI interrupt, not MSI-X.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
A bad implementation of virtio might cause us to mark the virtqueue
broken: we'll dev_err() in that case, and the device is useless, but
let's not BUG_ON().
ENOMEM or ENOSPC implies the ring is full, and we should try again
later (-ENOMEM is documented to happen, but doesn't, as we fall
through to ENOSPC).
EIO means it's broken.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
mtip_pci_probe() dumps the current CPU when loaded, but it does
so in a preemptible context. Hence smp_processor_id() correctly
warns:
BUG: using smp_processor_id() in preemptible [00000000] code: systemd-udevd/155
caller is mtip_pci_probe+0x53/0x880 [mtip32xx]
Switch to raw_smp_processor_id(), since it's just informational
and persistent accuracy isn't important.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
"Small collection of fixes for 3.14-rc. It contains:
- Three minor update to blk-mq from Christoph.
- Reduce number of unaligned (< 4kb) in-flight writes on mtip32xx to
two. From Micron.
- Make the blk-mq CPU notify spinlock raw, since it can't be a
sleeper spinlock on RT. From Mike Galbraith.
- Drop now bogus BUG_ON() for bio iteration with blk integrity. From
Nic Bellinger.
- Properly propagate the SYNC flag on requests. From Shaohua"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: add REQ_SYNC early
rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
bio-integrity: Drop bio_integrity_verify BUG_ON in post bip->bip_iter world
blk-mq: support partial I/O completions
blk-mq: merge blk_mq_insert_request and blk_mq_run_request
blk-mq: remove blk_mq_alloc_rq
mtip32xx: Reduce the number of unaligned writes to 2
PREPARE_[DELAYED_]WORK() are being phased out. They have few users
and a nasty surprise in terms of reentrancy guarantee as workqueue
considers work items to be different if they don't have the same work
function.
nvme_dev->reset_work is multiplexed with multiple work functions.
Introduce nvme_reset_workfn() which invokes nvme_dev->reset_workfn and
always use it as the work function and update the users to set the
->reset_workfn field instead of overriding the work function using
PREPARE_WORK().
It would probably be best to route this with other related updates
through the workqueue tree.
Compile tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-nvme@lists.infradead.org
PREPARE_[DELAYED_]WORK() are being phased out. They have few users
and a nasty surprise in terms of reentrancy guarantee as workqueue
considers work items to be different if they don't have the same work
function.
floppy has been multiplexing floppy_work and fd_timer with multiple
work functions. Introduce floppy_work_workfn() and fd_timer_workfn()
which invoke floppy_work_fn and fd_timer_fn respectively and always
use the two functions as the work functions and update the users to
set floppy_work_fn and fd_timer_fn instead of overriding work
functions using PREPARE_[DELAYED_]WORK().
It would probably be best to route this with other related updates
through the workqueue tree.
Lightly tested using qemu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jiri Kosina <jkosina@suse.cz>
zram_meta_alloc could fail so caller should check it. Otherwise, your
system will hang.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit bf6bddf1924e ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).
This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.
This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.
This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.
Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As result of deprecation of MSI-X/MSI enablement functions
pci_enable_msix() and pci_enable_msi_block() all drivers
using these two interfaces need to be updated to use the
new pci_enable_msi_range() and pci_enable_msix_range()
interfaces.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>