IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The macro for_each_memcg_cache_index contains a silly yet potentially
deadly mistake. Although the macro parameter is _idx, the loop tests
are done over i, not _idx.
This hasn't generated any problems so far, because all users use i as a
loop index. However, while playing with an extension of the code I
ended using another loop index and the compiler was quick to complain.
Unfortunately, this is not the kind of thing that testing reveals =(
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We use rwsem since commit 5a505085f043 ("mm/rmap: Convert the struct
anon_vma::mutex to an rwsem"). And most of comments are converted to
the new rwsem lock; while just 2 more missed from:
$ git grep 'anon_vma->mutex'
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Web resource http://avr32linux.org/ is no longer available. We add the
mirror of the web page foud at http://mirror.egtvedt.no/avr32linux.org/.
Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When setting a huge PTE, besides calling pte_mkhuge(), we also need to
call arch_make_huge_pte(), which we indeed do in make_huge_pte(), but we
forget to do in hugetlb_change_protection() and remove_migration_pte().
Signed-off-by: Zhigang Lu <zlu@tilera.com>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The year field is incorrectly masked when setting the date. If the year
is beyond 2099, the year field will be incorrectly updated in hardware.
This patch masks the year field correctly.
Signed-off-by: Edgar Toernig <froese@gmx.de>
Signed-off-by: Tony Prisk <linux@prisktech.co.nz>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no .gitignore in tools/vm, so 'git status' always show built
binaries. To ignore this, add .gitignore.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No reason to preserve the huge zero page in core dumps.
Reported-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There exists a situation when GC can work in background alone without
any other filesystem activity during significant time.
The nilfs_clean_segments() method calls nilfs_segctor_construct() that
updates superblocks in the case of NILFS_SC_SUPER_ROOT and
THE_NILFS_DISCONTINUED flags are set. But when GC is working alone the
nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag.
As a result, the update of superblocks doesn't occurred all this time
and in the case of SPOR superblocks keep very old values of last super
root placement.
SYMPTOMS:
Trying to mount a NILFS2 volume after SPOR in such environment ends with
very long mounting time (it can achieve about several hours in some
cases).
REPRODUCING PATH:
1. It needs to use external USB HDD, disable automount and doesn't
make any additional filesystem activity on the NILFS2 volume.
2. Generate temporary file with size about 100 - 500 GB (for example,
dd if=/dev/zero of=<file_name> bs=1073741824 count=200). The size of
file defines duration of GC working.
3. Then it needs to delete file.
4. Start GC manually by means of command "nilfs-clean -p 0". When you
start GC by means of such way then, at the end, superblocks is updated
by once. So, for simulation of SPOR, it needs to wait sometime (15 -
40 minutes) and simply switch off USB HDD manually.
5. Switch on USB HDD again and try to mount NILFS2 volume. As a
result, NILFS2 volume will mount during very long time.
REPRODUCIBILITY: 100%
FIX:
This patch adds checking that superblocks need to update and set
THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call.
Reported-by: Sergey Alexandrov <splavgm@gmail.com>
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Tested-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch addresses a long standing issue of the driver with the
mac80211 .flush() callback. Since implementing the .flush() callback
a number of issues have been fixed, but a WARN_ON_ONCE() was still
triggered because the timeout on the flush could still occur.
This patch changes the awkward design using msleep() into one using
a waitqueue. The waiting flush() context will kick the transmit dma
when it is idle and the timeout used waiting for the event is set
to 500 ms. Worst case there can be 64 frames outstanding for transmit
in the driver. At a rate of 1Mbps that would take 1.5 seconds assuming
MTU is 1500 bytes and ignoring retries. The WARN_ON_ONCE() is also
removed as this was put in to indicate the flush timeout as a reason
for the driver to stall. That was not happening since fixing endless
AMPDU retries with following upstream commit:
commit 85091fc0a75653e239dc8379658515e577544927
Author: Arend van Spriel <arend@broadcom.com>
Date: Thu Feb 23 18:38:22 2012 +0100
brcm80211: smac: fix endless retry of A-MPDU transmissions
bugzilla: 42840 <https://bugzilla.kernel.org/show_bug.cgi?id=42840>
bugzilla@redhat: <https://bugzilla.redhat.com/show_bug.cgi?id=799168>
bugzilla@redhat: <https://bugzilla.redhat.com/show_bug.cgi?id=787649>
Cc: Jonathan Nieder <jrnieder@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Camaleón <noelamac@gmail.com>
Cc: Milan Bouchet-Valat <nalimilan@club-internet.fr>
Cc: Seth Forshee <seth.forshee@canonical.com>
Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com>
Reviewed-by: Hante Meuleman <meuleman@broadcom.com>
Reviewed-by: Piotr Haber <phaber@broadcom.com>
Signed-off-by: Arend van Spriel <arend@broadcom.com>
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch unregisters the gpio chip before ssb gets unloaded.
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch unregisters the gpio chip before bcma gets unloaded.
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Reported-by: Piotr Haber <phaber@broadcom.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Kernel commits 41affd5 and 6539306 changed the locking in rtl_lps_leave()
from a spinlock to a mutex by doing the calls indirectly from a work queue
to reduce the time that interrupts were disabled. This change was fine for
most systems; however a scheduling while atomic bug was reported in
https://bugzilla.redhat.com/show_bug.cgi?id=903881. The backtrace indicates
that routine rtl_is_special(), which calls rtl_lps_leave() in three places
was entered in atomic context. These direct calls are replaced by putting a
request on the appropriate work queue.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Reported-and-tested-by: Nathaniel Doherty <ntdoherty@gmail.com>
Cc: Nathaniel Doherty <ntdoherty@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Stable <stable@vger.kernel.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Return EINVAL from write if the size is larger than
allowed. Do this before allocating kernel memory for
the bogus size, which could lead to OOM.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Jana Saout <jana@saout.de>
Signed-off-by: David Teigland <teigland@redhat.com>
Pull x86 fixes from Ingo Molnar:
"Three small fixlets"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel/cacheinfo: Shut up annoying warning
x86, doc: Boot protocol 2.12 is in 3.8
x86-64: Replace left over sti/cli in ia32 audit exit code
Pull scheduler fixes from Ingo Molnar:
"Three small fixlets"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Fix format string for 32-bit platforms
sched: Fix warning in kernel/sched/fair.c
sched/rt: Use root_domain of rt_rq not current processor
Pull perf fixes from Ingo Molnar:
"Three fixlets and two small (and low risk) hw-enablement changes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix event group context move
x86/perf: Add IvyBridge EP support
perf/x86: Fix P6 driver section warning
arch/x86/tools/insn_sanity.c: Identify source of messages
perf/x86: Enable Intel Lincroft/Penwell/Cloverview Atom support
Pull two small RCU fixlets from Ingo Molnar.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Make rcu_nocb_poll an early_param instead of module_param
rcu: Prevent soft-lockup complaints about no-CBs CPUs
1. Optimize the match rules with new macro for Huawei USB storage devices,
to avoid to load USB storage driver for the modem interface
with Huawei devices.
2. Add to support new switch command for new Huawei USB dongles.
Signed-off-by: fangxiaozhi <huananhu@huawei.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1. Define a new macro for USB storage match rules:
matching with Vendor ID and interface descriptors.
Signed-off-by: fangxiaozhi <huananhu@huawei.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It is normal for minidrivers accumulating frames to return NULL
from their tx_fixup function. We do not want to count this as a
drop, or log any debug messages. A different exit path is
therefore chosen for such drivers, skipping the debug message
and the tx_dropped increment.
The test for accumulating drivers was however completely bogus,
making the exit path selection depend on whether the user had
enabled tx_err logging or not. This would arbitrarily mess up
accounting for both accumulating and non-accumulating minidrivers,
and would result in unwanted debug messages for the accumulating
drivers.
Fix by testing for FLAG_MULTI_PACKET instead, which probably was
the intention from the beginning. This usage match the documented
behaviour of this flag:
Indicates to usbnet, that USB driver accumulates multiple IP packets.
Affects statistic (counters) and short packet handling.
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates LINUX_MIB_LISTENDROPS and LINUX_MIB_LISTENOVERFLOWS in
tcp_v6_conn_request() and tcp_v6_err(). tcp_v6_conn_request() in particular can
drop SYNs for various reasons which are not currently tracked.
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch updates LINUX_MIB_LISTENDROPS in tcp_v4_conn_request() and
tcp_v4_err(). tcp_v4_conn_request() in particular can drop SYNs for various
reasons which are not currently tracked.
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On tile architecture (with "make allyesconfig") including
<linux/swiotlb.h> is required to call swiotlb_nr_tbl().
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Unfortunately, this name conflicts with a different use of
the name in various places through the tree, so don't provide
it for the kernel. We preserve it for userspace to avoid
breaking any userspace code that relies on this definition.
This fixes a number of compile errors for various drivers that
are enabled by "allyesconfig".
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
I've been getting the following warning when doing randbuilds
since forever. Now it finally pissed me off just the perfect
amount so that I can fix it.
arch/x86/kernel/cpu/intel_cacheinfo.c:489:27: warning: ‘cache_disable_0’ defined but not used [-Wunused-variable]
arch/x86/kernel/cpu/intel_cacheinfo.c:491:27: warning: ‘cache_disable_1’ defined but not used [-Wunused-variable] arch/x86/kernel/cpu/intel_cacheinfo.c:524:27: warning: ‘subcaches’ defined but not used [-Wunused-variable]
It happens because in randconfigs where CONFIG_SYSFS is not set,
the whole sysfs-interface to L3 cache index disabling is
remaining unused and gcc correctly warns about it. Make it
optional, depending on CONFIG_SYSFS too, as is the case with
other sysfs-related machinery in this file.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/1359969195-27362-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull powerpc update from Benjamin Herrenschmidt:
"Just so that you don't get too bored on your Island here's a patch for
3.8 fixing a nasty bug that affects the new 64T support that was
merged in 3.7. Please apply whenever you have a chance (and an
internet connection!)"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/mm: Fix hash computation function
Pull radeon fixes from Dave Airlie:
"I got these late last week, the main chunks of these fix a rendering
regression since 3.7, and the settle ones all fix the issue where we
don't wait long enough for the memory controller to settle after
turning it off which causes bad memory reads, they all fix real users
bugs, and most of them are destined for stable.
Can't remember if you had net connection on that island :-)"
I don't know if the "two tin-cans and a string" thing here on "that
island" can really be considered internet, but I guess I can pull
things. Barely.
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon: switch back to the CP ring for VM PT updates
drm/radeon: prevent crash in the ring space allocation
drm/radeon: Calling object_unrefer() when creating fb failure
drm/radeon/r5xx-r7xx: wait for the MC to settle after MC blackout
drm/radeon/evergreen+: wait for the MC to settle after MC blackout
drm/radeon: protect against div by 0 in backend setup
drm/radeon: fix backend map setup on 1 RB sumo boards
drm/radeon: add quirk for RV100 board
drm/radeon: add WAIT_UNTIL to the non-VM safe regs list for cayman/TN
drm/radeon: fix MC blackout on evergreen+
The ASM version of hash computation function was truncating the upper bit.
Make the ASM version similar to hpt_hash function. Remove masking vsid bits.
Without this patch, we observed hang during bootup due to not satisfying page
fault request correctly. The fault handler used wrong hash values to update
the HPTE. Hence we kept looping with page fault.
hash_page(ea=000001003e260008, access=203, trap=300 ip=3fff91787134 dsisr 42000000
The computed value of hash 000000000f22f390
update: avpnv=4003e46054003e00, hash=000000000722f390, f=80000006, psize: 2 ...
BenH: The over-masking has been there for ever but only hurts with the
new 64T support introduced in 3.7
Reported-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.7]
When releasing a packet socket, the routine packet_set_ring() is reused
to free rings instead of allocating them. But when calling it for the
first time, it fills req->tp_block_nr with the value of rb->pg_vec_len
which in the second invocation makes it bail out since req->tp_block_nr
is greater zero but req->tp_block_size is zero.
This patch solves the problem by passing a zeroed auto-variable to
packet_set_ring() upon each invocation from packet_release().
As far as I can tell, this issue exists even since 69e3c75 (net: TX_RING
and packet mmap), i.e. the original inclusion of TX ring support into
af_packet, but applies only to sockets with both RX and TX ring
allocated, which is probably why this was unnoticed all the time.
Signed-off-by: Phil Sutter <phil.sutter@viprinet.com>
Cc: Johann Baudy <johann.baudy@gnu-log.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use correct inner offset to set inner_network_offset.
Found by inspection.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9dc274151a548 (tcp: fix ABC in tcp_slow_start())
uncovered a bug in FRTO code :
tcp_process_frto() is setting snd_cwnd to 0 if the number
of in flight packets is 0.
As Neal pointed out, if no packet is in flight we lost our
chance to disambiguate whether a loss timeout was spurious.
We should assume it was a proper loss.
Reported-by: Pasi Kärkkäinen <pasik@iki.fi>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 9dc274151a548 (tcp: fix ABC in tcp_slow_start()),
a nul snd_cwnd triggers an infinite loop in tcp_slow_start()
Avoid this infinite loop and log a one time error for further
analysis. FRTO code is suspected to cause this bug.
Reported-by: Pasi Kärkkäinen <pasik@iki.fi>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we have group with mixed events (hw/sw) we want to end up
with group leader being in hw context. So if group leader is
initialy sw event, we move all the events under hw context.
The move is done for each event by removing it from its context
and adding it back into proper one. As a part of the removal the
event is automatically disabled, which is not what we want at
this stage of creating groups.
The fix is to initialize event state after removal from sw
context.
This fix resulted from the following discussion:
http://thread.gmane.org/gmane.linux.kernel.perf.user/1144
Reported-by: Andreas Hollmann <hollmann@in.tum.de>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vince@deater.net>
Link: http://lkml.kernel.org/r/1359714225-4231-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Marc Kleine-Budde says:
====================
here's a patch for net for the v3.8 release cycle. Alexander Stein noticed that
the c_can hardware has a fixed bit in the IFx_MASK2 register. His patch fixes
writing of this register by always setting this bit.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
1) rhine_tx() should use dev_kfree_skb() not dev_kfree_skb_irq()
2) rhine_slow_event_task's NAPI triggering logic is racey, it
should just hit the interrupt mask register. This is the
same as commit 7dbb491878a2c51d372a8890fa45a8ff80358af1
("r8169: avoid NAPI scheduling delay.") made to fix the same
problem in the r8169 driver. From Francois Romieu.
Reported-by: Jamie Gloudon <jamie.gloudon@gmail.com>
Tested-by: Jamie Gloudon <jamie.gloudon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add PID/VID entries for ELV WS 300 PC II weather station
Signed-off-by: Sven Killig <sven@killig.de>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull scsi target fixes from Nicholas Bellinger:
"Here's the current set of v3.8-rc fixes in the target-pending.git
queue. Apologies in advance for these missing the -rc6 release, and
having to be destined for -rc7 code.
The majority of these patches are regression bugfixes specific to
v3.8-rc code changes, namely the zero-length CDB handling breakage
after the sense_reason_t conversion, and preventing configfs port
linking for unconfigured devices after the recent struct
se_subsystem_dev removal. These is also one (the divide by zero bug
for unconfigured devices) that is CC'ed to stable."
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
target: Fix divide by zero bug in fabric_max_sectors for unconfigured devices
target: Fix regression allowing unconfigured devices to fabric port link
tcm_vhost: fix pr_err on early kick
target: Fix zero-length READ_CAPACITY_16 regression
target: Fix zero-length MODE_SENSE regression
target: Fix zero-length INQUIRY additional sense code regression
John W. Linville says:
====================
This is a small batch of fixes intended for the 3.8 stream...
There are two pulls from Johannes. Regarding mac80211, Johannes says:
"One fix from Dan for a possible memory overrun."
Regarding iwlwifi, Johannes says:
"I have one fix from Emmanuel reverting a previous fix that caused
more trouble than it's worth."
Along with those:
Arend van Spriel fixes a fatal error in brcsmac related to tx status processing.
Bing Zhao corrects a problem where mwifiex would fail to complete a scan
in the event of an IE processing error.
Larry Finger fixes a thinko in rtlwifi in which the wrong skb variable
was being used in some cases.
Rafał Miłecki fixes a thinko in an ID check in the bcma flash code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We batch up operations to the extent allocation tree, which allows
us to deal with the recursive nature of using the extent allocation
tree to allocate extents to the extent allocation tree.
It also provides a mechanism to sort and collect extent
operations, which makes it much more efficient to record extents
that are close together.
The delayed extent operations must all be finished before the
running transaction commits, so we have code to make sure and run a few
of the batched operations when closing our transaction handles.
This creates a great deal of contention for the locks in the
delayed extent operation tree, and also contention for the lock on the
extent allocation tree itself. All the extra contention just slows
down the operations and doesn't get things done any faster.
This commit changes things to use a wait queue instead. As procs
want to run the delayed operations, one of them races in and gets
permission to hit the tree, and the others step back and wait for
progress to be made.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree. On tree roots and
other extent buffers that very cache hot, this can be highly contended.
These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the new raid56 code, we want to make sure we're
properly aligning our allocation clusters with -o ssd
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Buffered writes and DIRECT_IO writes will often break up
big contiguous changes to the file into sub-stripe writes.
This adds a plugging callback to gather those smaller writes full stripe
writes.
Example on flash:
fio job to do 64K writes in batches of 3 (which makes a full stripe):
With plugging: 450MB/s
Without plugging: 220MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The stripe cache allows us to avoid extra read/modify/write cycles
by caching the pages we read off the disk. Pages are cached when:
* They are read in during a read/modify/write cycle
* They are written during a read/modify/write cycle
* They are involved in a parity rebuild
Pages are not cached if we're doing a full stripe write. We're
assuming that a full stripe write won't be followed by another
partial stripe write any time soon.
This provides a substantial boost in performance for workloads that
synchronously modify adjacent offsets in the file, and for the parity
rebuild use case in general.
The size of the stripe cache isn't tunable (yet) and is set at 1024
entries.
Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct
Without the stripe cache -- 2.1MB/s
With the stripe cache 21MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.
Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.
But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.
Scrub is unable to repair crc errors on raid5/6 chunks.
Discard does not work on raid5/6 (yet)
The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>