IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
vm counters are allowed to be racy. Use raw_cpu_ops to avoid the
local_irq_disable overhead and to avoid preemption checks which will be
added to the __this_cpu operations.
[akpm@linux-foundation.org: Add comment. Again.]
Signed-off-by: Christoph Lameter <cl@linux.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the preempt checking logic for __this_cpu_ops we will get false
positives from locations in the code that use numa_node_id.
Before the __this_cpu ops where introduced there were no checks for
preemption present either. smp_raw_processor_id() was used. See
http://www.spinics.net/lists/linux-numa/msg00641.html
Therefore we need to use raw_cpu_read here to avoid false postives.
Note that this issue has been discussed in prior years. If the process
changes nodes after retrieving the current numa node then that is
acceptable since most uses of numa_node etc are for optimization and not
for correctness.
There were suggestions to implement a raw_numa_node_id in order to do
preempt checks for numa_node_id as well. But I think we better defer
that to another patch since that would mean investigating how
numa_node_id() is used throughout the kernel which would increase the
scope of this patchset significantly. After all preemption was never
checked before when numa_node_id() was used.
Some sample traces:
__this_cpu_read operation in preemptible [00000000] code: login/1456
caller is __this_cpu_preempt_check+0x2b/0x2d
CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3-dirty #185
Call Trace:
dump_stack+0x4e/0x82
check_preemption_disabled+0xc5/0xe0
__this_cpu_preempt_check+0x2b/0x2d
get_task_policy+0x1d/0x49
get_vma_policy+0x14/0x76
alloc_pages_vma+0x35/0xff
handle_mm_fault+0x290/0x73b
__do_page_fault+0x3fe/0x44d
do_page_fault+0x9/0xc
page_fault+0x22/0x30
generic_file_aio_read+0x38e/0x624
do_sync_read+0x54/0x73
vfs_read+0x9d/0x12a
SyS_read+0x47/0x7e
cstar_dispatch+0x7/0x23
caller is __this_cpu_preempt_check+0x2b/0x2d
CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3-dirty #185
Call Trace:
dump_stack+0x4e/0x82
check_preemption_disabled+0xc5/0xe0
__this_cpu_preempt_check+0x2b/0x2d
alloc_pages_current+0x8f/0xbc
__page_cache_alloc+0xb/0xd
__do_page_cache_readahead+0xf4/0x219
ra_submit+0x1c/0x20
ondemand_readahead+0x28c/0x2b4
page_cache_sync_readahead+0x38/0x3a
generic_file_aio_read+0x261/0x624
do_sync_read+0x54/0x73
vfs_read+0x9d/0x12a
SyS_read+0x47/0x7e
cstar_dispatch+0x7/0x23
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alex Shi <alex.shi@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel has never been audited to ensure that this_cpu operations are
consistently used throughout the kernel. The code generated in many
places can be improved through the use of this_cpu operations (which
uses a segment register for relocation of per cpu offsets instead of
performing address calculations).
The patch set also addresses various consistency issues in general with
the per cpu macros.
A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
because checks are skipped. This is typically shown through a raw_
prefix. So this patch set changes the places where __this_cpu_ptr()
is used to raw_cpu_ptr().
B. There has been the long term wish by some that __this_cpu operations
would check for preemption. However, there are cases where preemption
checks need to be skipped. This patch set adds raw_cpu operations that
do not check for preemption and then adds preemption checks to the
__this_cpu operations.
C. The use of __get_cpu_var is always a reference to a percpu variable
that can also be handled via a this_cpu operation. This patch set
replaces all uses of __get_cpu_var with this_cpu operations.
D. We can then use this_cpu RMW operations in various places replacing
sequences of instructions by a single one.
E. The use of this_cpu operations throughout will allow other arches than
x86 to implement optimized references and RMV operations to work with
per cpu local data.
F. The use of this_cpu operations opens up the possibility to
further optimize code that relies on synchronization through
per cpu data.
The patch set works in a couple of stages:
I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
Also converts the existing __this_cpu_xx_# primitive in the x86
code to raw_cpu_xx_#.
II. Patch 2-4 use the raw_cpu operations in places that would give
us false positives once they are enabled.
III. Patch 5 adds preemption checks to __this_cpu operations to allow
checking if preemption is properly disabled when these functions
are used.
IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
with this_cpu_ptr. They do not depend on any changes to the percpu
code. No preemption tests are skipped if they are applied.
V. Patches 21-46 are conversion patches that use this_cpu operations
in various kernel subsystems/drivers or arch code.
VI. Patches 47/48 (not included in this series) remove no longer used
functions (__this_cpu_ptr and __get_cpu_var). These should only be
applied after all the conversion patches have made it and after we
have done additional passes through the kernel to ensure that none of
the uses of these functions remain.
This patch (of 46):
The patches following this one will add preemption checks to __this_cpu
ops so we need to have an alternative way to use this_cpu operations
without preemption checks.
raw_cpu_ops will be the basis for all other ops since these will be the
operations that do not implement any checks.
Primitive operations are renamed by this patch from __this_cpu_xxx to
raw_cpu_xxxx.
Also change the uses of the x86 percpu primitives in preempt.h.
These depend directly on asm/percpu.h (header #include nesting issue).
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Alex Shi <alex.shi@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bryan Wu <cooloney@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: David Daney <david.daney@cavium.com>
Cc: David Miller <davem@davemloft.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Mike Travis <travis@sgi.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Robert Richter <rric@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wim Van Sebroeck <wim@iguana.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we try to arrange sysfs entries for memcg caches in the same
manner as for global caches. Apart from turning /sys/kernel/slab into a
mess when there are a lot of kmem-active memcgs created, it actually
does not work properly - we won't create more than one link to a memcg
cache in case its parent is merged with another cache. For instance, if
A is a root cache merged with another root cache B, we will have the
following sysfs setup:
X
A -> X
B -> X
where X is some unique id (see create_unique_id()). Now if memcgs M and
N start to allocate from cache A (or B, which is the same), we will get:
X
X:M
X:N
A -> X
B -> X
A:M -> X:M
A:N -> X:N
Since B is an alias for A, we won't get entries B:M and B:N, which is
confusing.
It is more logical to have entries for memcg caches under the
corresponding root cache's sysfs directory. This would allow us to keep
sysfs layout clean, and avoid such inconsistencies like one described
above.
This patch does the trick. It creates a "cgroup" kset in each root
cache kobject to keep its children caches there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we destroy children caches at the very beginning of
kmem_cache_destroy(). This is wrong, because the root cache will not
necessarily be destroyed in the end - if it has aliases (refcount > 0),
kmem_cache_destroy() will simply decrement its refcount and return. In
this case, at best we will get a bunch of warnings in dmesg, like this
one:
kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
CPU: 1 PID: 7139 Comm: modprobe Tainted: G B W 3.13.0+ #117
Call Trace:
dump_stack+0x49/0x5b
kmem_cache_destroy+0xdf/0xf0
kmem_cache_destroy_memcg_children+0x97/0xc0
kmem_cache_destroy+0xf/0xf0
xfs_mru_cache_uninit+0x21/0x30 [xfs]
exit_xfs_fs+0x2e/0xc44 [xfs]
SyS_delete_module+0x198/0x1f0
system_call_fastpath+0x16/0x1b
At worst - if kmem_cache_destroy() will race with an allocation from a
memcg cache - the kernel will panic.
This patch fixes this by moving children caches destruction after the
check if the cache has aliases. Plus, it forbids destroying a root
cache if it still has children caches, because each children cache keeps
a reference to its parent.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
memcg-only and except-for-memcg calls. To clean this up, let's move the
code responsible for memcg cache creation to a separate function.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch cleans up the memcg cache creation path as follows:
- Move memcg cache name creation to a separate function to be called
from kmem_cache_create_memcg(). This allows us to get rid of the mutex
protecting the temporary buffer used for the name formatting, because
the whole cache creation path is protected by the slab_mutex.
- Get rid of memcg_create_kmem_cache(). This function serves as a proxy
to kmem_cache_create_memcg(). After separating the cache name creation
path, it would be reduced to a function call, so let's inline it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the renamed symbol is defined lib/iomap.c implements ioport_map and
ioport_unmap and currently (nearly) all platforms define the port
accessor functions outb/inb and friend unconditionally. So
HAS_IOPORT_MAP is the better name for this.
Consequently NO_IOPORT is renamed to NO_IOPORT_MAP.
The motivation for this change is to reintroduce a symbol HAS_IOPORT
that signals if outb/int et al are available. I will address that at
least one merge window later though to keep surprises to a minimum and
catch new introductions of (HAS|NO)_IOPORT.
The changes in this commit were done using:
$ git grep -l -E '(NO|HAS)_IOPORT' | xargs perl -p -i -e 's/\b((?:CONFIG_)?(?:NO|HAS)_IOPORT)\b/$1_MAP/'
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The stub version of WARN for !CONFIG_BUG completely ignored its format
string and subsequent arguments; make it check them instead, using
no_printk.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When !CONFIG_BUG, WARN_ON and family become simple passthroughs of their
condition argument; however, WARN_ON_ONCE and family still have conditions
and a boolean to detect one-time invocation, even though the warning
they'd emit doesn't exist. Make the existing definitions conditional on
CONFIG_BUG, and add definitions for !CONFIG_BUG that map to the
passthrough versions of WARN and WARN_ON.
This saves 4.4k on a minimized configuration (smaller than allnoconfig),
and 20.6k with defconfig plus CONFIG_BUG=n.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes an artificial RapidIO bus root device and establishes
actual device hierarchy by providing reference to real parent devices.
It also introduces device class for RapidIO controller devices (on-chip
or an eternal bridge, known as "mport").
Existing implementation was sufficient for SoC-based platforms that have
a single RapidIO controller. With introduction of devices using
multiple RapidIO controllers and PCIe-to-RapidIO bridges the old scheme
is very limiting or does not work at all. The implemented changes allow
to properly reference platform's local RapidIO mport devices and provide
device details needed for upper layers.
This change to RapidIO device hierarchy does not break any known
existing kernel or user space interfaces.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
Cc: Jerry Jacobs <jerry.jacobs@prodrive-technologies.com>
Cc: Arno Tiemersma <arno.tiemersma@prodrive-technologies.com>
Cc: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove no longer used deprecated code, and make local functions
static.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jean Delvare <jdelvare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: George Spelvin <linux@horizon.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_task_state() uses the most significant bit to report the state to
user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
can be noticed via /proc as Z -> X -> Z change. Note that this was
possible even before EXIT_TRACE was introduced.
This is not really bad but imho it make sense to hide EXIT_TRACE from
user-space completely. So the patch simply swaps EXIT_ZOMBIE and
EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Lennart Poettering <lpoetter@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
drops tasklist_lock. If this task is not the natural child and it is
traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
The last transition is racy, this is even documented in 50b8d257486a
"ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
race". wait_consider_task() tries to detect this transition and clear
->notask_error but we can't rely on ptrace_reparented(), debugger can
exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
And there is another problem which were missed before: this transition
can also race with reparent_leader() which doesn't reset >exit_signal if
EXIT_DEAD, assuming that this task must be reaped by someone else. So
the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
/sbin/init doesn't use __WALL it becomes unreapable. This was fixed by
the previous commit, but it was the temporary hack.
1. Add the new exit_state, EXIT_TRACE. It means that the task is the
traced zombie, debugger is going to detach and notify its natural
parent.
This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
can avoid the changes in proc/kgdb code, get_task_state() still
reports "X (dead)" in this case.
Note: with or without this change userspace can see Z -> X -> Z
transition. Not really bad, but probably makes sense to fix.
2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
if we need to notify the ->real_parent.
3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
is always the final state we can safely ignore such a task.
4. Change wait_consider_task() to check EXIT_TRACE separately and kill
the racy and no longer needed ptrace_reparented() case.
If ptrace == T an EXIT_TRACE thread should be simply ignored, the
owner of this state is going to ptrace_unlink() this task. We can
pretend that it was already removed from ->ptraced list.
Otherwise we should skip this thread too but clear ->notask_error,
we must be the natural parent and debugger is going to untrace and
notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
even if the task was already untraced.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com>
Reported-by: Michal Schmidt <mschmidt@redhat.com>
Tested-by: Michal Schmidt <mschmidt@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Lennart Poettering <lpoetter@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Starting from commit c4ad8f98bef7 ("execve: use 'struct filename *' for
executable name passing") bprm->filename can not go away after
flush_old_exec(), so we do not need to save the binary name in
bprm->tcomm[] added by 96e02d158678 ("exec: fix use-after-free bug in
setup_new_exec()").
And there was never need for filename_to_taskname-like code, we can
simply do set_task_comm(kbasename(filename).
This patch has to change set_task_comm() and trace_task_rename() to
accept "const char *", but I think this change is also good.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH. However
LAST_CPUPID_WIDTH itself can be 0. (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is
set). In such a case LAST_CPUPID_MASK turns out to be 0.
But with recent commit 1ae71d0319: (mm: numa: bugfix for
LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0,
page_cpupid_xchg_last() and page_cpupid_reset_last() causes
page->_last_cpupid to be set to 0.
This causes performance regression. Its almost as if numa_balancing is
off.
Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of
LAST_CPUPID_WIDTH.
Some performance numbers and perf stats with and without the fix.
(3.14-rc6)
----------
numa01
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
12,27,462 cs [100.00%]
2,41,957 migrations [100.00%]
1,68,01,713 faults [100.00%]
7,99,35,29,041 cache-misses
98,808 migrate:mm_migrate_pages [100.00%]
1407.690148814 seconds time elapsed
numa02
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
63,065 cs [100.00%]
14,364 migrations [100.00%]
2,08,118 faults [100.00%]
25,32,59,404 cache-misses
12 migrate:mm_migrate_pages [100.00%]
63.840827219 seconds time elapsed
(3.14-rc6 with fix)
-------------------
numa01
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
9,68,911 cs [100.00%]
1,01,414 migrations [100.00%]
88,38,697 faults [100.00%]
4,42,92,51,042 cache-misses
4,25,060 migrate:mm_migrate_pages [100.00%]
685.965331189 seconds time elapsed
numa02
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
17,543 cs [100.00%]
2,962 migrations [100.00%]
1,17,843 faults [100.00%]
11,80,61,644 cache-misses
12,358 migrate:mm_migrate_pages [100.00%]
20.380132343 seconds time elapsed
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
ra_submit with a single function call.
Move ra_submit to internal.h and inline it to save some stack. Thanks
to Andrew Morton for commenting different versions.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's only one caller of set_page_dirty_balance() and that will call it
with page_mkwrite == 0.
The page_mkwrite argument was unused since commit b827e496c893 "mm: close
page_mkwrite races".
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_newpage_charge is used only for charging anonymous memory so
it is better to rename it to mem_cgroup_charge_anon.
mem_cgroup_cache_charge is used for file backed memory so rename it to
mem_cgroup_charge_file.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of returning NULL from try_get_mem_cgroup_from_mm() when the mm
owner is exiting, just return root_mem_cgroup. This makes sense for all
callsites and gets rid of some of them having to fallback manually.
[fengguang.wu@intel.com: fix warnings]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
warning:
warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]
Let's convert 'reason' to 'const char *' in dump_page() and friends: we
shouldn't modify it anyway.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The res_counter_{charge,uncharge}_locked() variants are not used in the
kernel outside of the resource counter code itself, so remove the
interface.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Tim Hockin <thockin@google.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
There's no significant performance degradation to checking
current->mempolicy rather than current->flags & PF_MEMPOLICY in the
allocation path, especially since this is considered unlikely().
Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
64GB of memory and without a mempolicy:
threads before after
16 1249409 1244487
32 1281786 1246783
48 1239175 1239138
64 1244642 1241841
80 1244346 1248918
96 1266436 1254316
112 1307398 1312135
128 1327607 1326502
Per-process flags are a scarce resource so we should free them up whenever
possible and make them available. We'll be using it shortly for memcg oom
reserves.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Tim Hockin <thockin@google.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slab_node() is actually a mempolicy function, so rename it to
mempolicy_slab_node() to make it clearer that it used for processes with
mempolicies.
At the same time, cleanup its code by saving numa_mem_id() in a local
variable (since we require a node with memory, not just any node) and
remove an obsolete comment that assumes the mempolicy is actually passed
into the function.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Tim Hockin <thockin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.
We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.
The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:
1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+
2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+
3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+
4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:
+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+
[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.
It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here's new version of faultaround patchset. It took a while to tune it
and collect performance data.
First patch adds new callback ->map_pages to vm_operations_struct.
->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from
"pgoff" till "max_pgoff". ->map_pages() is called with page table
locked and must not block. If it's not possible to reach a page without
blocking, filesystem should skip it. Filesystem should use do_set_pte()
to setup page table entry. Pointer to entry associated with offset
"pgoff" is passed in "pte" field in vm_fault structure. Pointers to
entries for other offsets should be calculated relative to "pte".
Currently VM use ->map_pages only on read page fault path. We try to
map FAULT_AROUND_PAGES a time. FAULT_AROUND_PAGES is 16 for now.
Performance data for different FAULT_AROUND_ORDER is below.
TODO:
- implement ->map_pages() for shmem/tmpfs;
- modify get_user_pages() to be able to use ->map_pages() and implement
mmap(MAP_POPULATE|MAP_NONBLOCK) on top.
=========================================================================
Tested on 4-socket machine (120 threads) with 128GiB of RAM.
Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
somewhere between 3 and 5. Let's say 4 :)
Linux build (make -j60)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 283,301,572 247,151,987 212,215,789 204,772,882 199,568,944 194,703,779 193,381,485
time, seconds 151.227629483 153.920996480 151.356125472 150.863792049 150.879207877 151.150764954 151.450962358
Linux rebuild (make -j60)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 5,396,854 4,148,444 2,855,286 2,577,282 2,361,957 2,169,573 2,112,643
time, seconds 27.404543757 27.559725591 27.030057426 26.855045126 26.678618635 26.974523490 26.761320095
Git test suite (make -j60 test)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 129,591,823 99,200,751 66,106,718 57,606,410 51,510,808 45,776,813 44,085,515
time, seconds 66.087215026 64.784546905 64.401156567 65.282708668 66.034016829 66.793780811 67.237810413
Two synthetic tests: access every word in file in sequential/random order.
It doesn't improve much after FAULT_AROUND_ORDER == 4.
Sequential access 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
1 thread
minor-faults 4,195,437 2,098,275 525,068 262,251 131,170 32,856 8,282
time, seconds 7.250461742 6.461711074 5.493859139 5.488488147 5.707213983 5.898510832 5.109232856
8 threads
minor-faults 33,557,540 16,892,728 4,515,848 2,366,999 1,423,382 442,732 142,339
time, seconds 16.649304881 9.312555263 6.612490639 6.394316732 6.669827501 6.75078944 6.371900528
32 threads
minor-faults 134,228,222 67,526,810 17,725,386 9,716,537 4,763,731 1,668,921 537,200
time, seconds 49.164430543 29.712060103 12.938649729 10.175151004 11.840094583 9.594081325 9.928461797
60 threads
minor-faults 251,687,988 126,146,952 32,919,406 18,208,804 10,458,947 2,733,907 928,217
time, seconds 86.260656897 49.626551828 22.335007632 17.608243696 16.523119035 16.339489186 16.326390902
120 threads
minor-faults 503,352,863 252,939,677 67,039,168 35,191,827 19,170,091 4,688,357 1,471,862
time, seconds 124.589206333 79.757867787 39.508707872 32.167281632 29.972989292 28.729834575 28.042251622
Random access 1GiB file
1 thread
minor-faults 262,636 132,743 34,369 17,299 8,527 3,451 1,222
time, seconds 15.351890914 16.613802482 16.569227308 15.179220992 16.557356122 16.578247824 15.365266994
8 threads
minor-faults 2,098,948 1,061,871 273,690 154,501 87,110 25,663 7,384
time, seconds 15.040026343 15.096933500 14.474757288 14.289129964 14.411537468 14.296316837 14.395635804
32 threads
minor-faults 8,390,734 4,231,023 1,054,432 528,847 269,242 97,746 26,881
time, seconds 20.430433109 21.585235358 22.115062928 14.872878951 14.880856305 14.883370649 14.821261690
60 threads
minor-faults 15,733,258 7,892,809 1,973,393 988,266 594,789 164,994 51,691
time, seconds 26.577302548 25.692397770 18.728863715 20.153026398 21.619101933 17.745086260 17.613215273
120 threads
minor-faults 31,471,111 15,816,616 3,959,209 1,978,685 1,008,299 264,635 96,010
time, seconds 41.835322703 40.459786095 36.085306105 35.313894834 35.814445675 36.552633793 34.289210594
Touch only one page in page table in 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
1 thread
minor-faults 8,372 8,324 8,270 8,260 8,249 8,239 8,237
time, seconds 0.039892712 0.045369149 0.051846126 0.063681685 0.079095975 0.17652406 0.541213386
8 threads
minor-faults 65,731 65,681 65,628 65,620 65,608 65,599 65,596
time, seconds 0.124159196 0.488600638 0.156854426 0.191901957 0.242631486 0.543569456 1.677303984
32 threads
minor-faults 262,388 262,341 262,285 262,276 262,266 262,257 263,183
time, seconds 0.452421421 0.488600638 0.565020946 0.648229739 0.789850823 1.651584361 5.000361559
60 threads
minor-faults 491,822 491,792 491,723 491,711 491,701 491,691 491,825
time, seconds 0.763288616 0.869620515 0.980727360 1.161732354 1.466915814 3.04041448 9.308612938
120 threads
minor-faults 983,466 983,655 983,366 983,372 983,363 984,083 984,164
time, seconds 1.595846553 1.667902182 2.008959376 2.425380942 2.941368804 5.977807890 18.401846125
This patch (of 2):
Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
accessible pages around fault address.
On read page fault, if filesystem provides ->map_pages(), we try to map up
to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
number of minor page faults.
We call ->map_pages first and use ->fault() as fallback if page by the
offset is not ready to be mapped (cold page cache or something).
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
also adds a prctl control which allows us to set the THP disable bit in
mm->def_flags so that VMs will pick up the setting as they are created.
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The purpose of this single series of commits from Srivatsa S Bhat (with
a small piece from Gautham R Shenoy) touching multiple subsystems that use
CPU hotplug notifiers is to provide a way to register them that will not
lead to deadlocks with CPU online/offline operations as described in the
changelog of commit 93ae4f978ca7f (CPU hotplug: Provide lockless versions
of callback registration functions).
The first three commits in the series introduce the API and document it
and the rest simply goes through the users of CPU hotplug notifiers and
converts them to using the new method.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTQow2AAoJEILEb/54YlRxW4QQAJlYRDUzwFJzJzYhltQYuVR+
4D74XMtvXgoJfg3cwdSWvMKKpJZnA9BVN0f7Hcx9wYmgdexYUuHeZJmMNyc3S2+g
KjKBIsugvgmZhHbbLd6TJ6GBbhGT5JLt9VmSfL9zIkveInU1YHFUUqL/mxdHm4J0
BSGKjk2rN3waRJgmY+xfliFLtQjDKFwJpMuvrgtoUyfas3f4sIV43UNbqdvA/weJ
rzedxXOlKH/id4b56lj/4iIzcoL3mwvJJ7r6n0CEMsKv87z09kqR0O+69Tsq/cgs
j17CsvoJOmZGk3QTeKVMQWBsvk6aPoDu3zK83gLbQMt+qjOpSTbJLz/3HZw4/TrW
ss4nuZne1DLMGS+6hoxYbTP+6Ni//Kn+l/LrHc5jb7m1X3lMO4W2aV3IROtIE1rv
lEP1IG01NU4u9YwkVj1dyhrkSp8tLPul4SrUK8W+oNweOC5crjJV7vJbIPJgmYiM
IZN55wln0yVRtR4TX+rmvN0PixsInE8MeaVCmReApyF9pdzul/StxlBze5BKLSJD
cqo1kNPpsmdxoDucqUpQ/gSvy+IOl2qnlisB5PpV93sk7De6TFDYrGHxjYIW7jMf
StXwdCDDQhzd2Q8Kfpp895A1dbIl8rKtwA6bTU2eX+BfMVFzuMdT44cvosx1+UdQ
sWl//rg76nb13dFjvF+q
=SW7Q
-----END PGP SIGNATURE-----
Merge tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull CPU hotplug notifiers registration fixes from Rafael Wysocki:
"The purpose of this single series of commits from Srivatsa S Bhat
(with a small piece from Gautham R Shenoy) touching multiple
subsystems that use CPU hotplug notifiers is to provide a way to
register them that will not lead to deadlocks with CPU online/offline
operations as described in the changelog of commit 93ae4f978ca7f ("CPU
hotplug: Provide lockless versions of callback registration
functions").
The first three commits in the series introduce the API and document
it and the rest simply goes through the users of CPU hotplug notifiers
and converts them to using the new method"
* tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits)
net/iucv/iucv.c: Fix CPU hotplug callback registration
net/core/flow.c: Fix CPU hotplug callback registration
mm, zswap: Fix CPU hotplug callback registration
mm, vmstat: Fix CPU hotplug callback registration
profile: Fix CPU hotplug callback registration
trace, ring-buffer: Fix CPU hotplug callback registration
xen, balloon: Fix CPU hotplug callback registration
hwmon, via-cputemp: Fix CPU hotplug callback registration
hwmon, coretemp: Fix CPU hotplug callback registration
thermal, x86-pkg-temp: Fix CPU hotplug callback registration
octeon, watchdog: Fix CPU hotplug callback registration
oprofile, nmi-timer: Fix CPU hotplug callback registration
intel-idle: Fix CPU hotplug callback registration
clocksource, dummy-timer: Fix CPU hotplug callback registration
drivers/base/topology.c: Fix CPU hotplug callback registration
acpi-cpufreq: Fix CPU hotplug callback registration
zsmalloc: Fix CPU hotplug callback registration
scsi, fcoe: Fix CPU hotplug callback registration
scsi, bnx2fc: Fix CPU hotplug callback registration
scsi, bnx2i: Fix CPU hotplug callback registration
...
This is the final piece in the puzzle, as all patches to remove the
last users of \(interruptible_\|\)sleep_on\(_timeout\|\) have made it
into the 3.15 merge window. The work was long overdue, and this
interface in particular should not have survived the BKL removal
that was done a couple of years ago.
Citing Jon Corbet from http://lwn.net/2001/0201/kernel.php3":
"[...] it was suggested that the janitors look for and fix all code
that calls sleep_on() [...] since (1) almost all such code is
incorrect, and (2) Linus has agreed that those functions should
be removed in the 2.5 development series".
We haven't quite made it for 2.5, but maybe we can merge this for 3.15.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph updates from Sage Weil:
"The biggest chunk is a series of patches from Ilya that add support
for new Ceph osd and crush map features, including some new tunables,
primary affinity, and the new encoding that is needed for erasure
coding support. This brings things into parity with the server side
and the looming firefly release. There is also support for allocation
hints in RBD that help limit fragmentation on the server side.
There is also a series of patches from Zheng fixing NFS reexport,
directory fragmentation support, flock vs fnctl behavior, and some
issues with clustered MDS.
Finally, there are some miscellaneous fixes from Yunchuan Wen for
fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd)
behavior"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits)
ceph: skip invalid dentry during dcache readdir
libceph: dump pool {read,write}_tier to debugfs
libceph: output primary affinity values on osdmap updates
ceph: flush cap release queue when trimming session caps
ceph: don't grabs open file reference for aborted request
ceph: drop extra open file reference in ceph_atomic_open()
ceph: preallocate buffer for readdir reply
libceph: enable PRIMARY_AFFINITY feature bit
libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting()
libceph: add support for osd primary affinity
libceph: add support for primary_temp mappings
libceph: return primary from ceph_calc_pg_acting()
libceph: switch ceph_calc_pg_acting() to new helpers
libceph: introduce apply_temps() helper
libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers
libceph: ceph_can_shift_osds(pool) and pool type defines
libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions
libceph: enable OSDMAP_ENC feature bit
libceph: primary_affinity decode bits
libceph: primary_affinity infrastructure
...
This patch-set includes the following major enhancement patches.
o introduce large directory support
o introduce f2fs_issue_flush to merge redundant flush commands
o merge write IOs as much as possible aligned to the segment
o add sysfs entries to tune the f2fs configuration
o use radix_tree for the free_nid_list to reduce in-memory operations
o remove costly bit operations in f2fs_find_entry
o enhance the readahead flow for CP/NAT/SIT/SSA blocks
The other bug fixes are as follows.
o recover xattr node blocks correctly after sudden-power-cut
o fix to calculate the maximum number of node ids
o enhance to handle many error cases
And, there are a bunch of cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJTQiQrAAoJEEAUqH6CSFDSlbIP/iq06BrUeMDLoQFhA2GQFKFD
wd0A5h9hCiFcKBcI/u/aAQqj/a5wdwzDl9XzH2PzJ45IM6sVGQZ0lv+kdLhab6rk
ipNbV7G0yLAX+8ygS6GZF7pSKfMzGSGTrRvfdtoiunIip1jCY1IkUxv1XMgBSPza
wnWYrE5HXEqRUDCqPXJyxrPmx0/0jw8/V82Ng9stnY34ySs+l/3Pvg65Kh0QuSSy
BRjJUGlOCF68KUBKd+6YB2T5KlbQde3/5lhP+GMOi+xm5sFB+j+59r/WpJpF2Nxs
ImxQs5GkiU01ErH/rn5FgHY/zzddQenBKwOvrjEeUA1eVpBurdsIr1JN0P6qDbgB
ho5U8LzCQq+HZiW444eQGkXSOagpUKqDhTVJO7Fji/wG88Atc9gLX3ix8TH2skxT
C5CvvrJM7DKBtkZyTzotKY/cWorOZhge6E/EkbGaM1sSHdK5b1Rg4YlFi9TDyz0n
QjGD1uuvEeukeKGdIG9pjc7o5ledbMDYwLpT2RuRXenLOTsn8BqDOo9aRTg+5Kag
tJNJLFumjPR2mEBNKjicJMUf381J/SKDwZszAz9mgvCZXldMza/Ax0LzJDJCVmkP
UuBiVzGxVzpd33IsESUDr0J9hc+t8kS10jfAeKnE3cpb6n7/RYxstHh6CHOFKNXM
gPUSYPN3CYiP47DnSfzA
=eSW+
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch-set includes the following major enhancement patches.
- introduce large directory support
- introduce f2fs_issue_flush to merge redundant flush commands
- merge write IOs as much as possible aligned to the segment
- add sysfs entries to tune the f2fs configuration
- use radix_tree for the free_nid_list to reduce in-memory operations
- remove costly bit operations in f2fs_find_entry
- enhance the readahead flow for CP/NAT/SIT/SSA blocks
The other bug fixes are as follows:
- recover xattr node blocks correctly after sudden-power-cut
- fix to calculate the maximum number of node ids
- enhance to handle many error cases
And, there are a bunch of cleanups"
* tag 'for-f2fs-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (62 commits)
f2fs: fix wrong statistics of inline data
f2fs: check the acl's validity before setting
f2fs: introduce f2fs_issue_flush to avoid redundant flush issue
f2fs: fix to cover io->bio with io_rwsem
f2fs: fix error path when fail to read inline data
f2fs: use list_for_each_entry{_safe} for simplyfying code
f2fs: avoid free slab cache under spinlock
f2fs: avoid unneeded lookup when xattr name length is too long
f2fs: avoid unnecessary bio submit when wait page writeback
f2fs: return -EIO when node id is not matched
f2fs: avoid RECLAIM_FS-ON-W warning
f2fs: skip unnecessary node writes during fsync
f2fs: introduce fi->i_sem to protect fi's info
f2fs: change reclaim rate in percentage
f2fs: add missing documentation for dir_level
f2fs: remove unnecessary threshold
f2fs: throttle the memory footprint with a sysfs entry
f2fs: avoid to drop nat entries due to the negative nr_shrink
f2fs: call f2fs_wait_on_page_writeback instead of native function
f2fs: introduce nr_pages_to_write for segment alignment
...
This contains OMAP related fbdev changes for 3.15. The bulk of the patches are
for adding Device Tree support for OMAP Display Subsystem:
* SoCs: OMAP2/3/4
* Boards: OMAP4 Panda, OMAP4 SDP, OMAP3 Beagle, OMAP3 Beagle-xM, OMAP3
IGEP0020, OMAP3 N900
* Devices: TFP410 Encoder, tpd12s015 HDMI companion chip, Sony acx565akm panel,
MIPI DSI Command mode panel and HDMI, DVI and Analog TV connectors
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTQmaWAAoJEPo9qoy8lh71LS8P+QHkUWD22XhPqt2NaeIxFhzU
rq7uW7iMtzIawb+y2+GLHykk4obmJKzpanhe+p3dZMSEG8O4s6IyRx8LoJGLeWoP
LtnV77Fbh3p9Bn1nLkDmszbcdSSbH8VseykfV5yCJ0lhTPjUXwm3JBFJ57AtOqeF
ya/13NCKR0v2FFCy89rOc9Iu+dYbVVNakTWcN1qcZOSjNQ6cLWyq97OmC4BD2mkC
H9u4XC/+UC6WJZT0PVs9cPn7D2QCt1ZR8eULmSL3YO2rXdLLfL87brtCS1uMNXk2
MUq9KVS1Iw+6yk248F9AyHkg5nZOzWoeXxZj0m48kaLlFBfkGuuiRSJZhc5qgBtm
61X6eQDclJUo1tgEmtH1wCFe4M3vZ0ImK21bX3PXO+IBgz4Zb3e3uqOAJy4LREJk
KZnjfwn/Gyp0D21YSZvquAZ+qZeZMttiwIlZVD26+Xt7GincC5e/HPi58nJ5IcQQ
BuUsjI8fjFcYTf5O2DqdAaYo3cpt8tfUwCXL6Iw32SYi4xz1Rocmxe/xnGCUxZHk
j3qAT5NkHcQL5xSMZgNQUyyqST1Ey9lgmLlQmnxknRPCOW35tbRo621CNz+5Z0MR
TJDu5REahAS7jEmeAjD1m1Xs4X0tYltiPgnbCwFtNHCvTdnw3R916quqRcQKaP6d
Irzufbo2EoTG8cVhvE8e
=Mnc1
-----END PGP SIGNATURE-----
Merge tag 'fbdev-omap-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
Pull OMAP fbdev changes from Tomi Valkeinen:
"This is based on the already pulled fbdev-main changes, and this also
merges .dts branch from Tony Lindgren (which has also been pulled), so
that I was able to add the display related .dts changes.
This contains OMAP related fbdev changes for 3.15. The bulk of the
patches are for adding Device Tree support for OMAP Display Subsystem:
- SoCs: OMAP2/3/4
- Boards: OMAP4 Panda, OMAP4 SDP, OMAP3 Beagle, OMAP3 Beagle-xM,
OMAP3 IGEP0020, OMAP3 N900
- Devices: TFP410 Encoder, tpd12s015 HDMI companion chip, Sony
acx565akm panel, MIPI DSI Command mode panel and HDMI, DVI and
Analog TV connectors"
* tag 'fbdev-omap-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: (45 commits)
OMAPDSS: HDMI: fix interlace output
OMAPDSS: add missing __init for dss_init_ports
ARM: OMAP2+: remove pdata quirks for displays
OMAPDSS: remove DT hacks for regulators
Doc/DT: Add DT binding documentation for tpd12s015 encoder
Doc/DT: Add DT binding documentation for TFP410 encoder
Doc/DT: Add DT binding documentation for Sony acx565akm panel
Doc/DT: Add DT binding documentation for MIPI DSI CM Panel
Doc/DT: Add DT binding documentation for HDMI Connector
Doc/DT: Add DT binding documentation for DVI Connector
Doc/DT: Add DT binding documentation for Analog TV Connector
ARM: omap3-n900.dts: add display information
ARM: omap3-igep0020.dts: add display information
ARM: omap3-beagle-xm.dts: add display information
ARM: omap3-beagle.dts: add display information
ARM: omap4-sdp.dts: add display information
Doc/DT: Add DT binding documentation for OMAP DSS
OMAPDSS: acx565akm: Add DT support
OMAPDSS: connector-analog-tv: Add DT support
OMAPDSS: hdmi-connector: Add DT support
...
- A few SPI NOR ID definitions
- Kill the NAND "max pagesize" restriction
- Fix some x16 bus-width NAND support
- Add NAND JEDEC parameter page support
- DT bindings for NAND ECC
- GPMI NAND updates (subpage reads)
- More OMAP NAND refactoring
- New STMicro SPI NOR driver (now in 40 patches!)
- A few other random bugfixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTP6x+AAoJEFySrpd9RFgtit0P/jLWsjMK8G2ldPC4bMZsXmDF
n3c71GcCRlUq4Qzb4rtZx9DANLAh+JyRrMOKCPg6dAMegFdmUqDOpZpNp0vF57KG
myFjqTk+n5y0tfSkWLMUFt0tQ8ArDp3IBkQCUWkD5LgG50EWmjveIQGH0kFnkE39
Kytqvw17RV7f81tIs+WvKt8++YWD2X1VTpTi0S4fx2bJ99bJDBf/GgdoQpj2oirt
igXmloUFEsob9JHZ3qumcUm9vaHwv2TiouZTvRyGdJCCoPdpJEZO4Ka6e4uAVarT
6kMKXBk3lj2GsilOSFFCNetXfy5Bf0TkJkv4rDjh3R1Y4J/hSgraVCbWXdKhb6tj
RmwesdFMjsyS4f/Rhk5PXwJgGL9uK2mi6bk/SmXU0AMgCDSa5zjshY8Wq6C6uXwk
LqlnK8l3h8Txotbc/XJIL+QGMbMkYQI8gxWTHFaqzDtkMe36mnGs9Zec3oso/s2d
CNRpq5+dMZ6qF0z3zpOQHmFbaOekivMy7kCKMXer6ONsrQBNwTdmkwy+SdqnWsLF
YdJttwV/RRcE0SRvK6GrhvzkGlV83Z8RPny6hC1kbrgQ0ffoy2CmIqyWNObK1RXf
sYqoF8TCtR6Y8rHHi5dzZ1lria+nm8pb4+UfQLRI0mK8og7YW+fIfHXxqRrZZIrt
No8NCPBKWzyew2UE0AiQ
=CKih
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20140405' of git://git.infradead.org/linux-mtd
Pull MTD updates from Brian Norris:
- A few SPI NOR ID definitions
- Kill the NAND "max pagesize" restriction
- Fix some x16 bus-width NAND support
- Add NAND JEDEC parameter page support
- DT bindings for NAND ECC
- GPMI NAND updates (subpage reads)
- More OMAP NAND refactoring
- New STMicro SPI NOR driver (now in 40 patches!)
- A few other random bugfixes
* tag 'for-linus-20140405' of git://git.infradead.org/linux-mtd: (120 commits)
Fix index regression in nand_read_subpage
mtd: diskonchip: mem resource name is not optional
mtd: nand: fix mention to CONFIG_MTD_NAND_ECC_BCH
mtd: nand: fix GET/SET_FEATURES address on 16-bit devices
mtd: omap2: Use devm_ioremap_resource()
mtd: denali_dt: Use devm_ioremap_resource()
mtd: devices: elm: update DRIVER_NAME as "omap-elm"
mtd: devices: elm: configure parallel channels based on ecc_steps
mtd: devices: elm: clean elm_load_syndrome
mtd: devices: elm: check for hardware engine's design constraints
mtd: st_spi_fsm: Succinctly reorganise .remove()
mtd: st_spi_fsm: Allow loop to run at least once before giving up CPU
mtd: st_spi_fsm: Correct vendor name spelling issue - missing "M"
mtd: st_spi_fsm: Avoid duplicating MTD core code
mtd: st_spi_fsm: Remove useless consts from function arguments
mtd: st_spi_fsm: Convert ST SPI FSM (NOR) Flash driver to new DT partitions
mtd: st_spi_fsm: Move runtime configurable msg sequences into device's struct
mtd: st_spi_fsm: Supply the W25Qxxx chip specific configuration call-back
mtd: st_spi_fsm: Supply the S25FLxxx chip specific configuration call-back
mtd: st_spi_fsm: Supply the MX25xxx chip specific configuration call-back
...
Currently cpufreq frequency table has two fields: frequency and driver_data.
driver_data is only for drivers' internal use and cpufreq core shouldn't use
it at all. But with the introduction of BOOST frequencies, this assumption
was broken and we started using it as a flag instead.
There are two problems due to this:
- It is against the description of this field, as driver's data is used by
the core now.
- if drivers fill it with -3 for any frequency, then those frequencies are
never considered by cpufreq core as it is exactly same as value of
CPUFREQ_BOOST_FREQ, i.e. ~2.
The best way to get this fixed is by creating another field flags which
will be used for such flags. This patch does that. Along with that various
drivers need modifications due to the change of struct cpufreq_frequency_table.
Reviewed-by: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Split up __sbc_dif_verify_read() so that VERIFY READ emulation can
perform target-core specific READ_STRIP, seperate from the existing
FILEIO/RAMDISK backend emulation code.
Also add sbc_dif_read_strip() in order to determine number of sectors
using cmd->prot_length, and skip the extra sbc_dif_copy_prot().
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Quinn Tran <quinn.tran@qlogic.com>
Cc: Giridhar Malavali <giridhar.malavali@qlogic.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This patch adds WRITE_INSERT emulation within target-core
using TYPE1 / TYPE3 PI modes in sbc_dif_generate() code.
This is useful in order for existing legacy fabrics that do not
support protection offloads to interact with backend devices that
currently have T10 PI enabled.
v2 changes:
- Rename to sbc_dif_generate() (Sagi)
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Quinn Tran <quinn.tran@qlogic.com>
Cc: Giridhar Malavali <giridhar.malavali@qlogic.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
In order to support local WRITE_INSERT + READ_STRIP operations for
non PI enabled fabrics, the fabric driver needs to be able signal
what protection offload operations are supported.
This is done at session initialization time so the modes can be
signaled by individual se_wwn + se_portal_group endpoints, as well
as optionally across different transports on the same endpoint.
For iser-target, set TARGET_PROT_ALL if the underlying ib_device
has already signaled PI offload support, and allow this to be
exposed via a new iscsit_transport->iscsit_get_sup_prot_ops()
callback.
For loopback, set TARGET_PROT_ALL to signal SCSI initiator mode
operation.
For all other drivers, set TARGET_PROT_NORMAL to disable fabric
level PI.
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Quinn Tran <quinn.tran@qlogic.com>
Cc: Giridhar Malavali <giridhar.malavali@qlogic.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Now that TASK_ABORTED status is not generated for all cases by
TMR ABORT_TASK + LUN_RESET, a new TFO->abort_task() caller is
necessary in order to give fabric drivers a chance to unmap
hardware / software resources before the se_cmd descriptor is
released via the normal TFO->release_cmd() codepath.
This patch adds TFO->aborted_task() in core_tmr_abort_task()
in place of the original transport_send_task_abort(), and
also updates all fabric drivers to implement this caller.
The fabric drivers that include changes to perform cleanup
via ->aborted_task() are:
- iscsi-target
- iser-target
- srpt
- tcm_qla2xxx
The fabric drivers that currently set ->aborted_task() to
NOPs are:
- loopback
- tcm_fc
- usb-gadget
- sbp-target
- vhost-scsi
For the latter five, there appears to be no additional cleanup
required before invoking TFO->release_cmd() to release the
se_cmd descriptor.
v2 changes:
- Move ->aborted_task() call into transport_cmd_finish_abort (Alex)
Cc: Alex Leung <amleung21@yahoo.com>
Cc: Mark Rustad <mark.d.rustad@intel.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Vu Pham <vu@mellanox.com>
Cc: Chris Boot <bootc@bootc.net>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Giridhar Malavali <giridhar.malavali@qlogic.com>
Cc: Saurav Kashyap <saurav.kashyap@qlogic.com>
Cc: Quinn Tran <quinn.tran@qlogic.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This patch addresses three of long standing issues wrt to Task
Aborted Status (TAS) handling.
The first is the incorrect assumption in core_tmr_handle_tas_abort()
that TASK_ABORTED status is sent for the task referenced by TMR
ABORT_TASK, and sending TASK_ABORTED status for TMR LUN_RESET on
the same nexus the LUN_RESET was received.
The second is to ensure the lun reference count is dropped within
transport_cmd_finish_abort() by calling transport_lun_remove_cmd()
before invoking transport_cmd_check_stop_to_fabric().
The last is to fix the delayed TAS handling to allow outstanding
WRITEs to complete before sending the TASK_ABORTED status. This
includes changing transport_check_aborted_status() to avoid
processing when SCF_SEND_DELAYED_TAS has not be set, and updating
transport_send_task_abort() to drop the SCF_SENT_DELAYED_TAS
check.
Signed-off-by: Alex Leung <amleung21@yahoo.com>
Cc: Alex Leung <amleung21@yahoo.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
This is not going to be supported soon - so drop it.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Some transports (iSCSI/iSER/SRP/FC) support hardware INSERT/STRIP
capabilities while other transports like loopback/vhost-scsi need
perform this is software.
This patch allows fabrics using SCF_PASSTHROUGH_SG_TO_MEM_NOALLOC
to signal the early LUN scan handling case where PROTECT CDB bits
are set, but no fabric buffer has been provided.
For transports which use generic new command these buffers have yet
to be allocated.
Also this way, target may support protection information
against legacy initiators (writes are inserted and reads
are stripped).
(Only set prot_pto for loopback during early special case - nab)
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Highlights include:
- Stable fix for a use after free issue in the NFSv4.1 open code
- Fix the SUNRPC bi-directional RPC code to account for TCP segmentation
- Optimise usage of readdirplus when confronted with 'ls -l' situations
- Soft mount bugfixes
- NFS over RDMA bugfixes
- NFSv4 close locking fixes
- Various NFSv4.x client state management optimisations
- Rename/unlink code cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTQBayAAoJEGcL54qWCgDyUzgQAKzSlbcksMQT55M/KZJXabNW
KSctJeDrkTkRxOXTNxuF9NbIgeqenLijCokXty6BIUgup0zkOPMzFfRfgdQvplnp
YEj4sOEXEZ8CX+PoUTYOEayzt0ssEAOyidumiM+Gx2LD/E1d2xyCL7YaAOjIhVQS
OnXcX1cZw+dZSUxC9vu5fVDjrphJTnp4CXdbvR5PiJiXeKqzZd9e5M3hXgpAQ/AS
mWjYeUvM9mwyz7UmbLKkWEmzB3tFlGdTzDPxLRrkfcOSKI2Ham0lL3/Uv50/nRTu
99ts6KH8KLGcUuL9vD9KRebht2f71usBrWAdvpy1cUcf1Fh6lmEg4ktGfkqldaUu
9kNu9d5DCxJoGc6R2UTw5FeyPwYuDWoBwEGy1DcguJ5CeQn2R2nH4ps/P3J3DX4d
DZsJqCY9idKZCQhtyR0iF9j3x2bNFoENaL6WHI6b0J+xjMedIbHgeUQzIQP0RLBJ
h0IcjK0D+e7WdyC7jk4Nm3krtms5SNUG5/N9OUO36a7v8735PJBcbcgm9hZJt8Fh
t/4vqUmKIBXHioHsMhaFslqTWlYIR9a3MYmN7QtHFYbqUfNxH69v9y3d6jb4Igck
kqoEiui5aJOCR76s7oVdHCcm+klBwEPiACT+H9CUMzSoKzHSWsBSNZbJR3BEia4M
7dwScS1OfI2KuutshGQA
=weNx
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
- Stable fix for a use after free issue in the NFSv4.1 open code
- Fix the SUNRPC bi-directional RPC code to account for TCP segmentation
- Optimise usage of readdirplus when confronted with 'ls -l' situations
- Soft mount bugfixes
- NFS over RDMA bugfixes
- NFSv4 close locking fixes
- Various NFSv4.x client state management optimisations
- Rename/unlink code cleanups"
* tag 'nfs-for-3.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
nfs: pass string length to pr_notice message about readdir loops
NFSv4: Fix a use-after-free problem in open()
SUNRPC: rpc_restart_call/rpc_restart_call_prepare should clear task->tk_status
SUNRPC: Don't let rpc_delay() clobber non-timeout errors
SUNRPC: Ensure call_connect_status() deals correctly with SOFTCONN tasks
SUNRPC: Ensure call_status() deals correctly with SOFTCONN tasks
NFSv4: Ensure we respect soft mount timeouts during trunking discovery
NFSv4: Schedule recovery if nfs40_walk_client_list() is interrupted
NFS: advertise only supported callback netids
SUNRPC: remove KERN_INFO from dprintk() call sites
SUNRPC: Fix large reads on NFS/RDMA
NFS: Clean up: revert increase in READDIR RPC buffer max size
SUNRPC: Ensure that call_bind times out correctly
SUNRPC: Ensure that call_connect times out correctly
nfs: emit a fsnotify_nameremove call in sillyrename codepath
nfs: remove synchronous rename code
nfs: convert nfs_rename to use async_rename infrastructure
nfs: make nfs_async_rename non-static
nfs: abstract out code needed to complete a sillyrename
NFSv4: Clear the open state flags if the new stateid does not match
...