IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
With eBPF getting more extended and exposure to user space is on it's way,
hardening the memory range the interpreter uses to steer its command flow
seems appropriate. This patch moves the to be interpreted bytecode to
read-only pages.
In case we execute a corrupted BPF interpreter image for some reason e.g.
caused by an attacker which got past a verifier stage, it would not only
provide arbitrary read/write memory access but arbitrary function calls
as well. After setting up the BPF interpreter image, its contents do not
change until destruction time, thus we can setup the image on immutable
made pages in order to mitigate modifications to that code. The idea
is derived from commit 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit
against spraying attacks").
This is possible because bpf_prog is not part of sk_filter anymore.
After setup bpf_prog cannot be altered during its life-time. This prevents
any modifications to the entire bpf_prog structure (incl. function/JIT
image pointer).
Every eBPF program (including classic BPF that are migrated) have to call
bpf_prog_select_runtime() to select either interpreter or a JIT image
as a last setup step, and they all are being freed via bpf_prog_free(),
including non-JIT. Therefore, we can easily integrate this into the
eBPF life-time, plus since we directly allocate a bpf_prog, we have no
performance penalty.
Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual
inspection of kernel_page_tables. Brad Spengler proposed the same idea
via Twitter during development of this patch.
Joint work with Hannes Frederic Sowa.
Suggested-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Branching takes two cycles on MPC8xx. Lets duplicate the two instructions
and avoid the branching.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
By XORing the upper part of the instruction code, we get a value that can
directly be verified with the second test and we can remove the first test.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
r10 and r3 are only used inside FixupDAR function. So lets save them inside
that function only.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Since commit 2321f33790a6c5b80322d907a92d5739e7521a13, dirty handling is not
handled here anymore. So we fix the comment.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Since commit 2321f33790a6c5b80322d907a92d5739e7521a13, r10 is not used anymore
after FixupDAR. There is therefore no need to set it up with the value of DAR.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
SCRATCH0 and SCRATCH1 are only used in Exceptions prologs where no other
exception can happen. There is therefore no need to preserve them accross
TLB handlers, we can use them there as in other exceptions. One of the
advantages is that they do not suffer CPU6 errata unlike M_TW register.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Since commit 469d62be9263b92f2c3329540cbb1c076111f4f3, SPRG2 is used as a
scratch register just like SPRG0 and SPRG1. So Declare it as such and fix
the comment which is not valid anymore since that commit.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Allocate msis such that each time a new interrupt is requested,
the SRS (MSIR register select) to be used is allocated in a
round-robin fashion.
The end result is that the msi interrupts will be spread across
distinct MSIRs with the main benefit that now users can set
affinity to each msi int through the mpic irq backing up the
MSIR register.
This is achieved with the help of a newly introduced msi bitmap
api that allows specifying the starting point when searching
for a free msi interrupt.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Rename the irq controller associated with a MSI
interrupt to fsl-msi-<V>, where <V> is the virq
of the cascade irq backing up this MSI interrupt.
This way, one can set the affinity of a MSI
through the cascade irq associated with said MSI
interrupt.
Given this example /proc/interrupts snippet:
CPU0 CPU1 CPU2 CPU3
16: 0 0 0 0 OpenPIC 16 Edge mpic-error-int
17: 0 4 0 0 fsl-msi-224 0 Edge eth0-rx-0
18: 0 5 0 0 fsl-msi-225 1 Edge eth0-tx-0
19: 0 2 0 0 fsl-msi-226 2 Edge eth0
[...]
224: 0 11 0 0 OpenPIC 224 Edge fsl-msi-cascade
225: 0 0 0 0 OpenPIC 225 Edge fsl-msi-cascade
226: 0 0 0 0 OpenPIC 226 Edge fsl-msi-cascade
[...]
To change the affinity of MSI interrupt 17
(having the irq controller named "fsl-msi-224")
instead of writing /proc/irq/17/smp_affinity, use
the associated MSI cascade irq, in this case,
interrupt 224, e.g.:
echo 6 > /proc/irq/224/smp_affinity
Note that a MSI cascade irq covers several MSI
interrupts, so changing the affinity on the
cascade will impact all of the associated MSI
interrupts.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
As we do for other fsl-mpic related cascaded irqchips
(e.g. error ints, mpic timers), use a normal irq handler
for msi irqs too.
This brings some advantages such as mask/unmask/ack/eoi
and irq state taken care behind the scenes, kstats
updates a.s.o plus access to features provided by mpic,
such as affinity.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Store cascade_data in an array inside the driver
data for later use.
Get rid of the msi_virq array since now we can
encapsulate the virqs in the cascade_data
directly and access them through the array
mentioned earlier.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Change USB controller version to 2.5 in compatible string for T4240
Signed-off-by: Nikhil Badola <nikhil.badola@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
The following commit prevents the MPC8548E on the XPedite5200 PrPMC
module from enumerating its PCI/PCI-X bus:
powerpc/fsl-pci: use 'Header Type' to identify PCIE mode
The previous patch prevents any Freescale PCI-X bridge from enumerating
the bus, if it is hardware strapped into Agent mode.
In PCI-X, the Host is responsible for driving the PCI-X initialization
pattern to devices on the bus, so that they know whether to operate in
conventional PCI or PCI-X mode as well as what the bus timing will be.
For a PCI-X PrPMC, the pattern is driven by the mezzanine carrier it is
installed onto. Therefore, PrPMCs are PCI-X Agents, but one per system
may still enumerate the bus.
This patch causes the device node of any PCI/PCI-X bridge strapped into
Agent mode to be checked for the fsl,pci-agent-force-enum property. If
the property is present in the node, the bridge will be allowed to
enumerate the bus.
Cc: Minghuan Lian <Minghuan.Lian@freescale.com>
Signed-off-by: Aaron Sierra <asierra@xes-inc.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Add CONFIG_NLS_CODEPAGE_437, CONFIG_NLS_CODEPAGE_850,
CONFIG_NLS_ISO8859_1 in default configs for 85xx
and 86xx socs. Required for mounting vfat file-systems
on USB devices
Signed-off-by: Ramneek Mehresh <ramneek.mehresh@freescale.com>
Signed-off-by: Nikhil Badola <nikhil.badola@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
The new MSI block in MPIC 4.3 added the MSIIR1 register,
with a different layout, in order to support 16 MSIR
registers. The msi binding was also updated so that
the "reg" reflects the newly introduced MSIIR1 register.
Virtual machines advertise these msi nodes by using the
compatible "fsl,vmpic-msi-v4.3" so add support for it.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Cc: Scott Wood <scottwood@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
FSL PCI cannot directly address the whole lower 4 GiB due to
conflicts with PCICSRBAR and outbound windows. By the time
max_direct_dma_addr is set to the precise limit, it will be too late to
alter the zone limits, but we should always have at least 2 GiB mapped
(unless RAM is smaller than that).
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
A DMA zone is still needed with swiotlb, for coherent allocations.
This doesn't affect platforms that don't use swiotlb or that don't call
swiotlb_detect_4g().
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
FSL PCI cannot directly address the whole lower 4 GiB due to
conflicts with PCICSRBAR and outbound windows, and thus
max_direct_dma_addr is less than 4GiB. Honor that limit in
dma_direct_alloc_coherent().
Note that setting the DMA mask to 31 bits is not an option, since many
PCI drivers would fail if we reject 32-bit DMA in dma_supported(), and
we have no control over the setting of coherent_dma_mask if
dma_supported() returns true.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
Platform code can call limit_zone_pfn() to set appropriate limits
for ZONE_DMA and ZONE_DMA32, and dma_direct_alloc_coherent() will
select a suitable zone based on a device's mask and the pfn limits that
platform code has configured.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
fc95ca7284bc54953165cba76c3228bd2cdb9591 introduces a memset in
kvmppc_alloc_hpt since the general CMA doesn't clear the memory it
allocates.
However, the size argument passed to memset is computed from a signed value
and its signed bit is extended by the cast the compiler is doing. This lead
to extremely large size value when dealing with order value >= 31, and
almost all the memory following the allocated space is cleaned. As a
consequence, the system is panicing and may even fail spawning the kdump
kernel.
This fix makes use of an unsigned value for the memset's size argument to
avoid sign extension. Among this fix, another shift operation which may
lead to signed extended value too is also fixed.
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
New system call depends on crypto. As it did not have a separate config
option, CONFIG_KEXEC was modified to select CRYPTO and CRYPTO_SHA256.
But now previous patch introduced a new config option for new syscall.
So CONFIG_KEXEC does not require crypto. Remove that dependency.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Shaun Ruffell <sruffell@digium.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the beggining was on_each_cpu(), which required an unused argument to
kvm_arch_ops.hardware_{en,dis}able, but this was soon forgotten.
Remove unnecessary arguments that stem from this.
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using static inline is going to save few bytes and cycles.
For example on powerpc, the difference is 700 B after stripping.
(5 kB before)
This patch also deals with two overlooked empty functions:
kvm_arch_flush_shadow was not removed from arch/mips/kvm/mips.c
2df72e9bc KVM: split kvm_arch_flush_shadow
and kvm_arch_sched_in never made it into arch/ia64/kvm/kvm-ia64.c.
e790d9ef6 KVM: add kvm_arch_sched_in
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Opaque KVM structs are useful for prototypes in asm/kvm_host.h, to avoid
"'struct foo' declared inside parameter list" warnings (and consequent
breakage due to conflicting types).
Move them from individual files to a generic place in linux/kvm_types.h.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 5828f666c069af74e00db21559f1535103c9f79a due to
build failure after merging with pending powerpc changes.
Link: http://lkml.kernel.org/g/20140827142243.6277eaff@canb.auug.org.au
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.
The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
__this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
__this_cpu_inc(y)
tj: Folded a fix patch.
http://lkml.kernel.org/g/alpine.DEB.2.11.1408172143020.9652@gentwo.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.
The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
__this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
__this_cpu_inc(y)
CC: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Acked-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Introduce preempt notifiers for architecture specific code.
Advantage over creating a new notifier in every arch is slightly simpler
code and guaranteed call order with respect to kvm_sched_in.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
fc95ca7284bc54953165cba76c3228bd2cdb9591 claims that there is no
functional change but this is not true as it calls get_order() (which
takes bytes) where it should have called order_base_2() and the kernel
stops on VM_BUG_ON().
This replaces get_order() with order_base_2() (round-up version of ilog2).
Suggested-by: Paul Mackerras <paulus@samba.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull more powerpc updates from Ben Herrenschmidt:
"Here are some more powerpc bits for 3.17, essentially fixes.
The biggest series, also aimed at -stable, is from Aneesh and is the
result of weeks and weeks of debugging to find out why the heck or THP
implementation was occasionally triggering multi-hit errors in our
level 1 TLB. It ended up being a combination of issues including
subtleties as to how we should invalidate those special 'MPSS' pages
we use to allow the use of 16M pages inside 4K/64K "base page size"
segments (you really have to love our MMU !)
Another interesting one in the "OMG" category is the series from
Michael adding memory barriers to spin_is_locked(). That's also the
result of many days of debugging to figure out why the semaphore code
would occasionally crash in ways that made no sense. It ended up
being some creative lock stacking that was defeated by the fact that
our locks allow a load inside the locked section to be re-ordered with
the load of the lock value itself (I'm still of two mind about whether
to kill that once and for all by putting a heavier barrier back into
our lock implementation...). The fixes come with a long explanation
in the cset comments, feel free to read it if you feel like having a
headache today"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (25 commits)
powerpc/thp: Add tracepoints to track hugepage invalidate
powerpc/mm: Use read barrier when creating real_pte
powerpc/thp: Use ACCESS_ONCE when loading pmdp
powerpc/thp: Invalidate with vpn in loop
powerpc/thp: Handle combo pages in invalidate
powerpc/thp: Invalidate old 64K based hash page mapping before insert of 4k pte
powerpc/thp: Don't recompute vsid and ssize in loop on invalidate
powerpc/thp: Add write barrier after updating the valid bit
powerpc: reorder per-cpu NUMA information's initialization
powerpc/perf/hv-24x7: Use kmem_cache_free
powerpc/pseries/hvcserver: Fix endian issue in hvcs_get_partner_info
powerpc: Hard disable interrupts in xmon
powerpc: remove duplicate definition of TEXASR_FS
powerpc/pseries: Avoid deadlock on removing ddw
powerpc/pseries: Failure on removing device node
powerpc/boot: Use correct zlib types for comparison
powerpc/powernv: Interface to register/unregister opal dump region
printk: Add function to return log buffer address and size
powerpc: Add POWER8 features to CPU_FTRS_POSSIBLE/ALWAYS
powerpc/ppc476: Disable BTAC
...
window:
Group changes to the device tree. In preparation for adding device tree
overlay support, OF_DYNAMIC is reworked so that a set of device tree
changes can be prepared and applied to the tree all at once. OF_RECONFIG
notifiers see the most significant change here so that users always get
a consistent view of the tree. Notifiers generation is moved from before
a change to after it, and notifiers for a group of changes are emitted
after the entire block of changes have been applied
Automatic console selection from DT. Console drivers can now use
of_console_check() to see if the device node is specified as a console
device. If so then it gets added as a preferred console. UART devices
get this support automatically when uart_add_one_port() is called.
DT unit tests no longer depend on pre-loaded data in the device tree.
Data is loaded dynamically at the start of unit tests, and then unloaded
again when the tests have completed.
Also contains a few bugfixes for reserved regions and early memory setup.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT6M7XAAoJEMWQL496c2LNTTgP/2rXyrTTGZpK/qrLKHWKYHvr
XL7tcTkhA0OLU64E37fB+xtDXyBYnLsharuwUFSd1LlL1Wnc1cZN5ORRlMJbmjUR
Wvwl0A8/mkhGl4tzzKgJ4z4rMJXvlZfJRpnVoRB5FOn90LI7k/jsf5rIwF/6S90B
6D6II0r4RG9ku1m7g70cToxcIFCzp0V+eu2tym9GnhsyGKlunPT9iNiTpwfVhPAj
QUvMPKIQXReOv6xDU5q6E07839IMf7SdAvciBTHGnCDD7sGziHvnBIShj/2vTDAF
27sGRKrWUnnuxRUMOoIudiYyeHXIdt1WXp6FsS/ztVI37Ijh9YPShJwWGhQDppnp
4tcoSdefqw9IRUajAVWsB0RUW/tCL4a7tggWofylOA6itDi+HZnw6YafE1G1YzxF
q8OFo9uqLcmFQfHDJpk+sdtXoMZzOgrxlEscsGsQ8kd2Uoe8+chgR9EY1sqPkWVF
Zw0FJCTB6spBlsnXeboBGrnvpbPkacwhvesIFO0IANy4j4R55xlEeTcs1fe3ubUf
UhNyyFdnCAUA7e5CAabcAQYsdbEKG6v0Il3H6xts36c8lXCSFXVgNcw5mdCpFCcQ
DZ3/1FGSVzkYNXX8hlzIn1W0dqvwn4x0ZbnpXUJrGUBpSmjt9qkXx0XbdeFwartU
7X4tNGS1jfNYhOdP87jF
=8IFz
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux
Pull device tree updates from Grant Likely:
"The branch contains the following device tree changes the v3.17 merge
window:
Group changes to the device tree. In preparation for adding device
tree overlay support, OF_DYNAMIC is reworked so that a set of device
tree changes can be prepared and applied to the tree all at once.
OF_RECONFIG notifiers see the most significant change here so that
users always get a consistent view of the tree. Notifiers generation
is moved from before a change to after it, and notifiers for a group
of changes are emitted after the entire block of changes have been
applied
Automatic console selection from DT. Console drivers can now use
of_console_check() to see if the device node is specified as a console
device. If so then it gets added as a preferred console. UART
devices get this support automatically when uart_add_one_port() is
called.
DT unit tests no longer depend on pre-loaded data in the device tree.
Data is loaded dynamically at the start of unit tests, and then
unloaded again when the tests have completed.
Also contains a few bugfixes for reserved regions and early memory
setup"
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux: (21 commits)
of: Fixing OF Selftest build error
drivers: of: add automated assignment of reserved regions to client devices
of: Use proper types for checking memory overflow
of: typo fix in __of_prop_dup()
Adding selftest testdata dynamically into live tree
of: Add todo tasklist for Devicetree
of: Transactional DT support.
of: Reorder device tree changes and notifiers
of: Move dynamic node fixups out of powerpc and into common code
of: Make sure attached nodes don't carry along extra children
of: Make devicetree sysfs update functions consistent.
of: Create unlocked versions of node and property add/remove functions
OF: Utility helper functions for dynamic nodes
of: Move CONFIG_OF_DYNAMIC code into a separate file
of: rename of_aliases_mutex to just of_mutex
of/platform: Fix of_platform_device_destroy iteration of devices
of: Migrate of_find_node_by_name() users to for_each_node_by_name()
tty: Update hypervisor tty drivers to use core stdout parsing code.
arm/versatile: Add the uart as the stdout device.
of: Enable console on serial ports specified by /chosen/stdout-path
...
Many of the atomic op implementations are the same except for one
instruction; fold the lot into a few CPP macros and reduce LoC.
Requires asm_op because PPC asm is weird :-)
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20140508135852.713980957@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add tracepoint to track hugepage invalidate. This help us
in debugging difficult to track bugs.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On ppc64 we support 4K hash pte with 64K page size. That requires
us to track the hash pte slot information on a per 4k basis. We do that
by storing the slot details in the second half of pte page. The pte bit
_PAGE_COMBO is used to indicate whether the second half need to be
looked while building real_pte. We need to use read memory barrier while
doing that so that load of hidx is not reordered w.r.t _PAGE_COMBO
check. On the store side we already do a lwsync in __hash_page_4K
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We would get wrong results in compiler recomputed old_pmd. Avoid
that by using ACCESS_ONCE
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As per ISA, for 4k base page size we compare 14..65 bits of VA specified
with the entry_VA in tlb. That implies we need to make sure we do a
tlbie with all the possible 4k va we used to access the 16MB hugepage.
With 64k base page size we compare 14..57 bits of VA. Hence we cannot
ignore the lower 24 bits of va while tlbie .We also cannot tlb
invalidate a 16MB entry with just one tlbie instruction because
we don't track which va was used to instantiate the tlb entry.
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we changed base page size of the segment, either via sub_page_protect
or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
table entries. We do a lazy hash page table flush for all mapped pages
in the demoted segment. This happens when we handle hash page fault for
these pages.
We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
that implies that we could possibly have older 64K hash pte entries in
the hash page table and we need to invalidate those entries.
Use _PAGE_COMBO to determine the page size with which we should
invalidate the hash table entries on unmap.
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we changed base page size of the segment, either via sub_page_protect
or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
table entries. We do a lazy hash page table flush for all mapped pages
in the demoted segment. This happens when we handle hash page fault
for these pages.
We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
that implies that we could possibly have older 64K hash pte entries in
the hash page table and we need to invalidate those entries.
Handle this correctly for 16M pages
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The segment identifier and segment size will remain the same in
the loop, So we can compute it outside. We also change the
hugepage_invalidate interface so that we can use it the later patch
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With hugepages, we store the hpte valid information in the pte page
whose address is stored in the second half of the PMD. Use a
write barrier to make sure clearing pmd busy bit and updating
hpte valid info are ordered properly.
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().
Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.
Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:
bce903809ab3f ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469342 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")
Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).
Fix this by initializing the powerpc-specific CPU<->node/local memory
node mapping as early as possible, which on powerpc is
do_init_bootmem(). Currently that function initializes the mapping for
the boot CPU, but we extend it to setup the mapping for all possible
CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
per-cpu values for all possible CPUs. That ensures that before the
early_initcalls run (and really as early as possible), the per-cpu NUMA
mapping is accurate.
While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology of:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node 0 1
0: 10 40
1: 40 10
the slab consumption decreases from
Slab: 932416 kB
SUnreclaim: 902336 kB
to
Slab: 395264 kB
SUnreclaim: 359424 kB
And we a corresponding increase in the slab efficiency from
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 337 MB 11.28% 100.00%
task_struct 288 MB 9.93% 100.00%
to
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 37 MB 100.00% 100.00%
task_struct 31 MB 100.00% 100.00%
Powerpc didn't support memoryless nodes until recently (64bb80d87f01
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261194d
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Free memory allocated using kmem_cache_zalloc using kmem_cache_free
rather than kfree.
The Coccinelle semantic patch that makes this change is as follows:
// <smpl>
@@
expression x,E,c;
@@
x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
... when != x = E
when != &x
?-kfree(x)
+kmem_cache_free(c,x)
// </smpl>
Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
A buffer returned by H_VTERM_PARTNER_INFO contains device information
in big endian format, causing problems for little endian architectures.
This patch ensures that they are in cpu endian.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
xmon only soft disables interrupts. This seems like a bad idea - we
certainly don't want decrementer and PMU exceptions going off when
we are debugging something inside xmon.
This issue was uncovered when the hard lockup detector went off
inside xmon. To ensure we wont get a spurious hard lockup warning,
I also call touch_nmi_watchdog() when exiting xmon.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It appears that commits 7f06f21d40a6 ("powerpc/tm: Add checking to
treclaim/trechkpt") and e4e38121507a ("KVM: PPC: Book3S HV: Add
transactional memory support") both added definitions of TEXASR_FS.
Remove one of them. At the same time, fix the alignment of the remaining
definition (should be tab-separated like the rest of the #defines).
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>