IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
In btf_parse_hdr(), the length of the btf data header is firstly copied
from the user space to 'hdr_len' and checked to see whether it is larger
than 'btf_data_size'. If yes, an error code EINVAL is returned. Otherwise,
the whole header is copied again from the user space to 'btf->hdr'.
However, after the second copy, there is no check between
'btf->hdr->hdr_len' and 'hdr_len' to confirm that the two copies get the
same value. Given that the btf data is in the user space, a malicious user
can race to change the data between the two copies. By doing so, the user
can provide malicious data to the kernel and cause undefined behavior.
This patch adds a necessary check after the second copy, to make sure
'btf->hdr->hdr_len' has the same value as 'hdr_len'. Otherwise, an error
code EINVAL will be returned.
Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Drop BUG_ON()s and do normal error handling instead, in
find_next_iomem_res().
- Align function arguments on opening braces.
- Get rid of local var sibling_only in find_next_iomem_res().
- Shorten unnecessarily long first_level_children_only arg name.
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Bjorn Helgaas <bhelgaas@google.com>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
Link: <new submission>
Previously find_next_iomem_res() used "*res" as both an input parameter for
the range to search and the type of resource to search for, and an output
parameter for the resource we found, which makes the interface confusing.
The current callers use find_next_iomem_res() incorrectly because they
allocate a single struct resource and use it for repeated calls to
find_next_iomem_res(). When find_next_iomem_res() returns a resource, it
overwrites the start, end, flags, and desc members of the struct. If we
call find_next_iomem_res() again, we must update or restore these fields.
The previous code restored res.start and res.end, but not res.flags or
res.desc.
Since the callers did not restore res.flags, if they searched for flags
IORESOURCE_MEM | IORESOURCE_BUSY and found a resource with flags
IORESOURCE_MEM | IORESOURCE_BUSY | IORESOURCE_SYSRAM, the next search would
incorrectly skip resources unless they were also marked as
IORESOURCE_SYSRAM.
Fix this by restructuring the interface so it takes explicit "start, end,
flags" parameters and uses "*res" only as an output parameter.
Based on a patch by Lianbo Jiang <lijiang@redhat.com>.
[ bp: While at it:
- make comments kernel-doc style.
-
Originally-by: http://lore.kernel.org/lkml/20180921073211.20097-2-lijiang@redhat.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/153805812916.1157.177580438135143788.stgit@bhelgaas-glaptop.roam.corp.google.com
find_next_iomem_res() finds an iomem resource that covers part of a range
described by "start, end". All callers expect that range to be inclusive,
i.e., both start and end are included, but find_next_iomem_res() doesn't
handle the end address correctly.
If it finds an iomem resource that contains exactly the end address, it
skips it, e.g., if "start, end" is [0x0-0x10000] and there happens to be an
iomem resource [mem 0x10000-0x10000] (the single byte at 0x10000), we skip
it:
find_next_iomem_res(...)
{
start = 0x0;
end = 0x10000;
for (p = next_resource(...)) {
# p->start = 0x10000;
# p->end = 0x10000;
# we *should* return this resource, but this condition is false:
if ((p->end >= start) && (p->start < end))
break;
Adjust find_next_iomem_res() so it allows a resource that includes the
single byte at the end of the range. This is a corner case that we
probably don't see in practice.
Fixes: 58c1b5b07907 ("[PATCH] memory hotadd fixes: find_next_system_ram catch range fix")
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/153805812254.1157.16736368485811773752.stgit@bhelgaas-glaptop.roam.corp.google.com
Respect the DMA_ATTR_NO_WARN flags for allocations in dma-direct.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Robin Murphy <robin.murphy@arm.com>
The current printk() is ready to handle log buffer size up to 2G.
Give an explicit error for users who want to use larger log buffer.
Also fix printk formatting to show the 2G as a positive number.
Link: http://lkml.kernel.org/r/20181008135916.gg4kkmoki5bgtco5@pathway.suse.cz
Cc: rostedt@goodmis.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: He Zhe <zhe.he@windriver.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek: Fixed to the really safe limit 2GB.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
lockdep_assert_held() is better suited for checking locking requirements,
since it won't get confused when the lock is held by some other task. This
is also a step towards possibly removing spin_is_locked().
Signed-off-by: Lance Roy <ldr709@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Link: https://lkml.kernel.org/r/20181003053902.6910-12-ldr709@gmail.com
A sizable portion of the CPU cycles spent on the __lock_acquire() is used
up by the atomic increment of the class->ops stat counter. By taking it out
from the lock_class structure and changing it to a per-cpu per-lock-class
counter, we can reduce the amount of cacheline contention on the class
structure when multiple CPUs are trying to acquire locks of the same
class simultaneously.
To limit the increase in memory consumption because of the percpu nature
of that counter, it is now put back under the CONFIG_DEBUG_LOCKDEP
config option. So the memory consumption increase will only occur if
CONFIG_DEBUG_LOCKDEP is defined. The lock_class structure, however,
is reduced in size by 16 bytes on 64-bit archs after ops removal and
a minor restructuring of the fields.
This patch also fixes a bug in the increment code as the counter is of
the 'unsigned long' type, but atomic_inc() was used to increment it.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/d66681f3-8781-9793-1dcf-2436a284550b@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-10-08
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow
BPF programs to perform lookups for sockets in a network namespace. This would
allow programs to determine early on in processing whether the stack is
expecting to receive the packet, and perform some action (eg drop,
forward somewhere) based on this information.
2) per-cpu cgroup local storage from Roman Gushchin.
Per-cpu cgroup local storage is very similar to simple cgroup storage
except all the data is per-cpu. The main goal of per-cpu variant is to
implement super fast counters (e.g. packet counters), which don't require
neither lookups, neither atomic operations in a fast path.
The example of these hybrid counters is in selftests/bpf/netcnt_prog.c
3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet
4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski
5) rename of libbpf interfaces from Andrey Ignatov.
libbpf is maturing as a library and should follow good practices in
library design and implementation to play well with other libraries.
This patch set brings consistent naming convention to global symbols.
6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov
to let Apache2 projects use libbpf
7) various AF_XDP fixes from Björn and Magnus
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a grammar mistake in <linux/interrupt.h>.
[ mingo: While at it also fix another similar error in another comment as well. ]
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181008111726.26286-1-geert%2Brenesas@glider.be
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that there is at least one driver supporting BPF-to-BPF function
calls, lift the restriction, in the verifier, on hardware offload of
eBPF programs containing such calls. But prevent jit_subprogs(), still
in the verifier, from being run for offloaded programs.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In preparation for BPF-to-BPF calls in offloaded programs, add a new
function attribute to the struct bpf_prog_offload_ops so that drivers
supporting eBPF offload can hook at the end of program verification, and
potentially extract information collected by the verifier.
Implement a minimal callback (returning 0) in the drivers providing the
structs, namely netdevsim and nfp.
This will be useful in the nfp driver, in later commits, to extract the
number of subprograms as well as the stack depth for those subprograms.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
I recently debugged a DMA mapping oops where a driver was trying to map
a buffer returned from request_firmware() with dma_map_single(). Memory
returned from request_firmware() is mapped into the vmalloc region and
this isn't a valid region to map with dma_map_single() per the DMA
documentation's "What memory is DMA'able?" section.
Unfortunately, we don't really check that in the DMA debugging code, so
enabling DMA debugging doesn't help catch this problem. Let's add a new
DMA debug function to check for a vmalloc address or an invalid virtual
address and print a warning if this happens. This makes it a little
easier to debug these sorts of problems, instead of seeing odd behavior
or crashes when drivers attempt to map the vmalloc space for DMA.
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Andrei Vagin <avagin@gmail.com> reported:
> Accoding to the man page, the user should not set si_signo, it has to be set
> by kernel.
>
> $ man 2 rt_sigqueueinfo
>
> The uinfo argument specifies the data to accompany the signal. This
> argument is a pointer to a structure of type siginfo_t, described in
> sigaction(2) (and defined by including <sigaction.h>). The caller
> should set the following fields in this structure:
>
> si_code
> This must be one of the SI_* codes in the Linux kernel source
> file include/asm-generic/siginfo.h, with the restriction that
> the code must be negative (i.e., cannot be SI_USER, which is
> used by the kernel to indicate a signal sent by kill(2)) and
> cannot (since Linux 2.6.39) be SI_TKILL (which is used by the
> kernel to indicate a signal sent using tgkill(2)).
>
> si_pid This should be set to a process ID, typically the process ID of
> the sender.
>
> si_uid This should be set to a user ID, typically the real user ID of
> the sender.
>
> si_value
> This field contains the user data to accompany the signal. For
> more information, see the description of the last (union sigval)
> argument of sigqueue(3).
>
> Internally, the kernel sets the si_signo field to the value specified
> in sig, so that the receiver of the signal can also obtain the signal
> number via that field.
>
> On Tue, Sep 25, 2018 at 07:19:02PM +0200, Eric W. Biederman wrote:
>>
>> If there is some application that calls sigqueueinfo directly that has
>> a problem with this added sanity check we can revisit this when we see
>> what kind of crazy that application is doing.
>
>
> I already know two "applications" ;)
>
> https://github.com/torvalds/linux/blob/master/tools/testing/selftests/ptrace/peeksiginfo.c
> https://github.com/checkpoint-restore/criu/blob/master/test/zdtm/static/sigpending.c
>
> Disclaimer: I'm the author of both of them.
Looking at the kernel code the historical behavior has alwasy been to prefer
the signal number passed in by the kernel.
So sigh. Implmenet __copy_siginfo_from_user and __copy_siginfo_from_user32 to
take that signal number and prefer it. The user of ptrace will still
use copy_siginfo_from_user and copy_siginfo_from_user32 as they do not and
never have had a signal number there.
Luckily this change has never made it farther than linux-next.
Fixes: e75dc036c445 ("signal: Fail sigqueueinfo if si_signo != sig")
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Dave writes:
"Networking fixes:
1) Fix truncation of 32-bit right shift in bpf, from Jann Horn.
2) Fix memory leak in wireless wext compat, from Stefan Seyfried.
3) Use after free in cfg80211's reg_process_hint(), from Yu Zhao.
4) Need to cancel pending work when unbinding in smsc75xx otherwise
we oops, also from Yu Zhao.
5) Don't allow enslaving a team device to itself, from Ido Schimmel.
6) Fix backwards compat with older userspace for rtnetlink FDB dumps.
From Mauricio Faria.
7) Add validation of tc policy netlink attributes, from David Ahern.
8) Fix RCU locking in rawv6_send_hdrinc(), from Wei Wang."
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
net: mvpp2: Extract the correct ethtype from the skb for tx csum offload
ipv6: take rcu lock in rawv6_send_hdrinc()
net: sched: Add policy validation for tc attributes
rtnetlink: fix rtnl_fdb_dump() for ndmsg header
yam: fix a missing-check bug
net: bpfilter: Fix type cast and pointer warnings
net: cxgb3_main: fix a missing-check bug
bpf: 32-bit RSH verification must truncate input before the ALU op
net: phy: phylink: fix SFP interface autodetection
be2net: don't flip hw_features when VXLANs are added/deleted
net/packet: fix packet drop as of virtio gso
net: dsa: b53: Keep CPU port as tagged in all VLANs
openvswitch: load NAT helper
bnxt_en: get the reduced max_irqs by the ones used by RDMA
bnxt_en: free hwrm resources, if driver probe fails.
bnxt_en: Fix enables field in HWRM_QUEUE_COS2BW_CFG request
bnxt_en: Fix VNIC reservations on the PF.
team: Forbid enslaving team device to itself
net/usb: cancel pending work when unbinding smsc75xx
mlxsw: spectrum: Delete RIF when VLAN device is removed
...
Ingo writes:
"perf fixes:
- fix a CPU#0 hot unplug bug and a PCI enumeration bug in the x86 Intel uncore PMU driver
- fix a CPU event enumeration bug in the x86 AMD PMU driver
- fix a perf ring-buffer corruption bug when using tracepoints
- fix a PMU unregister locking bug"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events
perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX
perf/ring_buffer: Prevent concurent ring buffer access
perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded physical package ID 0
perf/core: Fix perf_pmu_unregister() locking
Ingo writes:
"scheduler fixes:
These fixes address a rather involved performance regression between
v4.17->v4.19 in the sched/numa auto-balancing code. Since distros
really need this fix we accelerated it to sched/urgent for a faster
upstream merge.
NUMA scheduling and balancing performance is now largely back to
v4.17 levels, without reintroducing the NUMA placement bugs that
v4.18 and v4.19 fixed.
Many thanks to Srikar Dronamraju, Mel Gorman and Jirka Hladky, for
reporting, testing, re-testing and solving this rather complex set of
bugs."
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task
mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration
sched/numa: Avoid task migration for small NUMA improvement
mm/migrate: Use spin_trylock() while resetting rate limit
sched/numa: Limit the conditions where scan period is reset
sched/numa: Reset scan rate whenever task moves across nodes
sched/numa: Pass destination CPU as a parameter to migrate_task_rq
sched/numa: Stop multiple tasks from moving to the CPU at the same time
Daniel Borkmann says:
====================
pull-request: bpf 2018-10-05
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix to truncate input on ALU operations in 32 bit mode, from Jann.
2) Fixes for cgroup local storage to reject reserved flags on element
update and rejection of map allocation with zero-sized value, from Roman.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When I wrote commit 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification"), I
assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it
is sufficient to just truncate the output to 32 bits; and so I just moved
the register size coercion that used to be at the start of the function to
the end of the function.
That assumption is true for almost every op, but not for 32-bit right
shifts, because those can propagate information towards the least
significant bit. Fix it by always truncating inputs for 32-bit ops to 32
bits.
Also get rid of the coerce_reg_to_size() after the ALU op, since that has
no effect.
Fixes: 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
log_buf_len_setup does not check input argument before passing it to
simple_strtoull. The argument would be a NULL pointer if "log_buf_len",
without its value, is set in command line and thus causes the following
panic.
PANIC: early exception 0xe3 IP 10:ffffffffaaeacd0d error 0 cr2 0x0
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc4-yocto-standard+ #1
[ 0.000000] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70
...
[ 0.000000] Call Trace:
[ 0.000000] simple_strtoull+0x29/0x70
[ 0.000000] memparse+0x26/0x90
[ 0.000000] log_buf_len_setup+0x17/0x22
[ 0.000000] do_early_param+0x57/0x8e
[ 0.000000] parse_args+0x208/0x320
[ 0.000000] ? rdinit_setup+0x30/0x30
[ 0.000000] parse_early_options+0x29/0x2d
[ 0.000000] ? rdinit_setup+0x30/0x30
[ 0.000000] parse_early_param+0x36/0x4d
[ 0.000000] setup_arch+0x336/0x99e
[ 0.000000] start_kernel+0x6f/0x4ee
[ 0.000000] x86_64_start_reservations+0x24/0x26
[ 0.000000] x86_64_start_kernel+0x6f/0x72
[ 0.000000] secondary_startup_64+0xa4/0xb0
This patch adds a check to prevent the panic.
Link: http://lkml.kernel.org/r/1538239553-81805-1-git-send-email-zhe.he@windriver.com
Cc: stable@vger.kernel.org
Cc: rostedt@goodmis.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: He Zhe <zhe.he@windriver.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
When booting with "nosmt=force" a message is issued into dmesg to
confirm that SMT has been force-disabled but such a message is not
issued when only "nosmt" is on the kernel command line.
Fix that.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181004172227.10094-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It appears that in commit 9d7a224b463e ("dma-direct: always allow dma mask
<= physiscal memory size") the logic of the test was changed from a "<" to
a ">=" however I don't see any reason for that change. I am assuming that
there was some additional change planned, specifically I suspect the logic
was intended to be reversed and possibly used for a return. Since that is
the case I have gone ahead and done that.
This addresses issues I had on my system that prevented me from booting
with the above mentioned commit applied on an x86_64 system w/ Intel IOMMU.
Fixes: 9d7a224b463e ("dma-direct: always allow dma mask <= physiscal memory size")
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Architectures have extra archdata in the clocksource, e.g. for VDSO
support. There are no sanity checks or general initializations for this
available. Add support for that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Matt Rickard <matt@softrans.com.au>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: devel@linuxdriverproject.org
Cc: virtualization@lists.linux-foundation.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juergen Gross <jgross@suse.com>
Link: https://lkml.kernel.org/r/20180917130706.973042587@linutronix.de
A cgroup which is already a threaded domain may be converted into a
threaded cgroup if the prerequisite conditions are met. When this
happens, all threaded descendant should also have their ->dom_cgrp
updated to the new threaded domain cgroup. Unfortunately, this
propagation was missing leading to the following failure.
# cd /sys/fs/cgroup/unified
# cat cgroup.subtree_control # show that no controllers are enabled
# mkdir -p mycgrp/a/b/c
# echo threaded > mycgrp/a/b/cgroup.type
At this point, the hierarchy looks as follows:
mycgrp [d]
a [dt]
b [t]
c [inv]
Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
# echo threaded > mycgrp/a/cgroup.type
By this point, we now have a hierarchy that looks as follows:
mycgrp [dt]
a [t]
b [t]
c [inv]
But, when we try to convert the node "c" from "domain invalid" to
"threaded", we get ENOTSUP on the write():
# echo threaded > mycgrp/a/b/c/cgroup.type
sh: echo: write error: Operation not supported
This patch fixes the problem by
* Moving the opencoded ->dom_cgrp save and restoration in
cgroup_enable_threaded() into cgroup_{save|restore}_control() so
that mulitple cgroups can be handled.
* Updating all threaded descendants' ->dom_cgrp to point to the new
dom_cgrp when enabling threaded mode.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Reported-by: Amin Jamali <ajamali@pivotal.io>
Reported-by: Joao De Almeida Pereira <jpereira@pivotal.io>
Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
Fixes: 454000adaa2a ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
Cc: stable@vger.kernel.org # v4.14+
The comment related to nr_iowait_cpu() and get_iowait_load() confuses
cpufreq with cpuidle and is not very useful for this reason, so fix it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux PM <linux-pm@vger.kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e33a9bba85a8 "sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler"
Link: http://lkml.kernel.org/r/3803514.xkx7zY50tF@aspire.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
We reserve 128 bytes for struct siginfo but only use about 48 bytes on
64bit and 32 bytes on 32bit. Someday we might use more but it is unlikely
to be anytime soon.
Userspace seems content with just enough bytes of siginfo to implement
sigqueue. Or in the case of checkpoint/restart reinjecting signals
the kernel has sent.
Reducing the stack footprint and the work to copy siginfo around from
2 cachelines to 1 cachelines seems worth doing even if I don't have
benchmarks to show a performance difference.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.
The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.
So split siginfo in two; kernel_siginfo and siginfo. Keeping the
traditional name for the userspace definition. While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.
The definition of struct kernel_siginfo I have put in include/signal_types.h
A set of buildtime checks has been added to verify the two structures have
the same field offsets.
To make it easy to verify the change kernel_siginfo retains the same
size as siginfo. The reduction in size comes in a following change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In preparation for using a smaller version of siginfo in the kernel
introduce copy_siginfo_from_user and use it when siginfo is copied from
userspace.
Make the pattern for using copy_siginfo_from_user and
copy_siginfo_from_user32 to capture the return value and return that
value on error.
This is a necessary prerequisite for using a smaller siginfo
in the kernel than the kernel exports to userspace.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Rework the defintion of struct siginfo so that the array padding
struct siginfo to SI_MAX_SIZE can be placed in a union along side of
the rest of the struct siginfo members. The result is that we no
longer need the __ARCH_SI_PREAMBLE_SIZE or SI_PAD_SIZE definitions.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The kernel needs to validate that the contents of struct siginfo make
sense as siginfo is copied into the kernel, so that the proper union
members can be put in the appropriate locations. The field si_signo
is a fundamental part of that validation. As such changing the
contents of si_signo after the validation make no sense and can result
in nonsense values in the kernel.
As such simply fail if someone is silly enough to set si_signo out of
sync with the signal number passed to sigqueueinfo.
I don't expect a problem as glibc's sigqueue implementation sets
"si_signo = sig" and CRIU just returns to the kernel what the kernel
gave to it.
If there is some application that calls sigqueueinfo directly that has
a problem with this added sanity check we can revisit this when we see
what kind of crazy that application is doing.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When moving all of the architectures specific si_codes into
siginfo.h, I apparently overlooked EMT_TAGOVF. Move it now.
Remove the now redundant test in siginfo_layout for SIGEMT
as now NSIGEMT is always defined.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If CONFIG_WW_MUTEX_SELFTEST=y is enabled, booting an image
in an arm64 virtual machine results in the following
traceback if 8 CPUs are enabled:
DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current)
WARNING: CPU: 2 PID: 537 at kernel/locking/mutex.c:1033 __mutex_unlock_slowpath+0x1a8/0x2e0
...
Call trace:
__mutex_unlock_slowpath()
ww_mutex_unlock()
test_cycle_work()
process_one_work()
worker_thread()
kthread()
ret_from_fork()
If requesting b_mutex fails with -EDEADLK, the error variable
is reassigned to the return value from calling ww_mutex_lock
on a_mutex again. If this call fails, a_mutex is not locked.
It is, however, unconditionally unlocked subsequently, causing
the reported warning. Fix the problem by using two error variables.
With this change, the selftest still fails as follows:
cyclic deadlock not resolved, ret[7/8] = -35
However, the traceback is gone.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: d1b42b800e5d0 ("locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks")
Link: http://lkml.kernel.org/r/1538516929-9734-1-git-send-email-linux@roeck-us.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When __lock_release() is called, the most likely unlock scenario is
on the innermost lock in the chain. In this case, we can skip some of
the checks and provide a faster path to completion.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1538511560-10090-4-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The static __lock_acquire() function has only two callers:
1) lock_acquire()
2) reacquire_held_locks()
In lock_acquire(), raw_local_irq_save() is called beforehand. So
IRQs must have been disabled. So the check:
DEBUG_LOCKS_WARN_ON(!irqs_disabled())
is kind of redundant in this case. So move the above check
to reacquire_held_locks() to eliminate redundant code in the
lock_acquire() path.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1538511560-10090-3-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The inline function add_chain_cache_classes() is defined, but has no
caller. Just remove it.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1538511560-10090-2-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds new BPF helper functions, bpf_sk_lookup_tcp() and
bpf_sk_lookup_udp() which allows BPF programs to find out if there is a
socket listening on this host, and returns a socket pointer which the
BPF program can then access to determine, for instance, whether to
forward or drop traffic. bpf_sk_lookup_xxx() may take a reference on the
socket, so when a BPF program makes use of this function, it must
subsequently pass the returned pointer into the newly added sk_release()
to return the reference.
By way of example, the following pseudocode would filter inbound
connections at XDP if there is no corresponding service listening for
the traffic:
struct bpf_sock_tuple tuple;
struct bpf_sock_ops *sk;
populate_tuple(ctx, &tuple); // Extract the 5tuple from the packet
sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof tuple, netns, 0);
if (!sk) {
// Couldn't find a socket listening for this traffic. Drop.
return TC_ACT_SHOT;
}
bpf_sk_release(sk, 0);
return TC_ACT_OK;
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow helper functions to acquire a reference and return it into a
register. Specific pointer types such as the PTR_TO_SOCKET will
implicitly represent such a reference. The verifier must ensure that
these references are released exactly once in each path through the
program.
To achieve this, this commit assigns an id to the pointer and tracks it
in the 'bpf_func_state', then when the function or program exits,
verifies that all of the acquired references have been freed. When the
pointer is passed to a function that frees the reference, it is removed
from the 'bpf_func_state` and all existing copies of the pointer in
registers are marked invalid.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
An upcoming commit will need very similar copy/realloc boilerplate, so
refactor the existing stack copy/realloc functions into macros to
simplify it.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Teach the verifier a little bit about a new type of pointer, a
PTR_TO_SOCKET. This pointer type is accessed from BPF through the
'struct bpf_sock' structure.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This check will be reused by an upcoming commit for conditional jump
checks for sockets. Refactor it a bit to simplify the later commit.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The array "reg_type_str" provides canonical formatting of register
types, however a couple of places would previously check whether a
register represented the context and write the name "context" directly.
An upcoming commit will add another pointer type to these statements, so
to provide more accurate error messages in the verifier, update these
error messages to use "reg_type_str" instead.
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>