27800 Commits

Author SHA1 Message Date
Mathieu Malaterre
66448bc274 workqueue: move function definitions within CONFIG_SMP block
In commit 7ee681b25284 ("workqueue: Convert to state machine callbacks"),
three new function definitions were added: ‘workqueue_prepare_cpu’,
‘workqueue_online_cpu’ and ‘workqueue_offline_cpu’.

Move these function definitions within a CONFIG_SMP block since they are
not used outside of it. This will match function declarations in header
<include/linux/workqueue.h>, and silence the following gcc warning (W=1):

  kernel/workqueue.c:4743:5: warning: no previous prototype for ‘workqueue_prepare_cpu’ [-Wmissing-prototypes]
  kernel/workqueue.c:4756:5: warning: no previous prototype for ‘workqueue_online_cpu’ [-Wmissing-prototypes]
  kernel/workqueue.c:4783:5: warning: no previous prototype for ‘workqueue_offline_cpu’ [-Wmissing-prototypes]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-23 11:16:58 -07:00
Tejun Heo
d8742e2290 cgroup: css_set_lock should nest inside tasklist_lock
cgroup_enable_task_cg_lists() incorrectly nests non-irq-safe
tasklist_lock inside irq-safe css_set_lock triggering the following
lockdep warning.

  WARNING: possible irq lock inversion dependency detected
  4.17.0-rc1-00027-gb37d049 #6 Not tainted
  --------------------------------------------------------
  systemd/1 just changed the state of lock:
  00000000fe57773b (css_set_lock){..-.}, at: cgroup_free+0xf2/0x12a
  but this lock took another, SOFTIRQ-unsafe lock in the past:
   (tasklist_lock){.+.+}

  and interrupts could create inverse lock ordering between them.

  other info that might help us debug this:
   Possible interrupt unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(tasklist_lock);
				 local_irq_disable();
				 lock(css_set_lock);
				 lock(tasklist_lock);
    <Interrupt>
      lock(css_set_lock);

   *** DEADLOCK ***

The condition is highly unlikely to actually happen especially given
that the path is executed only once per boot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
2018-05-23 11:04:54 -07:00
Alexei Starovoitov
449325b52b umh: introduce fork_usermode_blob() helper
Introduce helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
struct umh_info {
       struct file *pipe_to_umh;
       struct file *pipe_from_umh;
       pid_t pid;
};

that GPLed kernel modules (signed or unsigned) can use it to execute part
of its own data as swappable user mode process.

The kernel will do:
- allocate a unique file in tmpfs
- populate that file with [data, data + len] bytes
- user-mode-helper code will do_execve that file and, before the process
  starts, the kernel will create two unix pipes for bidirectional
  communication between kernel module and umh
- close tmpfs file, effectively deleting it
- the fork_usermode_blob will return zero on success and populate
  'struct umh_info' with two unix pipes and the pid of the user process

As the first step in the development of the bpfilter project
the fork_usermode_blob() helper is introduced to allow user mode code
to be invoked from a kernel module. The idea is that user mode code plus
normal kernel module code are built as part of the kernel build
and installed as traditional kernel module into distro specified location,
such that from a distribution point of view, there is
no difference between regular kernel modules and kernel modules + umh code.
Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
by a kernel module doesn't make it any special from kernel and user space
tooling point of view.

Such approach enables kernel to delegate functionality traditionally done
by the kernel modules into the user space processes (either root or !root) and
reduces security attack surface of the new code. The buggy umh code would crash
the user process, but not the kernel. Another advantage is that umh code
of the kernel module can be debugged and tested out of user space
(e.g. opening the possibility to run clang sanitizers, fuzzers or
user space test suites on the umh code).
In case of the bpfilter project such architecture allows complex control plane
to be done in the user space while bpf based data plane stays in the kernel.

Since umh can crash, can be oom-ed by the kernel, killed by the admin,
the kernel module that uses them (like bpfilter) needs to manage life
time of umh on its own via two unix pipes and the pid of umh.

The exit code of such kernel module should kill the umh it started,
so that rmmod of the kernel module will cleanup the corresponding umh.
Just like if the kernel module does kmalloc() it should kfree() it
in the exit code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-23 13:23:39 -04:00
Martin KaFai Lau
9b2cf328b2 bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info
In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id"
could cause confusion because the "id" of "btf_id" means the BPF obj id
given to the BTF object while
"btf_key_id" and "btf_value_id" means the BTF type id within
that BTF object.

To make it clear, btf_key_id and btf_value_id are
renamed to btf_key_type_id and btf_value_type_id.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23 12:03:32 +02:00
Martin KaFai Lau
aea2f7b891 bpf: btf: Remove unused bits from uapi/linux/btf.h
This patch does the followings:
1. Limit BTF_MAX_TYPES and BTF_MAX_NAME_OFFSET to 64k.  We can
   raise it later.

2. Remove the BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID.  They are
   currently encoded at the highest bit of a u32.
   It is because the current use case does not require supporting
   parent type (i.e type_id referring to a type in another BTF file).
   It also does not support referring to a string in ELF.

   The BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID checks are replaced
   by BTF_TYPE_ID_CHECK and BTF_STR_OFFSET_CHECK which are
   defined in btf.c instead of uapi/linux/btf.h.

3. Limit the BTF_INFO_KIND from 5 bits to 4 bits which is enough.
   There is unused bits headroom if we ever needed it later.

4. The root bit in BTF_INFO is also removed because it is not
   used in the current use case.

5. Remove BTF_INT_VARARGS since func type is not supported now.
   The BTF_INT_ENCODING is limited to 4 bits instead of 8 bits.

The above can be added back later because the verifier
ensures the unused bits are zeros.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23 12:03:32 +02:00
Martin KaFai Lau
4ef5f5741e bpf: btf: Check array->index_type
Instead of ingoring the array->index_type field.  Enforce that
it must be a BTF_KIND_INT in size 1/2/4/8 bytes.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23 12:03:32 +02:00
Martin KaFai Lau
f80442a4cd bpf: btf: Change how section is supported in btf_header
There are currently unused section descriptions in the btf_header.  Those
sections are here to support future BTF use cases.  For example, the
func section (func_off) is to support function signature (e.g. the BPF
prog function signature).

Instead of spelling out all potential sections up-front in the btf_header.
This patch makes changes to btf_header such that extending it (e.g. adding
a section) is possible later.  The unused ones can be removed for now and
they can be added back later.

This patch:
1. adds a hdr_len to the btf_header.  It will allow adding
sections (and other info like parent_label and parent_name)
later.  The check is similar to the existing bpf_attr.
If a user passes in a longer hdr_len, the kernel
ensures the extra tailing bytes are 0.

2. allows the section order in the BTF object to be
different from its sec_off order in btf_header.

3. each sec_off is followed by a sec_len.  It must not have gap or
overlapping among sections.

The string section is ensured to be at the end due to the 4 bytes
alignment requirement of the type section.

The above changes will allow enough flexibility to
add new sections (and other info) to the btf_header later.

This patch also removes an unnecessary !err check
at the end of btf_parse().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23 12:03:31 +02:00
Martin KaFai Lau
dcab51f19b bpf: Expose check_uarg_tail_zero()
This patch exposes check_uarg_tail_zero() which will
be reused by a later BTF patch.  Its name is changed to
bpf_check_uarg_tail_zero().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-23 12:03:31 +02:00
Joel Fernandes (Google)
152db033d7 schedutil: Allow cpufreq requests to be made even when kthread kicked
Currently there is a chance of a schedutil cpufreq update request to be
dropped if there is a pending update request. This pending request can
be delayed if there is a scheduling delay of the irq_work and the wake
up of the schedutil governor kthread.

A very bad scenario is when a schedutil request was already just made,
such as to reduce the CPU frequency, then a newer request to increase
CPU frequency (even sched deadline urgent frequency increase requests)
can be dropped, even though the rate limits suggest that its Ok to
process a request. This is because of the way the work_in_progress flag
is used.

This patch improves the situation by allowing new requests to happen
even though the old one is still being processed. Note that in this
approach, if an irq_work was already issued, we just update next_freq
and don't bother to queue another request so there's no extra work being
done to make this happen.

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-23 10:37:56 +02:00
Viresh Kumar
036399782b cpufreq: Rename cpufreq_can_do_remote_dvfs()
This routine checks if the CPU running this code belongs to the policy
of the target CPU or if not, can it do remote DVFS for it remotely. But
the current name of it implies as if it is only about doing remote
updates.

Rename it to make it more relevant.

Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-23 10:37:08 +02:00
Peter Zijlstra
f64c6013a2 rcu/x86: Provide early rcu_cpu_starting() callback
The x86/mtrr code does horrific things because hardware. It uses
stop_machine_from_inactive_cpu(), which does a wakeup (of the stopper
thread on another CPU), which uses RCU, all before the CPU is onlined.

RCU complains about this, because wakeups use RCU and RCU does
(rightfully) not consider offline CPUs for grace-periods.

Fix this by initializing RCU way early in the MTRR case.

Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Add !SMP support, per 0day Test Robot report. ]
2018-05-22 16:12:26 -07:00
Dan Williams
e763848843 mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
be able to rely on the fact that they will get wakeups on dev_pagemap
page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
generic_dax_page_free() as common indicator / infrastructure for dax
filesytems to require. With this change there are no users of the
MEMORY_DEVICE_HOST designation, so remove it.

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Reported-by: Dave Jiang <dave.jiang@intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22 06:59:39 -07:00
Patrick Bellasi
fd7d5287fd cpufreq: schedutil: Cleanup and document iowait boost
The iowait boosting code has been recently updated to add a progressive
boosting behavior which allows to be less aggressive in boosting tasks
doing only sporadic IO operations, thus being more energy efficient for
example on mobile platforms.

The current code is now however a bit convoluted. Some functionalities
(e.g. iowait boost reset) are replicated in different paths and their
documentation is slightly misaligned.

Let's cleanup the code by consolidating all the IO wait boosting related
functionality within within few dedicated functions and better define
their role:

- sugov_iowait_boost: set/increase the IO wait boost of a CPU
- sugov_iowait_apply: apply/reduce the IO wait boost of a CPU

Both these two function are used at every sugov update and they make
use of a unified IO wait boost reset policy provided by:

- sugov_iowait_reset: reset/disable the IO wait boost of a CPU
     if a CPU is not updated for more then one tick

This makes possible a cleaner and more self-contained design for the IO
wait boosting code since the rest of the sugov update routines, both for
single and shared frequency domains, follow the same template:

   /* Configure IO boost, if required */
   sugov_iowait_boost()

   /* Return here if freq change is in progress or throttled */

   /* Collect and aggregate utilization information */
   sugov_get_util()
   sugov_aggregate_util()

   /*
    * Add IO boost, if currently enabled, on top of the aggregated
    * utilization value
    */
   sugov_iowait_apply()

As a extra bonus, let's also add the documentation for the new
functions and better align the in-code documentation.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-22 14:05:05 +02:00
Patrick Bellasi
295f1a9953 cpufreq: schedutil: Fix iowait boost reset
A more energy efficient update of the IO wait boosting mechanism has
been introduced in:

   commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost more energy efficient")

where the boost value is expected to be:

 - doubled at each successive wakeup from IO
   staring from the minimum frequency supported by a CPU

 - reset when a CPU is not updated for more then one tick
   by either disabling the IO wait boost or resetting its value to the
   minimum frequency if this new update requires an IO boost.

This approach is supposed to "ignore" boosting for sporadic wakeups from
IO, while still getting the frequency boosted to the maximum to benefit
long sequence of wakeup from IO operations.

However, these assumptions are not always satisfied.
For example, when an IO boosted CPU enters idle for more the one tick
and then wakes up after an IO wait, since in sugov_set_iowait_boost() we
first check the IOWAIT flag, we keep doubling the iowait boost instead
of restarting from the minimum frequency value.

This misbehavior could happen mainly on non-shared frequency domains,
thus defeating the energy efficiency optimization, but it can also
happen on shared frequency domain systems.

Let fix this issue in sugov_set_iowait_boost() by:
 - first check the IO wait boost reset conditions
   to eventually reset the boost value
 - then applying the correct IO boost value
   if required by the caller

Fixes: a5a0809bc58e (cpufreq: schedutil: Make iowait boost more energy efficient)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-22 14:05:05 +02:00
David S. Miller
6f6e434aa2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.

TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.

The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.

Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-21 16:01:54 -04:00
Ondrej Mosnáček
5b71388663 audit: Fix wrong task in comparison of session ID
The audit_filter_rules() function in auditsc.c compared the session ID
with the credentials of the current task, while it should use the
credentials of the task given to audit_filter_rules() as a parameter
(tsk).

GitHub issue:
https://github.com/linux-audit/audit-kernel/issues/82

Fixes: 8fae47705685 ("audit: add support for session ID user filter")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
[PM: not user visible, dropped stable]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-21 14:27:43 -04:00
Linus Torvalds
3b78ce4a34 Merge branch 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Merge speculative store buffer bypass fixes from Thomas Gleixner:

 - rework of the SPEC_CTRL MSR management to accomodate the new fancy
   SSBD (Speculative Store Bypass Disable) bit handling.

 - the CPU bug and sysfs infrastructure for the exciting new Speculative
   Store Bypass 'feature'.

 - support for disabling SSB via LS_CFG MSR on AMD CPUs including
   Hyperthread synchronization on ZEN.

 - PRCTL support for dynamic runtime control of SSB

 - SECCOMP integration to automatically disable SSB for sandboxed
   processes with a filter flag for opt-out.

 - KVM integration to allow guests fiddling with SSBD including the new
   software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on
   AMD.

 - BPF protection against SSB

.. this is just the core and x86 side, other architecture support will
come separately.

* 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
  bpf: Prevent memory disambiguation attack
  x86/bugs: Rename SSBD_NO to SSB_NO
  KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD
  x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
  x86/bugs: Rework spec_ctrl base and mask logic
  x86/bugs: Remove x86_spec_ctrl_set()
  x86/bugs: Expose x86_spec_ctrl_base directly
  x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}
  x86/speculation: Rework speculative_store_bypass_update()
  x86/speculation: Add virtualized speculative store bypass disable support
  x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
  x86/speculation: Handle HT correctly on AMD
  x86/cpufeatures: Add FEATURE_ZEN
  x86/cpufeatures: Disentangle SSBD enumeration
  x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS
  x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
  KVM: SVM: Move spec control call after restore of GS
  x86/cpu: Make alternative_msr_write work for 32-bit code
  x86/bugs: Fix the parameters alignment and missing void
  x86/bugs: Make cpu_show_common() static
  ...
2018-05-21 11:23:26 -07:00
Linus Torvalds
5aef268ace Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix refcounting bug for connections in on-packet scheduling mode of
    IPVS, from Julian Anastasov.

 2) Set network header properly in AF_PACKET's packet_snd, from Willem
    de Bruijn.

 3) Fix regressions in 3c59x by converting to generic DMA API. It was
    relying upon the hack that the PCI DMA interfaces would accept NULL
    for EISA devices. From Christoph Hellwig.

 4) Remove RDMA devices before unregistering netdev in QEDE driver, from
    Michal Kalderon.

 5) Use after free in TUN driver ptr_ring usage, from Jason Wang.

 6) Properly check for missing netlink attributes in SMC_PNETID
    requests, from Eric Biggers.

 7) Set DMA mask before performaing any DMA operations in vmxnet3
    driver, from Regis Duchesne.

 8) Fix mlx5 build with SMP=n, from Saeed Mahameed.

 9) Classifier fixes in bcm_sf2 driver from Florian Fainelli.

10) Tuntap use after free during release, from Jason Wang.

11) Don't use stack memory in scatterlists in tls code, from Matt
    Mullins.

12) Not fully initialized flow key object in ipv4 routing code, from
    David Ahern.

13) Various packet headroom bug fixes in ip6_gre driver, from Petr
    Machata.

14) Remove queues from XPS maps using correct index, from Amritha
    Nambiar.

15) Fix use after free in sock_diag, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (64 commits)
  net: ip6_gre: fix tunnel metadata device sharing.
  cxgb4: fix offset in collecting TX rate limit info
  net: sched: red: avoid hashing NULL child
  sock_diag: fix use-after-free read in __sk_free
  sh_eth: Change platform check to CONFIG_ARCH_RENESAS
  net: dsa: Do not register devlink for unused ports
  net: Fix a bug in removing queues from XPS map
  bpf: fix truncated jump targets on heavy expansions
  bpf: parse and verdict prog attach may race with bpf map update
  bpf: sockmap update rollback on error can incorrectly dec prog refcnt
  net: test tailroom before appending to linear skb
  net: ip6_gre: Fix ip6erspan hlen calculation
  net: ip6_gre: Split up ip6gre_changelink()
  net: ip6_gre: Split up ip6gre_newlink()
  net: ip6_gre: Split up ip6gre_tnl_change()
  net: ip6_gre: Split up ip6gre_tnl_link_config()
  net: ip6_gre: Fix headroom request in ip6erspan_tunnel_xmit()
  net: ip6_gre: Request headroom in __gre6_xmit()
  selftests/bpf: check return value of fopen in test_verifier.c
  erspan: fix invalid erspan version.
  ...
2018-05-21 08:37:48 -07:00
Tejun Heo
197f6accac workqueue: Make sure struct worker is accessible for wq_worker_comm()
The worker struct could already be freed when wq_worker_comm() tries
to access it for reporting.  This patch protects PF_WQ_WORKER
modifications with wq_pool_attach_mutex and makes wq_worker_comm()
test the flag before dereferencing worker from kthread_data(), which
ensures that it only dereferences when the worker struct is valid.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Lai Jiangshan <jiangshanlai@gmail.com>
Fixes: 6b59808bfe48 ("workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}")
2018-05-21 08:04:35 -07:00
Linus Torvalds
b9aad92236 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull UP timer fix from Thomas Gleixner:
 "Work around the for_each_cpu() oddity on UP kernels in the tick
  broadcast code which causes boot failures because the CPU0 bit is
  always reported as set independent of the cpumask content"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/broadcast: Use for_each_cpu() specially on UP kernels
2018-05-20 11:25:54 -07:00
Linus Torvalds
441cab960d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixlets from Thomas Gleixner:
 "Three trivial fixlets for the scheduler:

   - move print_rt_rq() and print_dl_rq() declarations to the right
     place

   - make grub_reclaim() static

   - fix the bogus documentation reference in Kconfig"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix documentation file path
  sched/deadline: Make the grub_reclaim() function static
  sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h
2018-05-20 11:23:34 -07:00
Alexei Starovoitov
af86ca4e30 bpf: Prevent memory disambiguation attack
Detect code patterns where malicious 'speculative store bypass' can be used
and sanitize such patterns.

 39: (bf) r3 = r10
 40: (07) r3 += -216
 41: (79) r8 = *(u64 *)(r7 +0)   // slow read
 42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
 43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
 44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
 45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                 // is now sanitized

Above code after x86 JIT becomes:
 e5: mov    %rbp,%rdx
 e8: add    $0xffffffffffffff28,%rdx
 ef: mov    0x0(%r13),%r14
 f3: movq   $0x0,-0x48(%rbp)
 fb: mov    %rdx,0x0(%r14)
 ff: mov    0x0(%rbx),%rdi
103: movzbq 0x0(%rdi),%rsi

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-19 20:44:24 +02:00
Arnd Bergmann
b9ff604cff timekeeping: Add ktime_get_coarse_with_offset
I have run into a couple of drivers using current_kernel_time()
suffering from the y2038 problem, and they could be converted
to using ktime_t, but don't have interfaces that skip the nanosecond
calculation at the moment.

This introduces ktime_get_coarse_with_offset() as a simpler
variant of ktime_get_with_offset(), and adds wrappers for the
three time domains we support with the existing function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: y2038@lists.linaro.org
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/20180427134016.2525989-5-arnd@arndb.de
2018-05-19 13:57:32 +02:00
Arnd Bergmann
fb7fcc96a8 timekeeping: Standardize on ktime_get_*() naming
The current_kernel_time64, get_monotonic_coarse64, getrawmonotonic64,
get_monotonic_boottime64 and timekeeping_clocktai64 interfaces have
rather inconsistent naming, and they differ in the calling conventions
by passing the output either by reference or as a return value.

Rename them to ktime_get_coarse_real_ts64, ktime_get_coarse_ts64,
ktime_get_raw_ts64, ktime_get_boottime_ts64 and ktime_get_clocktai_ts64
respectively, and provide the interfaces with macros or inline
functions as needed.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: y2038@lists.linaro.org
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/20180427134016.2525989-4-arnd@arndb.de
2018-05-19 13:57:32 +02:00
Arnd Bergmann
edca71fecb timekeeping: Clean up ktime_get_real_ts64
In a move to make ktime_get_*() the preferred driver interface into the
timekeeping code, sanitizes ktime_get_real_ts64() to be a proper exported
symbol rather than an alias for getnstimeofday64().

The internal __getnstimeofday64() is no longer used, so remove that
and merge it into ktime_get_real_ts64().

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: y2038@lists.linaro.org
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/20180427134016.2525989-3-arnd@arndb.de
2018-05-19 13:57:32 +02:00
Arnd Bergmann
4f0fad9a60 timekeeping: Remove timespec64 hack
At this point, we have converted most of the kernel to use timespec64
consistently in place of timespec, so it seems it's time to make
timespec64 the native structure and define timespec in terms of that
one on 64-bit architectures.

Starting with gcc-5, the compiler can completely optimize away the
timespec_to_timespec64 and timespec64_to_timespec functions on 64-bit
architectures. With older compilers, we introduce a couple of extra
copies of local variables, but those are easily avoided by using
the timespec64 based interfaces consistently, as we do in most of the
important code paths already.

The main upside of removing the hack is that printing the tv_sec
field of a timespec64 structure can now use the %lld format
string on all architectures without a cast to time64_t. Without
this patch, the field is a 'long' type and would have to be printed
using %ld on 64-bit architectures.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: y2038@lists.linaro.org
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/20180427134016.2525989-2-arnd@arndb.de
2018-05-19 13:57:31 +02:00
Thomas Gleixner
b563ea676a Merge branch 'linus' into timers/2038
Merge upstream to pick up changes on which pending patches depend on.
2018-05-19 13:55:40 +02:00
John Fastabend
303def35f6 bpf: allow sk_msg programs to read sock fields
Currently sk_msg programs only have access to the raw data. However,
it is often useful when building policies to have the policies specific
to the socket endpoint. This allows using the socket tuple as input
into filters, etc.

This patch adds ctx access to the sock fields.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-18 22:44:10 +02:00
Richard Guy Briggs
5c5b8d8beb audit: use existing session info function
Use the existing audit_log_session_info() function rather than
hardcoding its functionality.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-18 15:47:54 -04:00
Tejun Heo
6b59808bfe workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}
There can be a lot of workqueue workers and they all show up with the
cryptic kworker/* names making it difficult to understand which is
doing what and how they came to be.

  # ps -ef | grep kworker
  root           4       2  0 Feb25 ?        00:00:00 [kworker/0:0H]
  root           6       2  0 Feb25 ?        00:00:00 [kworker/u112:0]
  root          19       2  0 Feb25 ?        00:00:00 [kworker/1:0H]
  root          25       2  0 Feb25 ?        00:00:00 [kworker/2:0H]
  root          31       2  0 Feb25 ?        00:00:00 [kworker/3:0H]
  ...

This patch makes workqueue workers report the latest workqueue it was
executing for through /proc/PID/{comm,stat,status}.  The extra
information is appended to the kthread name with intervening '+' if
currently executing, otherwise '-'.

  # cat /proc/25/comm
  kworker/2:0-events_power_efficient
  # cat /proc/25/stat
  25 (kworker/2:0-events_power_efficient) I 2 0 0 0 -1 69238880 0 0...
  # grep Name /proc/25/status
  Name:   kworker/2:0-events_power_efficient

Unfortunately, ps(1) truncates comm to 15 characters,

  # ps 25
    PID TTY      STAT   TIME COMMAND
     25 ?        I      0:00 [kworker/2:0-eve]

making it a lot less useful; however, this should be an easy fix from
ps(1) side.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Craig Small <csmall@enc.com.au>
2018-05-18 08:47:13 -07:00
Tejun Heo
8bf895931e workqueue: Set worker->desc to workqueue name by default
Work functions can use set_worker_desc() to improve the visibility of
what the worker task is doing.  Currently, the desc field is unset at
the beginning of each execution and there is a separate field to track
the field is set during the current execution.

Instead of leaving empty till desc is set, worker->desc can be used to
remember the last workqueue the worker worked on by default and users
that use set_worker_desc() can override it to something more
informative as necessary.

This simplifies desc handling and helps tracking the last workqueue
that the worker exected on to improve visibility.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-18 08:47:13 -07:00
Tejun Heo
a2d812a27a workqueue: Make worker_attach/detach_pool() update worker->pool
For historical reasons, the worker attach/detach functions don't
currently manage worker->pool and the callers are manually and
inconsistently updating it.

This patch moves worker->pool updates into the worker attach/detach
functions.  This makes worker->pool consistent and clearly defines how
worker->pool updates are synchronized.

This will help later workqueue visibility improvements by allowing
safe access to workqueue information from worker->task.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-18 08:47:13 -07:00
Tejun Heo
1258fae73c workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex
To improve workqueue visibility, we want to be able to access
workqueue information from worker tasks.  The per-pool attach mutex
makes that difficult because there's no way of stabilizing task ->
worker pool association without knowing the pool first.

Worker attach/detach is a slow path and there's no need for different
pools to be able to perform them concurrently.  This patch replaces
the per-pool attach_mutex with global wq_pool_attach_mutex to prepare
for visibility improvement changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-18 08:47:13 -07:00
Steffen Maier
5c750d58e9 scsi: zfcp: workqueue: set description for port work items with their WWPN as context
As a prerequisite, complement commit 3d1cb2059d93 ("workqueue: include
workqueue info when printing debug dump of a worker task") to be usable with
kernel modules by exporting the symbol set_worker_desc().  Current built-in
user was introduced with commit ef3b101925f2 ("writeback: set worker desc to
identify writeback workers in task dumps").

Can help distinguishing work items which do not have adapter scope.
Description is printed out with task dump for debugging on WARN, BUG, panic,
or magic-sysrq [show-task-states(t)].

Example:
$ echo 0 >| /sys/bus/ccw/drivers/zfcp/0.0.1880/0x50050763031bd327/failed &
$ echo 't' >| /proc/sysrq-trigger
$ dmesg
sysrq: SysRq : Show State
  task                        PC stack   pid father
...
zfcp_q_0.0.1880 S14640  2165      2 0x02000000
Call Trace:
([<00000000009df464>] __schedule+0xbf4/0xc78)
 [<00000000009df57c>] schedule+0x94/0xc0
 [<0000000000168654>] rescuer_thread+0x33c/0x3a0
 [<000000000016f8be>] kthread+0x166/0x178
 [<00000000009e71f2>] kernel_thread_starter+0x6/0xc
 [<00000000009e71ec>] kernel_thread_starter+0x0/0xc
no locks held by zfcp_q_0.0.1880/2165.
...
kworker/u512:2  D11280  2193      2 0x02000000
Workqueue: zfcp_q_0.0.1880 zfcp_scsi_rport_work [zfcp] (zrpd-50050763031bd327)
                                                        ^^^^^^^^^^^^^^^^^^^^^
Call Trace:
([<00000000009df464>] __schedule+0xbf4/0xc78)
 [<00000000009df57c>] schedule+0x94/0xc0
 [<00000000009e50c0>] schedule_timeout+0x488/0x4d0
 [<00000000001e425c>] msleep+0x5c/0x78                  >>test code only<<
 [<000003ff8008a21e>] zfcp_scsi_rport_work+0xbe/0x100 [zfcp]
 [<0000000000167154>] process_one_work+0x3b4/0x718
 [<000000000016771c>] worker_thread+0x264/0x408
 [<000000000016f8be>] kthread+0x166/0x178
 [<00000000009e71f2>] kernel_thread_starter+0x6/0xc
 [<00000000009e71ec>] kernel_thread_starter+0x0/0xc
2 locks held by kworker/u512:2/2193:
 #0:  (name){++++.+}, at: [<0000000000166f4e>] process_one_work+0x1ae/0x718
 #1:  ((&(&port->rport_work)->work)){+.+.+.}, at: [<0000000000166f4e>] process_one_work+0x1ae/0x718
...

=============================================
Showing busy workqueues and worker pools:
workqueue zfcp_q_0.0.1880: flags=0x2000a
  pwq 512: cpus=0-255 flags=0x4 nice=0 active=1/1
    in-flight: 2193:zfcp_scsi_rport_work [zfcp]
pool 512: cpus=0-255 flags=0x4 nice=0 hung=0s workers=4 idle: 5 2354 2311

Work items with adapter scope are already identified by the workqueue name
"zfcp_q_<devbusid>" and the work item function name.

Signed-off-by: Steffen Maier <maier@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-05-18 11:27:21 -04:00
Björn Töpel
dac09149d9 xsk: clean up SPDX headers
Clean up SPDX-License-Identifier and removing licensing leftovers.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-18 16:07:02 +02:00
Amir Goldstein
b249f5be61 fsnotify: add fsnotify_add_inode_mark() wrappers
Before changing the arguments of the functions fsnotify_add_mark()
and fsnotify_add_mark_locked(), convert most callers to use a wrapper.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18 14:58:22 +02:00
Amir Goldstein
5b0457ad02 fsnotify: remove redundant arguments to handle_event()
inode_mark and vfsmount_mark arguments are passed to handle_event()
operation as function arguments as well as on iter_info struct.
The difference is that iter_info struct may contain marks that should
not be handled and are represented as NULL arguments to inode_mark or
vfsmount_mark.

Instead of passing the inode_mark and vfsmount_mark arguments, add
a report_mask member to iter_info struct to indicate which marks should
be handled, versus marks that should only be kept alive during user
wait.

This change is going to be used for passing more mark types
with handle_event() (i.e. super block marks).

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-05-18 14:58:22 +02:00
Mathieu Malaterre
3febfc8a21 sched/deadline: Make the grub_reclaim() function static
Since the grub_reclaim() function can be made static, make it so.

Silences the following GCC warning (W=1):

  kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-18 09:05:22 +02:00
Mathieu Malaterre
f6a3463063 sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h
In the following commit:

  6b55c9654fcc ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")

the print_cfs_rq() prototype was added to <kernel/sched/sched.h>,
right next to the prototypes for print_cfs_stats(), print_rt_stats()
and print_dl_stats().

Finish this previous commit and also move related prototypes for
print_rt_rq() and print_dl_rq().

Remove existing extern declarations now that they not needed anymore.

Silences the following GCC warning, triggered by W=1:

  kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
  kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-18 09:05:14 +02:00
Daniel Borkmann
050fad7c45 bpf: fix truncated jump targets on heavy expansions
Recently during testing, I ran into the following panic:

  [  207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
  [  207.901637] Modules linked in: binfmt_misc [...]
  [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G        W         4.17.0-rc3+ #7
  [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
  [  207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
  [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
  [  207.992603] lr : 0xffff000000bdb754
  [  207.996080] sp : ffff000013703ca0
  [  207.999384] x29: ffff000013703ca0 x28: 0000000000000001
  [  208.004688] x27: 0000000000000001 x26: 0000000000000000
  [  208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
  [  208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
  [  208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
  [  208.025903] x19: ffff000009578000 x18: 0000000000000a03
  [  208.031206] x17: 0000000000000000 x16: 0000000000000000
  [  208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
  [  208.041813] x13: 0000000000000000 x12: 0000000000000000
  [  208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
  [  208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
  [  208.057723] x7 : 000000000000000a x6 : 00280c6160000000
  [  208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
  [  208.068329] x3 : 000000000008647a x2 : 19868179b1484500
  [  208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
  [  208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
  [  208.086235] Call trace:
  [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
  [  208.093713]  0xffff000000bdb754
  [  208.096845]  bpf_test_run+0x78/0xf8
  [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
  [  208.104758]  sys_bpf+0x314/0x1198
  [  208.108064]  el0_svc_naked+0x30/0x34
  [  208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
  [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---

The program itself which caused this had a long jump over the whole
instruction sequence where all of the inner instructions required
heavy expansions into multiple BPF instructions. Additionally, I also
had BPF hardening enabled which requires once more rewrites of all
constant values in order to blind them. Each time we rewrite insns,
bpf_adj_branches() would need to potentially adjust branch targets
which cross the patchlet boundary to accommodate for the additional
delta. Eventually that lead to the case where the target offset could
not fit into insn->off's upper 0x7fff limit anymore where then offset
wraps around becoming negative (in s16 universe), or vice versa
depending on the jump direction.

Therefore it becomes necessary to detect and reject any such occasions
in a generic way for native eBPF and cBPF to eBPF migrations. For
the latter we can simply check bounds in the bpf_convert_filter()'s
BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
of subsequent hardening) is a bit more complex in that we need to
detect such truncations before hitting the bpf_prog_realloc(). Thus
the latter is split into an extra pass to probe problematic offsets
on the original program in order to fail early. With that in place
and carefully tested I no longer hit the panic and the rewrites are
rejected properly. The above example panic I've seen on bpf-next,
though the issue itself is generic in that a guard against this issue
in bpf seems more appropriate in this case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-17 16:05:35 -07:00
John Fastabend
9617456054 bpf: parse and verdict prog attach may race with bpf map update
In the sockmap design BPF programs (SK_SKB_STREAM_PARSER,
SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap
map type and when a sock is added to the map the programs are used by
the socket. However, sockmap updates from both userspace and BPF
programs can happen concurrently with the attach and detach of these
programs.

To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE()
primitive to ensure the program pointer is not refeched and
possibly NULL'd before the refcnt increment. This happens inside
a RCU critical section so although the pointer reference in the map
object may be NULL (by a concurrent detach operation) the reference
from READ_ONCE will not be free'd until after grace period. This
ensures the object returned by READ_ONCE() is valid through the
RCU criticl section and safe to use as long as we "know" it may
be free'd shortly.

Daniel spotted a case in the sock update API where instead of using
the READ_ONCE() program reference we used the pointer from the
original map, stab->bpf_{verdict|parse|txmsg}. The problem with this
is the logic checks the object returned from the READ_ONCE() is not
NULL and then tries to reference the object again but using the
above map pointer, which may have already been NULL'd by a parallel
detach operation. If this happened bpf_porg_inc_not_zero could
dereference a NULL pointer.

Fix this by using variable returned by READ_ONCE() that is checked
for NULL.

Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-18 00:27:37 +02:00
John Fastabend
a593f70831 bpf: sockmap update rollback on error can incorrectly dec prog refcnt
If the user were to only attach one of the parse or verdict programs
then it is possible a subsequent sockmap update could incorrectly
decrement the refcnt on the program. This happens because in the
rollback logic, after an error, we have to decrement the program
reference count when its been incremented. However, we only increment
the program reference count if the user has both a verdict and a
parse program. The reason for this is because, at least at the
moment, both are required for any one to be meaningful. The problem
fixed here is in the rollback path we decrement the program refcnt
even if only one existing. But we never incremented the refcnt in
the first place creating an imbalance.

This patch fixes the error path to handle this case.

Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-18 00:27:37 +02:00
Gustavo A. R. Silva
a78622932c bpf: sockmap, fix double-free
`e' is being freed twice.

Fix this by removing one of the kfree() calls.

Addresses-Coverity-ID: 1468983 ("Double free")
Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17 22:44:49 +02:00
Gustavo A. R. Silva
0e43645603 bpf: sockmap, fix uninitialized variable
There is a potential execution path in which variable err is
returned without being properly initialized previously.

Fix this by initializing variable err to 0.

Addresses-Coverity-ID: 1468964 ("Uninitialized scalar variable")
Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17 22:44:43 +02:00
Richard Guy Briggs
38f8059048 audit: normalize loginuid read access
Recognizing that the loginuid is an internal audit value, use an access
function to retrieve the audit loginuid value for the task rather than
reaching directly into the task struct to get it.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-17 16:41:19 -04:00
Richard Guy Briggs
8982a1fbe0 audit: use new audit_context access funciton for seccomp_actions_logged
On the rebase of the following commit on the new seccomp actions_logged
function, one audit_context access was missed.

commit cdfb6b341f0f2409aba24b84f3b4b2bba50be5c5
("audit: use inline function to get audit context")

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-17 15:56:20 -04:00
David S. Miller
b9f672af14 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-05-17

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Provide a new BPF helper for doing a FIB and neighbor lookup
   in the kernel tables from an XDP or tc BPF program. The helper
   provides a fast-path for forwarding packets. The API supports
   IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
   implemented in this initial work, from David (Ahern).

2) Just a tiny diff but huge feature enabled for nfp driver by
   extending the BPF offload beyond a pure host processing offload.
   Offloaded XDP programs are allowed to set the RX queue index and
   thus opening the door for defining a fully programmable RSS/n-tuple
   filter replacement. Once BPF decided on a queue already, the device
   data-path will skip the conventional RSS processing completely,
   from Jakub.

3) The original sockmap implementation was array based similar to
   devmap. However unlike devmap where an ifindex has a 1:1 mapping
   into the map there are use cases with sockets that need to be
   referenced using longer keys. Hence, sockhash map is added reusing
   as much of the sockmap code as possible, from John.

4) Introduce BTF ID. The ID is allocatd through an IDR similar as
   with BPF maps and progs. It also makes BTF accessible to user
   space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
   through BPF_OBJ_GET_INFO_BY_FD, from Martin.

5) Enable BPF stackmap with build_id also in NMI context. Due to the
   up_read() of current->mm->mmap_sem build_id cannot be parsed.
   This work defers the up_read() via a per-cpu irq_work so that
   at least limited support can be enabled, from Song.

6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
   JIT conversion as well as implementation of an optimized 32/64 bit
   immediate load in the arm64 JIT that allows to reduce the number of
   emitted instructions; in case of tested real-world programs they
   were shrinking by three percent, from Daniel.

7) Add ifindex parameter to the libbpf loader in order to enable
   BPF offload support. Right now only iproute2 can load offloaded
   BPF and this will also enable libbpf for direct integration into
   other applications, from David (Beckett).

8) Convert the plain text documentation under Documentation/bpf/ into
   RST format since this is the appropriate standard the kernel is
   moving to for all documentation. Also add an overview README.rst,
   from Jesper.

9) Add __printf verification attribute to the bpf_verifier_vlog()
   helper. Though it uses va_list we can still allow gcc to check
   the format string, from Mathieu.

10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
    is a bash 4.0+ feature which is not guaranteed to be available
    when calling out to shell, therefore use a more portable variant,
    from Joe.

11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
    instead of relying on the gcc built-in, from Björn.

12) Fix a sock hashmap kmalloc warning reported by syzbot when an
    overly large key size is used in hashmap then causing overflows
    in htab->elem_size. Reject bogus attr->key_size early in the
    sock_hash_alloc(), from Yonghong.

13) Ensure in BPF selftests when urandom_read is being linked that
    --build-id is always enabled so that test_stacktrace_build_id[_nmi]
    won't be failing, from Alexei.

14) Add bitsperlong.h as well as errno.h uapi headers into the tools
    header infrastructure which point to one of the arch specific
    uapi headers. This was needed in order to fix a build error on
    some systems for the BPF selftests, from Sirio.

15) Allow for short options to be used in the xdp_monitor BPF sample
    code. And also a bpf.h tools uapi header sync in order to fix a
    selftest build failure. Both from Prashant.

16) More formally clarify the meaning of ID in the direct packet access
    section of the BPF documentation, from Wang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16 22:47:11 -04:00
John Fastabend
e23afe5e7c bpf: sockmap, on update propagate errors back to userspace
When an error happens in the update sockmap element logic also pass
the err up to the user.

Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17 01:48:22 +02:00
Yonghong Song
683d2ac390 bpf: fix sock hashmap kmalloc warning
syzbot reported a kernel warning below:
  WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x1b9/0x294 lib/dump_stack.c:113
   panic+0x22f/0x4de kernel/panic.c:184
   __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
   report_bug+0x252/0x2d0 lib/bug.c:186
   fixup_bug arch/x86/kernel/traps.c:178 [inline]
   do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
   do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
   invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
  RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
  RSP: 0018:ffff8801d907fc58 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff8801aeecb280 RCX: ffffffff8185ebd7
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffe1
  RBP: ffff8801d907fc58 R08: ffff8801adb5e1c0 R09: ffffed0035a84700
  R10: ffffed0035a84700 R11: ffff8801ad423803 R12: ffff8801aeecb280
  R13: 00000000fffffff4 R14: ffff8801ad891a00 R15: 00000000014200c0
   __do_kmalloc mm/slab.c:3713 [inline]
   __kmalloc+0x25/0x760 mm/slab.c:3727
   kmalloc include/linux/slab.h:517 [inline]
   map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
   __do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
   __se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
   __x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

The test case is against sock hashmap with a key size 0xffffffe1.
Such a large key size will cause the below code in function
sock_hash_alloc() overflowing and produces a smaller elem_size,
hence map creation will be successful.
    htab->elem_size = sizeof(struct htab_elem) +
                      round_up(htab->map.key_size, 8);

Later, when map_get_next_key is called and kernel tries
to allocate the key unsuccessfully, it will issue
the above warning.

Similar to hashtab, ensure the key size is at most
MAX_BPF_STACK for a successful map creation.

Fixes: 81110384441a ("bpf: sockmap, add hash map support")
Reported-by: syzbot+e4566d29080e7f3460ff@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17 00:56:06 +02:00
Mathieu Malaterre
db6f9e55c8 clocksource: Move inline keyword to the beginning of function declarations
The inline keyword was not at the beginning of the function declarations.
Fix the following warnings triggered when using W=1:

  kernel/time/clocksource.c:456:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
  kernel/time/clocksource.c:457:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/20180516195943.31924-1-malat@debian.org
2018-05-16 22:21:32 +02:00