32855 Commits

Author SHA1 Message Date
Ian Rogers
95ed6c707f perf/cgroup: Order events in RB tree by cgroup id
If one is monitoring 6 events on 20 cgroups the per-CPU RB tree will
hold 120 events. The scheduling in of the events currently iterates
over all events looking to see which events match the task's cgroup or
its cgroup hierarchy. If a task is in 1 cgroup with 6 events, then 114
events are considered unnecessarily.

This change orders events in the RB tree by cgroup id if it is present.
This means scheduling in may go directly to events associated with the
task's cgroup if one is present. The per-CPU iterator storage in
visit_groups_merge is sized sufficent for an iterator per cgroup depth,
where different iterators are needed for the task's cgroup and parent
cgroups. By considering the set of iterators when visiting, the lowest
group_index event may be selected and the insertion order group_index
property is maintained. This also allows event rotation to function
correctly, as although events are grouped into a cgroup, rotation always
selects the lowest group_index event to rotate (delete/insert into the
tree) and the min heap of iterators make it so that the group_index order
is maintained.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20190724223746.153620-3-irogers@google.com
2020-03-06 11:57:01 +01:00
Ian Rogers
c2283c9368 perf/cgroup: Grow per perf_cpu_context heap storage
Allow the per-CPU min heap storage to have sufficient space for per-cgroup
iterators.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-6-irogers@google.com
2020-03-06 11:57:00 +01:00
Ian Rogers
836196beb3 perf/core: Add per perf_cpu_context min_heap storage
The storage required for visit_groups_merge's min heap needs to vary in
order to support more iterators, such as when multiple nested cgroups'
events are being visited. This change allows for 2 iterators and doesn't
support growth.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-5-irogers@google.com
2020-03-06 11:57:00 +01:00
Ian Rogers
6eef8a7116 perf/core: Use min_heap in visit_groups_merge()
visit_groups_merge will pick the next event based on when it was
inserted in to the context (perf_event group_index). Events may be per CPU
or for any CPU, but in the future we'd also like to have per cgroup events
to avoid searching all events for the events to schedule for a cgroup.
Introduce a min heap for the events that maintains a property that the
earliest inserted event is always at the 0th element. Initialize the heap
with per-CPU and any-CPU events for the context.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-4-irogers@google.com
2020-03-06 11:56:59 +01:00
Peter Zijlstra
98add2af89 perf/cgroup: Reorder perf_cgroup_connect()
Move perf_cgroup_connect() after perf_event_alloc(), such that we can
find/use the PMU's cpu context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-2-irogers@google.com
2020-03-06 11:56:58 +01:00
Peter Zijlstra
2c2366c754 perf/core: Remove 'struct sched_in_data'
We can deduce the ctx and cpuctx from the event, no need to pass them
along. Remove the structure and pass in can_add_hw directly.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 11:56:58 +01:00
Peter Zijlstra
ab6f824cfd perf/core: Unify {pinned,flexible}_sched_in()
Less is more; unify the two very nearly identical function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 11:56:55 +01:00
Ingo Molnar
1941011a8b Merge branch 'perf/urgent' into perf/core, to pick up the latest fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 11:56:40 +01:00
Peter Zijlstra
4b39f99c22 futex: Remove {get,drop}_futex_key_refs()
Now that {get,drop}_futex_key_refs() have become a glorified NOP,
remove them entirely.

The only thing get_futex_key_refs() is still doing is an smp_mb(), and
now that we don't need to (ab)use existing atomic ops to obtain them,
we can place it explicitly where we need it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-06 11:06:19 +01:00
Peter Zijlstra
222993395e futex: Remove pointless mmgrap() + mmdrop()
We always set 'key->private.mm' to 'current->mm', getting an extra
reference on 'current->mm' is quite pointless, because as long as the
task is blocked it isn't going to go away.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-06 11:06:18 +01:00
Peter Zijlstra
3867913c45 Merge branch 'locking/urgent' 2020-03-06 11:06:17 +01:00
Peter Zijlstra
8019ad13ef futex: Fix inode life-time issue
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.

This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.

Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-06 11:06:15 +01:00
Eric Biggers
07b24c7c08 crypto: pcrypt - simplify error handling in pcrypt_create_aead()
Simplify the error handling in pcrypt_create_aead() by taking advantage
of crypto_grab_aead() now handling an ERR_PTR() name and by taking
advantage of crypto_drop_aead() now accepting (as a no-op) a spawn that
hasn't been grabbed yet.

This required also making padata_free_shell() accept a NULL argument.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-03-06 12:28:24 +11:00
KP Singh
3e7c67d90e bpf: Fix bpf_prog_test_run_tracing for !CONFIG_NET
test_run.o is not built when CONFIG_NET is not set and
bpf_prog_test_run_tracing being referenced in bpf_trace.o causes the
linker error:

ld: kernel/trace/bpf_trace.o:(.rodata+0x38): undefined reference to
 `bpf_prog_test_run_tracing'

Add a __weak function in bpf_trace.c to handle this.

Fixes: da00d2f117a0 ("bpf: Add test ops for BPF_PROG_TYPE_TRACING")
Signed-off-by: KP Singh <kpsingh@google.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305220127.29109-1-kpsingh@chromium.org
2020-03-05 15:14:58 -08:00
KP Singh
69191754ff bpf: Remove unnecessary CAP_MAC_ADMIN check
While well intentioned, checking CAP_MAC_ADMIN for attaching
BPF_MODIFY_RETURN tracing programs to "security_" functions is not
necessary as tracing BPF programs already require CAP_SYS_ADMIN.

Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305204955.31123-1-kpsingh@chromium.org
2020-03-05 14:27:22 -08:00
Martin KaFai Lau
849b4d9458 bpf: Do not allow map_freeze in struct_ops map
struct_ops map cannot support map_freeze.  Otherwise, a struct_ops
cannot be unregistered from the subsystem.

Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305013454.535397-1-kafai@fb.com
2020-03-05 14:15:49 -08:00
Martin KaFai Lau
8e5290e710 bpf: Return better error value in delete_elem for struct_ops map
The current always succeed behavior in bpf_struct_ops_map_delete_elem()
is not ideal for userspace tool.  It can be improved to return proper
error value.

If it is in TOBEFREE, it means unregistration has been already done
before but it is in progress and waiting for the subsystem to clear
the refcnt to zero, so -EINPROGRESS.

If it is INIT, it means the struct_ops has not been registered yet,
so -ENOENT.

Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305013447.535326-1-kafai@fb.com
2020-03-05 14:15:48 -08:00
Yonghong Song
1bc7896e9e bpf: Fix deadlock with rq_lock in bpf_send_signal()
When experimenting with bpf_send_signal() helper in our production
environment (5.2 based), we experienced a deadlock in NMI mode:
   #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
   #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
   #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
   #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
   #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
  #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
  #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
  #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
  #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
  #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
  #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
  #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
  #17 [ffffc9002219fc90] schedule at ffffffff81a3e219
  #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
  #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
  #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
  #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
  #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
  #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068

The above call stack is actually very similar to an issue
reported by Commit eac9153f2b58 ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") by Song Liu. The only difference is
bpf_send_signal() helper instead of bpf_get_stack() helper.

The above deadlock is triggered with a perf_sw_event.
Similar to Commit eac9153f2b58, the below almost identical reproducer
used tracepoint point sched/sched_switch so the issue can be easily caught.
  /* stress_test.c */
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/mman.h>
  #include <pthread.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>

  #define THREAD_COUNT 1000
  char *filename;
  void *worker(void *p)
  {
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
  }

  int main(int argc, char *argv[])
  {
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
  }
and the following command:
  1. run `stress_test /bin/ls` in one windown
  2. hack bcc trace.py with the following change:
     --- a/tools/trace.py
     +++ b/tools/trace.py
     @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
              __data.tgid = __tgid;
              __data.pid = __pid;
              bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
     +        bpf_send_signal(10);
      %s
      %s
              %s.perf_submit(%s, &__data, sizeof(__data));
  3. in a different window run
     ./trace.py -p $(pidof stress_test) t:sched:sched_switch

The deadlock can be reproduced in our production system.

Similar to Song's fix, the fix is to delay sending signal if
irqs is disabled to avoid deadlocks involving with rq_lock.
With this change, my above stress-test in our production system
won't cause deadlock any more.

I also implemented a scale-down version of reproducer in the
selftest (a subsequent commit). With latest bpf-next,
it complains for the following potential deadlock.
  [   32.832450] -> #1 (&p->pi_lock){-.-.}:
  [   32.833100]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.833696]        task_rq_lock+0x2c/0xa0
  [   32.834182]        task_sched_runtime+0x59/0xd0
  [   32.834721]        thread_group_cputime+0x250/0x270
  [   32.835304]        thread_group_cputime_adjusted+0x2e/0x70
  [   32.835959]        do_task_stat+0x8a7/0xb80
  [   32.836461]        proc_single_show+0x51/0xb0
  ...
  [   32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
  [   32.840275]        __lock_acquire+0x1358/0x1a20
  [   32.840826]        lock_acquire+0xc7/0x1d0
  [   32.841309]        _raw_spin_lock_irqsave+0x44/0x80
  [   32.841916]        __lock_task_sighand+0x79/0x160
  [   32.842465]        do_send_sig_info+0x35/0x90
  [   32.842977]        bpf_send_signal+0xa/0x10
  [   32.843464]        bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
  [   32.844301]        trace_call_bpf+0x115/0x270
  [   32.844809]        perf_trace_run_bpf_submit+0x4a/0xc0
  [   32.845411]        perf_trace_sched_switch+0x10f/0x180
  [   32.846014]        __schedule+0x45d/0x880
  [   32.846483]        schedule+0x5f/0xd0
  ...

  [   32.853148] Chain exists of:
  [   32.853148]   &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
  [   32.853148]
  [   32.854451]  Possible unsafe locking scenario:
  [   32.854451]
  [   32.855173]        CPU0                    CPU1
  [   32.855745]        ----                    ----
  [   32.856278]   lock(&rq->lock);
  [   32.856671]                                lock(&p->pi_lock);
  [   32.857332]                                lock(&rq->lock);
  [   32.857999]   lock(&(&sighand->siglock)->rlock);

  Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
  but it has been held by CPU1 and CPU1 tries to grab &rq->lock
  and cannot get it.

  This is not exactly the callstack in our production environment,
  but sympotom is similar and both locks are using spin_lock_irqsave()
  to acquire the lock, and both involves rq_lock. The fix to delay
  sending signal when irq is disabled also fixed this issue.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com
2020-03-05 14:02:22 -08:00
Cengiz Can
153031a301 blktrace: fix dereference after null check
There was a recent change in blktrace.c that added a RCU protection to
`q->blk_trace` in order to fix a use-after-free issue during access.

However the change missed an edge case that can lead to dereferencing of
`bt` pointer even when it's NULL:

Coverity static analyzer marked this as a FORWARD_NULL issue with CID
1460458.

```
/kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store()
1898            ret = 0;
1899            if (bt == NULL)
1900                    ret = blk_trace_setup_queue(q, bdev);
1901
1902            if (ret == 0) {
1903                    if (attr == &dev_attr_act_mask)
>>>     CID 1460458:  Null pointer dereferences  (FORWARD_NULL)
>>>     Dereferencing null pointer "bt".
1904                            bt->act_mask = value;
1905                    else if (attr == &dev_attr_pid)
1906                            bt->pid = value;
1907                    else if (attr == &dev_attr_start_lba)
1908                            bt->start_lba = value;
1909                    else if (attr == &dev_attr_end_lba)
```

Added a reassignment with RCU annotation to fix the issue.

Fixes: c780e86dd48 ("blktrace: Protect q->blk_trace with RCU")
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Cengiz Can <cengiz@kernel.wtf>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-05 13:42:40 -07:00
Tycho Andersen
51891498f2 seccomp: allow TSYNC and USER_NOTIF together
The restriction introduced in 7a0df7fbc145 ("seccomp: Make NEW_LISTENER and
TSYNC flags exclusive") is mostly artificial: there is enough information
in a seccomp user notification to tell which thread triggered a
notification. The reason it was introduced is because TSYNC makes the
syscall return a thread-id on failure, and NEW_LISTENER returns an fd, and
there's no way to distinguish between these two cases (well, I suppose the
caller could check all fds it has, then do the syscall, and if the return
value was an fd that already existed, then it must be a thread id, but
bleh).

Matthew would like to use these two flags together in the Chrome sandbox
which wants to use TSYNC for video drivers and NEW_LISTENER to proxy
syscalls.

So, let's fix this ugliness by adding another flag, TSYNC_ESRCH, which
tells the kernel to just return -ESRCH on a TSYNC error. This way,
NEW_LISTENER (and any subsequent seccomp() commands that want to return
positive values) don't conflict with each other.

Suggested-by: Matthew Denton <mpdenton@google.com>
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Link: https://lore.kernel.org/r/20200304180517.23867-1-tycho@tycho.ws
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-03-04 14:48:54 -08:00
KP Singh
da00d2f117 bpf: Add test ops for BPF_PROG_TYPE_TRACING
The current fexit and fentry tests rely on a different program to
exercise the functions they attach to. Instead of doing this, implement
the test operations for tracing which will also be used for
BPF_MODIFY_RETURN in a subsequent patch.

Also, clean up the fexit test to use the generated skeleton.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-7-kpsingh@chromium.org
2020-03-04 13:41:06 -08:00
KP Singh
6ba43b761c bpf: Attachment verification for BPF_MODIFY_RETURN
- Allow BPF_MODIFY_RETURN attachment only to functions that are:

    * Whitelisted for error injection by checking
      within_error_injection_list. Similar discussions happened for the
      bpf_override_return helper.

    * security hooks, this is expected to be cleaned up with the LSM
      changes after the KRSI patches introduce the LSM_HOOK macro:

        https://lore.kernel.org/bpf/20200220175250.10795-1-kpsingh@chromium.org/

- The attachment is currently limited to functions that return an int.
  This can be extended later other types (e.g. PTR).

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-5-kpsingh@chromium.org
2020-03-04 13:41:05 -08:00
KP Singh
ae24082331 bpf: Introduce BPF_MODIFY_RETURN
When multiple programs are attached, each program receives the return
value from the previous program on the stack and the last program
provides the return value to the attached function.

The fmod_ret bpf programs are run after the fentry programs and before
the fexit programs. The original function is only called if all the
fmod_ret programs return 0 to avoid any unintended side-effects. The
success value, i.e. 0 is not currently configurable but can be made so
where user-space can specify it at load time.

For example:

int func_to_be_attached(int a, int b)
{  <--- do_fentry

do_fmod_ret:
   <update ret by calling fmod_ret>
   if (ret != 0)
        goto do_fexit;

original_function:

    <side_effects_happen_here>

}  <--- do_fexit

The fmod_ret program attached to this function can be defined as:

SEC("fmod_ret/func_to_be_attached")
int BPF_PROG(func_name, int a, int b, int ret)
{
        // This will skip the original function logic.
        return 1;
}

The first fmod_ret program is passed 0 in its return argument.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-4-kpsingh@chromium.org
2020-03-04 13:41:05 -08:00
KP Singh
88fd9e5352 bpf: Refactor trampoline update code
As we need to introduce a third type of attachment for trampolines, the
flattened signature of arch_prepare_bpf_trampoline gets even more
complicated.

Refactor the prog and count argument to arch_prepare_bpf_trampoline to
use bpf_tramp_progs to simplify the addition and accounting for new
attachment types.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-2-kpsingh@chromium.org
2020-03-04 13:41:05 -08:00
Tycho Andersen
2e5383d790 cgroup1: don't call release_agent when it is ""
Older (and maybe current) versions of systemd set release_agent to "" when
shutting down, but do not set notify_on_release to 0.

Since 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate
call_usermodehelper()"), we filter out such calls when the user mode helper
path is "". However, when used in conjunction with an actual (i.e. non "")
STATIC_USERMODEHELPER, the path is never "", so the real usermode helper
will be called with argv[0] == "".

Let's avoid this by not invoking the release_agent when it is "".

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-04 11:53:33 -05:00
Yu Chen
2333e82995 workqueue: Make workqueue_init*() return void
The return values of workqueue_init() and workqueue_early_int() are
always 0, and there is no usage of their return value.  So just make
them return void.

Signed-off-by: Yu Chen <chen.yu@easystack.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-04 11:21:49 -05:00
Qian Cai
190ecb190a cgroup: fix psi_show() crash on 32bit ino archs
Similar to the commit d7495343228f ("cgroup: fix incorrect
WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not
equal to 1 on 32bit ino archs which triggers all sorts of issues with
psi_show() on s390x. For example,

 BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/
 Read of size 4 at addr 000000001e0ce000 by task read_all/3667
 collect_percpu_times+0x2d0/0x798
 psi_show+0x7c/0x2a8
 seq_read+0x2ac/0x830
 vfs_read+0x92/0x150
 ksys_read+0xe2/0x188
 system_call+0xd8/0x2b4

Fix it by using cgroup_ino().

Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.5
2020-03-04 11:15:58 -05:00
Waiman Long
d441dceb5d tick/common: Make tick_periodic() check for missing ticks
The tick_periodic() function is used at the beginning part of the
bootup process for time keeping while the other clock sources are
being initialized.

The current code assumes that all the timer interrupts are handled in
a timely manner with no missing ticks. That is not actually true. Some
ticks are missed and there are some discrepancies between the tick time
(jiffies) and the timestamp reported in the kernel log.  Some systems,
however, are more prone to missing ticks than the others.  In the extreme
case, the discrepancy can actually cause a soft lockup message to be
printed by the watchdog kthread. For example, on a Cavium ThunderX2
Sabre arm64 system:

 [   25.496379] watchdog: BUG: soft lockup - CPU#14 stuck for 22s!

On that system, the missing ticks are especially prevalent during the
smp_init() phase of the boot process. With an instrumented kernel,
it was found that it took about 24s as reported by the timestamp for
the tick to accumulate 4s of time.

Investigation and bisection done by others seemed to point to the
commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or
lack thereof") as the culprit. It could also be a firmware issue as
new firmware was promised that would fix the issue.

To properly address this problem, stop assuming that there will be no
missing tick in tick_periodic(). Modify it to follow the example of
tick_do_update_jiffies64() by using another reference clock to check for
missing ticks. Since the watchdog timer uses running_clock(), it is used
here as the reference. With this applied, the soft lockup problem in the
affected arm64 system is gone and tick time tracks much more closely to the
timestamp time.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200207193929.27308-1-longman@redhat.com
2020-03-04 10:18:11 +01:00
Wen Yang
38f7b0b131 hrtimer: Cast explicitely to u32t in __ktime_divns()
do_div() does a 64-by-32 division at least on 32bit platforms, while the
divisor 'div' is explicitly casted to unsigned long, thus 64-bit on 64-bit
platforms.

The code already ensures that the divisor is less than 2^32. Hence the
proper cast type is u32.

Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200130130851.29204-1-wenyang@linux.alibaba.com
2020-03-04 10:17:51 +01:00
Wen Yang
4cbbc3a0ee timekeeping: Prevent 32bit truncation in scale64_check_overflow()
While unlikely the divisor in scale64_check_overflow() could be >= 32bit in
scale64_check_overflow(). do_div() truncates the divisor to 32bit at least
on 32bit platforms.

Use div64_u64() instead to avoid the truncation to 32-bit.

[ tglx: Massaged changelog ]

Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200120100523.45656-1-wenyang@linux.alibaba.com
2020-03-04 10:17:51 +01:00
Eric W. Biederman
b95e31c07c posix-cpu-timers: Stop disabling timers on mt-exec
The reasons why the extra posix_cpu_timers_exit_group() invocation has been
added are not entirely clear from the commit message.  Today all that
posix_cpu_timers_exit_group() does is stop timers that are tracking the
task from firing.  Every other operation on those timers is still allowed.

The practical implication of this is posix_cpu_timer_del() which could
not get the siglock after the thread group leader has exited (because
sighand == NULL) would be able to run successfully because the timer
was already dequeued.

With that locking issue fixed there is no point in disabling all of the
timers.  So remove this ``tempoary'' hack.

Fixes: e0a70217107e ("posix-cpu-timers: workaround to suppress the problems with mt exec")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87o8tityzs.fsf@x220.int.ebiederm.org
2020-03-04 09:54:55 +01:00
Eric W. Biederman
55e8c8eb2c posix-cpu-timers: Store a reference to a pid not a task
posix cpu timers do not handle the death of a process well.

This is most clearly seen when a multi-threaded process calls exec from a
thread that is not the leader of the thread group.  The posix cpu timer code
continues to pin the old thread group leader and is unable to find the
siglock from there.

This results in posix_cpu_timer_del being unable to delete a timer,
posix_cpu_timer_set being unable to set a timer.  Further to compensate for
the problems in posix_cpu_timer_del on a multi-threaded exec all timers
that point at the multi-threaded task are stopped.

The code for the timers fundamentally needs to check if the target
process/thread is alive.  This needs an extra level of indirection. This
level of indirection is already available in struct pid.

So replace cpu.task with cpu.pid to get the needed extra layer of
indirection.

In addition to handling things more cleanly this reduces the amount of
memory a timer can pin when a process exits and then is reaped from
a task_struct to the vastly smaller struct pid.

Fixes: e0a70217107e ("posix-cpu-timers: workaround to suppress the problems with mt exec")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87wo86tz6d.fsf@x220.int.ebiederm.org
2020-03-04 09:54:55 +01:00
Qian Cai
a534e924c5 PM: QoS: annotate data races in pm_qos_*_value()
The target_value field in struct pm_qos_constraints is used for
lockless access to the effective constraint value of a given QoS
list, so the readers of it cannot expect it to always reflect the
most recent effective constraint value.  However, they can and do
expect it to be equal to a valid effective constraint value computed
at a certain time in the past (event though it may not be the most
recent one), so add READ|WRITE_ONCE() annotations around the
target_value accesses to prevent the compiler from possibly causing
that expectation to be unmet by generating code in an exceptionally
convoluted way.

Signed-off-by: Qian Cai <cai@lca.pw>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-03-03 23:34:51 +01:00
Steven Rostedt (VMware)
5412e0b763 tracing: Remove unused TRACE_BUFFER bits
Commit 567cd4da54ff ("ring-buffer: User context bit recursion checking")
added the TRACE_BUFFER bits to be used in the current task's trace_recursion
field. But the final submission of the logic removed the use of those bits,
but never removed the bits themselves (they were never used in upstream
Linux). These can be safely removed.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-03 17:34:02 -05:00
Steven Rostedt (VMware)
b396bfdebf tracing: Have hwlat ts be first instance and record count of instances
The hwlat tracer runs a loop of width time during a given window. It then
reports the max latency over a given threshold and records a timestamp. But
this timestamp is the time after the width has finished, and not the time it
actually triggered.

Record the actual time when the latency was greater than the threshold as
well as the number of times it was greater in a given width per window.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-03 17:33:43 -05:00
Cyril Hrubis
ecc421e05b sys/sysinfo: Respect boottime inside time namespace
The sysinfo() syscall includes uptime in seconds but has no correction for
time namespaces which makes it inconsistent with the /proc/uptime inside of
a time namespace.

Add the missing time namespace adjustment call.

Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dmitry Safonov <dima@arista.com>
Link: https://lkml.kernel.org/r/20200303150638.7329-1-chrubis@suse.cz
2020-03-03 19:34:32 +01:00
Andrii Nakryiko
70ed506c3b bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.

Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.

Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 22:06:27 -08:00
Oleg Nesterov
6fb614920b task_work_run: don't take ->pi_lock unconditionally
As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg()
if task->task_works == NULL && !PF_EXITING.

And in fact the only reason why task_work_run() needs ->pi_lock is
the possible race with task_work_cancel(), we can optimize this code
and make the locking more clear.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:06:33 -07:00
Linus Torvalds
c105df5d86 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix a scheduler statistics bug"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix statistics for find_idlest_group()
2020-03-02 06:51:43 -06:00
Eric W. Biederman
beb41d9cbe posix-cpu-timers: Pass the task into arm_timer()
The task has been already computed to take siglock before calling
arm_timer. So pass the benefit of that labor into arm_timer().

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/8736auvdt1.fsf@x220.int.ebiederm.org
2020-03-01 11:21:44 +01:00
Eric W. Biederman
60f2ceaa81 posix-cpu-timers: Remove unnecessary locking around cpu_clock_sample_group
As of e78c3496790e ("time, signal: Protect resource use statistics
with seqlock") cpu_clock_sample_group no longers needs siglock
protection.  Unfortunately no one realized it at the time.

Remove the extra locking that is for cpu_clock_sample_group and not
for cpu_clock_sample.  This significantly simplifies the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/878skmvdts.fsf@x220.int.ebiederm.org
2020-03-01 11:21:44 +01:00
Eric W. Biederman
a2efdbf4fc posix-cpu-timers: cpu_clock_sample_group() no longer needs siglock
As of e78c3496790e ("time, signal: Protect resource use statistics with
seqlock") cpu_clock_sample_group() no longer needs siglock protection so
remove the stale comment.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87eeuevduq.fsf@x220.int.ebiederm.org
2020-03-01 11:21:43 +01:00
David S. Miller
9f0ca0c1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-02-28

The following pull-request contains BPF updates for your *net-next* tree.

We've added 41 non-merge commits during the last 7 day(s) which contain
a total of 49 files changed, 1383 insertions(+), 499 deletions(-).

The main changes are:

1) BPF and Real-Time nicely co-exist.

2) bpftool feature improvements.

3) retrieve bpf_sk_storage via INET_DIAG.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 15:53:35 -08:00
Eric W. Biederman
af9fe6d607 pid: Improve the comment about waiting in zap_pid_ns_processes
Oleg wrote a very informative comment, but with the removal of
proc_cleanup_work it is no longer accurate.

Rewrite the comment so that it only talks about the details
that are still relevant, and hopefully is a little clearer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-02-28 16:29:12 -06:00
Linus Torvalds
2edc78b9a4 block-5.6-2020-02-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5ZXl0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmltEACSA4yxdvWsVMYRCijjm/FzBEq7C8PSsWNK
 H8KPmjQiNpbSiZSi1uMVsHMlhBmBM8ZQ6Zc+gbZSs6xMqa4yP/iRtmzxnGonC7TB
 f5Ne2QuC0+TKMFJJTG8cCTzrgEOrWYkFKkmabzDml7HtloJtuzgArrmPzRj2sUfY
 J+d0osdp1b4U4sqhhAnxSm/zYJkGrQb+9UgNdVjhZCUzaX6oCcuK8xUwu2reLGlM
 qPkSKOywnl3WHCSCJXsCrNLKX0QWtIfMzlWDr40GYgHauPBbWfa8+1yHR1/lWP4R
 zyxGk63I9f6/+iQSUC72wP77bAVWKW674c53jgd7r1pNL9TiuK+a3E4lgf7eU+rl
 ymA/rM6Iy3SjTgiLT57PPOecsILJns3cwZ6mhvSRs0+zpao7LOQZXWdu9V0+Fyqo
 jur+7Ll/Qfdv/CLlM94DeBJtwhaTWiHTfDoaDHlG9p1/vvcWWXTUTIVPwAD+YGbj
 geio/bIWECnQxDtZL5Jikf5zsC76aQ46vvxK4F6RJlXj6jaugIbN3mWLsg17sUVf
 Y4h+IEVtQr0zA0LkPrfVdAS9IqVlTrMRDCkrrlhsDt7FI0orCOag7JOcmN2/nPn/
 2H22nl6i02b0gdGrScU5pyBswSPaImddH5tqE9uL2rK4hrFe6oKxL5EicTFDZmTh
 tHnukoc+Yg==
 =1bzv
 -----END PGP SIGNATURE-----

Merge tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Passthrough insertion fix (Ming)

 - Kill off some unused arguments (John)

 - blktrace RCU fix (Jan)

 - Dead fields removal for null_blk (Dongli)

 - NVMe polled IO fix (Bijan)

* tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
  nvme-pci: Hold cq_poll_lock while completing CQEs
  blk-mq: Remove some unused function arguments
  null_blk: remove unused fields in 'nullb_cmd'
  blktrace: Protect q->blk_trace with RCU
  blk-mq: insert passthrough request into hctx->dispatch directly
2020-02-28 11:43:30 -08:00
Eric W. Biederman
69879c01a0 proc: Remove the now unnecessary internal mount of proc
There remains no more code in the kernel using pids_ns->proc_mnt,
therefore remove it from the kernel.

The big benefit of this change is that one of the most error prone and
tricky parts of the pid namespace implementation, maintaining kernel
mounts of proc is removed.

In addition removing the unnecessary complexity of the kernel mount
fixes a regression that caused the proc mount options to be ignored.
Now that the initial mount of proc comes from userspace, those mount
options are again honored.  This fixes Android's usage of the proc
hidepid option.

Reported-by: Alistair Strachan <astrachan@google.com>
Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-02-28 12:06:14 -06:00
Linus Torvalds
3642859812 Power management fixes for 5.6-rc4
Fix a recent cpufreq initialization regression (Rafael Wysocki),
 revert a devfreq commit that made incompatible changes and broke
 user land on some systems (Orson Zhai), drop a stale reference to
 a document that has gone away recently (Jonathan Neuschäfer) and
 fix a typo in a hibernation code comment (Alexandre Belloni).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl5Y6MYSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxqCAP/3zmwDsBMG07c+g9kOEJzuGLkpUrGIFp
 t1vR2+EmV7bCq7onCLSBz090wvogLCmNPpXmq45Ddt5hx3ltZSx4stwYdbsS/xmU
 t+9pFiCSSFufs1vaUzZqCuw53jpizRD4KKoUbCT2kpCgH4wjF7ftRIC5y2DYI57Y
 hOIfrI9XzwR2UZYRWZqCYtwpPkMicao8lXLfhs1mrZFk+AxwwwUFSp8/Gw94LBPD
 +ESctvsrGfmqEDEvcZUaYd2i0i5PsqnbnVy6wxWb3rhmhSXJTfiCzCGECItBJuSA
 HcZj3m2dohOXDxXK/Yevwiv+fBkqeeeP+KVlvtjzmYzTDW1bryLYQpxAsdpeYGEE
 bANjAl6MARqBDKUjvw+N6yDdAPPNbsT6zEK3f0w4eQNRpfmjhHP7TxrOsjC0DRJu
 EZpN6hWGQs1Vmndr+xaK0ITeRXB+7IQeXG6t6XDYZg2gS65hkhF5z1E5vjmn1Cen
 wpQsqvek6K7/DqqLqCH27yATNNKff7PNNCfpIazPE6CjDKBAiM9jsMpWkQf4E9/7
 VKJFzmhy5P9VsHnUNonRN1K8Vl5eo+b7YOfmR8kvfsmNpfvKqaXM813owVMkOEq6
 44kc2D38t3TnlH+GNgGfyPBnHchqlw/P0y0mypQ73Wyg+PQB+fx+W9vAXWDe6fBW
 /tIhZYILu90f
 =ySea
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "Fix a recent cpufreq initialization regression (Rafael Wysocki),
  revert a devfreq commit that made incompatible changes and broke user
  land on some systems (Orson Zhai), drop a stale reference to a
  document that has gone away recently (Jonathan Neuschäfer), and fix a
  typo in a hibernation code comment (Alexandre Belloni)"

* tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Fix policy initialization for internal governor drivers
  Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"
  PM / hibernate: fix typo "reserverd_size" -> "reserved_size"
  Documentation: power: Drop reference to interface.rst
2020-02-28 08:49:52 -08:00
Madhuparna Bhowmik
22a34c6fe0
exit: Fix Sparse errors and warnings
This patch fixes the following sparse error:
kernel/exit.c:627:25: error: incompatible types in comparison expression

And the following warning:
kernel/exit.c:626:40: warning: incorrect type in assignment

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: edit commit message]
Link: https://lore.kernel.org/r/20200130062028.4870-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-02-28 13:34:39 +01:00
Madhuparna Bhowmik
0c282b068e
fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()
Use RCU_INIT_POINTER() instead of rcu_access_pointer() in
copy_sighand().

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: edit commit message]
Link: https://lore.kernel.org/r/20200127175821.10833-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-02-28 13:34:36 +01:00
Rafael J. Wysocki
189c6967fe Merge branches 'pm-sleep' and 'pm-devfreq'
* pm-sleep:
  PM / hibernate: fix typo "reserverd_size" -> "reserved_size"
  Documentation: power: Drop reference to interface.rst

* pm-devfreq:
  Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"
2020-02-28 11:00:50 +01:00