IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
cpu_stop_queue_work() checks stopper->enabled before it queues the
work, but ->enabled == T can only guarantee cpu_stop_signal_done()
if we race with cpu_down().
This is not enough for stop_two_cpus() or stop_machine(), they will
deadlock if multi_cpu_stop() won't be called by one of the target
CPU's. stop_machine/stop_cpus are fine, they rely on stop_cpus_mutex.
But stop_two_cpus() has to check cpu_active() to avoid the same race
with hotplug, and this check is very unobvious and probably not even
correct if we race with cpu_up().
Change cpu_down() pass to clear ->enabled before cpu_stopper_thread()
flushes the pending ->works and returns with KTHREAD_SHOULD_PARK set.
Note also that smpboot_thread_call() calls cpu_stop_unpark() which
sets enabled == T at CPU_ONLINE stage, so this CPU can't go away until
cpu_stopper_thread() is called at least once. This all means that if
cpu_stop_queue_work() succeeds, we know that work->fn() will be called.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20151008145131.GA18139@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
9d51426242 ("sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target")
broke select_task_rq_dl() and find_lock_later_rq(), because it introduced
a comparison between the local task's deadline and dl.earliest_dl.curr of
the remote queue.
However, if the remote runqueue does not contain any SCHED_DEADLINE
task its earliest_dl.curr is 0 (always smaller than the deadline of
the local task) and the remote runqueue is not selected for pushing.
As a result, if an application creates multiple SCHED_DEADLINE
threads, they will never be pushed to runqueues that do not already
contain SCHED_DEADLINE tasks.
This patch fixes the issue by checking if dl.dl_nr_running == 0.
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@linux.intel.com>
Fixes: 9d51426242 ("sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target")
Link: http://lkml.kernel.org/r/1444982781-15608-1-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reverts:
8cb9764fc8 ("nohz: Set isolcpus when nohz_full is set")
We assumed that full-nohz users always want scheduler isolation on full
dynticks CPUs, therefore we included full-nohz CPUs on cpu_isolated_map.
This means that tasks run by default on CPUs outside the nohz_full range
unless their affinity is explicity overwritten.
This suits pure isolation workloads but when the machine is needed to
run common workloads, the available sets of CPUs to run common tasks
becomes reduced.
We reach an extreme case when CONFIG_NO_HZ_FULL_ALL is enabled as it
leaves only CPU 0 for non-isolation tasks, which makes people think that
their supercomputer regressed to 90's UP - which is true in a sense.
Some full-nohz users appear to be interested in running normal workloads
either before or after an isolation workload. Full-nohz isn't optimized
toward normal workloads but it's still better than UP performance.
We are reaching a limitation in kernel presets here. Lets revert this
cpu_isolated_map inclusion and let userspace do its own scheduler
isolation using cpusets or explicit affinity settings.
Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1444663283-30068-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU updates from Paul E. McKenney:
- Miscellaneous fixes. (Paul E. McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)
- Improvements to expedited grace periods. (Paul E. McKenney)
- Performance improvements to and locktorture tests for percpu-rwsem.
(Oleg Nesterov, Paul E. McKenney)
- Torture-test changes. (Paul E. McKenney, Davidlohr Bueso)
- Documentation updates. (Paul E. McKenney)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull irq/timer fixes from Thomas Gleixner:
"irq: a fix for the new hierarchical MSI interrupt handling which
unbreaks PCI=n configurations.
timers: a fix for the new hrtimer clock offset update mechanism to
ensure that the boot time offset is respected"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi: Do not use pci_msi_[un]mask_irq as default methods
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Increment clock_was_set_seq in timekeeping_init()
"Not" is too abstract variable name - changed to clear_filter.
Removed ftrace_match_module_records function: comparison with !* or *
not does the general code in filter_parse_regex() as it works without
mod command for
sh# echo '!*' > /sys/kernel/debug/tracing/set_ftrace_filter
Link: http://lkml.kernel.org/r/1443545176-3215-2-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
timekeeping_init() can set the wall time offset, so we need to
increment the clock_was_set_seq counter. That way hrtimers will pick
up the early offset immediately. Otherwise on a machine which does not
set wall time later in the boot process the hrtimer offset is stale at
0 and wall time timers are going to expire with a delay of 45 years.
Fixes: 868a3e915f "hrtimer: Make offset update smarter"
Reported-and-tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stefan Liebler <stli@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
When we create a generic MSI domain, that MSI_FLAG_USE_DEF_CHIP_OPS
is set, and that any of .mask or .unmask are NULL in the irq_chip
structure, we set them to pci_msi_[un]mask_irq.
This is a bad idea for at least two reasons:
- PCI_MSI might not be selected, kernel fails to build (yes, this is
legitimate, at least on arm64!)
- This may not be a PCI/MSI domain at all (platform MSI, for example)
Either way, this looks wrong. Move the overriding of mask/unmask to
the PCI counterpart, and panic is any of these two methods is not
set in the core code (they really should be present).
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: http://lkml.kernel.org/r/1444760085-27857-1-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently, is only called from __prog_put_rcu in the bpf_prog_release
path. Need this to call this from bpf_prog_put also to get correct
accounting.
Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Tom Herbert <tom@herbertland.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that interfaces for the major three controllers - cpu, memory, io
- are shaping up, there's no reason to have an option to force legacy
files to show up on the unified hierarchy for testing. Drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
The init sequence shouldn't fail short of bugs and even when it does
it's better to continue with the rest of initialization and we were
silently ignoring /proc/cgroups creation failure.
Drop the explicit error handling and wrap sysfs_create_mount_point(),
register_filesystem() and proc_create() with WARN_ON()s.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
pids controller is completely broken in that it uncharges when a task
exits allowing zombies to escape resource control. With the recent
updates, cgroup core now maintains cgroup association till task free
and pids controller can be fixed by uncharging on free instead of
exit.
This patch adds cgroup_subsys->free() method and update pids
controller to use it instead of ->exit() for uncharging.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
cgroup_exit() is called when a task exits and disassociates the
exiting task from its cgroups and half-attach it to the root cgroup.
This is unnecessary and undesirable.
No controller actually needs an exiting task to be disassociated with
non-root cgroups. Both cpu and perf_event controllers update the
association to the root cgroup from their exit callbacks just to keep
consistent with the cgroup core behavior.
Also, this disassociation makes it difficult to track resources held
by zombies or determine where the zombies came from. Currently, pids
controller is completely broken as it uncharges on exit and zombies
always escape the resource restriction. With cgroup association being
reset on exit, fixing it is pretty painful.
There's no reason to reset cgroup membership on exit. The zombie can
be removed from its css_set so that it doesn't show up on
"cgroup.procs" and thus can't be migrated or interfere with cgroup
removal. It can still pin and point to the css_set so that its cgroup
membership is maintained. This patch makes cgroup core keep zombies
associated with their cgroups at the time of exit.
* Previous patches decoupled populated_cnt tracking from css_set
lifetime, so a dying task can be simply unlinked from its css_set
while pinning and pointing to the css_set. This keeps css_set
association from task side alive while hiding it from "cgroup.procs"
and populated_cnt tracking. The css_set reference is dropped when
the task_struct is freed.
* ->exit() callback no longer needs the css arguments as the
associated css never changes once PF_EXITING is set. Removed.
* cpu and perf_events controllers no longer need ->exit() callbacks.
There's no reason to explicitly switch away on exit. The final
schedule out is enough. The callbacks are removed.
* On traditional hierarchies, nothing changes. "/proc/PID/cgroup"
still reports "/" for all zombies. On the default hierarchy,
"/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
to at the time of exit. If the cgroup gets removed before the task
is reaped, " (deleted)" is appended.
v2: Build brekage due to missing dummy cgroup_free() when
!CONFIG_CGROUP fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
css_set_rwsem is the inner lock protecting css_sets and is accessed
from hot paths such as fork and exit. Internally, it has no reason to
be a rwsem or even mutex. There are no internal blocking operations
while holding it. This was rwsem because css task iteration used to
expose it to external iterator users. As the previous patch updated
css task iteration such that the locking is not leaked to its users,
there's no reason to keep it a rwsem.
This patch converts css_set_rwsem to a spinlock and rename it to
css_set_lock. It uses bh-safe operations as a planned usage needs to
access it from RCU callback context.
Signed-off-by: Tejun Heo <tj@kernel.org>
css_sets are synchronized through css_set_rwsem but the locking scheme
is kinda bizarre. The hot paths - fork and exit - have to write lock
the rwsem making the rw part pointless; furthermore, many readers
already hold cgroup_mutex.
One of the readers is css task iteration. It read locks the rwsem
over the entire duration of iteration. This leads to silly locking
behavior. When cpuset tries to migrate processes of a cgroup to a
different NUMA node, css_set_rwsem is held across the entire migration
attempt which can take a long time locking out forking, exiting and
other cgroup operations.
This patch updates css task iteration so that it locks css_set_rwsem
only while the iterator is being advanced. css task iteration
involves two levels - css_set and task iteration. As css_sets in use
are practically immutable, simply pinning the current one is enough
for resuming iteration afterwards. Task iteration is tricky as tasks
may leave their css_set while iteration is in progress. This is
solved by keeping track of active iterators and advancing them if
their next task leaves its css_set.
v2: put_task_struct() in css_task_iter_next() moved outside
css_set_rwsem. A later patch will add cgroup operations to
task_struct free path which may grab the same lock and this avoids
deadlock possibilities.
css_set_move_task() updated to use list_for_each_entry_safe() when
walking task_iters and advancing them. This is necessary as
advancing an iter may remove it from the list.
Signed-off-by: Tejun Heo <tj@kernel.org>
* Rename css_advance_task_iter() to css_task_iter_advance_css_set()
and make it clear it->task_pos too at the end of the iteration.
* Factor out css_task_iter_advance() from css_task_iter_next(). The
new function whines if called on a terminated iterator.
Except for the termination check, this is pure reorganization and
doesn't introduce any behavior changes. This will help the planned
locking update for css_task_iter.
Signed-off-by: Tejun Heo <tj@kernel.org>
A task is associated and disassociated with its css_set in three
places - during migration, after a new task is created and when a task
exits. The first is handled by cgroup_task_migrate() and the latter
two are open-coded.
These are similar operations and spreading them over multiple places
makes it harder to follow and update. This patch collects all task
css_set [dis]association operations into css_set_move_task().
While css_set_move_task() may check whether populated state needs to
be updated when not strictly necessary, the behavior is essentially
equivalent before and after this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
css task iteration will be updated to not leak cgroup internal locking
to iterator users. In preparation, update css_set and task lists to
be in chronological order.
For tasks, as migration path is already using list_splice_tail_init(),
only cgroup_enable_task_cg_lists() and cgroup_post_fork() need
updating. For css_sets, link_css_set() is the only place which needs
to be updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_destroy_locked() currently tests whether any css_sets are
associated to reject removal if the cgroup contains tasks. This works
because a css_set's refcnt converges with the number of tasks linked
to it and thus there's no css_set linked to a cgroup if it doesn't
have any live tasks.
To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.
This patch updates cgroup_destroy_locked() so that it tests
cgroup_is_populated(), which counts the number of populated css_sets,
instead of whether cgrp->cset_links is empty to determine whether the
cgroup is populated or not. This ensures that rmdirs won't be
incorrectly rejected for cgroups which only contain zombie tasks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, css_sets don't pin the associated cgroups. This is okay as
a cgroup with css_sets associated are not allowed to be removed;
however, to help resource tracking for zombie tasks, this is scheduled
to change such that a cgroup can be removed even when it has css_sets
associated as long as none of them are populated.
To ensure that a cgroup doesn't go away while css_sets are still
associated with it, make each associated css_set hold a reference on
the cgroup if non-root.
v2: Root cgroups are special and shouldn't be ref'd by css_sets.
Signed-off-by: Tejun Heo <tj@kernel.org>
Relocate cgroup_get(), cgroup_tryget() and cgroup_put() upwards. This
is pure code reorganization to prepare for future changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
To trigger release agent when the last task leaves the cgroup,
check_for_release() is called from put_css_set_locked(); however,
css_set being unlinked is being decoupled from task leaving the cgroup
and the correct condition to test is cgroup->nr_populated dropping to
zero which check_for_release() is already updated to test.
This patch moves check_for_release() invocation from
put_css_set_locked() to cgroup_update_populated().
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, cgroup_has_tasks() tests whether the target cgroup has any
css_set linked to it. This works because a css_set's refcnt converges
with the number of tasks linked to it and thus there's no css_set
linked to a cgroup if it doesn't have any live tasks.
To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.
This patch replaces cgroup_has_tasks() with cgroup_is_populated()
which tests cgroup->nr_populated instead which locally counts the
number of populated css_sets. Unlike cgroup_has_tasks(),
cgroup_is_populated() is recursive - if any of the descendants is
populated, the cgroup is populated too. While this changes the
meaning of the test, all the existing users are okay with the change.
While at it, replace the open-coded ->populated_cnt test in
cgroup_events_show() with cgroup_is_populated().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Currently, cgroup->nr_populated counts whether the cgroup has any
css_sets linked to it and the number of children which has non-zero
->nr_populated. This works because a css_set's refcnt converges with
the number of tasks linked to it and thus there's no css_set linked to
a cgroup if it doesn't have any live tasks.
To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.
This patch updates cgroup->nr_populated so that for the cgroup itself
it counts the number of css_sets which have tasks associated with them
so that empty css_sets don't skew the populated test.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull workqueue fixlet from Tejun Heo:
"Single patch to make delayed work always be queued on the local CPU"
This is not actually something we should guarantee, but it's something
we by accident have historically done, and at least one call site has
grown to depend on it.
I'm going to fix that known broken callsite, but in the meantime this
makes the accidental behavior be explicit, just in case there are other
cases that might depend on it.
* 'for-4.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: make sure delayed work run in local cpu
It was found while running a database workload on large systems that
significant time was spent trying to acquire the sighand lock.
The issue was that whenever an itimer expired, many threads ended up
simultaneously trying to send the signal. Most of the time, nothing
happened after acquiring the sighand lock because another thread
had just already sent the signal and updated the "next expire" time.
The fastpath_timer_check() didn't help much since the "next expire"
time was updated after the threads exit fastpath_timer_check().
This patch addresses this by having the thread_group_cputimer structure
maintain a boolean to signify when a thread in the group is already
checking for process wide timers, and adds extra logic in the fastpath
to check the boolean.
Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: George Spelvin <linux@horizon.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: hideaki.kimura@hpe.com
Cc: terry.rudd@hpe.com
Cc: scott.norton@hpe.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1444849677-29330-5-git-send-email-jason.low2@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
By now there isn't any subcommand for mod.
Before:
sh$ echo '*:mod:ipv6:a' > set_ftrace_filter
sh$ echo '*:mod:ipv6' > set_ftrace_filter
had the same results, but now first will result in:
sh$ echo '*:mod:ipv6:a' > set_ftrace_filter
-bash: echo: write error: Invalid argument
Also, I clarified ftrace_mod_callback code a little.
Link: http://lkml.kernel.org/r/1443545176-3215-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
[ converted 'if (ret == 0)' to 'if (!ret)' ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are quite a few cases in which device drivers, bus types or
even the PM core itself may benefit from knowing whether or not
the platform firmware will be involved in the upcoming system power
transition (during system suspend) or whether or not it was involved
in it (during system resume).
For this reason, introduce global system suspend flags that can be
used by the platform code to expose that information for the benefit
of the other parts of the kernel and make the ACPI core set them
as appropriate.
Users of the new flags will be added later.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
since eBPF programs and maps use kernel memory consider it 'locked' memory
from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
This limit is typically set to 64Kbytes by distros, so almost all
bpf+tracing programs would need to increase it, since they use maps,
but kernel charges maximum map size upfront.
For example the hash map of 1024 elements will be charged as 64Kbyte.
It's inconvenient for current users and changes current behavior for root,
but probably worth doing to be consistent root vs non-root.
Similar accounting logic is done by mmap of perf_event.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to let unprivileged users load and execute eBPF programs
teach verifier to prevent pointer leaks.
Verifier will prevent
- any arithmetic on pointers
(except R10+Imm which is used to compute stack addresses)
- comparison of pointers
(except if (map_value_ptr == 0) ... )
- passing pointers to helper functions
- indirectly passing pointers in stack to helper functions
- returning pointer from bpf program
- storing pointers into ctx or maps
Spill/fill of pointers into stack is allowed, but mangling
of pointers stored in the stack or reading them byte by byte is not.
Within bpf programs the pointers do exist, since programs need to
be able to access maps, pass skb pointer to LD_ABS insns, etc
but programs cannot pass such pointer values to the outside
or obfuscate them.
Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
so that socket filters (tcpdump), af_packet (quic acceleration)
and future kcm can use it.
tracing and tc cls/act program types still require root permissions,
since tracing actually needs to be able to see all kernel pointers
and tc is for root only.
For example, the following unprivileged socket filter program is allowed:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += skb->len;
return 0;
}
but the following program is not:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += (u64) skb;
return 0;
}
since it would leak the kernel address into the map.
Unprivileged socket filter bpf programs have access to the
following helper functions:
- map lookup/update/delete (but they cannot store kernel pointers into them)
- get_random (it's already exposed to unprivileged user space)
- get_smp_processor_id
- tail_call into another socket filter program
- ktime_get_ns
The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
This toggle defaults to off (0), but can be set true (1). Once true,
bpf programs and maps cannot be accessed from unprivileged process,
and the toggle cannot be set back to false.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>