2019-05-28 10:10:09 -07:00
// SPDX-License-Identifier: GPL-2.0-only
2017-03-30 21:45:38 -07:00
/* Copyright (c) 2017 Facebook
*/
# include <linux/bpf.h>
# include <linux/slab.h>
# include <linux/vmalloc.h>
# include <linux/etherdevice.h>
# include <linux/filter.h>
# include <linux/sched/signal.h>
bpf: Introduce bpf sk local storage
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
into different bpf running context.
this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).
When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined. Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
some enhancement in the verifier and it is a separate topic. ]
Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).
The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production. Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.
This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use. The space will be allocated when the first bpf
prog has created data for this particular sk.
The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update). bpf_spin_lock should be used if the inline update needs
to be protected.
BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs. The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.
The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage". This
particular "type" of "sk-local-storage" data can then be stored in any sk.
The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
when the map is freed.
sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage"). When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list. The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.
To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset. At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have. Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array. Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.
The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map. On the program side, having a few bpf_progs
running in the networking hotpath is already a lot. The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty. 16 has enough runway to grow.
All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.
bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete(). The verifier can then enforce the
ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk. It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.
Misc notes:
----------
1. map_get_next_key is not supported. From the userspace syscall
perspective, the map has the socket fd as the key while the map
can be shared by pinned-file or map-id.
Since btf is enforced, the existing "ss" could be enhanced to pretty
print the local-storage.
Supporting a kernel defined btf with 4 tuples as the return key could
be explored later also.
2. The sk->sk_lock cannot be acquired. Atomic operations is used instead.
e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
Please refer to the source code comments for the details in
synchronization cases and considerations.
3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:
One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)
Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt. netperf is used to drive
data with 4096 connected UDP sockets.
BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
loaded_at 2019-04-15T13:46:39-0700 uid 0
xlated 344B jited 258B memlock 4096B map_ids 16
btf_id 5
BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
loaded_at 2019-04-15T13:47:54-0700 uid 0
xlated 168B jited 156B memlock 4096B map_ids 17
btf_id 6
Here is a high-level picture on how are the objects organized:
sk
┌──────┐
│ │
│ │
│ │
│*sk_bpf_storage─────▶ bpf_sk_storage
└──────┘ ┌───────┐
┌───────────┤ list │
│ │ │
│ │ │
│ │ │
│ └───────┘
│
│ elem
│ ┌────────┐
├─▶│ snode │
│ ├────────┤
│ │ data │ bpf_map
│ ├────────┤ ┌─────────┐
│ │map_node│◀─┬─────┤ list │
│ └────────┘ │ │ │
│ │ │ │
│ elem │ │ │
│ ┌────────┐ │ └─────────┘
└─▶│ snode │ │
├────────┤ │
bpf_map │ data │ │
┌─────────┐ ├────────┤ │
│ list ├───────▶│map_node│ │
│ │ └────────┘ │
│ │ │
│ │ elem │
└─────────┘ ┌────────┐ │
┌─▶│ snode │ │
│ ├────────┤ │
│ │ data │ │
│ ├────────┤ │
│ │map_node│◀─┘
│ └────────┘
│
│
│ ┌───────┐
sk └──────────│ list │
┌──────┐ │ │
│ │ │ │
│ │ │ │
│ │ └───────┘
│*sk_bpf_storage───────▶bpf_sk_storage
└──────┘
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26 16:39:39 -07:00
# include <net/bpf_sk_storage.h>
2018-10-19 09:57:58 -07:00
# include <net/sock.h>
# include <net/tcp.h>
2017-03-30 21:45:38 -07:00
2019-04-26 11:49:51 -07:00
# define CREATE_TRACE_POINTS
# include <trace/events/bpf_test_run.h>
2019-02-12 15:42:38 -08:00
static int bpf_test_run ( struct bpf_prog * prog , void * ctx , u32 repeat ,
u32 * retval , u32 * time )
2017-03-30 21:45:38 -07:00
{
2019-03-08 01:45:51 -05:00
struct bpf_cgroup_storage * storage [ MAX_BPF_CGROUP_STORAGE_TYPE ] = { NULL } ;
2018-09-28 14:45:36 +00:00
enum bpf_cgroup_storage_type stype ;
2017-03-30 21:45:38 -07:00
u64 time_start , time_spent = 0 ;
2019-02-12 15:42:38 -08:00
int ret = 0 ;
2018-12-01 10:39:44 -08:00
u32 i ;
2017-03-30 21:45:38 -07:00
2018-09-28 14:45:36 +00:00
for_each_cgroup_storage_type ( stype ) {
storage [ stype ] = bpf_cgroup_storage_alloc ( prog , stype ) ;
if ( IS_ERR ( storage [ stype ] ) ) {
storage [ stype ] = NULL ;
for_each_cgroup_storage_type ( stype )
bpf_cgroup_storage_free ( storage [ stype ] ) ;
return - ENOMEM ;
}
}
2018-08-02 14:27:27 -07:00
2017-03-30 21:45:38 -07:00
if ( ! repeat )
repeat = 1 ;
2019-02-12 15:42:38 -08:00
rcu_read_lock ( ) ;
preempt_disable ( ) ;
2017-03-30 21:45:38 -07:00
time_start = ktime_get_ns ( ) ;
for ( i = 0 ; i < repeat ; i + + ) {
2019-02-12 15:42:38 -08:00
bpf_cgroup_storage_set ( storage ) ;
* retval = BPF_PROG_RUN ( prog , ctx ) ;
if ( signal_pending ( current ) ) {
ret = - EINTR ;
break ;
}
2017-03-30 21:45:38 -07:00
if ( need_resched ( ) ) {
time_spent + = ktime_get_ns ( ) - time_start ;
2019-02-12 15:42:38 -08:00
preempt_enable ( ) ;
rcu_read_unlock ( ) ;
2017-03-30 21:45:38 -07:00
cond_resched ( ) ;
2019-02-12 15:42:38 -08:00
rcu_read_lock ( ) ;
preempt_disable ( ) ;
2017-03-30 21:45:38 -07:00
time_start = ktime_get_ns ( ) ;
}
}
time_spent + = ktime_get_ns ( ) - time_start ;
2019-02-12 15:42:38 -08:00
preempt_enable ( ) ;
rcu_read_unlock ( ) ;
2017-03-30 21:45:38 -07:00
do_div ( time_spent , repeat ) ;
* time = time_spent > U32_MAX ? U32_MAX : ( u32 ) time_spent ;
2018-09-28 14:45:36 +00:00
for_each_cgroup_storage_type ( stype )
bpf_cgroup_storage_free ( storage [ stype ] ) ;
2018-08-02 14:27:27 -07:00
2019-02-12 15:42:38 -08:00
return ret ;
2017-03-30 21:45:38 -07:00
}
2017-05-02 11:36:33 -04:00
static int bpf_test_finish ( const union bpf_attr * kattr ,
union bpf_attr __user * uattr , const void * data ,
2017-03-30 21:45:38 -07:00
u32 size , u32 retval , u32 duration )
{
2017-05-02 11:36:33 -04:00
void __user * data_out = u64_to_user_ptr ( kattr - > test . data_out ) ;
2017-03-30 21:45:38 -07:00
int err = - EFAULT ;
2018-12-03 11:31:23 +00:00
u32 copy_size = size ;
2017-03-30 21:45:38 -07:00
2018-12-03 11:31:23 +00:00
/* Clamp copy if the user has provided a size hint, but copy the full
* buffer if not to retain old behaviour .
*/
if ( kattr - > test . data_size_out & &
copy_size > kattr - > test . data_size_out ) {
copy_size = kattr - > test . data_size_out ;
err = - ENOSPC ;
}
if ( data_out & & copy_to_user ( data_out , data , copy_size ) )
2017-03-30 21:45:38 -07:00
goto out ;
if ( copy_to_user ( & uattr - > test . data_size_out , & size , sizeof ( size ) ) )
goto out ;
if ( copy_to_user ( & uattr - > test . retval , & retval , sizeof ( retval ) ) )
goto out ;
if ( copy_to_user ( & uattr - > test . duration , & duration , sizeof ( duration ) ) )
goto out ;
2018-12-03 11:31:23 +00:00
if ( err ! = - ENOSPC )
err = 0 ;
2017-03-30 21:45:38 -07:00
out :
2019-04-26 11:49:51 -07:00
trace_bpf_test_finish ( & err ) ;
2017-03-30 21:45:38 -07:00
return err ;
}
static void * bpf_test_init ( const union bpf_attr * kattr , u32 size ,
u32 headroom , u32 tailroom )
{
void __user * data_in = u64_to_user_ptr ( kattr - > test . data_in ) ;
void * data ;
if ( size < ETH_HLEN | | size > PAGE_SIZE - headroom - tailroom )
return ERR_PTR ( - EINVAL ) ;
data = kzalloc ( size + headroom + tailroom , GFP_USER ) ;
if ( ! data )
return ERR_PTR ( - ENOMEM ) ;
if ( copy_from_user ( data + headroom , data_in , size ) ) {
kfree ( data ) ;
return ERR_PTR ( - EFAULT ) ;
}
return data ;
}
2019-04-09 11:49:09 -07:00
static void * bpf_ctx_init ( const union bpf_attr * kattr , u32 max_size )
{
void __user * data_in = u64_to_user_ptr ( kattr - > test . ctx_in ) ;
void __user * data_out = u64_to_user_ptr ( kattr - > test . ctx_out ) ;
u32 size = kattr - > test . ctx_size_in ;
void * data ;
int err ;
if ( ! data_in & & ! data_out )
return NULL ;
data = kzalloc ( max_size , GFP_USER ) ;
if ( ! data )
return ERR_PTR ( - ENOMEM ) ;
if ( data_in ) {
err = bpf_check_uarg_tail_zero ( data_in , max_size , size ) ;
if ( err ) {
kfree ( data ) ;
return ERR_PTR ( err ) ;
}
size = min_t ( u32 , max_size , size ) ;
if ( copy_from_user ( data , data_in , size ) ) {
kfree ( data ) ;
return ERR_PTR ( - EFAULT ) ;
}
}
return data ;
}
static int bpf_ctx_finish ( const union bpf_attr * kattr ,
union bpf_attr __user * uattr , const void * data ,
u32 size )
{
void __user * data_out = u64_to_user_ptr ( kattr - > test . ctx_out ) ;
int err = - EFAULT ;
u32 copy_size = size ;
if ( ! data | | ! data_out )
return 0 ;
if ( copy_size > kattr - > test . ctx_size_out ) {
copy_size = kattr - > test . ctx_size_out ;
err = - ENOSPC ;
}
if ( copy_to_user ( data_out , data , copy_size ) )
goto out ;
if ( copy_to_user ( & uattr - > test . ctx_size_out , & size , sizeof ( size ) ) )
goto out ;
if ( err ! = - ENOSPC )
err = 0 ;
out :
return err ;
}
/**
* range_is_zero - test whether buffer is initialized
* @ buf : buffer to check
* @ from : check from this position
* @ to : check up until ( excluding ) this position
*
* This function returns true if the there is a non - zero byte
* in the buf in the range [ from , to ) .
*/
static inline bool range_is_zero ( void * buf , size_t from , size_t to )
{
return ! memchr_inv ( ( u8 * ) buf + from , 0 , to - from ) ;
}
static int convert___skb_to_skb ( struct sk_buff * skb , struct __sk_buff * __skb )
{
struct qdisc_skb_cb * cb = ( struct qdisc_skb_cb * ) skb - > cb ;
if ( ! __skb )
return 0 ;
/* make sure the fields we don't use are zeroed */
if ( ! range_is_zero ( __skb , 0 , offsetof ( struct __sk_buff , priority ) ) )
return - EINVAL ;
/* priority is allowed */
if ( ! range_is_zero ( __skb , offsetof ( struct __sk_buff , priority ) +
FIELD_SIZEOF ( struct __sk_buff , priority ) ,
offsetof ( struct __sk_buff , cb ) ) )
return - EINVAL ;
/* cb is allowed */
if ( ! range_is_zero ( __skb , offsetof ( struct __sk_buff , cb ) +
FIELD_SIZEOF ( struct __sk_buff , cb ) ,
sizeof ( struct __sk_buff ) ) )
return - EINVAL ;
skb - > priority = __skb - > priority ;
memcpy ( & cb - > data , __skb - > cb , QDISC_CB_PRIV_LEN ) ;
return 0 ;
}
static void convert_skb_to___skb ( struct sk_buff * skb , struct __sk_buff * __skb )
{
struct qdisc_skb_cb * cb = ( struct qdisc_skb_cb * ) skb - > cb ;
if ( ! __skb )
return ;
__skb - > priority = skb - > priority ;
memcpy ( __skb - > cb , & cb - > data , QDISC_CB_PRIV_LEN ) ;
}
2017-03-30 21:45:38 -07:00
int bpf_prog_test_run_skb ( struct bpf_prog * prog , const union bpf_attr * kattr ,
union bpf_attr __user * uattr )
{
bool is_l2 = false , is_direct_pkt_access = false ;
u32 size = kattr - > test . data_size_in ;
u32 repeat = kattr - > test . repeat ;
2019-04-09 11:49:09 -07:00
struct __sk_buff * ctx = NULL ;
2017-03-30 21:45:38 -07:00
u32 retval , duration ;
2018-07-11 15:30:14 +02:00
int hh_len = ETH_HLEN ;
2017-03-30 21:45:38 -07:00
struct sk_buff * skb ;
2018-10-19 09:57:58 -07:00
struct sock * sk ;
2017-03-30 21:45:38 -07:00
void * data ;
int ret ;
2017-05-02 11:36:45 -04:00
data = bpf_test_init ( kattr , size , NET_SKB_PAD + NET_IP_ALIGN ,
2017-03-30 21:45:38 -07:00
SKB_DATA_ALIGN ( sizeof ( struct skb_shared_info ) ) ) ;
if ( IS_ERR ( data ) )
return PTR_ERR ( data ) ;
2019-04-09 11:49:09 -07:00
ctx = bpf_ctx_init ( kattr , sizeof ( struct __sk_buff ) ) ;
if ( IS_ERR ( ctx ) ) {
kfree ( data ) ;
return PTR_ERR ( ctx ) ;
}
2017-03-30 21:45:38 -07:00
switch ( prog - > type ) {
case BPF_PROG_TYPE_SCHED_CLS :
case BPF_PROG_TYPE_SCHED_ACT :
is_l2 = true ;
/* fall through */
case BPF_PROG_TYPE_LWT_IN :
case BPF_PROG_TYPE_LWT_OUT :
case BPF_PROG_TYPE_LWT_XMIT :
is_direct_pkt_access = true ;
break ;
default :
break ;
}
2018-10-19 09:57:58 -07:00
sk = kzalloc ( sizeof ( struct sock ) , GFP_USER ) ;
if ( ! sk ) {
kfree ( data ) ;
2019-04-09 11:49:09 -07:00
kfree ( ctx ) ;
2018-10-19 09:57:58 -07:00
return - ENOMEM ;
}
sock_net_set ( sk , current - > nsproxy - > net_ns ) ;
sock_init_data ( NULL , sk ) ;
2017-03-30 21:45:38 -07:00
skb = build_skb ( data , 0 ) ;
if ( ! skb ) {
kfree ( data ) ;
2019-04-09 11:49:09 -07:00
kfree ( ctx ) ;
2018-10-19 09:57:58 -07:00
kfree ( sk ) ;
2017-03-30 21:45:38 -07:00
return - ENOMEM ;
}
2018-10-19 09:57:58 -07:00
skb - > sk = sk ;
2017-03-30 21:45:38 -07:00
2017-05-02 11:36:45 -04:00
skb_reserve ( skb , NET_SKB_PAD + NET_IP_ALIGN ) ;
2017-03-30 21:45:38 -07:00
__skb_put ( skb , size ) ;
skb - > protocol = eth_type_trans ( skb , current - > nsproxy - > net_ns - > loopback_dev ) ;
skb_reset_network_header ( skb ) ;
if ( is_l2 )
2018-07-11 15:30:14 +02:00
__skb_push ( skb , hh_len ) ;
2017-03-30 21:45:38 -07:00
if ( is_direct_pkt_access )
2017-09-25 02:25:50 +02:00
bpf_compute_data_pointers ( skb ) ;
2019-04-09 11:49:09 -07:00
ret = convert___skb_to_skb ( skb , ctx ) ;
if ( ret )
goto out ;
2018-12-01 10:39:44 -08:00
ret = bpf_test_run ( prog , skb , repeat , & retval , & duration ) ;
2019-04-09 11:49:09 -07:00
if ( ret )
goto out ;
2018-07-11 15:30:14 +02:00
if ( ! is_l2 ) {
if ( skb_headroom ( skb ) < hh_len ) {
int nhead = HH_DATA_ALIGN ( hh_len - skb_headroom ( skb ) ) ;
if ( pskb_expand_head ( skb , nhead , 0 , GFP_USER ) ) {
2019-04-09 11:49:09 -07:00
ret = - ENOMEM ;
goto out ;
2018-07-11 15:30:14 +02:00
}
}
memset ( __skb_push ( skb , hh_len ) , 0 , hh_len ) ;
}
2019-04-09 11:49:09 -07:00
convert_skb_to___skb ( skb , ctx ) ;
2018-07-11 15:30:14 +02:00
2017-03-30 21:45:38 -07:00
size = skb - > len ;
/* bpf program can never convert linear skb to non-linear */
if ( WARN_ON_ONCE ( skb_is_nonlinear ( skb ) ) )
size = skb_headlen ( skb ) ;
2017-05-02 11:36:33 -04:00
ret = bpf_test_finish ( kattr , uattr , skb - > data , size , retval , duration ) ;
2019-04-09 11:49:09 -07:00
if ( ! ret )
ret = bpf_ctx_finish ( kattr , uattr , ctx ,
sizeof ( struct __sk_buff ) ) ;
out :
2017-03-30 21:45:38 -07:00
kfree_skb ( skb ) ;
bpf: Introduce bpf sk local storage
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
into different bpf running context.
this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).
When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined. Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
some enhancement in the verifier and it is a separate topic. ]
Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).
The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production. Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.
This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use. The space will be allocated when the first bpf
prog has created data for this particular sk.
The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update). bpf_spin_lock should be used if the inline update needs
to be protected.
BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs. The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.
The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage". This
particular "type" of "sk-local-storage" data can then be stored in any sk.
The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
when the map is freed.
sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage"). When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list. The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.
To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset. At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have. Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array. Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.
The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map. On the program side, having a few bpf_progs
running in the networking hotpath is already a lot. The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty. 16 has enough runway to grow.
All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.
bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete(). The verifier can then enforce the
ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk. It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.
Misc notes:
----------
1. map_get_next_key is not supported. From the userspace syscall
perspective, the map has the socket fd as the key while the map
can be shared by pinned-file or map-id.
Since btf is enforced, the existing "ss" could be enhanced to pretty
print the local-storage.
Supporting a kernel defined btf with 4 tuples as the return key could
be explored later also.
2. The sk->sk_lock cannot be acquired. Atomic operations is used instead.
e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
Please refer to the source code comments for the details in
synchronization cases and considerations.
3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:
One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)
Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt. netperf is used to drive
data with 4096 connected UDP sockets.
BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
loaded_at 2019-04-15T13:46:39-0700 uid 0
xlated 344B jited 258B memlock 4096B map_ids 16
btf_id 5
BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
loaded_at 2019-04-15T13:47:54-0700 uid 0
xlated 168B jited 156B memlock 4096B map_ids 17
btf_id 6
Here is a high-level picture on how are the objects organized:
sk
┌──────┐
│ │
│ │
│ │
│*sk_bpf_storage─────▶ bpf_sk_storage
└──────┘ ┌───────┐
┌───────────┤ list │
│ │ │
│ │ │
│ │ │
│ └───────┘
│
│ elem
│ ┌────────┐
├─▶│ snode │
│ ├────────┤
│ │ data │ bpf_map
│ ├────────┤ ┌─────────┐
│ │map_node│◀─┬─────┤ list │
│ └────────┘ │ │ │
│ │ │ │
│ elem │ │ │
│ ┌────────┐ │ └─────────┘
└─▶│ snode │ │
├────────┤ │
bpf_map │ data │ │
┌─────────┐ ├────────┤ │
│ list ├───────▶│map_node│ │
│ │ └────────┘ │
│ │ │
│ │ elem │
└─────────┘ ┌────────┐ │
┌─▶│ snode │ │
│ ├────────┤ │
│ │ data │ │
│ ├────────┤ │
│ │map_node│◀─┘
│ └────────┘
│
│
│ ┌───────┐
sk └──────────│ list │
┌──────┐ │ │
│ │ │ │
│ │ │ │
│ │ └───────┘
│*sk_bpf_storage───────▶bpf_sk_storage
└──────┘
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26 16:39:39 -07:00
bpf_sk_storage_free ( sk ) ;
2018-10-19 09:57:58 -07:00
kfree ( sk ) ;
2019-04-09 11:49:09 -07:00
kfree ( ctx ) ;
2017-03-30 21:45:38 -07:00
return ret ;
}
int bpf_prog_test_run_xdp ( struct bpf_prog * prog , const union bpf_attr * kattr ,
union bpf_attr __user * uattr )
{
u32 size = kattr - > test . data_size_in ;
u32 repeat = kattr - > test . repeat ;
2018-01-31 12:58:56 +01:00
struct netdev_rx_queue * rxqueue ;
2017-03-30 21:45:38 -07:00
struct xdp_buff xdp = { } ;
u32 retval , duration ;
void * data ;
int ret ;
2019-04-11 15:47:07 -07:00
if ( kattr - > test . ctx_in | | kattr - > test . ctx_out )
return - EINVAL ;
2017-05-02 11:36:45 -04:00
data = bpf_test_init ( kattr , size , XDP_PACKET_HEADROOM + NET_IP_ALIGN , 0 ) ;
2017-03-30 21:45:38 -07:00
if ( IS_ERR ( data ) )
return PTR_ERR ( data ) ;
xdp . data_hard_start = data ;
2017-05-02 11:36:45 -04:00
xdp . data = data + XDP_PACKET_HEADROOM + NET_IP_ALIGN ;
bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.
xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.
The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 02:25:51 +02:00
xdp . data_meta = xdp . data ;
2017-03-30 21:45:38 -07:00
xdp . data_end = xdp . data + size ;
2018-01-31 12:58:56 +01:00
rxqueue = __netif_get_rx_queue ( current - > nsproxy - > net_ns - > loopback_dev , 0 ) ;
xdp . rxq = & rxqueue - > xdp_rxq ;
2018-12-01 10:39:44 -08:00
ret = bpf_test_run ( prog , & xdp , repeat , & retval , & duration ) ;
if ( ret )
goto out ;
2018-04-17 21:42:21 -07:00
if ( xdp . data ! = data + XDP_PACKET_HEADROOM + NET_IP_ALIGN | |
xdp . data_end ! = xdp . data + size )
2017-03-30 21:45:38 -07:00
size = xdp . data_end - xdp . data ;
2017-05-02 11:36:33 -04:00
ret = bpf_test_finish ( kattr , uattr , xdp . data , size , retval , duration ) ;
2018-12-01 10:39:44 -08:00
out :
2017-03-30 21:45:38 -07:00
kfree ( data ) ;
return ret ;
}
2019-01-28 08:53:54 -08:00
int bpf_prog_test_run_flow_dissector ( struct bpf_prog * prog ,
const union bpf_attr * kattr ,
union bpf_attr __user * uattr )
{
u32 size = kattr - > test . data_size_in ;
2019-04-22 08:55:45 -07:00
struct bpf_flow_dissector ctx = { } ;
2019-01-28 08:53:54 -08:00
u32 repeat = kattr - > test . repeat ;
struct bpf_flow_keys flow_keys ;
u64 time_start , time_spent = 0 ;
2019-04-22 08:55:45 -07:00
const struct ethhdr * eth ;
2019-01-28 08:53:54 -08:00
u32 retval , duration ;
void * data ;
int ret ;
u32 i ;
if ( prog - > type ! = BPF_PROG_TYPE_FLOW_DISSECTOR )
return - EINVAL ;
2019-04-11 15:47:07 -07:00
if ( kattr - > test . ctx_in | | kattr - > test . ctx_out )
return - EINVAL ;
2019-04-22 08:55:45 -07:00
if ( size < ETH_HLEN )
return - EINVAL ;
data = bpf_test_init ( kattr , size , 0 , 0 ) ;
2019-01-28 08:53:54 -08:00
if ( IS_ERR ( data ) )
return PTR_ERR ( data ) ;
2019-04-22 08:55:45 -07:00
eth = ( struct ethhdr * ) data ;
2019-01-28 08:53:54 -08:00
if ( ! repeat )
repeat = 1 ;
2019-04-22 08:55:45 -07:00
ctx . flow_keys = & flow_keys ;
ctx . data = data ;
ctx . data_end = ( __u8 * ) data + size ;
2019-02-19 10:54:17 -08:00
rcu_read_lock ( ) ;
preempt_disable ( ) ;
2019-01-28 08:53:54 -08:00
time_start = ktime_get_ns ( ) ;
for ( i = 0 ; i < repeat ; i + + ) {
2019-04-22 08:55:45 -07:00
retval = bpf_flow_dissect ( prog , & ctx , eth - > h_proto , ETH_HLEN ,
size ) ;
2019-02-19 10:54:17 -08:00
if ( signal_pending ( current ) ) {
preempt_enable ( ) ;
rcu_read_unlock ( ) ;
ret = - EINTR ;
goto out ;
}
2019-01-28 08:53:54 -08:00
if ( need_resched ( ) ) {
time_spent + = ktime_get_ns ( ) - time_start ;
2019-02-19 10:54:17 -08:00
preempt_enable ( ) ;
rcu_read_unlock ( ) ;
2019-01-28 08:53:54 -08:00
cond_resched ( ) ;
2019-02-19 10:54:17 -08:00
rcu_read_lock ( ) ;
preempt_disable ( ) ;
2019-01-28 08:53:54 -08:00
time_start = ktime_get_ns ( ) ;
}
}
time_spent + = ktime_get_ns ( ) - time_start ;
2019-02-19 10:54:17 -08:00
preempt_enable ( ) ;
rcu_read_unlock ( ) ;
2019-01-28 08:53:54 -08:00
do_div ( time_spent , repeat ) ;
duration = time_spent > U32_MAX ? U32_MAX : ( u32 ) time_spent ;
ret = bpf_test_finish ( kattr , uattr , & flow_keys , sizeof ( flow_keys ) ,
retval , duration ) ;
2019-02-19 10:54:17 -08:00
out :
2019-04-22 08:55:45 -07:00
kfree ( data ) ;
2019-01-28 08:53:54 -08:00
return ret ;
}