2019-05-27 08:55:06 +02:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-04-16 15:20:36 -07:00
/*
* TUN - Universal TUN / TAP device driver .
* Copyright ( C ) 1999 - 2002 Maxim Krasnyansky < maxk @ qualcomm . com >
*
* $ Id : tun . c , v 1.15 2002 / 03 / 01 02 : 44 : 24 maxk Exp $
*/
/*
* Changes :
*
2005-09-01 17:40:05 -07:00
* Mike Kershaw < dragorn @ kismetwireless . net > 2005 / 08 / 14
* Add TUNSETLINK ioctl to set the link encapsulation
*
2005-04-16 15:20:36 -07:00
* Mark Smith < markzzzsmith @ yahoo . com . au >
2012-07-12 19:33:09 +00:00
* Use eth_random_addr ( ) for tap MAC address .
2005-04-16 15:20:36 -07:00
*
* Harald Roelle < harald . roelle @ ifi . lmu . de > 2004 / 04 / 20
* Fixes in packet dropping , queue length setting and queue wakeup .
* Increased default tx queue length .
* Added ethtool API .
* Minor cleanups
*
* Daniel Podlejski < underley @ underley . eu . org >
* Modifications for 2.3 .99 - pre5 kernel .
*/
2011-03-02 07:18:10 +00:00
# define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
2005-04-16 15:20:36 -07:00
# define DRV_NAME "tun"
# define DRV_VERSION "1.6"
# define DRV_DESCRIPTION "Universal TUN / TAP device driver"
# define DRV_COPYRIGHT "(C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>"
# include <linux/module.h>
# include <linux/errno.h>
# include <linux/kernel.h>
2017-02-02 19:15:33 +01:00
# include <linux/sched/signal.h>
2005-04-16 15:20:36 -07:00
# include <linux/major.h>
# include <linux/slab.h>
# include <linux/poll.h>
# include <linux/fcntl.h>
# include <linux/init.h>
# include <linux/skbuff.h>
# include <linux/netdevice.h>
# include <linux/etherdevice.h>
# include <linux/miscdevice.h>
# include <linux/ethtool.h>
# include <linux/rtnetlink.h>
2009-11-06 22:52:32 -08:00
# include <linux/compat.h>
2005-04-16 15:20:36 -07:00
# include <linux/if.h>
# include <linux/if_arp.h>
# include <linux/if_ether.h>
# include <linux/if_tun.h>
2013-07-25 13:00:33 +08:00
# include <linux/if_vlan.h>
2005-04-16 15:20:36 -07:00
# include <linux/crc32.h>
2008-04-16 00:41:16 -07:00
# include <linux/nsproxy.h>
2008-07-03 03:48:02 -07:00
# include <linux/virtio_net.h>
2010-02-14 01:01:10 +00:00
# include <linux/rcupdate.h>
2007-09-17 11:56:21 -07:00
# include <net/net_namespace.h>
2008-04-16 00:40:46 -07:00
# include <net/netns/generic.h>
2009-01-21 16:02:16 -08:00
# include <net/rtnetlink.h>
2009-02-05 21:25:32 -08:00
# include <net/sock.h>
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 16:46:12 +02:00
# include <net/xdp.h>
2020-06-29 19:06:22 -06:00
# include <net/ip_tunnels.h>
2014-01-29 16:43:31 +09:00
# include <linux/seq_file.h>
2014-11-07 21:22:23 +08:00
# include <linux/uio.h>
2016-06-30 14:45:36 +08:00
# include <linux/skb_array.h>
2017-08-11 19:41:18 +08:00
# include <linux/bpf.h>
# include <linux/bpf_trace.h>
2017-09-22 13:49:15 -07:00
# include <linux/mutex.h>
2021-04-06 18:45:54 +01:00
# include <linux/ieee802154.h>
# include <linux/if_ltalk.h>
# include <uapi/linux/if_fddi.h>
# include <uapi/linux/if_hippi.h>
# include <uapi/linux/if_fc.h>
# include <net/ax25.h>
# include <net/rose.h>
# include <net/6lowpan.h>
2005-04-16 15:20:36 -07:00
2016-12-24 11:46:01 -08:00
# include <linux/uaccess.h>
2018-02-14 16:40:14 +03:00
# include <linux/proc_fs.h>
2005-04-16 15:20:36 -07:00
2018-06-02 17:49:53 -04:00
static void tun_default_link_ksettings ( struct net_device * dev ,
struct ethtool_link_ksettings * cmd ) ;
2017-09-04 11:36:08 +08:00
# define TUN_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
2017-08-11 19:41:16 +08:00
2014-11-19 14:44:40 +02:00
/* TUN device flags */
/* IFF_ATTACH_QUEUE is never stored in device flags,
* overload it to mean fasync when stored there .
*/
# define TUN_FASYNC IFF_ATTACH_QUEUE
2014-12-16 15:05:06 +02:00
/* High bits in flags field are unused. */
# define TUN_VNET_LE 0x80000000
2015-04-24 14:50:36 +02:00
# define TUN_VNET_BE 0x40000000
2014-11-19 14:44:40 +02:00
# define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
2017-09-22 13:49:15 -07:00
IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS )
2012-07-20 09:23:23 +00:00
# define GOODCOPY_LEN 128
2008-07-14 22:18:19 -07:00
# define FLT_EXACT_COUNT 8
struct tap_filter {
unsigned int count ; /* Number of addrs. Zero means disabled */
u32 mask [ 2 ] ; /* Mask of the hashed addrs */
unsigned char addr [ FLT_EXACT_COUNT ] [ ETH_ALEN ] ;
} ;
2015-01-12 11:41:29 +05:30
/* MAX_TAP_QUEUES 256 is chosen to allow rx/tx queues to be equal
* to max number of VCPUs in guest . */
# define MAX_TAP_QUEUES 256
2013-01-23 03:59:13 +00:00
# define MAX_TAP_FLOWS 4096
2012-10-31 19:46:00 +00:00
2012-10-31 19:46:02 +00:00
# define TUN_FLOW_EXPIRE (3 * HZ)
2012-10-31 19:45:57 +00:00
/* A tun_file connects an open character device to a tuntap netdevice. It
2013-12-05 20:42:58 -08:00
* also contains all socket related structures ( except sock_fprog and tap_filter )
2012-10-31 19:45:57 +00:00
* to serve as one transmit queue for tuntap device . The sock_fprog and
* tap_filter were kept in tun_struct since they were used for filtering for the
2012-11-25 22:07:40 +00:00
* netdevice not for a specific queue ( at least I didn ' t see the requirement for
2012-10-31 19:45:57 +00:00
* this ) .
2012-10-31 19:45:58 +00:00
*
* RCU usage :
2012-11-25 22:07:40 +00:00
* The tun_file and tun_struct are loosely coupled , the pointer from one to the
2012-10-31 19:45:58 +00:00
* other can only be read while rcu_read_lock or rtnl_lock is held .
2012-10-31 19:45:57 +00:00
*/
2009-01-20 11:00:40 +00:00
struct tun_file {
2012-10-31 19:45:57 +00:00
struct sock sk ;
struct socket socket ;
2012-10-31 19:45:58 +00:00
struct tun_struct __rcu * tun ;
2012-10-31 19:45:57 +00:00
struct fasync_struct * fasync ;
/* only used for fasnyc */
unsigned int flags ;
tun: Add ability to create tun device with given index
Tun devices cannot be created with ifidex user wants, but it's
required by checkpoint-restore project.
Long time ago such ability was implemented for rtnl_ops-based
interface for creating links (9c7dafbf net: Allow to create links
with given ifindex), but the only API for creating and managing
tuntap devices is ioctl-based and is evolving with adding new ones
(cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
device).
Following that trend, here's how a new ioctl that sets the ifindex
for device, that _will_ be created by TUNSETIFF ioctl looks like.
So those who want a tuntap device with the ifindex N, should open
the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
If the index N is busy, then the register_netdev will find this out
and the ioctl would be failed with -EBUSY.
If setifindex is not called, then it will be generated as before.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 14:31:38 +04:00
union {
u16 queue_index ;
unsigned int ifindex ;
} ;
2017-09-22 13:49:14 -07:00
struct napi_struct napi ;
2017-10-18 12:12:09 -07:00
bool napi_enabled ;
2018-09-28 14:51:49 -07:00
bool napi_frags_enabled ;
2017-09-22 13:49:15 -07:00
struct mutex napi_mutex ; /* Protects access to the above napi */
2012-12-13 23:53:30 +00:00
struct list_head next ;
struct tun_struct * detached ;
2018-01-04 11:14:27 +08:00
struct ptr_ring tx_ring ;
2018-01-03 11:25:59 +01:00
struct xdp_rxq_info xdp_rxq ;
2009-01-20 11:00:40 +00:00
} ;
2018-11-15 17:43:10 +08:00
struct tun_page {
struct page * page ;
int count ;
} ;
2012-10-31 19:46:02 +00:00
struct tun_flow_entry {
struct hlist_node hash_link ;
struct rcu_head rcu ;
struct tun_struct * tun ;
u32 rxhash ;
2013-12-22 18:54:32 +08:00
u32 rps_rxhash ;
2012-10-31 19:46:02 +00:00
int queue_index ;
2018-12-06 16:08:17 +08:00
unsigned long updated ____cacheline_aligned_in_smp ;
2012-10-31 19:46:02 +00:00
} ;
# define TUN_NUM_FLOW_ENTRIES 1024
2018-08-03 15:50:02 +08:00
# define TUN_MASK_FLOW_ENTRIES (TUN_NUM_FLOW_ENTRIES - 1)
2012-10-31 19:46:02 +00:00
2018-01-16 16:31:01 +08:00
struct tun_prog {
2017-12-04 17:31:23 +08:00
struct rcu_head rcu ;
struct bpf_prog * prog ;
} ;
2012-10-31 19:45:57 +00:00
/* Since the socket were moved to tun_file, to preserve the behavior of persist
2012-11-25 22:07:40 +00:00
* device , socket filter , sndbuf and vnet header size were restore when the
2012-10-31 19:45:57 +00:00
* file were attached to a persist device .
*/
2008-04-12 18:48:58 -07:00
struct tun_struct {
2012-10-31 19:46:00 +00:00
struct tun_file __rcu * tfiles [ MAX_TAP_QUEUES ] ;
unsigned int numqueues ;
2008-07-14 22:18:19 -07:00
unsigned int flags ;
2012-02-07 16:48:55 -08:00
kuid_t owner ;
kgid_t group ;
2008-04-12 18:48:58 -07:00
struct net_device * dev ;
2011-11-15 15:29:55 +00:00
netdev_features_t set_features ;
2011-04-19 06:13:10 +00:00
# define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
2022-12-07 13:35:55 +02:00
NETIF_F_TSO6 | NETIF_F_GSO_UDP_L4 )
2010-03-17 17:45:01 +02:00
2016-02-26 10:45:40 +01:00
int align ;
2010-03-17 17:45:01 +02:00
int vnet_hdr_sz ;
2012-10-31 19:45:57 +00:00
int sndbuf ;
struct tap_filter txflt ;
struct sock_fprog fprog ;
/* protected by rtnl lock */
bool filter_attached ;
2020-03-04 17:24:14 +01:00
u32 msg_enable ;
2012-10-31 19:46:02 +00:00
spinlock_t lock ;
struct hlist_head flows [ TUN_NUM_FLOW_ENTRIES ] ;
struct timer_list flow_gc_timer ;
unsigned long ageing_time ;
2012-12-13 23:53:30 +00:00
unsigned int numdisabled ;
struct list_head disabled ;
2013-01-14 07:12:19 +00:00
void * security ;
2013-01-23 03:59:13 +00:00
u32 flow_count ;
2017-01-18 15:02:03 +08:00
u32 rx_batched ;
2020-11-07 21:50:56 +01:00
atomic_long_t rx_frame_errors ;
2017-08-11 19:41:18 +08:00
struct bpf_prog __rcu * xdp_prog ;
2018-01-16 16:31:01 +08:00
struct tun_prog __rcu * steering_prog ;
2018-01-16 16:31:02 +08:00
struct tun_prog __rcu * filter_prog ;
2018-06-02 17:49:53 -04:00
struct ethtool_link_ksettings link_ksettings ;
2021-12-16 13:25:32 -05:00
/* init args */
struct file * file ;
struct ifreq * ifr ;
2008-04-12 18:48:58 -07:00
} ;
2005-04-16 15:20:36 -07:00
2018-01-16 16:31:02 +08:00
struct veth {
__be16 h_vlan_proto ;
__be16 h_vlan_TCI ;
2008-04-12 18:48:58 -07:00
} ;
2005-04-16 15:20:36 -07:00
2021-12-16 13:25:32 -05:00
static void tun_flow_init ( struct tun_struct * tun ) ;
static void tun_flow_uninit ( struct tun_struct * tun ) ;
2017-09-22 13:49:14 -07:00
static int tun_napi_receive ( struct napi_struct * napi , int budget )
{
struct tun_file * tfile = container_of ( napi , struct tun_file , napi ) ;
struct sk_buff_head * queue = & tfile - > sk . sk_write_queue ;
struct sk_buff_head process_queue ;
struct sk_buff * skb ;
int received = 0 ;
__skb_queue_head_init ( & process_queue ) ;
spin_lock ( & queue - > lock ) ;
skb_queue_splice_tail_init ( queue , & process_queue ) ;
spin_unlock ( & queue - > lock ) ;
while ( received < budget & & ( skb = __skb_dequeue ( & process_queue ) ) ) {
napi_gro_receive ( napi , skb ) ;
+ + received ;
}
if ( ! skb_queue_empty ( & process_queue ) ) {
spin_lock ( & queue - > lock ) ;
skb_queue_splice ( & process_queue , queue ) ;
spin_unlock ( & queue - > lock ) ;
}
return received ;
}
static int tun_napi_poll ( struct napi_struct * napi , int budget )
{
unsigned int received ;
received = tun_napi_receive ( napi , budget ) ;
if ( received < budget )
napi_complete_done ( napi , received ) ;
return received ;
}
static void tun_napi_init ( struct tun_struct * tun , struct tun_file * tfile ,
2018-09-28 14:51:49 -07:00
bool napi_en , bool napi_frags )
2017-09-22 13:49:14 -07:00
{
2017-10-18 12:12:09 -07:00
tfile - > napi_enabled = napi_en ;
2018-09-28 14:51:49 -07:00
tfile - > napi_frags_enabled = napi_en & & napi_frags ;
2017-09-22 13:49:14 -07:00
if ( napi_en ) {
2022-05-04 09:37:24 -07:00
netif_napi_add_tx ( tun - > dev , & tfile - > napi , tun_napi_poll ) ;
2017-09-22 13:49:14 -07:00
napi_enable ( & tfile - > napi ) ;
}
}
2022-06-22 21:21:05 -07:00
static void tun_napi_enable ( struct tun_file * tfile )
{
if ( tfile - > napi_enabled )
napi_enable ( & tfile - > napi ) ;
}
2018-09-28 14:51:47 -07:00
static void tun_napi_disable ( struct tun_file * tfile )
2017-09-22 13:49:14 -07:00
{
2017-10-18 12:12:09 -07:00
if ( tfile - > napi_enabled )
2017-09-22 13:49:14 -07:00
napi_disable ( & tfile - > napi ) ;
}
2018-09-28 14:51:47 -07:00
static void tun_napi_del ( struct tun_file * tfile )
2017-09-22 13:49:14 -07:00
{
2017-10-18 12:12:09 -07:00
if ( tfile - > napi_enabled )
2017-09-22 13:49:14 -07:00
netif_napi_del ( & tfile - > napi ) ;
}
2018-09-28 14:51:49 -07:00
static bool tun_napi_frags_enabled ( const struct tun_file * tfile )
2017-09-22 13:49:15 -07:00
{
2018-09-28 14:51:49 -07:00
return tfile - > napi_frags_enabled ;
2017-09-22 13:49:15 -07:00
}
2015-04-24 14:50:36 +02:00
# ifdef CONFIG_TUN_VNET_CROSS_LE
static inline bool tun_legacy_is_little_endian ( struct tun_struct * tun )
{
return tun - > flags & TUN_VNET_BE ? false :
virtio_legacy_is_little_endian ( ) ;
}
static long tun_get_vnet_be ( struct tun_struct * tun , int __user * argp )
{
int be = ! ! ( tun - > flags & TUN_VNET_BE ) ;
if ( put_user ( be , argp ) )
return - EFAULT ;
return 0 ;
}
static long tun_set_vnet_be ( struct tun_struct * tun , int __user * argp )
{
int be ;
if ( get_user ( be , argp ) )
return - EFAULT ;
if ( be )
tun - > flags | = TUN_VNET_BE ;
else
tun - > flags & = ~ TUN_VNET_BE ;
return 0 ;
}
# else
static inline bool tun_legacy_is_little_endian ( struct tun_struct * tun )
{
return virtio_legacy_is_little_endian ( ) ;
}
static long tun_get_vnet_be ( struct tun_struct * tun , int __user * argp )
{
return - EINVAL ;
}
static long tun_set_vnet_be ( struct tun_struct * tun , int __user * argp )
{
return - EINVAL ;
}
# endif /* CONFIG_TUN_VNET_CROSS_LE */
2015-04-24 14:24:38 +02:00
static inline bool tun_is_little_endian ( struct tun_struct * tun )
{
2015-04-24 14:26:24 +02:00
return tun - > flags & TUN_VNET_LE | |
2015-04-24 14:50:36 +02:00
tun_legacy_is_little_endian ( tun ) ;
2015-04-24 14:24:38 +02:00
}
2014-10-23 22:59:31 +03:00
static inline u16 tun16_to_cpu ( struct tun_struct * tun , __virtio16 val )
{
2015-04-24 14:24:38 +02:00
return __virtio16_to_cpu ( tun_is_little_endian ( tun ) , val ) ;
2014-10-23 22:59:31 +03:00
}
static inline __virtio16 cpu_to_tun16 ( struct tun_struct * tun , u16 val )
{
2015-04-24 14:24:38 +02:00
return __cpu_to_virtio16 ( tun_is_little_endian ( tun ) , val ) ;
2014-10-23 22:59:31 +03:00
}
2012-10-31 19:46:02 +00:00
static inline u32 tun_hashfn ( u32 rxhash )
{
2018-08-03 15:50:02 +08:00
return rxhash & TUN_MASK_FLOW_ENTRIES ;
2012-10-31 19:46:02 +00:00
}
static struct tun_flow_entry * tun_flow_find ( struct hlist_head * head , u32 rxhash )
{
struct tun_flow_entry * e ;
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
hlist_for_each_entry_rcu ( e , head , hash_link ) {
2012-10-31 19:46:02 +00:00
if ( e - > rxhash = = rxhash )
return e ;
}
return NULL ;
}
static struct tun_flow_entry * tun_flow_create ( struct tun_struct * tun ,
struct hlist_head * head ,
u32 rxhash , u16 queue_index )
{
2012-12-21 07:17:21 +00:00
struct tun_flow_entry * e = kmalloc ( sizeof ( * e ) , GFP_ATOMIC ) ;
2012-10-31 19:46:02 +00:00
if ( e ) {
2020-03-04 17:24:14 +01:00
netif_info ( tun , tx_queued , tun - > dev ,
" create flow: hash %u index %u \n " ,
rxhash , queue_index ) ;
2012-10-31 19:46:02 +00:00
e - > updated = jiffies ;
e - > rxhash = rxhash ;
2013-12-22 18:54:32 +08:00
e - > rps_rxhash = 0 ;
2012-10-31 19:46:02 +00:00
e - > queue_index = queue_index ;
e - > tun = tun ;
hlist_add_head_rcu ( & e - > hash_link , head ) ;
2013-01-23 03:59:13 +00:00
+ + tun - > flow_count ;
2012-10-31 19:46:02 +00:00
}
return e ;
}
static void tun_flow_delete ( struct tun_struct * tun , struct tun_flow_entry * e )
{
2020-03-04 17:24:14 +01:00
netif_info ( tun , tx_queued , tun - > dev , " delete flow: hash %u index %u \n " ,
e - > rxhash , e - > queue_index ) ;
2012-10-31 19:46:02 +00:00
hlist_del_rcu ( & e - > hash_link ) ;
2012-12-21 07:17:21 +00:00
kfree_rcu ( e , rcu ) ;
2013-01-23 03:59:13 +00:00
- - tun - > flow_count ;
2012-10-31 19:46:02 +00:00
}
static void tun_flow_flush ( struct tun_struct * tun )
{
int i ;
spin_lock_bh ( & tun - > lock ) ;
for ( i = 0 ; i < TUN_NUM_FLOW_ENTRIES ; i + + ) {
struct tun_flow_entry * e ;
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
struct hlist_node * n ;
2012-10-31 19:46:02 +00:00
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
hlist_for_each_entry_safe ( e , n , & tun - > flows [ i ] , hash_link )
2012-10-31 19:46:02 +00:00
tun_flow_delete ( tun , e ) ;
}
spin_unlock_bh ( & tun - > lock ) ;
}
static void tun_flow_delete_by_queue ( struct tun_struct * tun , u16 queue_index )
{
int i ;
spin_lock_bh ( & tun - > lock ) ;
for ( i = 0 ; i < TUN_NUM_FLOW_ENTRIES ; i + + ) {
struct tun_flow_entry * e ;
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
struct hlist_node * n ;
2012-10-31 19:46:02 +00:00
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
hlist_for_each_entry_safe ( e , n , & tun - > flows [ i ] , hash_link ) {
2012-10-31 19:46:02 +00:00
if ( e - > queue_index = = queue_index )
tun_flow_delete ( tun , e ) ;
}
}
spin_unlock_bh ( & tun - > lock ) ;
}
treewide: setup_timer() -> timer_setup()
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.
Casting from unsigned long:
void my_callback(unsigned long data)
{
struct something *ptr = (struct something *)data;
...
}
...
setup_timer(&ptr->my_timer, my_callback, ptr);
and forced object casts:
void my_callback(struct something *ptr)
{
...
}
...
setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);
become:
void my_callback(struct timer_list *t)
{
struct something *ptr = from_timer(ptr, t, my_timer);
...
}
...
timer_setup(&ptr->my_timer, my_callback, 0);
Direct function assignments:
void my_callback(unsigned long data)
{
struct something *ptr = (struct something *)data;
...
}
...
ptr->my_timer.function = my_callback;
have a temporary cast added, along with converting the args:
void my_callback(struct timer_list *t)
{
struct something *ptr = from_timer(ptr, t, my_timer);
...
}
...
ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;
And finally, callbacks without a data assignment:
void my_callback(unsigned long data)
{
...
}
...
setup_timer(&ptr->my_timer, my_callback, 0);
have their argument renamed to verify they're unused during conversion:
void my_callback(struct timer_list *unused)
{
...
}
...
timer_setup(&ptr->my_timer, my_callback, 0);
The conversion is done with the following Coccinelle script:
spatch --very-quiet --all-includes --include-headers \
-I ./arch/x86/include -I ./arch/x86/include/generated \
-I ./include -I ./arch/x86/include/uapi \
-I ./arch/x86/include/generated/uapi -I ./include/uapi \
-I ./include/generated/uapi --include ./include/linux/kconfig.h \
--dir . \
--cocci-file ~/src/data/timer_setup.cocci
@fix_address_of@
expression e;
@@
setup_timer(
-&(e)
+&e
, ...)
// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@
(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)
@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@
(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
_E->_timer@_stl.function = _callback;
|
_E->_timer@_stl.function = &_callback;
|
_E->_timer@_stl.function = (_cast_func)_callback;
|
_E->_timer@_stl.function = (_cast_func)&_callback;
|
_E._timer@_stl.function = _callback;
|
_E._timer@_stl.function = &_callback;
|
_E._timer@_stl.function = (_cast_func)_callback;
|
_E._timer@_stl.function = (_cast_func)&_callback;
)
// callback(unsigned long arg)
@change_callback_handle_cast
depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@
void _callback(
-_origtype _origarg
+struct timer_list *t
)
{
(
... when != _origarg
_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle;
... when != _handle
_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle;
... when != _handle
_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
)
}
// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
depends on change_timer_function_usage &&
!change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@
void _callback(
-_origtype _origarg
+struct timer_list *t
)
{
+ _handletype *_origarg = from_timer(_origarg, t, _timer);
+
... when != _origarg
- (_handletype *)_origarg
+ _origarg
... when != _origarg
}
// Avoid already converted callbacks.
@match_callback_converted
depends on change_timer_function_usage &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@
void _callback(struct timer_list *t)
{ ... }
// callback(struct something *handle)
@change_callback_handle_arg
depends on change_timer_function_usage &&
!match_callback_converted &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@
void _callback(
-_handletype *_handle
+struct timer_list *t
)
{
+ _handletype *_handle = from_timer(_handle, t, _timer);
...
}
// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
depends on change_timer_function_usage &&
change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@
void _callback(struct timer_list *t)
{
- _handletype *_handle = from_timer(_handle, t, _timer);
}
// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
depends on change_timer_function_usage &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg &&
!change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@
(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)
// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
depends on change_timer_function_usage &&
(change_callback_handle_cast ||
change_callback_handle_cast_no_arg ||
change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@
(
_E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
;
)
// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
depends on change_timer_function_usage &&
(change_callback_handle_cast ||
change_callback_handle_cast_no_arg ||
change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@
_callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
)
// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@
(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)
@change_callback_unused_data
depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@
void _callback(
-_origtype _origarg
+struct timer_list *unused
)
{
... when != _origarg
}
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-10-16 14:43:17 -07:00
static void tun_flow_cleanup ( struct timer_list * t )
2012-10-31 19:46:02 +00:00
{
treewide: setup_timer() -> timer_setup()
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.
Casting from unsigned long:
void my_callback(unsigned long data)
{
struct something *ptr = (struct something *)data;
...
}
...
setup_timer(&ptr->my_timer, my_callback, ptr);
and forced object casts:
void my_callback(struct something *ptr)
{
...
}
...
setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);
become:
void my_callback(struct timer_list *t)
{
struct something *ptr = from_timer(ptr, t, my_timer);
...
}
...
timer_setup(&ptr->my_timer, my_callback, 0);
Direct function assignments:
void my_callback(unsigned long data)
{
struct something *ptr = (struct something *)data;
...
}
...
ptr->my_timer.function = my_callback;
have a temporary cast added, along with converting the args:
void my_callback(struct timer_list *t)
{
struct something *ptr = from_timer(ptr, t, my_timer);
...
}
...
ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;
And finally, callbacks without a data assignment:
void my_callback(unsigned long data)
{
...
}
...
setup_timer(&ptr->my_timer, my_callback, 0);
have their argument renamed to verify they're unused during conversion:
void my_callback(struct timer_list *unused)
{
...
}
...
timer_setup(&ptr->my_timer, my_callback, 0);
The conversion is done with the following Coccinelle script:
spatch --very-quiet --all-includes --include-headers \
-I ./arch/x86/include -I ./arch/x86/include/generated \
-I ./include -I ./arch/x86/include/uapi \
-I ./arch/x86/include/generated/uapi -I ./include/uapi \
-I ./include/generated/uapi --include ./include/linux/kconfig.h \
--dir . \
--cocci-file ~/src/data/timer_setup.cocci
@fix_address_of@
expression e;
@@
setup_timer(
-&(e)
+&e
, ...)
// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@
(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)
@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@
(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
_E->_timer@_stl.function = _callback;
|
_E->_timer@_stl.function = &_callback;
|
_E->_timer@_stl.function = (_cast_func)_callback;
|
_E->_timer@_stl.function = (_cast_func)&_callback;
|
_E._timer@_stl.function = _callback;
|
_E._timer@_stl.function = &_callback;
|
_E._timer@_stl.function = (_cast_func)_callback;
|
_E._timer@_stl.function = (_cast_func)&_callback;
)
// callback(unsigned long arg)
@change_callback_handle_cast
depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@
void _callback(
-_origtype _origarg
+struct timer_list *t
)
{
(
... when != _origarg
_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle;
... when != _handle
_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle;
... when != _handle
_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
)
}
// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
depends on change_timer_function_usage &&
!change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@
void _callback(
-_origtype _origarg
+struct timer_list *t
)
{
+ _handletype *_origarg = from_timer(_origarg, t, _timer);
+
... when != _origarg
- (_handletype *)_origarg
+ _origarg
... when != _origarg
}
// Avoid already converted callbacks.
@match_callback_converted
depends on change_timer_function_usage &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@
void _callback(struct timer_list *t)
{ ... }
// callback(struct something *handle)
@change_callback_handle_arg
depends on change_timer_function_usage &&
!match_callback_converted &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@
void _callback(
-_handletype *_handle
+struct timer_list *t
)
{
+ _handletype *_handle = from_timer(_handle, t, _timer);
...
}
// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
depends on change_timer_function_usage &&
change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@
void _callback(struct timer_list *t)
{
- _handletype *_handle = from_timer(_handle, t, _timer);
}
// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
depends on change_timer_function_usage &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg &&
!change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@
(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)
// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
depends on change_timer_function_usage &&
(change_callback_handle_cast ||
change_callback_handle_cast_no_arg ||
change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@
(
_E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
;
)
// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
depends on change_timer_function_usage &&
(change_callback_handle_cast ||
change_callback_handle_cast_no_arg ||
change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@
_callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
)
// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@
(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)
@change_callback_unused_data
depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@
void _callback(
-_origtype _origarg
+struct timer_list *unused
)
{
... when != _origarg
}
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-10-16 14:43:17 -07:00
struct tun_struct * tun = from_timer ( tun , t , flow_gc_timer ) ;
2012-10-31 19:46:02 +00:00
unsigned long delay = tun - > ageing_time ;
unsigned long next_timer = jiffies + delay ;
unsigned long count = 0 ;
int i ;
2017-10-20 11:29:55 -07:00
spin_lock ( & tun - > lock ) ;
2012-10-31 19:46:02 +00:00
for ( i = 0 ; i < TUN_NUM_FLOW_ENTRIES ; i + + ) {
struct tun_flow_entry * e ;
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
struct hlist_node * n ;
2012-10-31 19:46:02 +00:00
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 17:06:00 -08:00
hlist_for_each_entry_safe ( e , n , & tun - > flows [ i ] , hash_link ) {
2012-10-31 19:46:02 +00:00
unsigned long this_timer ;
2017-10-20 11:29:56 -07:00
2012-10-31 19:46:02 +00:00
this_timer = e - > updated + delay ;
2017-10-20 11:29:56 -07:00
if ( time_before_eq ( this_timer , jiffies ) ) {
2012-10-31 19:46:02 +00:00
tun_flow_delete ( tun , e ) ;
2017-10-20 11:29:56 -07:00
continue ;
}
count + + ;
if ( time_before ( this_timer , next_timer ) )
2012-10-31 19:46:02 +00:00
next_timer = this_timer ;
}
}
if ( count )
mod_timer ( & tun - > flow_gc_timer , round_jiffies_up ( next_timer ) ) ;
2017-10-20 11:29:55 -07:00
spin_unlock ( & tun - > lock ) ;
2012-10-31 19:46:02 +00:00
}
2012-12-12 19:22:57 +00:00
static void tun_flow_update ( struct tun_struct * tun , u32 rxhash ,
2013-01-28 01:05:19 +00:00
struct tun_file * tfile )
2012-10-31 19:46:02 +00:00
{
struct hlist_head * head ;
struct tun_flow_entry * e ;
unsigned long delay = tun - > ageing_time ;
2013-01-28 01:05:19 +00:00
u16 queue_index = tfile - > queue_index ;
2012-10-31 19:46:02 +00:00
2018-12-06 16:28:11 +08:00
head = & tun - > flows [ tun_hashfn ( rxhash ) ] ;
2012-10-31 19:46:02 +00:00
rcu_read_lock ( ) ;
e = tun_flow_find ( head , rxhash ) ;
if ( likely ( e ) ) {
/* TODO: keep queueing to old queue until it's empty? */
2019-10-09 09:20:02 -07:00
if ( READ_ONCE ( e - > queue_index ) ! = queue_index )
WRITE_ONCE ( e - > queue_index , queue_index ) ;
2018-12-06 16:08:17 +08:00
if ( e - > updated ! = jiffies )
e - > updated = jiffies ;
2013-12-22 18:54:32 +08:00
sock_rps_record_flow_hash ( e - > rps_rxhash ) ;
2012-10-31 19:46:02 +00:00
} else {
spin_lock_bh ( & tun - > lock ) ;
2013-01-23 03:59:13 +00:00
if ( ! tun_flow_find ( head , rxhash ) & &
tun - > flow_count < MAX_TAP_FLOWS )
2012-10-31 19:46:02 +00:00
tun_flow_create ( tun , head , rxhash , queue_index ) ;
if ( ! timer_pending ( & tun - > flow_gc_timer ) )
mod_timer ( & tun - > flow_gc_timer ,
round_jiffies_up ( jiffies + delay ) ) ;
spin_unlock_bh ( & tun - > lock ) ;
}
rcu_read_unlock ( ) ;
}
2020-03-04 17:23:59 +01:00
/* Save the hash received in the stack receive path and update the
2013-12-22 18:54:32 +08:00
* flow_hash table accordingly .
*/
static inline void tun_flow_save_rps_rxhash ( struct tun_flow_entry * e , u32 hash )
{
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 12:59:01 -08:00
if ( unlikely ( e - > rps_rxhash ! = hash ) )
2013-12-22 18:54:32 +08:00
e - > rps_rxhash = hash ;
}
2018-10-09 10:32:04 +08:00
/* We try to identify a flow through its rxhash. The reason that
2013-12-05 20:42:58 -08:00
* we do not check rxq no . is because some cards ( e . g 82599 ) , chooses
2012-10-31 19:46:00 +00:00
* the rxq based on the txq where the last packet of the flow comes . As
* the userspace application move between processors , we may get a
2018-10-09 10:32:04 +08:00
* different rxq no . here .
2012-10-31 19:46:00 +00:00
*/
2017-12-04 17:31:23 +08:00
static u16 tun_automq_select_queue ( struct tun_struct * tun , struct sk_buff * skb )
2012-10-31 19:46:00 +00:00
{
2012-10-31 19:46:02 +00:00
struct tun_flow_entry * e ;
2012-10-31 19:46:00 +00:00
u32 txq = 0 ;
u32 numqueues = 0 ;
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-23 14:07:29 -07:00
numqueues = READ_ONCE ( tun - > numqueues ) ;
2012-10-31 19:46:00 +00:00
2017-06-06 14:09:49 +08:00
txq = __skb_get_hash_symmetric ( skb ) ;
2018-10-09 10:32:04 +08:00
e = tun_flow_find ( & tun - > flows [ tun_hashfn ( txq ) ] , txq ) ;
if ( e ) {
tun_flow_save_rps_rxhash ( e , txq ) ;
txq = e - > queue_index ;
} else {
/* use multiply and shift instead of expensive divide */
txq = ( ( u64 ) txq * numqueues ) > > 32 ;
2012-10-31 19:46:00 +00:00
}
return txq ;
}
2017-12-04 17:31:23 +08:00
static u16 tun_ebpf_select_queue ( struct tun_struct * tun , struct sk_buff * skb )
{
2018-01-16 16:31:01 +08:00
struct tun_prog * prog ;
2019-05-08 23:20:17 -04:00
u32 numqueues ;
2017-12-04 17:31:23 +08:00
u16 ret = 0 ;
2019-05-08 23:20:17 -04:00
numqueues = READ_ONCE ( tun - > numqueues ) ;
if ( ! numqueues )
return 0 ;
2017-12-04 17:31:23 +08:00
prog = rcu_dereference ( tun - > steering_prog ) ;
if ( prog )
ret = bpf_prog_run_clear_cb ( prog - > prog , skb ) ;
2019-05-08 23:20:17 -04:00
return ret % numqueues ;
2017-12-04 17:31:23 +08:00
}
static u16 tun_select_queue ( struct net_device * dev , struct sk_buff * skb ,
2019-03-20 11:02:06 +01:00
struct net_device * sb_dev )
2017-12-04 17:31:23 +08:00
{
struct tun_struct * tun = netdev_priv ( dev ) ;
u16 ret ;
rcu_read_lock ( ) ;
if ( rcu_dereference ( tun - > steering_prog ) )
ret = tun_ebpf_select_queue ( tun , skb ) ;
else
ret = tun_automq_select_queue ( tun , skb ) ;
rcu_read_unlock ( ) ;
return ret ;
}
2012-10-31 19:46:01 +00:00
static inline bool tun_not_capable ( struct tun_struct * tun )
{
const struct cred * cred = current_cred ( ) ;
2012-11-18 21:34:11 +00:00
struct net * net = dev_net ( tun - > dev ) ;
2012-10-31 19:46:01 +00:00
return ( ( uid_valid ( tun - > owner ) & & ! uid_eq ( cred - > euid , tun - > owner ) ) | |
( gid_valid ( tun - > group ) & & ! in_egroup_p ( tun - > group ) ) ) & &
2012-11-18 21:34:11 +00:00
! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) ;
2012-10-31 19:46:01 +00:00
}
2012-10-31 19:46:00 +00:00
static void tun_set_real_num_queues ( struct tun_struct * tun )
{
netif_set_real_num_tx_queues ( tun - > dev , tun - > numqueues ) ;
netif_set_real_num_rx_queues ( tun - > dev , tun - > numqueues ) ;
}
2012-12-13 23:53:30 +00:00
static void tun_disable_queue ( struct tun_struct * tun , struct tun_file * tfile )
{
tfile - > detached = tun ;
list_add_tail ( & tfile - > next , & tun - > disabled ) ;
+ + tun - > numdisabled ;
}
2012-12-18 11:00:27 +08:00
static struct tun_struct * tun_enable_queue ( struct tun_file * tfile )
2012-12-13 23:53:30 +00:00
{
struct tun_struct * tun = tfile - > detached ;
tfile - > detached = NULL ;
list_del_init ( & tfile - > next ) ;
- - tun - > numdisabled ;
return tun ;
}
2018-03-09 14:50:34 +08:00
void tun_ptr_free ( void * ptr )
2018-01-04 11:14:28 +08:00
{
if ( ! ptr )
return ;
2018-04-17 16:45:47 +02:00
if ( tun_is_xdp_frame ( ptr ) ) {
struct xdp_frame * xdpf = tun_ptr_to_xdp ( ptr ) ;
2018-01-04 11:14:28 +08:00
xdp: transition into using xdp_frame for return API
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.
When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".
This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.
It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.
To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.
V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 16:46:32 +02:00
xdp_return_frame ( xdpf ) ;
2018-01-04 11:14:28 +08:00
} else {
__skb_array_destroy_skb ( ptr ) ;
}
}
2018-03-09 14:50:34 +08:00
EXPORT_SYMBOL_GPL ( tun_ptr_free ) ;
2018-01-04 11:14:28 +08:00
2013-09-05 17:53:59 +08:00
static void tun_queue_purge ( struct tun_file * tfile )
{
2018-01-04 11:14:28 +08:00
void * ptr ;
2016-06-30 14:45:36 +08:00
2018-01-04 11:14:28 +08:00
while ( ( ptr = ptr_ring_consume ( & tfile - > tx_ring ) ) ! = NULL )
tun_ptr_free ( ptr ) ;
2016-06-30 14:45:36 +08:00
2017-01-18 15:02:03 +08:00
skb_queue_purge ( & tfile - > sk . sk_write_queue ) ;
2013-09-05 17:53:59 +08:00
skb_queue_purge ( & tfile - > sk . sk_error_queue ) ;
}
2012-10-31 19:46:00 +00:00
static void __tun_detach ( struct tun_file * tfile , bool clean )
{
struct tun_file * ntfile ;
struct tun_struct * tun ;
2013-01-11 16:59:32 +00:00
tun = rtnl_dereference ( tfile - > tun ) ;
2017-09-22 13:49:14 -07:00
if ( tun & & clean ) {
2022-06-29 11:19:10 -07:00
if ( ! tfile - > detached )
tun_napi_disable ( tfile ) ;
2018-09-28 14:51:47 -07:00
tun_napi_del ( tfile ) ;
2017-09-22 13:49:14 -07:00
}
2013-01-28 01:05:19 +00:00
if ( tun & & ! tfile - > detached ) {
2012-10-31 19:46:00 +00:00
u16 index = tfile - > queue_index ;
BUG_ON ( index > = tun - > numqueues ) ;
rcu_assign_pointer ( tun - > tfiles [ index ] ,
tun - > tfiles [ tun - > numqueues - 1 ] ) ;
2013-01-11 16:59:32 +00:00
ntfile = rtnl_dereference ( tun - > tfiles [ index ] ) ;
2012-10-31 19:46:00 +00:00
ntfile - > queue_index = index ;
2019-05-08 23:20:18 -04:00
rcu_assign_pointer ( tun - > tfiles [ tun - > numqueues - 1 ] ,
NULL ) ;
2012-10-31 19:46:00 +00:00
- - tun - > numqueues ;
2013-01-28 01:05:19 +00:00
if ( clean ) {
2014-03-24 00:02:32 +05:30
RCU_INIT_POINTER ( tfile - > tun , NULL ) ;
2012-12-13 23:53:30 +00:00
sock_put ( & tfile - > sk ) ;
2022-06-22 21:21:05 -07:00
} else {
2012-12-13 23:53:30 +00:00
tun_disable_queue ( tun , tfile ) ;
2022-06-22 21:21:05 -07:00
tun_napi_disable ( tfile ) ;
}
2012-10-31 19:46:00 +00:00
synchronize_net ( ) ;
2012-10-31 19:46:02 +00:00
tun_flow_delete_by_queue ( tun , tun - > numqueues + 1 ) ;
2012-10-31 19:46:00 +00:00
/* Drop read queue */
2013-09-05 17:53:59 +08:00
tun_queue_purge ( tfile ) ;
2012-10-31 19:46:00 +00:00
tun_set_real_num_queues ( tun ) ;
2013-01-11 16:59:34 +00:00
} else if ( tfile - > detached & & clean ) {
2012-12-13 23:53:30 +00:00
tun = tun_enable_queue ( tfile ) ;
2013-01-11 16:59:34 +00:00
sock_put ( & tfile - > sk ) ;
}
2012-10-31 19:46:00 +00:00
if ( clean ) {
2013-01-28 00:38:02 +00:00
if ( tun & & tun - > numqueues = = 0 & & tun - > numdisabled = = 0 ) {
netif_carrier_off ( tun - > dev ) ;
2014-11-19 15:17:31 +02:00
if ( ! ( tun - > flags & IFF_PERSIST ) & &
2013-01-28 00:38:02 +00:00
tun - > dev - > reg_state = = NETREG_REGISTERED )
2012-12-13 23:53:30 +00:00
unregister_netdevice ( tun - > dev ) ;
2013-01-28 00:38:02 +00:00
}
2018-05-11 10:49:25 +08:00
if ( tun )
xdp_rxq_info_unreg ( & tfile - > xdp_rxq ) ;
2018-05-16 20:39:33 +08:00
ptr_ring_cleanup ( & tfile - > tx_ring , tun_ptr_free ) ;
2012-10-31 19:46:00 +00:00
}
}
static void tun_detach ( struct tun_file * tfile , bool clean )
{
2018-04-10 16:28:56 +02:00
struct tun_struct * tun ;
struct net_device * dev ;
2012-10-31 19:46:00 +00:00
rtnl_lock ( ) ;
2018-04-10 16:28:56 +02:00
tun = rtnl_dereference ( tfile - > tun ) ;
dev = tun ? tun - > dev : NULL ;
2012-10-31 19:46:00 +00:00
__tun_detach ( tfile , clean ) ;
2018-04-10 16:28:56 +02:00
if ( dev )
netdev_state_change ( dev ) ;
2012-10-31 19:46:00 +00:00
rtnl_unlock ( ) ;
2022-11-25 02:51:34 +09:00
if ( clean )
sock_put ( & tfile - > sk ) ;
2012-10-31 19:46:00 +00:00
}
static void tun_detach_all ( struct net_device * dev )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2012-12-13 23:53:30 +00:00
struct tun_file * tfile , * tmp ;
2012-10-31 19:46:00 +00:00
int i , n = tun - > numqueues ;
for ( i = 0 ; i < n ; i + + ) {
2013-01-11 16:59:32 +00:00
tfile = rtnl_dereference ( tun - > tfiles [ i ] ) ;
2012-10-31 19:46:00 +00:00
BUG_ON ( ! tfile ) ;
2018-09-28 14:51:47 -07:00
tun_napi_disable ( tfile ) ;
2016-05-19 13:36:51 +08:00
tfile - > socket . sk - > sk_shutdown = RCV_SHUTDOWN ;
2014-05-16 15:11:48 -07:00
tfile - > socket . sk - > sk_data_ready ( tfile - > socket . sk ) ;
2014-03-24 00:02:32 +05:30
RCU_INIT_POINTER ( tfile - > tun , NULL ) ;
2012-10-31 19:46:00 +00:00
- - tun - > numqueues ;
}
2013-01-28 01:05:19 +00:00
list_for_each_entry ( tfile , & tun - > disabled , next ) {
2016-05-19 13:36:51 +08:00
tfile - > socket . sk - > sk_shutdown = RCV_SHUTDOWN ;
2014-05-16 15:11:48 -07:00
tfile - > socket . sk - > sk_data_ready ( tfile - > socket . sk ) ;
2014-03-24 00:02:32 +05:30
RCU_INIT_POINTER ( tfile - > tun , NULL ) ;
2013-01-28 01:05:19 +00:00
}
2012-10-31 19:46:00 +00:00
BUG_ON ( tun - > numqueues ! = 0 ) ;
synchronize_net ( ) ;
for ( i = 0 ; i < n ; i + + ) {
2013-01-11 16:59:32 +00:00
tfile = rtnl_dereference ( tun - > tfiles [ i ] ) ;
2018-09-28 14:51:47 -07:00
tun_napi_del ( tfile ) ;
2012-10-31 19:46:00 +00:00
/* Drop read queue */
2013-09-05 17:53:59 +08:00
tun_queue_purge ( tfile ) ;
2018-05-11 10:49:25 +08:00
xdp_rxq_info_unreg ( & tfile - > xdp_rxq ) ;
2012-10-31 19:46:00 +00:00
sock_put ( & tfile - > sk ) ;
}
2012-12-13 23:53:30 +00:00
list_for_each_entry_safe ( tfile , tmp , & tun - > disabled , next ) {
2022-06-22 21:20:39 -07:00
tun_napi_del ( tfile ) ;
2012-12-13 23:53:30 +00:00
tun_enable_queue ( tfile ) ;
2013-09-05 17:53:59 +08:00
tun_queue_purge ( tfile ) ;
2018-05-11 10:49:25 +08:00
xdp_rxq_info_unreg ( & tfile - > xdp_rxq ) ;
2012-12-13 23:53:30 +00:00
sock_put ( & tfile - > sk ) ;
}
BUG_ON ( tun - > numdisabled ! = 0 ) ;
2013-01-11 16:59:34 +00:00
2014-11-19 15:17:31 +02:00
if ( tun - > flags & IFF_PERSIST )
2013-01-11 16:59:34 +00:00
module_put ( THIS_MODULE ) ;
2012-10-31 19:46:00 +00:00
}
2017-09-22 13:49:14 -07:00
static int tun_attach ( struct tun_struct * tun , struct file * file ,
2019-09-10 18:56:57 +08:00
bool skip_filter , bool napi , bool napi_frags ,
bool publish_tun )
2009-01-20 10:57:48 +00:00
{
2009-01-20 11:00:40 +00:00
struct tun_file * tfile = file - > private_data ;
2016-06-30 14:45:36 +08:00
struct net_device * dev = tun - > dev ;
2009-01-20 11:02:28 +00:00
int err ;
2009-01-20 10:57:48 +00:00
2013-01-14 07:12:19 +00:00
err = security_tun_dev_attach ( tfile - > socket . sk , tun - > security ) ;
if ( err < 0 )
goto out ;
2009-01-20 11:02:28 +00:00
err = - EINVAL ;
2013-01-28 01:05:19 +00:00
if ( rtnl_dereference ( tfile - > tun ) & & ! tfile - > detached )
2009-01-20 11:02:28 +00:00
goto out ;
err = - EBUSY ;
2014-11-19 15:17:31 +02:00
if ( ! ( tun - > flags & IFF_MULTI_QUEUE ) & & tun - > numqueues = = 1 )
2012-10-31 19:46:00 +00:00
goto out ;
err = - E2BIG ;
2012-12-13 23:53:30 +00:00
if ( ! tfile - > detached & &
tun - > numqueues + tun - > numdisabled = = MAX_TAP_QUEUES )
2009-01-20 11:02:28 +00:00
goto out ;
err = 0 ;
2012-10-31 19:45:57 +00:00
2013-12-05 20:42:58 -08:00
/* Re-attach the filter to persist device */
tun: Allow to skip filter on attach
There's a small problem with sk-filters on tun devices. Consider
an application doing this sequence of steps:
fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
ioctl(fd, TUNATTACHFILTER, &my_filter);
ioctl(fd, TUNSETPERSIST, 1);
close(fd);
At that point the tun0 will remain in the system and will keep in
mind that there should be a socket filter at address '&my_filter'.
If after that we do
fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
we most likely receive the -EFAULT error, since tun_attach() would
try to connect the filter back. But (!) if we provide a filter at
address &my_filter, then tun0 will be created and the "new" filter
would be attached, but application may not know about that.
This may create certain problems to anyone using tun-s, but it's
critical problem for c/r -- if we meet a persistent tun device
with a filter in mind, we will not be able to attach to it to dump
its state (flags, owner, address, vnethdr size, etc.).
The proposal is to allow to attach to tun device (with TUNSETIFF)
w/o attaching the filter to the tun-file's socket. After this
attach app may e.g clean the device by dropping the filter, it
doesn't want to have one, or (in case of c/r) get information
about the device with tun ioctls.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 14:32:21 +04:00
if ( ! skip_filter & & ( tun - > filter_attached = = true ) ) {
2016-04-05 17:10:16 +02:00
lock_sock ( tfile - > socket . sk ) ;
err = sk_attach_filter ( & tun - > fprog , tfile - > socket . sk ) ;
release_sock ( tfile - > socket . sk ) ;
2012-10-31 19:45:57 +00:00
if ( ! err )
goto out ;
}
2016-06-30 14:45:36 +08:00
if ( ! tfile - > detached & &
2018-05-11 10:49:25 +08:00
ptr_ring_resize ( & tfile - > tx_ring , dev - > tx_queue_len ,
GFP_KERNEL , tun_ptr_free ) ) {
2016-06-30 14:45:36 +08:00
err = - ENOMEM ;
goto out ;
}
2012-10-31 19:46:00 +00:00
tfile - > queue_index = tun - > numqueues ;
2016-05-19 13:36:51 +08:00
tfile - > socket . sk - > sk_shutdown & = ~ RCV_SHUTDOWN ;
2018-01-03 11:25:59 +01:00
if ( tfile - > detached ) {
/* Re-attach detached tfile, updating XDP queue_index */
WARN_ON ( ! xdp_rxq_info_is_reg ( & tfile - > xdp_rxq ) ) ;
if ( tfile - > xdp_rxq . queue_index ! = tfile - > queue_index )
tfile - > xdp_rxq . queue_index = tfile - > queue_index ;
} else {
/* Setup XDP RX-queue info, for new tfile getting attached */
err = xdp_rxq_info_reg ( & tfile - > xdp_rxq ,
2020-11-30 19:52:01 +01:00
tun - > dev , tfile - > queue_index , 0 ) ;
2018-01-03 11:25:59 +01:00
if ( err < 0 )
goto out ;
xdp: rhashtable with allocator ID to pointer mapping
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info. Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.
The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short. The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.
For more advanced allocators there is a need to store a pointer to the
registered allocator. Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU. The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().
It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function. In any case,
this must happen via RCU freeing.
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.
V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
XDP_REDIRECT, even-though it's not strictly necessary when
allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch
V8: Address sparse should be static warnings (from kbuild test robot)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 16:46:12 +02:00
err = xdp_rxq_info_reg_mem_model ( & tfile - > xdp_rxq ,
MEM_TYPE_PAGE_SHARED , NULL ) ;
if ( err < 0 ) {
xdp_rxq_info_unreg ( & tfile - > xdp_rxq ) ;
goto out ;
}
2018-01-03 11:25:59 +01:00
err = 0 ;
}
2017-09-22 13:49:14 -07:00
if ( tfile - > detached ) {
2012-12-13 23:53:30 +00:00
tun_enable_queue ( tfile ) ;
2022-06-22 21:21:05 -07:00
tun_napi_enable ( tfile ) ;
2017-09-22 13:49:14 -07:00
} else {
2012-12-13 23:53:30 +00:00
sock_hold ( & tfile - > sk ) ;
2018-09-28 14:51:49 -07:00
tun_napi_init ( tun , tfile , napi , napi_frags ) ;
2017-09-22 13:49:14 -07:00
}
2012-12-13 23:53:30 +00:00
2018-09-12 11:16:59 +08:00
if ( rtnl_dereference ( tun - > xdp_prog ) )
sock_set_flag ( & tfile - > sk , SOCK_XDP ) ;
2012-10-31 19:46:00 +00:00
/* device is allowed to go away first, so no need to hold extra
* refcnt .
*/
2019-01-07 13:38:38 -08:00
/* Publish tfile->tun and tun->tfiles only after we've fully
* initialized tfile ; otherwise we risk using half - initialized
* object .
*/
2019-09-10 18:56:57 +08:00
if ( publish_tun )
rcu_assign_pointer ( tfile - > tun , tun ) ;
2019-01-07 13:38:38 -08:00
rcu_assign_pointer ( tun - > tfiles [ tun - > numqueues ] , tfile ) ;
tun - > numqueues + + ;
2019-01-29 22:50:13 -05:00
tun_set_real_num_queues ( tun ) ;
2012-10-31 19:46:00 +00:00
out :
return err ;
2009-01-20 11:00:40 +00:00
}
2017-09-23 22:36:52 +08:00
static struct tun_struct * tun_get ( struct tun_file * tfile )
2009-01-20 11:00:40 +00:00
{
2012-10-31 19:45:58 +00:00
struct tun_struct * tun ;
2009-01-20 11:07:17 +00:00
2012-10-31 19:45:58 +00:00
rcu_read_lock ( ) ;
tun = rcu_dereference ( tfile - > tun ) ;
if ( tun )
dev_hold ( tun - > dev ) ;
rcu_read_unlock ( ) ;
2009-01-20 11:07:17 +00:00
return tun ;
2009-01-20 11:00:40 +00:00
}
static void tun_put ( struct tun_struct * tun )
{
2012-10-31 19:45:58 +00:00
dev_put ( tun - > dev ) ;
2009-01-20 11:00:40 +00:00
}
2011-03-02 07:18:10 +00:00
/* TAP filtering */
2008-07-14 22:18:19 -07:00
static void addr_hash_set ( u32 * mask , const u8 * addr )
{
int n = ether_crc ( ETH_ALEN , addr ) > > 26 ;
mask [ n > > 5 ] | = ( 1 < < ( n & 31 ) ) ;
}
static unsigned int addr_hash_test ( const u32 * mask , const u8 * addr )
{
int n = ether_crc ( ETH_ALEN , addr ) > > 26 ;
return mask [ n > > 5 ] & ( 1 < < ( n & 31 ) ) ;
}
static int update_filter ( struct tap_filter * filter , void __user * arg )
{
struct { u8 u [ ETH_ALEN ] ; } * addr ;
struct tun_filter uf ;
int err , alen , n , nexact ;
if ( copy_from_user ( & uf , arg , sizeof ( uf ) ) )
return - EFAULT ;
if ( ! uf . count ) {
/* Disabled */
filter - > count = 0 ;
return 0 ;
}
alen = ETH_ALEN * uf . count ;
2016-08-20 08:54:15 +02:00
addr = memdup_user ( arg + sizeof ( uf ) , alen ) ;
if ( IS_ERR ( addr ) )
return PTR_ERR ( addr ) ;
2008-07-14 22:18:19 -07:00
/* The filter is updated without holding any locks. Which is
* perfectly safe . We disable it first and in the worst
* case we ' ll accept a few undesired packets . */
filter - > count = 0 ;
wmb ( ) ;
/* Use first set of addresses as an exact filter */
for ( n = 0 ; n < uf . count & & n < FLT_EXACT_COUNT ; n + + )
memcpy ( filter - > addr [ n ] , addr [ n ] . u , ETH_ALEN ) ;
nexact = n ;
2009-02-08 17:49:17 -08:00
/* Remaining multicast addresses are hashed,
* unicast will leave the filter disabled . */
2008-07-14 22:18:19 -07:00
memset ( filter - > mask , 0 , sizeof ( filter - > mask ) ) ;
2009-02-08 17:49:17 -08:00
for ( ; n < uf . count ; n + + ) {
if ( ! is_multicast_ether_addr ( addr [ n ] . u ) ) {
err = 0 ; /* no filter */
2016-08-20 09:00:34 +02:00
goto free_addr ;
2009-02-08 17:49:17 -08:00
}
2008-07-14 22:18:19 -07:00
addr_hash_set ( filter - > mask , addr [ n ] . u ) ;
2009-02-08 17:49:17 -08:00
}
2008-07-14 22:18:19 -07:00
/* For ALLMULTI just set the mask to all ones.
* This overrides the mask populated above . */
if ( ( uf . flags & TUN_FLT_ALLMULTI ) )
memset ( filter - > mask , ~ 0 , sizeof ( filter - > mask ) ) ;
/* Now enable the filter */
wmb ( ) ;
filter - > count = nexact ;
/* Return the number of exact filters */
err = nexact ;
2016-08-20 09:00:34 +02:00
free_addr :
2008-07-14 22:18:19 -07:00
kfree ( addr ) ;
return err ;
}
/* Returns: 0 - drop, !=0 - accept */
static int run_filter ( struct tap_filter * filter , const struct sk_buff * skb )
{
/* Cannot use eth_hdr(skb) here because skb_mac_hdr() is incorrect
* at this point . */
struct ethhdr * eh = ( struct ethhdr * ) skb - > data ;
int i ;
/* Exact match */
for ( i = 0 ; i < filter - > count ; i + + )
drivers/net: Convert compare_ether_addr to ether_addr_equal
Use the new bool function ether_addr_equal to add
some clarity and reduce the likelihood for misuse
of compare_ether_addr for sorting.
Done via cocci script:
$ cat compare_ether_addr.cocci
@@
expression a,b;
@@
- !compare_ether_addr(a, b)
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- compare_ether_addr(a, b)
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- !ether_addr_equal(a, b) == 0
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- !ether_addr_equal(a, b) != 0
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- ether_addr_equal(a, b) == 0
+ !ether_addr_equal(a, b)
@@
expression a,b;
@@
- ether_addr_equal(a, b) != 0
+ ether_addr_equal(a, b)
@@
expression a,b;
@@
- !!ether_addr_equal(a, b)
+ ether_addr_equal(a, b)
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-09 17:17:46 +00:00
if ( ether_addr_equal ( eh - > h_dest , filter - > addr [ i ] ) )
2008-07-14 22:18:19 -07:00
return 1 ;
/* Inexact match (multicast only) */
if ( is_multicast_ether_addr ( eh - > h_dest ) )
return addr_hash_test ( filter - > mask , eh - > h_dest ) ;
return 0 ;
}
/*
* Checks whether the packet is accepted or not .
* Returns : 0 - drop , ! = 0 - accept
*/
static int check_filter ( struct tap_filter * filter , const struct sk_buff * skb )
{
if ( ! filter - > count )
return 1 ;
return run_filter ( filter , skb ) ;
}
2005-04-16 15:20:36 -07:00
/* Network device part of the driver */
2006-09-13 14:30:00 -04:00
static const struct ethtool_ops tun_ethtool_ops ;
2005-04-16 15:20:36 -07:00
2021-12-16 13:25:32 -05:00
static int tun_net_init ( struct net_device * dev )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
struct ifreq * ifr = tun - > ifr ;
int err ;
dev - > tstats = netdev_alloc_pcpu_stats ( struct pcpu_sw_netstats ) ;
if ( ! dev - > tstats )
return - ENOMEM ;
spin_lock_init ( & tun - > lock ) ;
err = security_tun_dev_alloc_security ( & tun - > security ) ;
if ( err < 0 ) {
free_percpu ( dev - > tstats ) ;
return err ;
}
tun_flow_init ( tun ) ;
dev - > hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
TUN_USER_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_HW_VLAN_STAG_TX ;
dev - > features = dev - > hw_features | NETIF_F_LLTX ;
dev - > vlan_features = dev - > features &
~ ( NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_HW_VLAN_STAG_TX ) ;
tun - > flags = ( tun - > flags & ~ TUN_FEATURES ) |
( ifr - > ifr_flags & TUN_FEATURES ) ;
INIT_LIST_HEAD ( & tun - > disabled ) ;
err = tun_attach ( tun , tun - > file , false , ifr - > ifr_flags & IFF_NAPI ,
ifr - > ifr_flags & IFF_NAPI_FRAGS , false ) ;
if ( err < 0 ) {
tun_flow_uninit ( tun ) ;
security_tun_dev_free_security ( tun - > security ) ;
free_percpu ( dev - > tstats ) ;
return err ;
}
return 0 ;
}
2009-01-20 11:07:17 +00:00
/* Net device detach from fd. */
static void tun_net_uninit ( struct net_device * dev )
{
2012-10-31 19:46:00 +00:00
tun_detach_all ( dev ) ;
2009-01-20 11:07:17 +00:00
}
2005-04-16 15:20:36 -07:00
/* Net device open. */
static int tun_net_open ( struct net_device * dev )
{
2012-10-31 19:46:00 +00:00
netif_tx_start_all_queues ( dev ) ;
2017-03-13 00:00:26 +01:00
2005-04-16 15:20:36 -07:00
return 0 ;
}
/* Net device close. */
static int tun_net_close ( struct net_device * dev )
{
2012-10-31 19:46:00 +00:00
netif_tx_stop_all_queues ( dev ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
/* Net device start xmit */
2017-12-04 17:31:23 +08:00
static void tun_automq_xmit ( struct tun_struct * tun , struct sk_buff * skb )
2005-04-16 15:20:36 -07:00
{
2016-04-25 23:13:42 -04:00
# ifdef CONFIG_RPS
2019-03-22 08:56:38 -07:00
if ( tun - > numqueues = = 1 & & static_branch_unlikely ( & rps_needed ) ) {
2013-12-22 18:54:32 +08:00
/* Select queue was not called for the skbuff, so we extract the
* RPS hash and save it into the flow_table here .
*/
2018-10-09 10:32:04 +08:00
struct tun_flow_entry * e ;
2013-12-22 18:54:32 +08:00
__u32 rxhash ;
2017-06-06 14:09:49 +08:00
rxhash = __skb_get_hash_symmetric ( skb ) ;
2018-10-09 10:32:04 +08:00
e = tun_flow_find ( & tun - > flows [ tun_hashfn ( rxhash ) ] , rxhash ) ;
if ( e )
tun_flow_save_rps_rxhash ( e , rxhash ) ;
2013-12-22 18:54:32 +08:00
}
2016-04-25 23:13:42 -04:00
# endif
2017-12-04 17:31:23 +08:00
}
2018-01-16 16:31:02 +08:00
static unsigned int run_ebpf_filter ( struct tun_struct * tun ,
struct sk_buff * skb ,
int len )
{
struct tun_prog * prog = rcu_dereference ( tun - > filter_prog ) ;
if ( prog )
len = bpf_prog_run_clear_cb ( prog - > prog , skb ) ;
return len ;
}
2017-12-04 17:31:23 +08:00
/* Net device start xmit */
static netdev_tx_t tun_net_xmit ( struct sk_buff * skb , struct net_device * dev )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2022-03-04 06:55:07 -08:00
enum skb_drop_reason drop_reason ;
2017-12-04 17:31:23 +08:00
int txq = skb - > queue_mapping ;
2021-11-12 08:56:03 +01:00
struct netdev_queue * queue ;
2017-12-04 17:31:23 +08:00
struct tun_file * tfile ;
2018-01-16 16:31:02 +08:00
int len = skb - > len ;
2017-12-04 17:31:23 +08:00
rcu_read_lock ( ) ;
tfile = rcu_dereference ( tun - > tfiles [ txq ] ) ;
/* Drop packet if interface is not attached */
2022-03-04 06:55:07 -08:00
if ( ! tfile ) {
drop_reason = SKB_DROP_REASON_DEV_READY ;
2017-12-04 17:31:23 +08:00
goto drop ;
2022-03-04 06:55:07 -08:00
}
2017-12-04 17:31:23 +08:00
if ( ! rcu_dereference ( tun - > steering_prog ) )
tun_automq_xmit ( tun , skb ) ;
2013-12-22 18:54:32 +08:00
2020-03-04 17:24:14 +01:00
netif_info ( tun , tx_queued , tun - > dev , " %s %d \n " , __func__ , skb - > len ) ;
2012-10-31 19:45:58 +00:00
2008-07-14 22:18:19 -07:00
/* Drop if the filter does not like it.
* This is a noop if the filter is disabled .
* Filter can be enabled only for the TAP devices . */
2022-03-04 06:55:07 -08:00
if ( ! check_filter ( & tun - > txflt , skb ) ) {
drop_reason = SKB_DROP_REASON_TAP_TXFILTER ;
2008-07-14 22:18:19 -07:00
goto drop ;
2022-03-04 06:55:07 -08:00
}
2008-07-14 22:18:19 -07:00
2012-10-31 19:45:57 +00:00
if ( tfile - > socket . sk - > sk_filter & &
2022-03-04 06:55:07 -08:00
sk_filter ( tfile - > socket . sk , skb ) ) {
drop_reason = SKB_DROP_REASON_SOCKET_FILTER ;
2010-02-14 01:01:10 +00:00
goto drop ;
2022-03-04 06:55:07 -08:00
}
2010-02-14 01:01:10 +00:00
2018-01-16 16:31:02 +08:00
len = run_ebpf_filter ( tun , skb , len ) ;
2022-03-04 06:55:07 -08:00
if ( len = = 0 ) {
drop_reason = SKB_DROP_REASON_TAP_FILTER ;
2022-03-04 06:55:06 -08:00
goto drop ;
2022-03-04 06:55:07 -08:00
}
2022-03-04 06:55:06 -08:00
2022-03-04 06:55:07 -08:00
if ( pskb_trim ( skb , len ) ) {
drop_reason = SKB_DROP_REASON_NOMEM ;
2018-01-16 16:31:02 +08:00
goto drop ;
2022-03-04 06:55:07 -08:00
}
2018-01-16 16:31:02 +08:00
2022-03-04 06:55:07 -08:00
if ( unlikely ( skb_orphan_frags_rx ( skb , GFP_ATOMIC ) ) ) {
drop_reason = SKB_DROP_REASON_SKB_UCOPY_FAULT ;
2013-09-05 17:54:00 +08:00
goto drop ;
2022-03-04 06:55:07 -08:00
}
2013-09-05 17:54:00 +08:00
2016-08-23 18:22:33 -04:00
skb_tx_timestamp ( skb ) ;
2013-07-19 19:40:10 +02:00
tun: orphan an skb on tx
The following situation was observed in the field:
tap1 sends packets, tap2 does not consume them, as a result
tap1 can not be closed. This happens because
tun/tap devices can hang on to skbs undefinitely.
As noted by Herbert, possible solutions include a timeout followed by a
copy/change of ownership of the skb, or always copying/changing
ownership if we're going into a hostile device.
This patch implements the second approach.
Note: one issue still remaining is that since skbs
keep reference to tun socket and tun socket has a
reference to tun device, we won't flush backlog,
instead simply waiting for all skbs to get transmitted.
At least this is not user-triggerable, and
this was not reported in practice, my assumption is
other devices besides tap complete an skb
within finite time after it has been queued.
A possible solution for the second issue
would not to have socket reference the device,
instead, implement dev->destructor for tun, and
wait for all skbs to complete there, but this
needs some thought, probably too risky for 2.6.34.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 04:59:44 +00:00
/* Orphan the skb - required as we might hang on to it
2013-09-05 17:54:00 +08:00
* for indefinite time .
*/
tun: orphan an skb on tx
The following situation was observed in the field:
tap1 sends packets, tap2 does not consume them, as a result
tap1 can not be closed. This happens because
tun/tap devices can hang on to skbs undefinitely.
As noted by Herbert, possible solutions include a timeout followed by a
copy/change of ownership of the skb, or always copying/changing
ownership if we're going into a hostile device.
This patch implements the second approach.
Note: one issue still remaining is that since skbs
keep reference to tun socket and tun socket has a
reference to tun device, we won't flush backlog,
instead simply waiting for all skbs to get transmitted.
At least this is not user-triggerable, and
this was not reported in practice, my assumption is
other devices besides tap complete an skb
within finite time after it has been queued.
A possible solution for the second issue
would not to have socket reference the device,
instead, implement dev->destructor for tun, and
wait for all skbs to complete there, but this
needs some thought, probably too risky for 2.6.34.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-13 04:59:44 +00:00
skb_orphan ( skb ) ;
2019-09-29 20:54:03 +02:00
nf_reset_ct ( skb ) ;
2013-03-06 11:02:37 +00:00
2022-03-04 06:55:07 -08:00
if ( ptr_ring_produce ( & tfile - > tx_ring , skb ) ) {
drop_reason = SKB_DROP_REASON_FULL_RING ;
2016-06-30 14:45:36 +08:00
goto drop ;
2022-03-04 06:55:07 -08:00
}
2005-04-16 15:20:36 -07:00
2021-11-12 08:56:03 +01:00
/* NETIF_F_LLTX requires to do our own update of trans_start */
queue = netdev_get_tx_queue ( dev , txq ) ;
2022-04-12 15:58:52 +02:00
txq_trans_cond_update ( queue ) ;
2021-11-12 08:56:03 +01:00
2005-04-16 15:20:36 -07:00
/* Notify and wake up reader process */
2012-10-31 19:45:57 +00:00
if ( tfile - > flags & TUN_FASYNC )
kill_fasync ( & tfile - > fasync , SIGIO , POLL_IN ) ;
2014-05-16 15:11:48 -07:00
tfile - > socket . sk - > sk_data_ready ( tfile - > socket . sk ) ;
2012-10-31 19:45:58 +00:00
rcu_read_unlock ( ) ;
2009-06-23 06:03:08 +00:00
return NETDEV_TX_OK ;
2005-04-16 15:20:36 -07:00
drop :
2022-03-10 21:14:20 -08:00
dev_core_stats_tx_dropped_inc ( dev ) ;
2012-11-01 09:16:32 +00:00
skb_tx_error ( skb ) ;
2022-03-04 06:55:07 -08:00
kfree_skb_reason ( skb , drop_reason ) ;
2012-10-31 19:45:58 +00:00
rcu_read_unlock ( ) ;
2014-11-18 13:20:41 +08:00
return NET_XMIT_DROP ;
2005-04-16 15:20:36 -07:00
}
2008-07-14 22:18:19 -07:00
static void tun_net_mclist ( struct net_device * dev )
2005-04-16 15:20:36 -07:00
{
2008-07-14 22:18:19 -07:00
/*
* This callback is supposed to deal with mc filter in
* _rx_ path and has nothing to do with the _tx_ path .
* In rx path we always accept everything userspace gives us .
*/
2005-04-16 15:20:36 -07:00
}
2011-11-15 15:29:55 +00:00
static netdev_features_t tun_net_fix_features ( struct net_device * dev ,
netdev_features_t features )
2011-04-19 06:13:10 +00:00
{
struct tun_struct * tun = netdev_priv ( dev ) ;
return ( features & tun - > set_features ) | ( features & ~ TUN_USER_FEATURES ) ;
}
2016-02-26 10:45:40 +01:00
static void tun_set_headroom ( struct net_device * dev , int new_hr )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
if ( new_hr < NET_SKB_PAD )
new_hr = NET_SKB_PAD ;
tun - > align = new_hr ;
}
2017-01-06 19:12:52 -08:00
static void
2016-04-13 10:52:20 +02:00
tun_net_get_stats64 ( struct net_device * dev , struct rtnl_link_stats64 * stats )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2020-11-07 21:50:56 +01:00
dev_get_tstats64 ( dev , stats ) ;
2016-04-13 10:52:20 +02:00
2020-11-07 21:50:56 +01:00
stats - > rx_frame_errors + =
( unsigned long ) atomic_long_read ( & tun - > rx_frame_errors ) ;
2016-04-13 10:52:20 +02:00
}
2017-08-11 19:41:18 +08:00
static int tun_xdp_set ( struct net_device * dev , struct bpf_prog * prog ,
struct netlink_ext_ack * extack )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2018-09-12 11:16:59 +08:00
struct tun_file * tfile ;
2017-08-11 19:41:18 +08:00
struct bpf_prog * old_prog ;
2018-09-12 11:16:59 +08:00
int i ;
2017-08-11 19:41:18 +08:00
old_prog = rtnl_dereference ( tun - > xdp_prog ) ;
rcu_assign_pointer ( tun - > xdp_prog , prog ) ;
if ( old_prog )
bpf_prog_put ( old_prog ) ;
2018-09-12 11:16:59 +08:00
for ( i = 0 ; i < tun - > numqueues ; i + + ) {
tfile = rtnl_dereference ( tun - > tfiles [ i ] ) ;
if ( prog )
sock_set_flag ( & tfile - > sk , SOCK_XDP ) ;
else
sock_reset_flag ( & tfile - > sk , SOCK_XDP ) ;
}
list_for_each_entry ( tfile , & tun - > disabled , next ) {
if ( prog )
sock_set_flag ( & tfile - > sk , SOCK_XDP ) ;
else
sock_reset_flag ( & tfile - > sk , SOCK_XDP ) ;
}
2017-08-11 19:41:18 +08:00
return 0 ;
}
2017-11-03 13:56:16 -07:00
static int tun_xdp ( struct net_device * dev , struct netdev_bpf * xdp )
2017-08-11 19:41:18 +08:00
{
switch ( xdp - > command ) {
case XDP_SETUP_PROG :
return tun_xdp_set ( dev , xdp - > prog , xdp - > extack ) ;
default :
return - EINVAL ;
}
}
2018-11-28 19:12:56 +01:00
static int tun_net_change_carrier ( struct net_device * dev , bool new_carrier )
{
if ( new_carrier ) {
struct tun_struct * tun = netdev_priv ( dev ) ;
if ( ! tun - > numqueues )
return - EPERM ;
netif_carrier_on ( dev ) ;
} else {
netif_carrier_off ( dev ) ;
}
return 0 ;
}
2008-11-19 22:10:37 -08:00
static const struct net_device_ops tun_netdev_ops = {
2021-12-16 13:25:32 -05:00
. ndo_init = tun_net_init ,
2009-01-20 11:07:17 +00:00
. ndo_uninit = tun_net_uninit ,
2008-11-19 22:10:37 -08:00
. ndo_open = tun_net_open ,
. ndo_stop = tun_net_close ,
2008-11-20 20:14:53 -08:00
. ndo_start_xmit = tun_net_xmit ,
2011-04-19 06:13:10 +00:00
. ndo_fix_features = tun_net_fix_features ,
2012-10-31 19:46:00 +00:00
. ndo_select_queue = tun_select_queue ,
2016-02-26 10:45:40 +01:00
. ndo_set_rx_headroom = tun_set_headroom ,
2016-04-13 10:52:20 +02:00
. ndo_get_stats64 = tun_net_get_stats64 ,
2018-11-28 19:12:56 +01:00
. ndo_change_carrier = tun_net_change_carrier ,
2008-11-19 22:10:37 -08:00
} ;
2018-05-31 11:00:03 +02:00
static void __tun_xdp_flush_tfile ( struct tun_file * tfile )
{
/* Notify and wake up reader process */
if ( tfile - > flags & TUN_FASYNC )
kill_fasync ( & tfile - > fasync , SIGIO , POLL_IN ) ;
tfile - > socket . sk - > sk_data_ready ( tfile - > socket . sk ) ;
}
2018-05-31 10:59:47 +02:00
static int tun_xdp_xmit ( struct net_device * dev , int n ,
struct xdp_frame * * frames , u32 flags )
2018-01-04 11:14:28 +08:00
{
struct tun_struct * tun = netdev_priv ( dev ) ;
struct tun_file * tfile ;
u32 numqueues ;
2021-03-08 12:06:58 +01:00
int nxmit = 0 ;
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 16:46:12 +02:00
int i ;
2018-01-04 11:14:28 +08:00
2018-05-31 11:00:03 +02:00
if ( unlikely ( flags & ~ XDP_XMIT_FLAGS_MASK ) )
2018-05-31 10:59:47 +02:00
return - EINVAL ;
2018-01-04 11:14:28 +08:00
rcu_read_lock ( ) ;
2019-05-08 23:20:18 -04:00
resample :
2018-01-04 11:14:28 +08:00
numqueues = READ_ONCE ( tun - > numqueues ) ;
if ( ! numqueues ) {
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 16:46:12 +02:00
rcu_read_unlock ( ) ;
return - ENXIO ; /* Caller will free/return all frames */
2018-01-04 11:14:28 +08:00
}
tfile = rcu_dereference ( tun - > tfiles [ smp_processor_id ( ) %
numqueues ] ) ;
2019-05-08 23:20:18 -04:00
if ( unlikely ( ! tfile ) )
goto resample ;
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 16:46:12 +02:00
spin_lock ( & tfile - > tx_ring . producer_lock ) ;
for ( i = 0 ; i < n ; i + + ) {
struct xdp_frame * xdp = frames [ i ] ;
/* Encode the XDP flag into lowest bit for consumer to differ
* XDP buffer from sk_buff .
*/
void * frame = tun_xdp_to_ptr ( xdp ) ;
if ( __ptr_ring_produce ( & tfile - > tx_ring , frame ) ) {
2022-03-10 21:14:20 -08:00
dev_core_stats_tx_dropped_inc ( dev ) ;
2021-03-08 12:06:58 +01:00
break ;
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 16:46:12 +02:00
}
2021-03-08 12:06:58 +01:00
nxmit + + ;
2018-01-04 11:14:28 +08:00
}
xdp: change ndo_xdp_xmit API to support bulking
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24 16:46:12 +02:00
spin_unlock ( & tfile - > tx_ring . producer_lock ) ;
2018-01-04 11:14:28 +08:00
2018-05-31 11:00:03 +02:00
if ( flags & XDP_XMIT_FLUSH )
__tun_xdp_flush_tfile ( tfile ) ;
2018-01-04 11:14:28 +08:00
rcu_read_unlock ( ) ;
2021-03-08 12:06:58 +01:00
return nxmit ;
2018-01-04 11:14:28 +08:00
}
2018-04-17 16:46:37 +02:00
static int tun_xdp_tx ( struct net_device * dev , struct xdp_buff * xdp )
{
2020-05-28 22:47:29 +02:00
struct xdp_frame * frame = xdp_convert_buff_to_frame ( xdp ) ;
2021-03-08 12:06:58 +01:00
int nxmit ;
2018-04-17 16:46:37 +02:00
if ( unlikely ( ! frame ) )
return - EOVERFLOW ;
2021-03-08 12:06:58 +01:00
nxmit = tun_xdp_xmit ( dev , 1 , & frame , XDP_XMIT_FLUSH ) ;
if ( ! nxmit )
xdp_return_frame_rx_napi ( frame ) ;
return nxmit ;
2018-01-04 11:14:28 +08:00
}
2008-11-19 22:10:37 -08:00
static const struct net_device_ops tap_netdev_ops = {
2021-12-16 13:25:32 -05:00
. ndo_init = tun_net_init ,
2009-01-20 11:07:17 +00:00
. ndo_uninit = tun_net_uninit ,
2008-11-19 22:10:37 -08:00
. ndo_open = tun_net_open ,
. ndo_stop = tun_net_close ,
2008-11-20 20:14:53 -08:00
. ndo_start_xmit = tun_net_xmit ,
2011-04-19 06:13:10 +00:00
. ndo_fix_features = tun_net_fix_features ,
2011-08-16 06:29:01 +00:00
. ndo_set_rx_mode = tun_net_mclist ,
2008-11-19 22:10:37 -08:00
. ndo_set_mac_address = eth_mac_addr ,
. ndo_validate_addr = eth_validate_addr ,
2012-10-31 19:46:00 +00:00
. ndo_select_queue = tun_select_queue ,
2015-07-31 15:03:27 +09:00
. ndo_features_check = passthru_features_check ,
2016-02-26 10:45:40 +01:00
. ndo_set_rx_headroom = tun_set_headroom ,
2020-11-07 21:50:56 +01:00
. ndo_get_stats64 = dev_get_tstats64 ,
2017-11-03 13:56:16 -07:00
. ndo_bpf = tun_xdp ,
2018-01-04 11:14:28 +08:00
. ndo_xdp_xmit = tun_xdp_xmit ,
2018-11-28 19:12:56 +01:00
. ndo_change_carrier = tun_net_change_carrier ,
2008-11-19 22:10:37 -08:00
} ;
2013-06-11 17:01:08 +04:00
static void tun_flow_init ( struct tun_struct * tun )
2012-10-31 19:46:02 +00:00
{
int i ;
for ( i = 0 ; i < TUN_NUM_FLOW_ENTRIES ; i + + )
INIT_HLIST_HEAD ( & tun - > flows [ i ] ) ;
tun - > ageing_time = TUN_FLOW_EXPIRE ;
treewide: setup_timer() -> timer_setup()
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.
Casting from unsigned long:
void my_callback(unsigned long data)
{
struct something *ptr = (struct something *)data;
...
}
...
setup_timer(&ptr->my_timer, my_callback, ptr);
and forced object casts:
void my_callback(struct something *ptr)
{
...
}
...
setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);
become:
void my_callback(struct timer_list *t)
{
struct something *ptr = from_timer(ptr, t, my_timer);
...
}
...
timer_setup(&ptr->my_timer, my_callback, 0);
Direct function assignments:
void my_callback(unsigned long data)
{
struct something *ptr = (struct something *)data;
...
}
...
ptr->my_timer.function = my_callback;
have a temporary cast added, along with converting the args:
void my_callback(struct timer_list *t)
{
struct something *ptr = from_timer(ptr, t, my_timer);
...
}
...
ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;
And finally, callbacks without a data assignment:
void my_callback(unsigned long data)
{
...
}
...
setup_timer(&ptr->my_timer, my_callback, 0);
have their argument renamed to verify they're unused during conversion:
void my_callback(struct timer_list *unused)
{
...
}
...
timer_setup(&ptr->my_timer, my_callback, 0);
The conversion is done with the following Coccinelle script:
spatch --very-quiet --all-includes --include-headers \
-I ./arch/x86/include -I ./arch/x86/include/generated \
-I ./include -I ./arch/x86/include/uapi \
-I ./arch/x86/include/generated/uapi -I ./include/uapi \
-I ./include/generated/uapi --include ./include/linux/kconfig.h \
--dir . \
--cocci-file ~/src/data/timer_setup.cocci
@fix_address_of@
expression e;
@@
setup_timer(
-&(e)
+&e
, ...)
// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@
(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)
@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@
(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
_E->_timer@_stl.function = _callback;
|
_E->_timer@_stl.function = &_callback;
|
_E->_timer@_stl.function = (_cast_func)_callback;
|
_E->_timer@_stl.function = (_cast_func)&_callback;
|
_E._timer@_stl.function = _callback;
|
_E._timer@_stl.function = &_callback;
|
_E._timer@_stl.function = (_cast_func)_callback;
|
_E._timer@_stl.function = (_cast_func)&_callback;
)
// callback(unsigned long arg)
@change_callback_handle_cast
depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@
void _callback(
-_origtype _origarg
+struct timer_list *t
)
{
(
... when != _origarg
_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle;
... when != _handle
_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
|
... when != _origarg
_handletype *_handle;
... when != _handle
_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
... when != _origarg
)
}
// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
depends on change_timer_function_usage &&
!change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@
void _callback(
-_origtype _origarg
+struct timer_list *t
)
{
+ _handletype *_origarg = from_timer(_origarg, t, _timer);
+
... when != _origarg
- (_handletype *)_origarg
+ _origarg
... when != _origarg
}
// Avoid already converted callbacks.
@match_callback_converted
depends on change_timer_function_usage &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@
void _callback(struct timer_list *t)
{ ... }
// callback(struct something *handle)
@change_callback_handle_arg
depends on change_timer_function_usage &&
!match_callback_converted &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@
void _callback(
-_handletype *_handle
+struct timer_list *t
)
{
+ _handletype *_handle = from_timer(_handle, t, _timer);
...
}
// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
depends on change_timer_function_usage &&
change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@
void _callback(struct timer_list *t)
{
- _handletype *_handle = from_timer(_handle, t, _timer);
}
// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
depends on change_timer_function_usage &&
!change_callback_handle_cast &&
!change_callback_handle_cast_no_arg &&
!change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@
(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)
// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
depends on change_timer_function_usage &&
(change_callback_handle_cast ||
change_callback_handle_cast_no_arg ||
change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@
(
_E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
;
|
_E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
;
|
_E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
;
)
// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
depends on change_timer_function_usage &&
(change_callback_handle_cast ||
change_callback_handle_cast_no_arg ||
change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@
_callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
)
// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@
(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)
@change_callback_unused_data
depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@
void _callback(
-_origtype _origarg
+struct timer_list *unused
)
{
... when != _origarg
}
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-10-16 14:43:17 -07:00
timer_setup ( & tun - > flow_gc_timer , tun_flow_cleanup , 0 ) ;
mod_timer ( & tun - > flow_gc_timer ,
round_jiffies_up ( jiffies + tun - > ageing_time ) ) ;
2012-10-31 19:46:02 +00:00
}
static void tun_flow_uninit ( struct tun_struct * tun )
{
del_timer_sync ( & tun - > flow_gc_timer ) ;
tun_flow_flush ( tun ) ;
}
net: use core MTU range checking in core net infra
geneve:
- Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
- This one isn't quite as straight-forward as others, could use some
closer inspection and testing
macvlan:
- set min/max_mtu
tun:
- set min/max_mtu, remove tun_net_change_mtu
vxlan:
- Merge __vxlan_change_mtu back into vxlan_change_mtu
- Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
change_mtu function
- This one is also not as straight-forward and could use closer inspection
and testing from vxlan folks
bridge:
- set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
change_mtu function
openvswitch:
- set min/max_mtu, remove internal_dev_change_mtu
- note: max_mtu wasn't checked previously, it's been set to 65535, which
is the largest possible size supported
sch_teql:
- set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
macsec:
- min_mtu = 0, max_mtu = 65535
macvlan:
- min_mtu = 0, max_mtu = 65535
ntb_netdev:
- min_mtu = 0, max_mtu = 65535
veth:
- min_mtu = 68, max_mtu = 65535
8021q:
- min_mtu = 0, max_mtu = 65535
CC: netdev@vger.kernel.org
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Tom Herbert <tom@herbertland.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Jiri Benc <jbenc@redhat.com>
CC: WANG Cong <xiyou.wangcong@gmail.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Patrick McHardy <kaber@trash.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Pravin Shelar <pshelar@nicira.com>
CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 13:55:20 -04:00
# define MIN_MTU 68
# define MAX_MTU 65535
2005-04-16 15:20:36 -07:00
/* Initialize net device. */
2021-12-16 13:25:32 -05:00
static void tun_net_initialize ( struct net_device * dev )
2005-04-16 15:20:36 -07:00
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2006-09-13 13:24:59 -04:00
2005-04-16 15:20:36 -07:00
switch ( tun - > flags & TUN_TYPE_MASK ) {
2014-11-19 15:17:31 +02:00
case IFF_TUN :
2008-11-19 22:10:37 -08:00
dev - > netdev_ops = & tun_netdev_ops ;
2020-06-29 19:06:22 -06:00
dev - > header_ops = & ip_tunnel_header_ops ;
2008-11-19 22:10:37 -08:00
2005-04-16 15:20:36 -07:00
/* Point-to-Point TUN Device */
dev - > hard_header_len = 0 ;
dev - > addr_len = 0 ;
dev - > mtu = 1500 ;
/* Zero header length */
2006-09-13 13:24:59 -04:00
dev - > type = ARPHRD_NONE ;
2005-04-16 15:20:36 -07:00
dev - > flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST ;
break ;
2014-11-19 15:17:31 +02:00
case IFF_TAP :
2008-12-29 18:23:28 -08:00
dev - > netdev_ops = & tap_netdev_ops ;
2005-04-16 15:20:36 -07:00
/* Ethernet TAP Device */
ether_setup ( dev ) ;
2011-07-26 06:05:38 +00:00
dev - > priv_flags & = ~ IFF_TX_SKB_SHARING ;
2012-12-10 15:16:00 +00:00
dev - > priv_flags | = IFF_LIVE_ADDR_CHANGE ;
2007-04-26 01:00:55 -07:00
2012-02-15 06:45:39 +00:00
eth_hw_addr_random ( dev ) ;
2007-04-26 01:00:55 -07:00
drivers: net: turn on XDP features
A summary of the flags being set for various drivers is given below.
Note that XDP_F_REDIRECT_TARGET and XDP_F_FRAG_TARGET are features
that can be turned off and on at runtime. This means that these flags
may be set and unset under RTNL lock protection by the driver. Hence,
READ_ONCE must be used by code loading the flag value.
Also, these flags are not used for synchronization against the availability
of XDP resources on a device. It is merely a hint, and hence the read
may race with the actual teardown of XDP resources on the device. This
may change in the future, e.g. operations taking a reference on the XDP
resources of the driver, and in turn inhibiting turning off this flag.
However, for now, it can only be used as a hint to check whether device
supports becoming a redirection target.
Turn 'hw-offload' feature flag on for:
- netronome (nfp)
- netdevsim.
Turn 'native' and 'zerocopy' features flags on for:
- intel (i40e, ice, ixgbe, igc)
- mellanox (mlx5).
- stmmac
- netronome (nfp)
Turn 'native' features flags on for:
- amazon (ena)
- broadcom (bnxt)
- freescale (dpaa, dpaa2, enetc)
- funeth
- intel (igb)
- marvell (mvneta, mvpp2, octeontx2)
- mellanox (mlx4)
- mtk_eth_soc
- qlogic (qede)
- sfc
- socionext (netsec)
- ti (cpsw)
- tap
- tsnep
- veth
- xen
- virtio_net.
Turn 'basic' (tx, pass, aborted and drop) features flags on for:
- netronome (nfp)
- cavium (thunder)
- hyperv.
Turn 'redirect_target' feature flag on for:
- amanzon (ena)
- broadcom (bnxt)
- freescale (dpaa, dpaa2)
- intel (i40e, ice, igb, ixgbe)
- ti (cpsw)
- marvell (mvneta, mvpp2)
- sfc
- socionext (netsec)
- qlogic (qede)
- mellanox (mlx5)
- tap
- veth
- virtio_net
- xen
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Marek Majtyka <alardam@gmail.com>
Link: https://lore.kernel.org/r/3eca9fafb308462f7edb1f58e451d59209aa07eb.1675245258.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-01 11:24:18 +01:00
/* Currently tun does not support XDP, only tap does. */
dev - > xdp_features = NETDEV_XDP_ACT_BASIC |
NETDEV_XDP_ACT_REDIRECT |
NETDEV_XDP_ACT_NDO_XMIT ;
2005-04-16 15:20:36 -07:00
break ;
}
net: use core MTU range checking in core net infra
geneve:
- Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
- This one isn't quite as straight-forward as others, could use some
closer inspection and testing
macvlan:
- set min/max_mtu
tun:
- set min/max_mtu, remove tun_net_change_mtu
vxlan:
- Merge __vxlan_change_mtu back into vxlan_change_mtu
- Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
change_mtu function
- This one is also not as straight-forward and could use closer inspection
and testing from vxlan folks
bridge:
- set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
change_mtu function
openvswitch:
- set min/max_mtu, remove internal_dev_change_mtu
- note: max_mtu wasn't checked previously, it's been set to 65535, which
is the largest possible size supported
sch_teql:
- set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
macsec:
- min_mtu = 0, max_mtu = 65535
macvlan:
- min_mtu = 0, max_mtu = 65535
ntb_netdev:
- min_mtu = 0, max_mtu = 65535
veth:
- min_mtu = 68, max_mtu = 65535
8021q:
- min_mtu = 0, max_mtu = 65535
CC: netdev@vger.kernel.org
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Tom Herbert <tom@herbertland.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Jiri Benc <jbenc@redhat.com>
CC: WANG Cong <xiyou.wangcong@gmail.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Patrick McHardy <kaber@trash.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Pravin Shelar <pshelar@nicira.com>
CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 13:55:20 -04:00
dev - > min_mtu = MIN_MTU ;
dev - > max_mtu = MAX_MTU - dev - > hard_header_len ;
2005-04-16 15:20:36 -07:00
}
2018-05-22 14:21:04 +08:00
static bool tun_sock_writeable ( struct tun_struct * tun , struct tun_file * tfile )
{
struct sock * sk = tfile - > socket . sk ;
return ( tun - > dev - > flags & IFF_UP ) & & sock_writeable ( sk ) ;
}
2005-04-16 15:20:36 -07:00
/* Character device part */
/* Poll */
2017-07-03 06:39:46 -04:00
static __poll_t tun_chr_poll ( struct file * file , poll_table * wait )
2006-09-13 13:24:59 -04:00
{
2009-01-20 11:03:21 +00:00
struct tun_file * tfile = file - > private_data ;
2017-09-23 22:36:52 +08:00
struct tun_struct * tun = tun_get ( tfile ) ;
2009-07-05 19:48:35 +00:00
struct sock * sk ;
2017-07-03 06:39:46 -04:00
__poll_t mask = 0 ;
2005-04-16 15:20:36 -07:00
if ( ! tun )
2018-02-11 14:34:03 -08:00
return EPOLLERR ;
2005-04-16 15:20:36 -07:00
2012-10-31 19:45:57 +00:00
sk = tfile - > socket . sk ;
2009-07-05 19:48:35 +00:00
2014-05-16 15:11:48 -07:00
poll_wait ( file , sk_sleep ( sk ) , wait ) ;
2006-09-13 13:24:59 -04:00
2018-01-04 11:14:27 +08:00
if ( ! ptr_ring_empty ( & tfile - > tx_ring ) )
2018-02-11 14:34:03 -08:00
mask | = EPOLLIN | EPOLLRDNORM ;
2005-04-16 15:20:36 -07:00
2018-05-22 14:21:04 +08:00
/* Make sure SOCKWQ_ASYNC_NOSPACE is set if not writable to
* guarantee EPOLLOUT to be raised by either here or
* tun_sock_write_space ( ) . Then process could get notification
* after it writes to a down device and meets - EIO .
*/
if ( tun_sock_writeable ( tun , tfile ) | |
( ! test_and_set_bit ( SOCKWQ_ASYNC_NOSPACE , & sk - > sk_socket - > flags ) & &
tun_sock_writeable ( tun , tfile ) ) )
2018-02-11 14:34:03 -08:00
mask | = EPOLLOUT | EPOLLWRNORM ;
2009-02-05 21:25:32 -08:00
2009-01-20 11:07:17 +00:00
if ( tun - > dev - > reg_state ! = NETREG_REGISTERED )
2018-02-11 14:34:03 -08:00
mask = EPOLLERR ;
2009-01-20 11:07:17 +00:00
2009-01-20 11:00:40 +00:00
tun_put ( tun ) ;
2005-04-16 15:20:36 -07:00
return mask ;
}
2017-09-22 13:49:15 -07:00
static struct sk_buff * tun_napi_alloc_frags ( struct tun_file * tfile ,
size_t len ,
const struct iov_iter * it )
{
struct sk_buff * skb ;
size_t linear ;
int err ;
int i ;
net: tun: fix bugs for oversize packet when napi frags enabled
Recently, we got two syzkaller problems because of oversize packet
when napi frags enabled.
One of the problems is because the first seg size of the iov_iter
from user space is very big, it is 2147479538 which is bigger than
the threshold value for bail out early in __alloc_pages(). And
skb->pfmemalloc is true, __kmalloc_reserve() would use pfmemalloc
reserves without __GFP_NOWARN flag. Thus we got a warning as following:
========================================================
WARNING: CPU: 1 PID: 17965 at mm/page_alloc.c:5295 __alloc_pages+0x1308/0x16c4 mm/page_alloc.c:5295
...
Call trace:
__alloc_pages+0x1308/0x16c4 mm/page_alloc.c:5295
__alloc_pages_node include/linux/gfp.h:550 [inline]
alloc_pages_node include/linux/gfp.h:564 [inline]
kmalloc_large_node+0x94/0x350 mm/slub.c:4038
__kmalloc_node_track_caller+0x620/0x8e4 mm/slub.c:4545
__kmalloc_reserve.constprop.0+0x1e4/0x2b0 net/core/skbuff.c:151
pskb_expand_head+0x130/0x8b0 net/core/skbuff.c:1654
__skb_grow include/linux/skbuff.h:2779 [inline]
tun_napi_alloc_frags+0x144/0x610 drivers/net/tun.c:1477
tun_get_user+0x31c/0x2010 drivers/net/tun.c:1835
tun_chr_write_iter+0x98/0x100 drivers/net/tun.c:2036
The other problem is because odd IPv6 packets without NEXTHDR_NONE
extension header and have big packet length, it is 2127925 which is
bigger than ETH_MAX_MTU(65535). After ipv6_gso_pull_exthdrs() in
ipv6_gro_receive(), network_header offset and transport_header offset
are all bigger than U16_MAX. That would trigger skb->network_header
and skb->transport_header overflow error, because they are all '__u16'
type. Eventually, it would affect the value for __skb_push(skb, value),
and make it be a big value. After __skb_push() in ipv6_gro_receive(),
skb->data would less than skb->head, an out of bounds memory bug occurred.
That would trigger the problem as following:
==================================================================
BUG: KASAN: use-after-free in eth_type_trans+0x100/0x260
...
Call trace:
dump_backtrace+0xd8/0x130
show_stack+0x1c/0x50
dump_stack_lvl+0x64/0x7c
print_address_description.constprop.0+0xbc/0x2e8
print_report+0x100/0x1e4
kasan_report+0x80/0x120
__asan_load8+0x78/0xa0
eth_type_trans+0x100/0x260
napi_gro_frags+0x164/0x550
tun_get_user+0xda4/0x1270
tun_chr_write_iter+0x74/0x130
do_iter_readv_writev+0x130/0x1ec
do_iter_write+0xbc/0x1e0
vfs_writev+0x13c/0x26c
To fix the problems, restrict the packet size less than
(ETH_MAX_MTU - NET_SKB_PAD - NET_IP_ALIGN) which has considered reserved
skb space in napi_alloc_skb() because transport_header is an offset from
skb->head. Add len check in tun_napi_alloc_frags() simply.
Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221029094101.1653855-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-29 17:41:01 +08:00
if ( it - > nr_segs > MAX_SKB_FRAGS + 1 | |
len > ( ETH_MAX_MTU - NET_SKB_PAD - NET_IP_ALIGN ) )
2020-12-25 10:52:16 +08:00
return ERR_PTR ( - EMSGSIZE ) ;
2017-09-22 13:49:15 -07:00
local_bh_disable ( ) ;
skb = napi_get_frags ( & tfile - > napi ) ;
local_bh_enable ( ) ;
if ( ! skb )
return ERR_PTR ( - ENOMEM ) ;
linear = iov_iter_single_seg_count ( it ) ;
err = __skb_grow ( skb , linear ) ;
if ( err )
goto free ;
skb - > len = len ;
skb - > data_len = len - linear ;
skb - > truesize + = skb - > data_len ;
for ( i = 1 ; i < it - > nr_segs ; i + + ) {
2023-03-29 08:52:15 -06:00
const struct iovec * iov = iter_iov ( it ) ;
size_t fragsz = iov - > iov_len ;
2018-11-18 07:37:33 -08:00
struct page * page ;
void * frag ;
2017-09-22 13:49:15 -07:00
if ( fragsz = = 0 | | fragsz > PAGE_SIZE ) {
err = - EINVAL ;
goto free ;
}
2018-11-18 07:37:33 -08:00
frag = netdev_alloc_frag ( fragsz ) ;
if ( ! frag ) {
2017-09-22 13:49:15 -07:00
err = - ENOMEM ;
goto free ;
}
2018-11-18 07:37:33 -08:00
page = virt_to_head_page ( frag ) ;
skb_fill_page_desc ( skb , i - 1 , page ,
frag - page_address ( page ) , fragsz ) ;
2017-09-22 13:49:15 -07:00
}
return skb ;
free :
/* frees skb and all frags allocated with napi_alloc_frag() */
napi_free_frags ( & tfile - > napi ) ;
return ERR_PTR ( err ) ;
}
2008-08-15 15:15:10 -07:00
/* prepad is the amount to reserve at front. len is length after that.
* linear is a hint as to how much to copy ( usually headers ) . */
2012-10-31 19:45:57 +00:00
static struct sk_buff * tun_alloc_skb ( struct tun_file * tfile ,
2011-06-08 14:33:08 +00:00
size_t prepad , size_t len ,
size_t linear , int noblock )
2008-08-15 15:15:10 -07:00
{
2012-10-31 19:45:57 +00:00
struct sock * sk = tfile - > socket . sk ;
2008-08-15 15:15:10 -07:00
struct sk_buff * skb ;
2009-02-05 21:25:32 -08:00
int err ;
2008-08-15 15:15:10 -07:00
/* Under a page? Don't bother with paged skb. */
2023-08-09 09:47:52 -07:00
if ( prepad + len < PAGE_SIZE )
2009-02-05 21:25:32 -08:00
linear = len ;
2008-08-15 15:15:10 -07:00
2023-08-01 20:52:52 +00:00
if ( len - linear > MAX_SKB_FRAGS * ( PAGE_SIZE < < PAGE_ALLOC_COSTLY_ORDER ) )
linear = len - MAX_SKB_FRAGS * ( PAGE_SIZE < < PAGE_ALLOC_COSTLY_ORDER ) ;
2009-02-05 21:25:32 -08:00
skb = sock_alloc_send_pskb ( sk , prepad + linear , len - linear , noblock ,
2023-08-01 20:52:52 +00:00
& err , PAGE_ALLOC_COSTLY_ORDER ) ;
2008-08-15 15:15:10 -07:00
if ( ! skb )
2009-02-05 21:25:32 -08:00
return ERR_PTR ( err ) ;
2008-08-15 15:15:10 -07:00
skb_reserve ( skb , prepad ) ;
skb_put ( skb , linear ) ;
2009-02-05 21:25:32 -08:00
skb - > data_len = len - linear ;
skb - > len + = len - linear ;
2008-08-15 15:15:10 -07:00
return skb ;
}
2017-01-18 15:02:03 +08:00
static void tun_rx_batched ( struct tun_struct * tun , struct tun_file * tfile ,
struct sk_buff * skb , int more )
{
struct sk_buff_head * queue = & tfile - > sk . sk_write_queue ;
struct sk_buff_head process_queue ;
u32 rx_batched = tun - > rx_batched ;
bool rcv = false ;
if ( ! rx_batched | | ( ! more & & skb_queue_empty ( queue ) ) ) {
local_bh_disable ( ) ;
2018-11-18 00:46:00 -07:00
skb_record_rx_queue ( skb , tfile - > queue_index ) ;
2017-01-18 15:02:03 +08:00
netif_receive_skb ( skb ) ;
local_bh_enable ( ) ;
return ;
}
spin_lock ( & queue - > lock ) ;
if ( ! more | | skb_queue_len ( queue ) = = rx_batched ) {
__skb_queue_head_init ( & process_queue ) ;
skb_queue_splice_tail_init ( queue , & process_queue ) ;
rcv = true ;
} else {
__skb_queue_tail ( queue , skb ) ;
}
spin_unlock ( & queue - > lock ) ;
if ( rcv ) {
struct sk_buff * nskb ;
local_bh_disable ( ) ;
2018-11-18 00:46:00 -07:00
while ( ( nskb = __skb_dequeue ( & process_queue ) ) ) {
skb_record_rx_queue ( nskb , tfile - > queue_index ) ;
2017-01-18 15:02:03 +08:00
netif_receive_skb ( nskb ) ;
2018-11-18 00:46:00 -07:00
}
skb_record_rx_queue ( skb , tfile - > queue_index ) ;
2017-01-18 15:02:03 +08:00
netif_receive_skb ( skb ) ;
local_bh_enable ( ) ;
}
}
2017-08-11 19:41:16 +08:00
static bool tun_can_build_skb ( struct tun_struct * tun , struct tun_file * tfile ,
int len , int noblock , bool zerocopy )
{
if ( ( tun - > flags & TUN_TYPE_MASK ) ! = IFF_TAP )
return false ;
if ( tfile - > socket . sk - > sk_sndbuf ! = INT_MAX )
return false ;
if ( ! noblock )
return false ;
if ( zerocopy )
return false ;
2023-08-03 20:59:48 +02:00
if ( SKB_DATA_ALIGN ( len + TUN_RX_PAD + XDP_PACKET_HEADROOM ) +
2017-08-11 19:41:16 +08:00
SKB_DATA_ALIGN ( sizeof ( struct skb_shared_info ) ) > PAGE_SIZE )
return false ;
return true ;
}
2019-07-23 16:23:01 +02:00
static struct sk_buff * __tun_build_skb ( struct tun_file * tfile ,
struct page_frag * alloc_frag , char * buf ,
2018-09-12 11:17:04 +08:00
int buflen , int len , int pad )
2018-09-12 11:17:03 +08:00
{
struct sk_buff * skb = build_skb ( buf , buflen ) ;
if ( ! skb )
return ERR_PTR ( - ENOMEM ) ;
2018-09-12 11:17:04 +08:00
skb_reserve ( skb , pad ) ;
2018-09-12 11:17:03 +08:00
skb_put ( skb , len ) ;
2019-07-23 16:23:01 +02:00
skb_set_owner_w ( skb , tfile - > socket . sk ) ;
2018-09-12 11:17:03 +08:00
get_page ( alloc_frag - > page ) ;
alloc_frag - > offset + = buflen ;
return skb ;
}
2018-09-12 11:17:04 +08:00
static int tun_xdp_act ( struct tun_struct * tun , struct bpf_prog * xdp_prog ,
struct xdp_buff * xdp , u32 act )
{
int err ;
switch ( act ) {
case XDP_REDIRECT :
err = xdp_do_redirect ( tun - > dev , xdp , xdp_prog ) ;
if ( err )
return err ;
break ;
case XDP_TX :
err = tun_xdp_tx ( tun - > dev , xdp ) ;
if ( err < 0 )
return err ;
break ;
case XDP_PASS :
break ;
default :
2021-11-30 11:08:07 +01:00
bpf_warn_invalid_xdp_action ( tun - > dev , xdp_prog , act ) ;
2020-08-23 17:36:59 -05:00
fallthrough ;
2018-09-12 11:17:04 +08:00
case XDP_ABORTED :
trace_xdp_exception ( tun - > dev , xdp_prog , act ) ;
2020-08-23 17:36:59 -05:00
fallthrough ;
2018-09-12 11:17:04 +08:00
case XDP_DROP :
2022-03-10 21:14:20 -08:00
dev_core_stats_rx_dropped_inc ( tun - > dev ) ;
2018-09-12 11:17:04 +08:00
break ;
}
return act ;
}
2017-08-11 19:41:18 +08:00
static struct sk_buff * tun_build_skb ( struct tun_struct * tun ,
struct tun_file * tfile ,
2017-08-11 19:41:16 +08:00
struct iov_iter * from ,
2017-08-11 19:41:18 +08:00
struct virtio_net_hdr * hdr ,
2017-09-04 11:36:09 +08:00
int len , int * skb_xdp )
2017-08-11 19:41:16 +08:00
{
2017-08-16 22:14:33 +08:00
struct page_frag * alloc_frag = & current - > task_frag ;
2017-08-11 19:41:18 +08:00
struct bpf_prog * xdp_prog ;
2017-09-04 11:36:08 +08:00
int buflen = SKB_DATA_ALIGN ( sizeof ( struct skb_shared_info ) ) ;
2017-08-11 19:41:16 +08:00
char * buf ;
size_t copied ;
2018-09-12 11:17:04 +08:00
int pad = TUN_RX_PAD ;
int err = 0 ;
2017-09-04 11:36:08 +08:00
rcu_read_lock ( ) ;
xdp_prog = rcu_dereference ( tun - > xdp_prog ) ;
if ( xdp_prog )
2018-09-12 11:17:00 +08:00
pad + = XDP_PACKET_HEADROOM ;
2017-09-04 11:36:08 +08:00
buflen + = SKB_DATA_ALIGN ( len + pad ) ;
rcu_read_unlock ( ) ;
2017-08-11 19:41:16 +08:00
2017-10-27 11:05:44 +08:00
alloc_frag - > offset = ALIGN ( ( u64 ) alloc_frag - > offset , SMP_CACHE_BYTES ) ;
2017-08-11 19:41:16 +08:00
if ( unlikely ( ! skb_page_frag_refill ( buflen , alloc_frag , GFP_KERNEL ) ) )
return ERR_PTR ( - ENOMEM ) ;
buf = ( char * ) page_address ( alloc_frag - > page ) + alloc_frag - > offset ;
copied = copy_page_from_iter ( alloc_frag - > page ,
2017-09-04 11:36:08 +08:00
alloc_frag - > offset + pad ,
2017-08-11 19:41:16 +08:00
len , from ) ;
if ( copied ! = len )
return ERR_PTR ( - EFAULT ) ;
2017-09-04 11:36:08 +08:00
/* There's a small window that XDP may be set after the check
* of xdp_prog above , this should be rare and for simplicity
* we do XDP on skb in case the headroom is not enough .
*/
2018-09-12 11:17:03 +08:00
if ( hdr - > gso_type | | ! xdp_prog ) {
2017-09-04 11:36:09 +08:00
* skb_xdp = 1 ;
2019-07-23 16:23:01 +02:00
return __tun_build_skb ( tfile , alloc_frag , buf , buflen , len ,
pad ) ;
2018-09-12 11:17:03 +08:00
}
* skb_xdp = 0 ;
2017-08-11 19:41:18 +08:00
2018-05-28 19:37:49 +09:00
local_bh_disable ( ) ;
2017-08-11 19:41:18 +08:00
rcu_read_lock ( ) ;
xdp_prog = rcu_dereference ( tun - > xdp_prog ) ;
2018-09-12 11:17:04 +08:00
if ( xdp_prog ) {
2017-08-11 19:41:18 +08:00
struct xdp_buff xdp ;
u32 act ;
2020-12-22 22:09:28 +01:00
xdp_init_buff ( & xdp , buflen , & tfile - > xdp_rxq ) ;
2020-12-22 22:09:29 +01:00
xdp_prepare_buff ( & xdp , buf , pad , len , false ) ;
2017-08-11 19:41:18 +08:00
2018-09-12 11:17:04 +08:00
act = bpf_prog_run_xdp ( xdp_prog , & xdp ) ;
if ( act = = XDP_REDIRECT | | act = = XDP_TX ) {
2018-03-14 11:23:40 +08:00
get_page ( alloc_frag - > page ) ;
alloc_frag - > offset + = buflen ;
2017-08-11 19:41:18 +08:00
}
2018-09-12 11:17:04 +08:00
err = tun_xdp_act ( tun , xdp_prog , & xdp , act ) ;
2020-04-03 16:13:21 +01:00
if ( err < 0 ) {
if ( act = = XDP_REDIRECT | | act = = XDP_TX )
put_page ( alloc_frag - > page ) ;
goto out ;
}
2018-09-12 11:17:05 +08:00
if ( err = = XDP_REDIRECT )
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 16:14:45 +01:00
xdp_do_flush ( ) ;
2018-09-12 11:17:04 +08:00
if ( err ! = XDP_PASS )
goto out ;
pad = xdp . data - xdp . data_hard_start ;
len = xdp . data_end - xdp . data ;
2017-08-11 19:41:18 +08:00
}
2018-09-12 11:17:01 +08:00
rcu_read_unlock ( ) ;
local_bh_enable ( ) ;
2017-08-11 19:41:18 +08:00
2019-07-23 16:23:01 +02:00
return __tun_build_skb ( tfile , alloc_frag , buf , buflen , len , pad ) ;
2017-08-11 19:41:18 +08:00
2018-09-12 11:17:02 +08:00
out :
2017-08-11 19:41:18 +08:00
rcu_read_unlock ( ) ;
2018-05-28 19:37:49 +09:00
local_bh_enable ( ) ;
2017-08-11 19:41:18 +08:00
return NULL ;
2017-08-11 19:41:16 +08:00
}
2005-04-16 15:20:36 -07:00
/* Get packet from user space buffer */
2012-10-31 19:45:57 +00:00
static ssize_t tun_get_user ( struct tun_struct * tun , struct tun_file * tfile ,
2014-06-19 15:36:49 -04:00
void * msg_control , struct iov_iter * from ,
2017-01-18 15:02:03 +08:00
int noblock , bool more )
2005-04-16 15:20:36 -07:00
{
2009-02-01 00:45:17 -08:00
struct tun_pi pi = { 0 , cpu_to_be16 ( ETH_P_IP ) } ;
2005-04-16 15:20:36 -07:00
struct sk_buff * skb ;
2014-06-19 15:36:49 -04:00
size_t total_len = iov_iter_count ( from ) ;
2016-02-26 10:45:40 +01:00
size_t len = total_len , align = tun - > align , linear ;
2008-07-03 03:48:02 -07:00
struct virtio_net_hdr gso = { 0 } ;
2013-11-13 14:00:39 +08:00
int good_linear ;
2012-07-20 09:23:23 +00:00
int copylen ;
bool zerocopy = false ;
int err ;
2017-12-04 17:31:23 +08:00
u32 rxhash = 0 ;
2017-09-04 11:36:09 +08:00
int skb_xdp = 1 ;
2018-09-28 14:51:49 -07:00
bool frags = tun_napi_frags_enabled ( tfile ) ;
2022-11-10 15:31:25 +08:00
enum skb_drop_reason drop_reason = SKB_DROP_REASON_NOT_SPECIFIED ;
2005-04-16 15:20:36 -07:00
2014-11-19 15:17:31 +02:00
if ( ! ( tun - > flags & IFF_NO_PI ) ) {
2013-08-15 15:52:57 +03:00
if ( len < sizeof ( pi ) )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
2013-08-15 15:52:57 +03:00
len - = sizeof ( pi ) ;
2005-04-16 15:20:36 -07:00
2016-11-01 22:09:04 -04:00
if ( ! copy_from_iter_full ( & pi , sizeof ( pi ) , from ) )
2005-04-16 15:20:36 -07:00
return - EFAULT ;
}
2014-11-19 15:17:31 +02:00
if ( tun - > flags & IFF_VNET_HDR ) {
2017-02-03 18:20:48 -05:00
int vnet_hdr_sz = READ_ONCE ( tun - > vnet_hdr_sz ) ;
if ( len < vnet_hdr_sz )
2008-07-03 03:48:02 -07:00
return - EINVAL ;
2017-02-03 18:20:48 -05:00
len - = vnet_hdr_sz ;
2008-07-03 03:48:02 -07:00
2016-11-01 22:09:04 -04:00
if ( ! copy_from_iter_full ( & gso , sizeof ( gso ) , from ) )
2008-07-03 03:48:02 -07:00
return - EFAULT ;
2009-06-08 00:20:01 -07:00
if ( ( gso . flags & VIRTIO_NET_HDR_F_NEEDS_CSUM ) & &
2014-10-23 22:59:31 +03:00
tun16_to_cpu ( tun , gso . csum_start ) + tun16_to_cpu ( tun , gso . csum_offset ) + 2 > tun16_to_cpu ( tun , gso . hdr_len ) )
gso . hdr_len = cpu_to_tun16 ( tun , tun16_to_cpu ( tun , gso . csum_start ) + tun16_to_cpu ( tun , gso . csum_offset ) + 2 ) ;
2009-06-08 00:20:01 -07:00
2014-10-23 22:59:31 +03:00
if ( tun16_to_cpu ( tun , gso . hdr_len ) > len )
2008-07-03 03:48:02 -07:00
return - EINVAL ;
2017-02-03 18:20:48 -05:00
iov_iter_advance ( from , vnet_hdr_sz - sizeof ( gso ) ) ;
2008-07-03 03:48:02 -07:00
}
2014-11-19 15:17:31 +02:00
if ( ( tun - > flags & TUN_TYPE_MASK ) = = IFF_TAP ) {
2011-06-08 14:33:07 +00:00
align + = NET_IP_ALIGN ;
2009-04-14 02:09:43 -07:00
if ( unlikely ( len < ETH_HLEN | |
2014-10-23 22:59:31 +03:00
( gso . hdr_len & & tun16_to_cpu ( tun , gso . hdr_len ) < ETH_HLEN ) ) )
2008-04-12 18:49:30 -07:00
return - EINVAL ;
}
2006-09-13 13:24:59 -04:00
2013-11-13 14:00:39 +08:00
good_linear = SKB_MAX_HEAD ( align ) ;
2013-07-18 10:55:15 +08:00
if ( msg_control ) {
2014-06-19 15:36:49 -04:00
struct iov_iter i = * from ;
2013-07-18 10:55:15 +08:00
/* There are 256 bytes to be copied in skb, so there is
* enough room for skb expand head in case it is used .
2012-07-20 09:23:23 +00:00
* The rest of the buffer is mapped from userspace .
*/
2014-10-23 22:59:31 +03:00
copylen = gso . hdr_len ? tun16_to_cpu ( tun , gso . hdr_len ) : GOODCOPY_LEN ;
2013-11-13 14:00:39 +08:00
if ( copylen > good_linear )
copylen = good_linear ;
2013-07-10 13:43:27 +08:00
linear = copylen ;
2014-06-19 15:36:49 -04:00
iov_iter_advance ( & i , copylen ) ;
if ( iov_iter_npages ( & i , INT_MAX ) < = MAX_SKB_FRAGS )
2013-07-18 10:55:15 +08:00
zerocopy = true ;
}
2017-09-22 13:49:15 -07:00
if ( ! frags & & tun_can_build_skb ( tun , tfile , len , noblock , zerocopy ) ) {
2017-09-04 11:36:09 +08:00
/* For the packet that is not easy to be processed
* ( e . g gso or jumbo packet ) , we will do it at after
* skb was created with generic XDP routine .
*/
skb = tun_build_skb ( tun , tfile , from , & gso , len , & skb_xdp ) ;
2022-11-10 15:31:25 +08:00
err = PTR_ERR_OR_ZERO ( skb ) ;
if ( err )
goto drop ;
2017-08-11 19:41:18 +08:00
if ( ! skb )
return total_len ;
2017-08-11 19:41:16 +08:00
} else {
if ( ! zerocopy ) {
copylen = len ;
if ( tun16_to_cpu ( tun , gso . hdr_len ) > good_linear )
linear = good_linear ;
else
linear = tun16_to_cpu ( tun , gso . hdr_len ) ;
}
2005-04-16 15:20:36 -07:00
2017-09-22 13:49:15 -07:00
if ( frags ) {
mutex_lock ( & tfile - > napi_mutex ) ;
skb = tun_napi_alloc_frags ( tfile , copylen , from ) ;
/* tun_napi_alloc_frags() enforces a layout for the skb.
* If zerocopy is enabled , then this layout will be
* overwritten by zerocopy_sg_from_iter ( ) .
*/
zerocopy = false ;
} else {
2023-08-09 09:47:52 -07:00
if ( ! linear )
linear = min_t ( size_t , good_linear , copylen ) ;
2017-09-22 13:49:15 -07:00
skb = tun_alloc_skb ( tfile , align , copylen , linear ,
noblock ) ;
}
2022-11-10 15:31:25 +08:00
err = PTR_ERR_OR_ZERO ( skb ) ;
if ( err )
goto drop ;
2012-07-20 09:23:23 +00:00
2017-08-11 19:41:16 +08:00
if ( zerocopy )
err = zerocopy_sg_from_iter ( skb , from ) ;
else
err = skb_copy_datagram_from_iter ( skb , 0 , from , len ) ;
if ( err ) {
2019-03-14 20:19:47 -07:00
err = - EFAULT ;
2022-03-04 06:55:07 -08:00
drop_reason = SKB_DROP_REASON_SKB_UCOPY_FAULT ;
2022-11-10 15:31:25 +08:00
goto drop ;
2017-08-11 19:41:16 +08:00
}
2006-03-11 18:49:13 -08:00
}
2005-04-16 15:20:36 -07:00
2016-11-18 15:40:38 -08:00
if ( virtio_net_hdr_to_skb ( skb , & gso , tun_is_little_endian ( tun ) ) ) {
2020-11-07 21:50:56 +01:00
atomic_long_inc ( & tun - > rx_frame_errors ) ;
2022-11-10 15:31:25 +08:00
err = - EINVAL ;
goto free_skb ;
2016-06-14 00:00:04 +02:00
}
2005-04-16 15:20:36 -07:00
switch ( tun - > flags & TUN_TYPE_MASK ) {
2014-11-19 15:17:31 +02:00
case IFF_TUN :
if ( tun - > flags & IFF_NO_PI ) {
tun: bail out from tun_get_user() if the skb is empty
KMSAN (https://github.com/google/kmsan) reported accessing uninitialized
skb->data[0] in the case the skb is empty (i.e. skb->len is 0):
================================================
BUG: KMSAN: use of uninitialized memory in tun_get_user+0x19ba/0x3770
CPU: 0 PID: 3051 Comm: probe Not tainted 4.13.0+ #3140
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
...
__msan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:477
tun_get_user+0x19ba/0x3770 drivers/net/tun.c:1301
tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
call_write_iter ./include/linux/fs.h:1743
new_sync_write fs/read_write.c:457
__vfs_write+0x6c3/0x7f0 fs/read_write.c:470
vfs_write+0x3e4/0x770 fs/read_write.c:518
SYSC_write+0x12f/0x2b0 fs/read_write.c:565
SyS_write+0x55/0x80 fs/read_write.c:557
do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:245
...
origin:
...
kmsan_poison_shadow+0x6e/0xc0 mm/kmsan/kmsan.c:211
slab_alloc_node mm/slub.c:2732
__kmalloc_node_track_caller+0x351/0x370 mm/slub.c:4351
__kmalloc_reserve net/core/skbuff.c:138
__alloc_skb+0x26a/0x810 net/core/skbuff.c:231
alloc_skb ./include/linux/skbuff.h:903
alloc_skb_with_frags+0x1d7/0xc80 net/core/skbuff.c:4756
sock_alloc_send_pskb+0xabf/0xfe0 net/core/sock.c:2037
tun_alloc_skb drivers/net/tun.c:1144
tun_get_user+0x9a8/0x3770 drivers/net/tun.c:1274
tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
call_write_iter ./include/linux/fs.h:1743
new_sync_write fs/read_write.c:457
__vfs_write+0x6c3/0x7f0 fs/read_write.c:470
vfs_write+0x3e4/0x770 fs/read_write.c:518
SYSC_write+0x12f/0x2b0 fs/read_write.c:565
SyS_write+0x55/0x80 fs/read_write.c:557
do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:245
================================================
Make sure tun_get_user() doesn't touch skb->data[0] unless there is
actual data.
C reproducer below:
==========================
// autogenerated by syzkaller (http://github.com/google/syzkaller)
#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/if_tun.h>
#include <netinet/ip.h>
#include <net/if.h>
#include <string.h>
#include <sys/ioctl.h>
int main()
{
int sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
int tun_fd = open("/dev/net/tun", O_RDWR);
struct ifreq req;
memset(&req, 0, sizeof(struct ifreq));
strcpy((char*)&req.ifr_name, "gre0");
req.ifr_flags = IFF_UP | IFF_MULTICAST;
ioctl(tun_fd, TUNSETIFF, &req);
ioctl(sock, SIOCSIFFLAGS, "gre0");
write(tun_fd, "hi", 0);
return 0;
}
==========================
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 11:32:37 +02:00
u8 ip_version = skb - > len ? ( skb - > data [ 0 ] > > 4 ) : 0 ;
switch ( ip_version ) {
case 4 :
2008-06-17 21:10:33 -07:00
pi . proto = htons ( ETH_P_IP ) ;
break ;
tun: bail out from tun_get_user() if the skb is empty
KMSAN (https://github.com/google/kmsan) reported accessing uninitialized
skb->data[0] in the case the skb is empty (i.e. skb->len is 0):
================================================
BUG: KMSAN: use of uninitialized memory in tun_get_user+0x19ba/0x3770
CPU: 0 PID: 3051 Comm: probe Not tainted 4.13.0+ #3140
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
...
__msan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:477
tun_get_user+0x19ba/0x3770 drivers/net/tun.c:1301
tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
call_write_iter ./include/linux/fs.h:1743
new_sync_write fs/read_write.c:457
__vfs_write+0x6c3/0x7f0 fs/read_write.c:470
vfs_write+0x3e4/0x770 fs/read_write.c:518
SYSC_write+0x12f/0x2b0 fs/read_write.c:565
SyS_write+0x55/0x80 fs/read_write.c:557
do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:245
...
origin:
...
kmsan_poison_shadow+0x6e/0xc0 mm/kmsan/kmsan.c:211
slab_alloc_node mm/slub.c:2732
__kmalloc_node_track_caller+0x351/0x370 mm/slub.c:4351
__kmalloc_reserve net/core/skbuff.c:138
__alloc_skb+0x26a/0x810 net/core/skbuff.c:231
alloc_skb ./include/linux/skbuff.h:903
alloc_skb_with_frags+0x1d7/0xc80 net/core/skbuff.c:4756
sock_alloc_send_pskb+0xabf/0xfe0 net/core/sock.c:2037
tun_alloc_skb drivers/net/tun.c:1144
tun_get_user+0x9a8/0x3770 drivers/net/tun.c:1274
tun_chr_write_iter+0x19f/0x300 drivers/net/tun.c:1365
call_write_iter ./include/linux/fs.h:1743
new_sync_write fs/read_write.c:457
__vfs_write+0x6c3/0x7f0 fs/read_write.c:470
vfs_write+0x3e4/0x770 fs/read_write.c:518
SYSC_write+0x12f/0x2b0 fs/read_write.c:565
SyS_write+0x55/0x80 fs/read_write.c:557
do_syscall_64+0x242/0x330 arch/x86/entry/common.c:284
return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:245
================================================
Make sure tun_get_user() doesn't touch skb->data[0] unless there is
actual data.
C reproducer below:
==========================
// autogenerated by syzkaller (http://github.com/google/syzkaller)
#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/if_tun.h>
#include <netinet/ip.h>
#include <net/if.h>
#include <string.h>
#include <sys/ioctl.h>
int main()
{
int sock = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);
int tun_fd = open("/dev/net/tun", O_RDWR);
struct ifreq req;
memset(&req, 0, sizeof(struct ifreq));
strcpy((char*)&req.ifr_name, "gre0");
req.ifr_flags = IFF_UP | IFF_MULTICAST;
ioctl(tun_fd, TUNSETIFF, &req);
ioctl(sock, SIOCSIFFLAGS, "gre0");
write(tun_fd, "hi", 0);
return 0;
}
==========================
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-28 11:32:37 +02:00
case 6 :
2008-06-17 21:10:33 -07:00
pi . proto = htons ( ETH_P_IPV6 ) ;
break ;
default :
2022-11-10 15:31:25 +08:00
err = - EINVAL ;
goto drop ;
2008-06-17 21:10:33 -07:00
}
}
2007-03-19 15:30:44 -07:00
skb_reset_mac_header ( skb ) ;
2005-04-16 15:20:36 -07:00
skb - > protocol = pi . proto ;
2007-04-25 17:40:23 -07:00
skb - > dev = tun - > dev ;
2005-04-16 15:20:36 -07:00
break ;
2014-11-19 15:17:31 +02:00
case IFF_TAP :
2020-05-30 15:41:31 -04:00
if ( frags & & ! pskb_may_pull ( skb , ETH_HLEN ) ) {
err = - ENOMEM ;
2022-03-04 06:55:07 -08:00
drop_reason = SKB_DROP_REASON_HDR_TRUNC ;
2020-05-30 15:41:31 -04:00
goto drop ;
}
skb - > protocol = eth_type_trans ( skb , tun - > dev ) ;
2005-04-16 15:20:36 -07:00
break ;
2011-06-03 11:51:20 +00:00
}
2005-04-16 15:20:36 -07:00
2012-07-20 09:23:23 +00:00
/* copy skb_ubuf_info for callback when skb has no error */
if ( zerocopy ) {
2021-01-06 14:18:40 -08:00
skb_zcopy_init ( skb , msg_control ) ;
2016-11-30 13:17:51 +08:00
} else if ( msg_control ) {
struct ubuf_info * uarg = msg_control ;
2021-01-06 14:18:34 -08:00
uarg - > callback ( NULL , uarg , false ) ;
2012-07-20 09:23:23 +00:00
}
2015-02-03 16:36:16 -05:00
skb_reset_network_header ( skb ) ;
net: Don't set transport offset to invalid value
If the socket was created with socket(AF_PACKET, SOCK_RAW, 0),
skb->protocol will be unset, __skb_flow_dissect() will fail, and
skb_probe_transport_header() will fall back to the offset_hint, making
the resulting skb_transport_offset incorrect.
If, however, there is no transport header in the packet,
transport_header shouldn't be set to an arbitrary value.
Fix it by leaving the transport offset unset if it couldn't be found, to
be explicit rather than to fill it with some wrong value. It changes the
behavior, but if some code relied on the old behavior, it would be
broken anyway, as the old one is incorrect.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 12:39:57 +00:00
skb_probe_transport_header ( skb ) ;
2020-04-10 18:20:59 +02:00
skb_record_rx_queue ( skb , tfile - > queue_index ) ;
2013-03-25 20:19:56 +00:00
2017-09-04 11:36:09 +08:00
if ( skb_xdp ) {
2017-08-11 19:41:18 +08:00
struct bpf_prog * xdp_prog ;
int ret ;
2018-05-28 19:37:49 +09:00
local_bh_disable ( ) ;
2017-08-11 19:41:18 +08:00
rcu_read_lock ( ) ;
xdp_prog = rcu_dereference ( tun - > xdp_prog ) ;
if ( xdp_prog ) {
ret = do_xdp_generic ( xdp_prog , skb ) ;
if ( ret ! = XDP_PASS ) {
rcu_read_unlock ( ) ;
2018-05-28 19:37:49 +09:00
local_bh_enable ( ) ;
2022-11-10 15:31:25 +08:00
goto unlock_frags ;
2017-08-11 19:41:18 +08:00
}
}
rcu_read_unlock ( ) ;
2018-05-28 19:37:49 +09:00
local_bh_enable ( ) ;
2017-08-11 19:41:18 +08:00
}
2018-04-20 13:18:16 +02:00
/* Compute the costly rx hash only if needed for flow updates.
* We may get a very small possibility of OOO during switching , not
* worth to optimize .
*/
if ( ! rcu_access_pointer ( tun - > steering_prog ) & & tun - > numqueues > 1 & &
! tfile - > detached )
2017-12-04 17:31:23 +08:00
rxhash = __skb_get_hash_symmetric ( skb ) ;
2017-09-22 13:49:14 -07:00
2019-03-14 20:19:47 -07:00
rcu_read_lock ( ) ;
if ( unlikely ( ! ( tun - > dev - > flags & IFF_UP ) ) ) {
err = - EIO ;
2019-03-16 13:09:53 -07:00
rcu_read_unlock ( ) ;
2022-03-04 06:55:07 -08:00
drop_reason = SKB_DROP_REASON_DEV_READY ;
2019-03-14 20:19:47 -07:00
goto drop ;
}
2017-09-22 13:49:15 -07:00
if ( frags ) {
2020-05-30 15:41:31 -04:00
u32 headlen ;
2017-09-22 13:49:15 -07:00
/* Exercise flow dissector code path. */
2020-05-30 15:41:31 -04:00
skb_push ( skb , ETH_HLEN ) ;
headlen = eth_get_headlen ( tun - > dev , skb - > data ,
skb_headlen ( skb ) ) ;
2017-09-22 13:49:15 -07:00
2017-10-17 10:07:44 -07:00
if ( unlikely ( headlen > skb_headlen ( skb ) ) ) {
2022-11-07 18:00:11 +00:00
WARN_ON_ONCE ( 1 ) ;
err = - ENOMEM ;
2022-03-10 21:14:20 -08:00
dev_core_stats_rx_dropped_inc ( tun - > dev ) ;
2022-11-07 18:00:11 +00:00
napi_busy :
2017-09-22 13:49:15 -07:00
napi_free_frags ( & tfile - > napi ) ;
2019-03-14 20:19:47 -07:00
rcu_read_unlock ( ) ;
2017-09-22 13:49:15 -07:00
mutex_unlock ( & tfile - > napi_mutex ) ;
2022-11-07 18:00:11 +00:00
return err ;
2017-09-22 13:49:15 -07:00
}
2022-11-07 18:00:11 +00:00
if ( likely ( napi_schedule_prep ( & tfile - > napi ) ) ) {
local_bh_disable ( ) ;
napi_gro_frags ( & tfile - > napi ) ;
napi_complete ( & tfile - > napi ) ;
local_bh_enable ( ) ;
} else {
err = - EBUSY ;
goto napi_busy ;
}
2017-09-22 13:49:15 -07:00
mutex_unlock ( & tfile - > napi_mutex ) ;
2017-10-18 12:12:09 -07:00
} else if ( tfile - > napi_enabled ) {
2017-09-22 13:49:14 -07:00
struct sk_buff_head * queue = & tfile - > sk . sk_write_queue ;
int queue_len ;
spin_lock_bh ( & queue - > lock ) ;
tun: Fix memory leak for detached NAPI queue.
syzkaller reported [0] memory leaks of sk and skb related to the TUN
device with no repro, but we can reproduce it easily with:
struct ifreq ifr = {}
int fd_tun, fd_tmp;
char buf[4] = {};
fd_tun = openat(AT_FDCWD, "/dev/net/tun", O_WRONLY, 0);
ifr.ifr_flags = IFF_TUN | IFF_NAPI | IFF_MULTI_QUEUE;
ioctl(fd_tun, TUNSETIFF, &ifr);
ifr.ifr_flags = IFF_DETACH_QUEUE;
ioctl(fd_tun, TUNSETQUEUE, &ifr);
fd_tmp = socket(AF_PACKET, SOCK_PACKET, 0);
ifr.ifr_flags = IFF_UP;
ioctl(fd_tmp, SIOCSIFFLAGS, &ifr);
write(fd_tun, buf, sizeof(buf));
close(fd_tun);
If we enable NAPI and multi-queue on a TUN device, we can put skb into
tfile->sk.sk_write_queue after the queue is detached. We should prevent
it by checking tfile->detached before queuing skb.
Note this must be done under tfile->sk.sk_write_queue.lock because write()
and ioctl(IFF_DETACH_QUEUE) can run concurrently. Otherwise, there would
be a small race window:
write() ioctl(IFF_DETACH_QUEUE)
`- tun_get_user `- __tun_detach
|- if (tfile->detached) |- tun_disable_queue
| `-> false | `- tfile->detached = tun
| `- tun_queue_purge
|- spin_lock_bh(&queue->lock)
`- __skb_queue_tail(queue, skb)
Another solution is to call tun_queue_purge() when closing and
reattaching the detached queue, but it could paper over another
problems. Also, we do the same kind of test for IFF_NAPI_FRAGS.
[0]:
unreferenced object 0xffff88801edbc800 (size 2048):
comm "syz-executor.1", pid 33269, jiffies 4295743834 (age 18.756s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace:
[<000000008c16ea3d>] __do_kmalloc_node mm/slab_common.c:965 [inline]
[<000000008c16ea3d>] __kmalloc+0x4a/0x130 mm/slab_common.c:979
[<000000003addde56>] kmalloc include/linux/slab.h:563 [inline]
[<000000003addde56>] sk_prot_alloc+0xef/0x1b0 net/core/sock.c:2035
[<000000003e20621f>] sk_alloc+0x36/0x2f0 net/core/sock.c:2088
[<0000000028e43843>] tun_chr_open+0x3d/0x190 drivers/net/tun.c:3438
[<000000001b0f1f28>] misc_open+0x1a6/0x1f0 drivers/char/misc.c:165
[<000000004376f706>] chrdev_open+0x111/0x300 fs/char_dev.c:414
[<00000000614d379f>] do_dentry_open+0x2f9/0x750 fs/open.c:920
[<000000008eb24774>] do_open fs/namei.c:3636 [inline]
[<000000008eb24774>] path_openat+0x143f/0x1a30 fs/namei.c:3791
[<00000000955077b5>] do_filp_open+0xce/0x1c0 fs/namei.c:3818
[<00000000b78973b0>] do_sys_openat2+0xf0/0x260 fs/open.c:1356
[<00000000057be699>] do_sys_open fs/open.c:1372 [inline]
[<00000000057be699>] __do_sys_openat fs/open.c:1388 [inline]
[<00000000057be699>] __se_sys_openat fs/open.c:1383 [inline]
[<00000000057be699>] __x64_sys_openat+0x83/0xf0 fs/open.c:1383
[<00000000a7d2182d>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<00000000a7d2182d>] do_syscall_64+0x3c/0x90 arch/x86/entry/common.c:80
[<000000004cc4e8c4>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
unreferenced object 0xffff88802f671700 (size 240):
comm "syz-executor.1", pid 33269, jiffies 4295743854 (age 18.736s)
hex dump (first 32 bytes):
68 c9 db 1e 80 88 ff ff 68 c9 db 1e 80 88 ff ff h.......h.......
00 c0 7b 2f 80 88 ff ff 00 c8 db 1e 80 88 ff ff ..{/............
backtrace:
[<00000000e9d9fdb6>] __alloc_skb+0x223/0x250 net/core/skbuff.c:644
[<000000002c3e4e0b>] alloc_skb include/linux/skbuff.h:1288 [inline]
[<000000002c3e4e0b>] alloc_skb_with_frags+0x6f/0x350 net/core/skbuff.c:6378
[<00000000825f98d7>] sock_alloc_send_pskb+0x3ac/0x3e0 net/core/sock.c:2729
[<00000000e9eb3df3>] tun_alloc_skb drivers/net/tun.c:1529 [inline]
[<00000000e9eb3df3>] tun_get_user+0x5e1/0x1f90 drivers/net/tun.c:1841
[<0000000053096912>] tun_chr_write_iter+0xac/0x120 drivers/net/tun.c:2035
[<00000000b9282ae0>] call_write_iter include/linux/fs.h:1868 [inline]
[<00000000b9282ae0>] new_sync_write fs/read_write.c:491 [inline]
[<00000000b9282ae0>] vfs_write+0x40f/0x530 fs/read_write.c:584
[<00000000524566e4>] ksys_write+0xa1/0x170 fs/read_write.c:637
[<00000000a7d2182d>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<00000000a7d2182d>] do_syscall_64+0x3c/0x90 arch/x86/entry/common.c:80
[<000000004cc4e8c4>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
Fixes: cde8b15f1aab ("tuntap: add ioctl to attach or detach a file form tuntap device")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 11:42:04 -07:00
if ( unlikely ( tfile - > detached ) ) {
spin_unlock_bh ( & queue - > lock ) ;
rcu_read_unlock ( ) ;
err = - EBUSY ;
goto free_skb ;
}
2017-09-22 13:49:14 -07:00
__skb_queue_tail ( queue , skb ) ;
queue_len = skb_queue_len ( queue ) ;
spin_unlock ( & queue - > lock ) ;
if ( ! more | | queue_len > NAPI_POLL_WEIGHT )
napi_schedule ( & tfile - > napi ) ;
local_bh_enable ( ) ;
} else if ( ! IS_ENABLED ( CONFIG_4KSTACKS ) ) {
tun_rx_batched ( tun , tfile , skb , more ) ;
} else {
2022-03-06 22:57:46 +01:00
netif_rx ( skb ) ;
2017-09-22 13:49:14 -07:00
}
2019-03-14 20:19:47 -07:00
rcu_read_unlock ( ) ;
2006-09-13 13:24:59 -04:00
2020-11-07 21:50:56 +01:00
preempt_disable ( ) ;
dev_sw_netstats_rx_add ( tun - > dev , len ) ;
preempt_enable ( ) ;
2005-04-16 15:20:36 -07:00
2017-12-04 17:31:23 +08:00
if ( rxhash )
tun_flow_update ( tun , rxhash , tfile ) ;
2012-07-20 09:23:23 +00:00
return total_len ;
2022-11-10 15:31:25 +08:00
drop :
if ( err ! = - EAGAIN )
dev_core_stats_rx_dropped_inc ( tun - > dev ) ;
free_skb :
if ( ! IS_ERR_OR_NULL ( skb ) )
kfree_skb_reason ( skb , drop_reason ) ;
unlock_frags :
if ( frags ) {
tfile - > napi . skb = NULL ;
mutex_unlock ( & tfile - > napi_mutex ) ;
}
return err ? : total_len ;
2006-09-13 13:24:59 -04:00
}
2005-04-16 15:20:36 -07:00
2014-06-19 15:36:49 -04:00
static ssize_t tun_chr_write_iter ( struct kiocb * iocb , struct iov_iter * from )
2005-04-16 15:20:36 -07:00
{
2009-02-05 21:25:32 -08:00
struct file * file = iocb - > ki_filp ;
2012-10-31 19:45:57 +00:00
struct tun_file * tfile = file - > private_data ;
2017-09-23 22:36:52 +08:00
struct tun_struct * tun = tun_get ( tfile ) ;
2009-01-20 11:00:40 +00:00
ssize_t result ;
2020-11-20 07:59:54 -07:00
int noblock = 0 ;
2005-04-16 15:20:36 -07:00
if ( ! tun )
return - EBADFD ;
2020-11-20 07:59:54 -07:00
if ( ( file - > f_flags & O_NONBLOCK ) | | ( iocb - > ki_flags & IOCB_NOWAIT ) )
noblock = 1 ;
result = tun_get_user ( tun , tfile , NULL , from , noblock , false ) ;
2009-01-20 11:00:40 +00:00
tun_put ( tun ) ;
return result ;
2005-04-16 15:20:36 -07:00
}
2018-01-04 11:14:28 +08:00
static ssize_t tun_put_user_xdp ( struct tun_struct * tun ,
struct tun_file * tfile ,
2018-04-17 16:45:47 +02:00
struct xdp_frame * xdp_frame ,
2018-01-04 11:14:28 +08:00
struct iov_iter * iter )
{
int vnet_hdr_sz = 0 ;
2018-04-17 16:45:47 +02:00
size_t size = xdp_frame - > len ;
2018-01-04 11:14:28 +08:00
size_t ret ;
if ( tun - > flags & IFF_VNET_HDR ) {
struct virtio_net_hdr gso = { 0 } ;
vnet_hdr_sz = READ_ONCE ( tun - > vnet_hdr_sz ) ;
if ( unlikely ( iov_iter_count ( iter ) < vnet_hdr_sz ) )
return - EINVAL ;
if ( unlikely ( copy_to_iter ( & gso , sizeof ( gso ) , iter ) ! =
sizeof ( gso ) ) )
return - EFAULT ;
iov_iter_advance ( iter , vnet_hdr_sz - sizeof ( gso ) ) ;
}
2018-04-17 16:45:47 +02:00
ret = copy_to_iter ( xdp_frame - > data , size , iter ) + vnet_hdr_sz ;
2018-01-04 11:14:28 +08:00
2020-11-07 21:50:56 +01:00
preempt_disable ( ) ;
dev_sw_netstats_tx_add ( tun - > dev , 1 , ret ) ;
preempt_enable ( ) ;
2018-01-04 11:14:28 +08:00
return ret ;
}
2005-04-16 15:20:36 -07:00
/* Put packet to the user space buffer */
2011-06-08 14:33:08 +00:00
static ssize_t tun_put_user ( struct tun_struct * tun ,
2012-10-31 19:45:57 +00:00
struct tun_file * tfile ,
2011-06-08 14:33:08 +00:00
struct sk_buff * skb ,
2014-11-07 21:22:23 +08:00
struct iov_iter * iter )
2005-04-16 15:20:36 -07:00
{
struct tun_pi pi = { 0 , skb - > protocol } ;
2014-11-07 21:22:23 +08:00
ssize_t total ;
2014-11-13 16:54:14 +08:00
int vlan_offset = 0 ;
2014-11-03 04:30:13 +08:00
int vlan_hlen = 0 ;
2014-11-03 04:30:14 +08:00
int vnet_hdr_sz = 0 ;
2014-11-03 04:30:13 +08:00
2015-01-13 17:13:44 +01:00
if ( skb_vlan_tag_present ( skb ) )
2014-11-03 04:30:13 +08:00
vlan_hlen = VLAN_HLEN ;
2005-04-16 15:20:36 -07:00
2014-11-19 15:17:31 +02:00
if ( tun - > flags & IFF_VNET_HDR )
2017-02-03 18:20:48 -05:00
vnet_hdr_sz = READ_ONCE ( tun - > vnet_hdr_sz ) ;
2005-04-16 15:20:36 -07:00
2014-11-07 21:22:23 +08:00
total = skb - > len + vlan_hlen + vnet_hdr_sz ;
2014-11-19 15:17:31 +02:00
if ( ! ( tun - > flags & IFF_NO_PI ) ) {
2014-11-07 21:22:23 +08:00
if ( iov_iter_count ( iter ) < sizeof ( pi ) )
2005-04-16 15:20:36 -07:00
return - EINVAL ;
2014-11-07 21:22:23 +08:00
total + = sizeof ( pi ) ;
if ( iov_iter_count ( iter ) < total ) {
2005-04-16 15:20:36 -07:00
/* Packet will be striped */
pi . flags | = TUN_PKT_STRIP ;
}
2006-09-13 13:24:59 -04:00
2014-11-07 21:22:23 +08:00
if ( copy_to_iter ( & pi , sizeof ( pi ) , iter ) ! = sizeof ( pi ) )
2005-04-16 15:20:36 -07:00
return - EFAULT ;
2006-09-13 13:24:59 -04:00
}
2005-04-16 15:20:36 -07:00
2014-11-03 04:30:14 +08:00
if ( vnet_hdr_sz ) {
2016-11-18 15:40:40 -08:00
struct virtio_net_hdr gso ;
2016-06-08 16:09:20 +03:00
2014-11-07 21:22:23 +08:00
if ( iov_iter_count ( iter ) < vnet_hdr_sz )
2008-07-03 03:48:02 -07:00
return - EINVAL ;
2016-11-18 15:40:38 -08:00
if ( virtio_net_hdr_from_skb ( skb , & gso ,
2018-06-06 11:23:01 -04:00
tun_is_little_endian ( tun ) , true ,
vlan_hlen ) ) {
2008-07-03 03:48:02 -07:00
struct skb_shared_info * sinfo = skb_shinfo ( skb ) ;
2016-06-08 16:09:20 +03:00
pr_err ( " unexpected GSO type: "
" 0x%x, gso_size %d, hdr_len %d \n " ,
sinfo - > gso_type , tun16_to_cpu ( tun , gso . gso_size ) ,
tun16_to_cpu ( tun , gso . hdr_len ) ) ;
print_hex_dump ( KERN_ERR , " tun: " ,
DUMP_PREFIX_NONE ,
16 , 1 , skb - > head ,
min ( ( int ) tun16_to_cpu ( tun , gso . hdr_len ) , 64 ) , true ) ;
WARN_ON_ONCE ( 1 ) ;
return - EINVAL ;
}
2008-07-03 03:48:02 -07:00
2014-11-07 21:22:23 +08:00
if ( copy_to_iter ( & gso , sizeof ( gso ) , iter ) ! = sizeof ( gso ) )
2008-07-03 03:48:02 -07:00
return - EFAULT ;
2014-11-13 16:54:14 +08:00
iov_iter_advance ( iter , vnet_hdr_sz - sizeof ( gso ) ) ;
2008-07-03 03:48:02 -07:00
}
2014-11-03 04:30:13 +08:00
if ( vlan_hlen ) {
2014-11-07 21:22:23 +08:00
int ret ;
2018-01-16 16:31:02 +08:00
struct veth veth ;
2013-07-25 13:00:33 +08:00
veth . h_vlan_proto = skb - > vlan_proto ;
2015-01-13 17:13:44 +01:00
veth . h_vlan_TCI = htons ( skb_vlan_tag_get ( skb ) ) ;
2013-07-25 13:00:33 +08:00
vlan_offset = offsetof ( struct vlan_ethhdr , h_vlan_proto ) ;
2014-11-07 21:22:23 +08:00
ret = skb_copy_datagram_iter ( skb , 0 , iter , vlan_offset ) ;
if ( ret | | ! iov_iter_count ( iter ) )
2013-07-25 13:00:33 +08:00
goto done ;
2014-11-07 21:22:23 +08:00
ret = copy_to_iter ( & veth , sizeof ( veth ) , iter ) ;
if ( ret ! = sizeof ( veth ) | | ! iov_iter_count ( iter ) )
2013-07-25 13:00:33 +08:00
goto done ;
}
2005-04-16 15:20:36 -07:00
2014-11-07 21:22:23 +08:00
skb_copy_datagram_iter ( skb , vlan_offset , iter , skb - > len - vlan_offset ) ;
2005-04-16 15:20:36 -07:00
2013-07-25 13:00:33 +08:00
done :
2016-04-13 10:52:20 +02:00
/* caller is in process context, */
2020-11-07 21:50:56 +01:00
preempt_disable ( ) ;
dev_sw_netstats_tx_add ( tun - > dev , 1 , skb - > len + vlan_hlen ) ;
preempt_enable ( ) ;
2005-04-16 15:20:36 -07:00
return total ;
}
2018-01-04 11:14:28 +08:00
static void * tun_ring_recv ( struct tun_file * tfile , int noblock , int * err )
2016-06-30 14:45:36 +08:00
{
DECLARE_WAITQUEUE ( wait , current ) ;
2018-01-04 11:14:28 +08:00
void * ptr = NULL ;
2016-07-04 13:53:38 +08:00
int error = 0 ;
2016-06-30 14:45:36 +08:00
2018-01-04 11:14:28 +08:00
ptr = ptr_ring_consume ( & tfile - > tx_ring ) ;
if ( ptr )
2016-06-30 14:45:36 +08:00
goto out ;
if ( noblock ) {
2016-07-04 13:53:38 +08:00
error = - EAGAIN ;
2016-06-30 14:45:36 +08:00
goto out ;
}
2019-07-05 20:14:16 +01:00
add_wait_queue ( & tfile - > socket . wq . wait , & wait ) ;
2016-06-30 14:45:36 +08:00
while ( 1 ) {
2019-02-23 12:53:13 +01:00
set_current_state ( TASK_INTERRUPTIBLE ) ;
2018-01-04 11:14:28 +08:00
ptr = ptr_ring_consume ( & tfile - > tx_ring ) ;
if ( ptr )
2016-06-30 14:45:36 +08:00
break ;
if ( signal_pending ( current ) ) {
2016-07-04 13:53:38 +08:00
error = - ERESTARTSYS ;
2016-06-30 14:45:36 +08:00
break ;
}
if ( tfile - > socket . sk - > sk_shutdown & RCV_SHUTDOWN ) {
2016-07-04 13:53:38 +08:00
error = - EFAULT ;
2016-06-30 14:45:36 +08:00
break ;
}
schedule ( ) ;
}
2019-02-25 21:13:13 +01:00
__set_current_state ( TASK_RUNNING ) ;
2019-07-05 20:14:16 +01:00
remove_wait_queue ( & tfile - > socket . wq . wait , & wait ) ;
2016-06-30 14:45:36 +08:00
out :
2016-07-04 13:53:38 +08:00
* err = error ;
2018-01-04 11:14:28 +08:00
return ptr ;
2016-06-30 14:45:36 +08:00
}
2012-10-31 19:45:57 +00:00
static ssize_t tun_do_read ( struct tun_struct * tun , struct tun_file * tfile ,
2014-11-07 13:52:07 -05:00
struct iov_iter * to ,
2018-01-04 11:14:28 +08:00
int noblock , void * ptr )
2005-04-16 15:20:36 -07:00
{
2014-11-07 13:52:07 -05:00
ssize_t ret ;
2016-06-30 14:45:36 +08:00
int err ;
2005-04-16 15:20:36 -07:00
2017-12-01 05:10:37 -05:00
if ( ! iov_iter_count ( to ) ) {
2018-01-04 11:14:28 +08:00
tun_ptr_free ( ptr ) ;
2014-11-07 13:52:07 -05:00
return 0 ;
2017-12-01 05:10:37 -05:00
}
2005-04-16 15:20:36 -07:00
2018-01-04 11:14:28 +08:00
if ( ! ptr ) {
2017-05-17 12:14:43 +08:00
/* Read frames from ring */
2018-01-04 11:14:28 +08:00
ptr = tun_ring_recv ( tfile , noblock , & err ) ;
if ( ! ptr )
2017-05-17 12:14:43 +08:00
return err ;
}
2014-11-07 21:22:23 +08:00
2018-04-17 16:45:47 +02:00
if ( tun_is_xdp_frame ( ptr ) ) {
struct xdp_frame * xdpf = tun_ptr_to_xdp ( ptr ) ;
2018-01-04 11:14:28 +08:00
2018-04-17 16:45:47 +02:00
ret = tun_put_user_xdp ( tun , tfile , xdpf , to ) ;
xdp: transition into using xdp_frame for return API
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.
When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".
This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.
It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.
To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.
V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17 16:46:32 +02:00
xdp_return_frame ( xdpf ) ;
2018-01-04 11:14:28 +08:00
} else {
struct sk_buff * skb = ptr ;
ret = tun_put_user ( tun , tfile , skb , to ) ;
if ( unlikely ( ret < 0 ) )
kfree_skb ( skb ) ;
else
consume_skb ( skb ) ;
}
2005-04-16 15:20:36 -07:00
2010-01-14 06:17:09 +00:00
return ret ;
}
2014-11-07 13:52:07 -05:00
static ssize_t tun_chr_read_iter ( struct kiocb * iocb , struct iov_iter * to )
2010-01-14 06:17:09 +00:00
{
struct file * file = iocb - > ki_filp ;
struct tun_file * tfile = file - > private_data ;
2017-09-23 22:36:52 +08:00
struct tun_struct * tun = tun_get ( tfile ) ;
2014-11-07 13:52:07 -05:00
ssize_t len = iov_iter_count ( to ) , ret ;
2020-11-20 07:59:54 -07:00
int noblock = 0 ;
2010-01-14 06:17:09 +00:00
if ( ! tun )
return - EBADFD ;
2020-11-20 07:59:54 -07:00
if ( ( file - > f_flags & O_NONBLOCK ) | | ( iocb - > ki_flags & IOCB_NOWAIT ) )
noblock = 1 ;
ret = tun_do_read ( tun , tfile , to , noblock , NULL ) ;
2013-12-10 22:05:45 -05:00
ret = min_t ( ssize_t , ret , len ) ;
2013-12-06 14:16:51 +08:00
if ( ret > 0 )
iocb - > ki_pos = ret ;
2009-01-20 11:00:40 +00:00
tun_put ( tun ) ;
2005-04-16 15:20:36 -07:00
return ret ;
}
2018-01-16 16:31:01 +08:00
static void tun_prog_free ( struct rcu_head * rcu )
2017-12-04 17:31:23 +08:00
{
2018-01-16 16:31:01 +08:00
struct tun_prog * prog = container_of ( rcu , struct tun_prog , rcu ) ;
2017-12-04 17:31:23 +08:00
bpf_prog_destroy ( prog - > prog ) ;
kfree ( prog ) ;
}
2018-01-22 10:55:38 +08:00
static int __tun_set_ebpf ( struct tun_struct * tun ,
struct tun_prog __rcu * * prog_p ,
2018-01-16 16:31:01 +08:00
struct bpf_prog * prog )
2017-12-04 17:31:23 +08:00
{
2018-01-16 16:31:01 +08:00
struct tun_prog * old , * new = NULL ;
2017-12-04 17:31:23 +08:00
if ( prog ) {
new = kmalloc ( sizeof ( * new ) , GFP_KERNEL ) ;
if ( ! new )
return - ENOMEM ;
new - > prog = prog ;
}
2017-12-08 12:02:30 +08:00
spin_lock_bh ( & tun - > lock ) ;
2018-01-16 16:31:01 +08:00
old = rcu_dereference_protected ( * prog_p ,
2017-12-08 12:02:30 +08:00
lockdep_is_held ( & tun - > lock ) ) ;
2018-01-16 16:31:01 +08:00
rcu_assign_pointer ( * prog_p , new ) ;
2017-12-08 12:02:30 +08:00
spin_unlock_bh ( & tun - > lock ) ;
2017-12-04 17:31:23 +08:00
if ( old )
2018-01-16 16:31:01 +08:00
call_rcu ( & old - > rcu , tun_prog_free ) ;
2017-12-04 17:31:23 +08:00
return 0 ;
}
2012-10-31 19:46:02 +00:00
static void tun_free_netdev ( struct net_device * dev )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2012-12-13 23:53:30 +00:00
BUG_ON ( ! ( list_empty ( & tun - > disabled ) ) ) ;
2019-10-07 12:21:05 -07:00
2020-11-07 21:50:56 +01:00
free_percpu ( dev - > tstats ) ;
2012-10-31 19:46:02 +00:00
tun_flow_uninit ( tun ) ;
2013-01-14 07:12:19 +00:00
security_tun_dev_free_security ( tun - > security ) ;
2018-01-16 16:31:01 +08:00
__tun_set_ebpf ( tun , & tun - > steering_prog , NULL ) ;
2018-01-16 16:31:02 +08:00
__tun_set_ebpf ( tun , & tun - > filter_prog , NULL ) ;
2012-10-31 19:46:02 +00:00
}
2005-04-16 15:20:36 -07:00
static void tun_setup ( struct net_device * dev )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2012-02-07 16:48:55 -08:00
tun - > owner = INVALID_UID ;
tun - > group = INVALID_GID ;
2018-06-02 17:49:53 -04:00
tun_default_link_ksettings ( dev , & tun - > link_ksettings ) ;
2005-04-16 15:20:36 -07:00
dev - > ethtool_ops = & tun_ethtool_ops ;
net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.
Either netdev_ops->ndo_uninit() or netdev->destructor().
The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.
netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.
netdev->destructor(), on the other hand, does not run until the
netdev references all go away.
Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().
This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.
If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().
This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.
However, this means that the resources that would normally be released
by netdev->destructor() will not be.
Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.
Many drivers do not try to deal with this, and instead we have leaks.
Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().
netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().
netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().
Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().
And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 12:52:56 -04:00
dev - > needs_free_netdev = true ;
dev - > priv_destructor = tun_free_netdev ;
2016-04-08 13:26:48 +08:00
/* We prefer our own queue length */
dev - > tx_queue_len = TUN_READQ_SIZE ;
2005-04-16 15:20:36 -07:00
}
2009-01-21 16:02:16 -08:00
/* Trivial set of netlink ops to allow deleting tun or tap
* device with netlink .
*/
2017-06-25 23:56:01 +02:00
static int tun_validate ( struct nlattr * tb [ ] , struct nlattr * data [ ] ,
struct netlink_ext_ack * extack )
2009-01-21 16:02:16 -08:00
{
2018-11-29 14:45:39 +01:00
NL_SET_ERR_MSG ( extack ,
" tun/tap creation via rtnetlink is not supported. " ) ;
return - EOPNOTSUPP ;
2009-01-21 16:02:16 -08:00
}
2018-02-16 11:03:07 +01:00
static size_t tun_get_size ( const struct net_device * dev )
{
BUILD_BUG_ON ( sizeof ( u32 ) ! = sizeof ( uid_t ) ) ;
BUILD_BUG_ON ( sizeof ( u32 ) ! = sizeof ( gid_t ) ) ;
return nla_total_size ( sizeof ( uid_t ) ) + /* OWNER */
nla_total_size ( sizeof ( gid_t ) ) + /* GROUP */
nla_total_size ( sizeof ( u8 ) ) + /* TYPE */
nla_total_size ( sizeof ( u8 ) ) + /* PI */
nla_total_size ( sizeof ( u8 ) ) + /* VNET_HDR */
nla_total_size ( sizeof ( u8 ) ) + /* PERSIST */
nla_total_size ( sizeof ( u8 ) ) + /* MULTI_QUEUE */
nla_total_size ( sizeof ( u32 ) ) + /* NUM_QUEUES */
nla_total_size ( sizeof ( u32 ) ) + /* NUM_DISABLED_QUEUES */
0 ;
}
static int tun_fill_info ( struct sk_buff * skb , const struct net_device * dev )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
if ( nla_put_u8 ( skb , IFLA_TUN_TYPE , tun - > flags & TUN_TYPE_MASK ) )
goto nla_put_failure ;
if ( uid_valid ( tun - > owner ) & &
nla_put_u32 ( skb , IFLA_TUN_OWNER ,
from_kuid_munged ( current_user_ns ( ) , tun - > owner ) ) )
goto nla_put_failure ;
if ( gid_valid ( tun - > group ) & &
nla_put_u32 ( skb , IFLA_TUN_GROUP ,
from_kgid_munged ( current_user_ns ( ) , tun - > group ) ) )
goto nla_put_failure ;
if ( nla_put_u8 ( skb , IFLA_TUN_PI , ! ( tun - > flags & IFF_NO_PI ) ) )
goto nla_put_failure ;
if ( nla_put_u8 ( skb , IFLA_TUN_VNET_HDR , ! ! ( tun - > flags & IFF_VNET_HDR ) ) )
goto nla_put_failure ;
if ( nla_put_u8 ( skb , IFLA_TUN_PERSIST , ! ! ( tun - > flags & IFF_PERSIST ) ) )
goto nla_put_failure ;
if ( nla_put_u8 ( skb , IFLA_TUN_MULTI_QUEUE ,
! ! ( tun - > flags & IFF_MULTI_QUEUE ) ) )
goto nla_put_failure ;
if ( tun - > flags & IFF_MULTI_QUEUE ) {
if ( nla_put_u32 ( skb , IFLA_TUN_NUM_QUEUES , tun - > numqueues ) )
goto nla_put_failure ;
if ( nla_put_u32 ( skb , IFLA_TUN_NUM_DISABLED_QUEUES ,
tun - > numdisabled ) )
goto nla_put_failure ;
}
return 0 ;
nla_put_failure :
return - EMSGSIZE ;
}
2009-01-21 16:02:16 -08:00
static struct rtnl_link_ops tun_link_ops __read_mostly = {
. kind = DRV_NAME ,
. priv_size = sizeof ( struct tun_struct ) ,
. setup = tun_setup ,
. validate = tun_validate ,
2018-02-16 11:03:07 +01:00
. get_size = tun_get_size ,
. fill_info = tun_fill_info ,
2009-01-21 16:02:16 -08:00
} ;
2009-02-05 21:25:32 -08:00
static void tun_sock_write_space ( struct sock * sk )
{
2012-10-31 19:45:57 +00:00
struct tun_file * tfile ;
2010-04-29 11:01:49 +00:00
wait_queue_head_t * wqueue ;
2009-02-05 21:25:32 -08:00
if ( ! sock_writeable ( sk ) )
return ;
2015-11-29 20:03:10 -08:00
if ( ! test_and_clear_bit ( SOCKWQ_ASYNC_NOSPACE , & sk - > sk_socket - > flags ) )
2009-02-05 21:25:32 -08:00
return ;
2010-04-29 11:01:49 +00:00
wqueue = sk_sleep ( sk ) ;
if ( wqueue & & waitqueue_active ( wqueue ) )
2018-02-11 14:34:03 -08:00
wake_up_interruptible_sync_poll ( wqueue , EPOLLOUT |
EPOLLWRNORM | EPOLLWRBAND ) ;
2009-06-03 21:45:55 -07:00
2012-10-31 19:45:57 +00:00
tfile = container_of ( sk , struct tun_file , sk ) ;
kill_fasync ( & tfile - > fasync , SIGIO , POLL_OUT ) ;
2009-02-05 21:25:32 -08:00
}
2018-11-15 17:43:10 +08:00
static void tun_put_page ( struct tun_page * tpage )
{
if ( tpage - > page )
__page_frag_cache_drain ( tpage - > page , tpage - > count ) ;
}
2018-09-12 11:17:07 +08:00
static int tun_xdp_one ( struct tun_struct * tun ,
struct tun_file * tfile ,
2018-11-15 17:43:10 +08:00
struct xdp_buff * xdp , int * flush ,
struct tun_page * tpage )
2018-09-12 11:17:07 +08:00
{
2018-12-03 18:09:24 +09:00
unsigned int datasize = xdp - > data_end - xdp - > data ;
2018-09-12 11:17:07 +08:00
struct tun_xdp_hdr * hdr = xdp - > data_hard_start ;
struct virtio_net_hdr * gso = & hdr - > gso ;
struct bpf_prog * xdp_prog ;
struct sk_buff * skb = NULL ;
2022-02-28 11:38:05 +08:00
struct sk_buff_head * queue ;
2018-09-12 11:17:07 +08:00
u32 rxhash = 0 , act ;
int buflen = hdr - > buflen ;
2022-02-28 11:38:05 +08:00
int ret = 0 ;
2018-09-12 11:17:07 +08:00
bool skb_xdp = false ;
2018-11-15 17:43:10 +08:00
struct page * page ;
2018-09-12 11:17:07 +08:00
xdp_prog = rcu_dereference ( tun - > xdp_prog ) ;
if ( xdp_prog ) {
if ( gso - > gso_type ) {
skb_xdp = true ;
goto build ;
}
2020-12-22 22:09:28 +01:00
xdp_init_buff ( xdp , buflen , & tfile - > xdp_rxq ) ;
2018-09-12 11:17:07 +08:00
xdp_set_data_meta_invalid ( xdp ) ;
act = bpf_prog_run_xdp ( xdp_prog , xdp ) ;
2022-02-28 11:38:05 +08:00
ret = tun_xdp_act ( tun , xdp_prog , xdp , act ) ;
if ( ret < 0 ) {
2018-09-12 11:17:07 +08:00
put_page ( virt_to_head_page ( xdp - > data ) ) ;
2022-02-28 11:38:05 +08:00
return ret ;
2018-09-12 11:17:07 +08:00
}
2022-02-28 11:38:05 +08:00
switch ( ret ) {
2018-09-12 11:17:07 +08:00
case XDP_REDIRECT :
* flush = true ;
2020-08-23 17:36:59 -05:00
fallthrough ;
2018-09-12 11:17:07 +08:00
case XDP_TX :
return 0 ;
case XDP_PASS :
break ;
default :
2018-11-15 17:43:10 +08:00
page = virt_to_head_page ( xdp - > data ) ;
if ( tpage - > page = = page ) {
+ + tpage - > count ;
} else {
tun_put_page ( tpage ) ;
tpage - > page = page ;
tpage - > count = 1 ;
}
2018-09-12 11:17:07 +08:00
return 0 ;
}
}
build :
skb = build_skb ( xdp - > data_hard_start , buflen ) ;
if ( ! skb ) {
2022-02-28 11:38:05 +08:00
ret = - ENOMEM ;
2018-09-12 11:17:07 +08:00
goto out ;
}
skb_reserve ( skb , xdp - > data - xdp - > data_hard_start ) ;
skb_put ( skb , xdp - > data_end - xdp - > data ) ;
if ( virtio_net_hdr_to_skb ( skb , gso , tun_is_little_endian ( tun ) ) ) {
2020-11-07 21:50:56 +01:00
atomic_long_inc ( & tun - > rx_frame_errors ) ;
2018-09-12 11:17:07 +08:00
kfree_skb ( skb ) ;
2022-02-28 11:38:05 +08:00
ret = - EINVAL ;
2018-09-12 11:17:07 +08:00
goto out ;
}
skb - > protocol = eth_type_trans ( skb , tun - > dev ) ;
skb_reset_network_header ( skb ) ;
net: Don't set transport offset to invalid value
If the socket was created with socket(AF_PACKET, SOCK_RAW, 0),
skb->protocol will be unset, __skb_flow_dissect() will fail, and
skb_probe_transport_header() will fall back to the offset_hint, making
the resulting skb_transport_offset incorrect.
If, however, there is no transport header in the packet,
transport_header shouldn't be set to an arbitrary value.
Fix it by leaving the transport offset unset if it couldn't be found, to
be explicit rather than to fill it with some wrong value. It changes the
behavior, but if some code relied on the old behavior, it would be
broken anyway, as the old one is incorrect.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 12:39:57 +00:00
skb_probe_transport_header ( skb ) ;
2020-04-10 18:20:59 +02:00
skb_record_rx_queue ( skb , tfile - > queue_index ) ;
2018-09-12 11:17:07 +08:00
if ( skb_xdp ) {
2022-02-28 11:38:05 +08:00
ret = do_xdp_generic ( xdp_prog , skb ) ;
if ( ret ! = XDP_PASS ) {
ret = 0 ;
2018-09-12 11:17:07 +08:00
goto out ;
2022-02-28 11:38:05 +08:00
}
2018-09-12 11:17:07 +08:00
}
2018-11-07 10:34:36 +01:00
if ( ! rcu_dereference ( tun - > steering_prog ) & & tun - > numqueues > 1 & &
! tfile - > detached )
2018-09-12 11:17:07 +08:00
rxhash = __skb_get_hash_symmetric ( skb ) ;
2022-02-28 11:38:05 +08:00
if ( tfile - > napi_enabled ) {
queue = & tfile - > sk . sk_write_queue ;
spin_lock ( & queue - > lock ) ;
tun: Fix memory leak for detached NAPI queue.
syzkaller reported [0] memory leaks of sk and skb related to the TUN
device with no repro, but we can reproduce it easily with:
struct ifreq ifr = {}
int fd_tun, fd_tmp;
char buf[4] = {};
fd_tun = openat(AT_FDCWD, "/dev/net/tun", O_WRONLY, 0);
ifr.ifr_flags = IFF_TUN | IFF_NAPI | IFF_MULTI_QUEUE;
ioctl(fd_tun, TUNSETIFF, &ifr);
ifr.ifr_flags = IFF_DETACH_QUEUE;
ioctl(fd_tun, TUNSETQUEUE, &ifr);
fd_tmp = socket(AF_PACKET, SOCK_PACKET, 0);
ifr.ifr_flags = IFF_UP;
ioctl(fd_tmp, SIOCSIFFLAGS, &ifr);
write(fd_tun, buf, sizeof(buf));
close(fd_tun);
If we enable NAPI and multi-queue on a TUN device, we can put skb into
tfile->sk.sk_write_queue after the queue is detached. We should prevent
it by checking tfile->detached before queuing skb.
Note this must be done under tfile->sk.sk_write_queue.lock because write()
and ioctl(IFF_DETACH_QUEUE) can run concurrently. Otherwise, there would
be a small race window:
write() ioctl(IFF_DETACH_QUEUE)
`- tun_get_user `- __tun_detach
|- if (tfile->detached) |- tun_disable_queue
| `-> false | `- tfile->detached = tun
| `- tun_queue_purge
|- spin_lock_bh(&queue->lock)
`- __skb_queue_tail(queue, skb)
Another solution is to call tun_queue_purge() when closing and
reattaching the detached queue, but it could paper over another
problems. Also, we do the same kind of test for IFF_NAPI_FRAGS.
[0]:
unreferenced object 0xffff88801edbc800 (size 2048):
comm "syz-executor.1", pid 33269, jiffies 4295743834 (age 18.756s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace:
[<000000008c16ea3d>] __do_kmalloc_node mm/slab_common.c:965 [inline]
[<000000008c16ea3d>] __kmalloc+0x4a/0x130 mm/slab_common.c:979
[<000000003addde56>] kmalloc include/linux/slab.h:563 [inline]
[<000000003addde56>] sk_prot_alloc+0xef/0x1b0 net/core/sock.c:2035
[<000000003e20621f>] sk_alloc+0x36/0x2f0 net/core/sock.c:2088
[<0000000028e43843>] tun_chr_open+0x3d/0x190 drivers/net/tun.c:3438
[<000000001b0f1f28>] misc_open+0x1a6/0x1f0 drivers/char/misc.c:165
[<000000004376f706>] chrdev_open+0x111/0x300 fs/char_dev.c:414
[<00000000614d379f>] do_dentry_open+0x2f9/0x750 fs/open.c:920
[<000000008eb24774>] do_open fs/namei.c:3636 [inline]
[<000000008eb24774>] path_openat+0x143f/0x1a30 fs/namei.c:3791
[<00000000955077b5>] do_filp_open+0xce/0x1c0 fs/namei.c:3818
[<00000000b78973b0>] do_sys_openat2+0xf0/0x260 fs/open.c:1356
[<00000000057be699>] do_sys_open fs/open.c:1372 [inline]
[<00000000057be699>] __do_sys_openat fs/open.c:1388 [inline]
[<00000000057be699>] __se_sys_openat fs/open.c:1383 [inline]
[<00000000057be699>] __x64_sys_openat+0x83/0xf0 fs/open.c:1383
[<00000000a7d2182d>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<00000000a7d2182d>] do_syscall_64+0x3c/0x90 arch/x86/entry/common.c:80
[<000000004cc4e8c4>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
unreferenced object 0xffff88802f671700 (size 240):
comm "syz-executor.1", pid 33269, jiffies 4295743854 (age 18.736s)
hex dump (first 32 bytes):
68 c9 db 1e 80 88 ff ff 68 c9 db 1e 80 88 ff ff h.......h.......
00 c0 7b 2f 80 88 ff ff 00 c8 db 1e 80 88 ff ff ..{/............
backtrace:
[<00000000e9d9fdb6>] __alloc_skb+0x223/0x250 net/core/skbuff.c:644
[<000000002c3e4e0b>] alloc_skb include/linux/skbuff.h:1288 [inline]
[<000000002c3e4e0b>] alloc_skb_with_frags+0x6f/0x350 net/core/skbuff.c:6378
[<00000000825f98d7>] sock_alloc_send_pskb+0x3ac/0x3e0 net/core/sock.c:2729
[<00000000e9eb3df3>] tun_alloc_skb drivers/net/tun.c:1529 [inline]
[<00000000e9eb3df3>] tun_get_user+0x5e1/0x1f90 drivers/net/tun.c:1841
[<0000000053096912>] tun_chr_write_iter+0xac/0x120 drivers/net/tun.c:2035
[<00000000b9282ae0>] call_write_iter include/linux/fs.h:1868 [inline]
[<00000000b9282ae0>] new_sync_write fs/read_write.c:491 [inline]
[<00000000b9282ae0>] vfs_write+0x40f/0x530 fs/read_write.c:584
[<00000000524566e4>] ksys_write+0xa1/0x170 fs/read_write.c:637
[<00000000a7d2182d>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<00000000a7d2182d>] do_syscall_64+0x3c/0x90 arch/x86/entry/common.c:80
[<000000004cc4e8c4>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
Fixes: cde8b15f1aab ("tuntap: add ioctl to attach or detach a file form tuntap device")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15 11:42:04 -07:00
if ( unlikely ( tfile - > detached ) ) {
spin_unlock ( & queue - > lock ) ;
kfree_skb ( skb ) ;
return - EBUSY ;
}
2022-02-28 11:38:05 +08:00
__skb_queue_tail ( queue , skb ) ;
spin_unlock ( & queue - > lock ) ;
ret = 1 ;
} else {
netif_receive_skb ( skb ) ;
ret = 0 ;
}
2018-09-12 11:17:07 +08:00
2020-11-07 21:50:56 +01:00
/* No need to disable preemption here since this function is
2018-12-11 11:43:07 +09:00
* always called with bh disabled
*/
2020-11-07 21:50:56 +01:00
dev_sw_netstats_rx_add ( tun - > dev , datasize ) ;
2018-09-12 11:17:07 +08:00
if ( rxhash )
tun_flow_update ( tun , rxhash , tfile ) ;
out :
2022-02-28 11:38:05 +08:00
return ret ;
2018-09-12 11:17:07 +08:00
}
2015-03-02 15:37:48 +08:00
static int tun_sendmsg ( struct socket * sock , struct msghdr * m , size_t total_len )
2010-01-14 06:17:09 +00:00
{
2018-09-12 11:17:07 +08:00
int ret , i ;
2012-10-31 19:45:57 +00:00
struct tun_file * tfile = container_of ( sock , struct tun_file , socket ) ;
2017-09-23 22:36:52 +08:00
struct tun_struct * tun = tun_get ( tfile ) ;
2018-09-12 11:17:06 +08:00
struct tun_msg_ctl * ctl = m - > msg_control ;
2018-09-12 11:17:07 +08:00
struct xdp_buff * xdp ;
2012-10-31 19:45:57 +00:00
if ( ! tun )
return - EBADFD ;
2014-06-19 15:36:49 -04:00
2022-03-03 10:24:40 +08:00
if ( m - > msg_controllen = = sizeof ( struct tun_msg_ctl ) & &
ctl & & ctl - > type = = TUN_MSG_PTR ) {
2018-11-17 16:53:46 -08:00
struct tun_page tpage ;
2018-09-12 11:17:07 +08:00
int n = ctl - > num ;
2022-02-28 11:38:05 +08:00
int flush = 0 , queued = 0 ;
2018-09-12 11:17:07 +08:00
2018-11-17 16:53:46 -08:00
memset ( & tpage , 0 , sizeof ( tpage ) ) ;
2018-09-12 11:17:07 +08:00
local_bh_disable ( ) ;
rcu_read_lock ( ) ;
for ( i = 0 ; i < n ; i + + ) {
xdp = & ( ( struct xdp_buff * ) ctl - > ptr ) [ i ] ;
2022-02-28 11:38:05 +08:00
ret = tun_xdp_one ( tun , tfile , xdp , & flush , & tpage ) ;
if ( ret > 0 )
queued + = ret ;
2018-09-12 11:17:07 +08:00
}
if ( flush )
xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().
Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.
With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).
Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 16:14:45 +01:00
xdp_do_flush ( ) ;
2018-09-12 11:17:07 +08:00
2022-02-28 11:38:05 +08:00
if ( tfile - > napi_enabled & & queued > 0 )
napi_schedule ( & tfile - > napi ) ;
2018-09-12 11:17:07 +08:00
rcu_read_unlock ( ) ;
local_bh_enable ( ) ;
2018-11-15 17:43:10 +08:00
tun_put_page ( & tpage ) ;
2018-09-12 11:17:07 +08:00
ret = total_len ;
goto out ;
}
2018-09-12 11:17:06 +08:00
ret = tun_get_user ( tun , tfile , ctl ? ctl - > ptr : NULL , & m - > msg_iter ,
2017-01-18 15:02:03 +08:00
m - > msg_flags & MSG_DONTWAIT ,
m - > msg_flags & MSG_MORE ) ;
2018-09-12 11:17:07 +08:00
out :
2012-10-31 19:45:57 +00:00
tun_put ( tun ) ;
return ret ;
2010-01-14 06:17:09 +00:00
}
2015-03-02 15:37:48 +08:00
static int tun_recvmsg ( struct socket * sock , struct msghdr * m , size_t total_len ,
2010-01-14 06:17:09 +00:00
int flags )
{
2012-10-31 19:45:57 +00:00
struct tun_file * tfile = container_of ( sock , struct tun_file , socket ) ;
2017-09-23 22:36:52 +08:00
struct tun_struct * tun = tun_get ( tfile ) ;
2018-01-04 11:14:28 +08:00
void * ptr = m - > msg_control ;
2010-01-14 06:17:09 +00:00
int ret ;
2012-10-31 19:45:57 +00:00
2017-12-01 05:10:37 -05:00
if ( ! tun ) {
ret = - EBADFD ;
2018-01-04 11:14:28 +08:00
goto out_free ;
2017-12-01 05:10:37 -05:00
}
2012-10-31 19:45:57 +00:00
2013-07-19 19:40:10 +02:00
if ( flags & ~ ( MSG_DONTWAIT | MSG_TRUNC | MSG_ERRQUEUE ) ) {
2013-04-24 21:59:23 +00:00
ret = - EINVAL ;
2017-12-01 05:10:37 -05:00
goto out_put_tun ;
2013-04-24 21:59:23 +00:00
}
2013-07-19 19:40:10 +02:00
if ( flags & MSG_ERRQUEUE ) {
ret = sock_recv_errqueue ( sock - > sk , m , total_len ,
SOL_PACKET , TUN_TX_TIMESTAMP ) ;
goto out ;
}
2018-01-04 11:14:28 +08:00
ret = tun_do_read ( tun , tfile , & m - > msg_iter , flags & MSG_DONTWAIT , ptr ) ;
2014-12-25 23:05:03 -08:00
if ( ret > ( ssize_t ) total_len ) {
2013-12-10 22:05:45 -05:00
m - > msg_flags | = MSG_TRUNC ;
ret = flags & MSG_TRUNC ? ret : total_len ;
}
2013-04-24 21:59:23 +00:00
out :
2012-10-31 19:45:57 +00:00
tun_put ( tun ) ;
2010-01-14 06:17:09 +00:00
return ret ;
2017-12-01 05:10:37 -05:00
out_put_tun :
tun_put ( tun ) ;
2018-01-04 11:14:28 +08:00
out_free :
tun_ptr_free ( ptr ) ;
2017-12-01 05:10:37 -05:00
return ret ;
2010-01-14 06:17:09 +00:00
}
2018-01-04 11:14:28 +08:00
static int tun_ptr_peek_len ( void * ptr )
{
if ( likely ( ptr ) ) {
2018-04-17 16:45:47 +02:00
if ( tun_is_xdp_frame ( ptr ) ) {
struct xdp_frame * xdpf = tun_ptr_to_xdp ( ptr ) ;
2018-01-04 11:14:28 +08:00
2018-04-17 16:45:47 +02:00
return xdpf - > len ;
2018-01-04 11:14:28 +08:00
}
return __skb_array_len_with_tag ( ptr ) ;
} else {
return 0 ;
}
}
2016-06-30 14:45:36 +08:00
static int tun_peek_len ( struct socket * sock )
{
struct tun_file * tfile = container_of ( sock , struct tun_file , socket ) ;
struct tun_struct * tun ;
int ret = 0 ;
2017-09-23 22:36:52 +08:00
tun = tun_get ( tfile ) ;
2016-06-30 14:45:36 +08:00
if ( ! tun )
return 0 ;
2018-01-04 11:14:28 +08:00
ret = PTR_RING_PEEK_CALL ( & tfile - > tx_ring , tun_ptr_peek_len ) ;
2016-06-30 14:45:36 +08:00
tun_put ( tun ) ;
return ret ;
}
2010-01-14 06:17:09 +00:00
/* Ops structure to mimic raw sockets with tun */
static const struct proto_ops tun_socket_ops = {
2016-06-30 14:45:36 +08:00
. peek_len = tun_peek_len ,
2010-01-14 06:17:09 +00:00
. sendmsg = tun_sendmsg ,
. recvmsg = tun_recvmsg ,
} ;
2009-02-05 21:25:32 -08:00
static struct proto tun_proto = {
. name = " tun " ,
. owner = THIS_MODULE ,
2012-10-31 19:45:57 +00:00
. obj_size = sizeof ( struct tun_file ) ,
2009-02-05 21:25:32 -08:00
} ;
2009-01-21 16:02:16 -08:00
2009-05-09 22:54:21 -07:00
static int tun_flags ( struct tun_struct * tun )
{
2014-11-19 14:44:40 +02:00
return tun - > flags & ( TUN_FEATURES | IFF_PERSIST | IFF_TUN | IFF_TAP ) ;
2009-05-09 22:54:21 -07:00
}
2021-05-19 10:38:50 +08:00
static ssize_t tun_flags_show ( struct device * dev , struct device_attribute * attr ,
2009-05-09 22:54:21 -07:00
char * buf )
{
struct tun_struct * tun = netdev_priv ( to_net_dev ( dev ) ) ;
2022-09-28 19:49:43 +08:00
return sysfs_emit ( buf , " 0x%x \n " , tun_flags ( tun ) ) ;
2009-05-09 22:54:21 -07:00
}
2021-05-19 10:38:50 +08:00
static ssize_t owner_show ( struct device * dev , struct device_attribute * attr ,
char * buf )
2009-05-09 22:54:21 -07:00
{
struct tun_struct * tun = netdev_priv ( to_net_dev ( dev ) ) ;
2012-02-07 16:48:55 -08:00
return uid_valid ( tun - > owner ) ?
2022-09-28 19:49:43 +08:00
sysfs_emit ( buf , " %u \n " ,
from_kuid_munged ( current_user_ns ( ) , tun - > owner ) ) :
sysfs_emit ( buf , " -1 \n " ) ;
2009-05-09 22:54:21 -07:00
}
2021-05-19 10:38:50 +08:00
static ssize_t group_show ( struct device * dev , struct device_attribute * attr ,
char * buf )
2009-05-09 22:54:21 -07:00
{
struct tun_struct * tun = netdev_priv ( to_net_dev ( dev ) ) ;
2012-02-07 16:48:55 -08:00
return gid_valid ( tun - > group ) ?
2022-09-28 19:49:43 +08:00
sysfs_emit ( buf , " %u \n " ,
from_kgid_munged ( current_user_ns ( ) , tun - > group ) ) :
sysfs_emit ( buf , " -1 \n " ) ;
2009-05-09 22:54:21 -07:00
}
2021-05-19 10:38:50 +08:00
static DEVICE_ATTR_RO ( tun_flags ) ;
static DEVICE_ATTR_RO ( owner ) ;
static DEVICE_ATTR_RO ( group ) ;
2009-05-09 22:54:21 -07:00
2015-02-04 14:37:34 +01:00
static struct attribute * tun_dev_attrs [ ] = {
& dev_attr_tun_flags . attr ,
& dev_attr_owner . attr ,
& dev_attr_group . attr ,
NULL
} ;
static const struct attribute_group tun_attr_group = {
. attrs = tun_dev_attrs
} ;
2008-04-16 00:41:16 -07:00
static int tun_set_iff ( struct net * net , struct file * file , struct ifreq * ifr )
2005-04-16 15:20:36 -07:00
{
struct tun_struct * tun ;
2012-10-31 19:45:57 +00:00
struct tun_file * tfile = file - > private_data ;
2005-04-16 15:20:36 -07:00
struct net_device * dev ;
int err ;
2013-01-11 16:59:33 +00:00
if ( tfile - > detached )
return - EINVAL ;
2017-09-22 13:49:15 -07:00
if ( ( ifr - > ifr_flags & IFF_NAPI_FRAGS ) ) {
if ( ! capable ( CAP_NET_ADMIN ) )
return - EPERM ;
if ( ! ( ifr - > ifr_flags & IFF_NAPI ) | |
( ifr - > ifr_flags & TUN_TYPE_MASK ) ! = IFF_TAP )
return - EINVAL ;
}
2009-01-20 10:56:20 +00:00
dev = __dev_get_by_name ( net , ifr - > ifr_name ) ;
if ( dev ) {
2009-04-27 03:23:54 -07:00
if ( ifr - > ifr_flags & IFF_TUN_EXCL )
return - EBUSY ;
2009-01-20 10:56:20 +00:00
if ( ( ifr - > ifr_flags & IFF_TUN ) & & dev - > netdev_ops = = & tun_netdev_ops )
tun = netdev_priv ( dev ) ;
else if ( ( ifr - > ifr_flags & IFF_TAP ) & & dev - > netdev_ops = = & tap_netdev_ops )
tun = netdev_priv ( dev ) ;
else
return - EINVAL ;
2013-05-28 18:32:11 +00:00
if ( ! ! ( ifr - > ifr_flags & IFF_MULTI_QUEUE ) ! =
2014-11-19 15:17:31 +02:00
! ! ( tun - > flags & IFF_MULTI_QUEUE ) )
2013-05-28 18:32:11 +00:00
return - EINVAL ;
2012-10-31 19:46:01 +00:00
if ( tun_not_capable ( tun ) )
2009-08-28 18:12:43 -04:00
return - EPERM ;
2013-01-14 07:12:19 +00:00
err = security_tun_dev_open ( tun - > security ) ;
2009-08-28 18:12:43 -04:00
if ( err < 0 )
return err ;
2017-09-22 13:49:14 -07:00
err = tun_attach ( tun , file , ifr - > ifr_flags & IFF_NOFILTER ,
2018-09-28 14:51:49 -07:00
ifr - > ifr_flags & IFF_NAPI ,
2019-09-10 18:56:57 +08:00
ifr - > ifr_flags & IFF_NAPI_FRAGS , true ) ;
2009-01-20 10:57:48 +00:00
if ( err < 0 )
return err ;
2012-12-13 23:53:30 +00:00
2014-11-19 15:17:31 +02:00
if ( tun - > flags & IFF_MULTI_QUEUE & &
2013-04-22 20:40:39 +00:00
( tun - > numqueues + tun - > numdisabled > 1 ) ) {
/* One or more queue has already been attached, no need
* to initialize the device again .
*/
2018-04-10 16:28:56 +02:00
netdev_state_change ( dev ) ;
2013-04-22 20:40:39 +00:00
return 0 ;
}
2018-04-10 16:28:55 +02:00
tun - > flags = ( tun - > flags & ~ TUN_FEATURES ) |
( ifr - > ifr_flags & TUN_FEATURES ) ;
2018-04-10 16:28:56 +02:00
netdev_state_change ( dev ) ;
} else {
2005-04-16 15:20:36 -07:00
char * name ;
unsigned long flags = 0 ;
2013-01-23 03:59:12 +00:00
int queues = ifr - > ifr_flags & IFF_MULTI_QUEUE ?
MAX_TAP_QUEUES : 1 ;
2005-04-16 15:20:36 -07:00
2012-11-18 21:34:11 +00:00
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
2006-06-22 16:07:52 -07:00
return - EPERM ;
2009-08-28 18:12:43 -04:00
err = security_tun_dev_create ( ) ;
if ( err < 0 )
return err ;
2006-06-22 16:07:52 -07:00
2005-04-16 15:20:36 -07:00
/* Set dev type */
if ( ifr - > ifr_flags & IFF_TUN ) {
/* TUN device */
2014-11-19 15:17:31 +02:00
flags | = IFF_TUN ;
2005-04-16 15:20:36 -07:00
name = " tun%d " ;
} else if ( ifr - > ifr_flags & IFF_TAP ) {
/* TAP device */
2014-11-19 15:17:31 +02:00
flags | = IFF_TAP ;
2005-04-16 15:20:36 -07:00
name = " tap%d " ;
2006-09-13 13:24:59 -04:00
} else
2009-09-16 21:36:13 +00:00
return - EINVAL ;
2006-09-13 13:24:59 -04:00
2005-04-16 15:20:36 -07:00
if ( * ifr - > ifr_name )
name = ifr - > ifr_name ;
2012-10-31 19:46:00 +00:00
dev = alloc_netdev_mqs ( sizeof ( struct tun_struct ) , name ,
net: set name_assign_type in alloc_netdev()
Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert
all users to pass NET_NAME_UNKNOWN.
Coccinelle patch:
@@
expression sizeof_priv, name, setup, txqs, rxqs, count;
@@
(
-alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs)
+alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs)
|
-alloc_netdev_mq(sizeof_priv, name, setup, count)
+alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count)
|
-alloc_netdev(sizeof_priv, name, setup)
+alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup)
)
v9: move comments here from the wrong commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 16:37:24 +02:00
NET_NAME_UNKNOWN , tun_setup , queues ,
queues ) ;
2013-01-23 03:59:12 +00:00
2005-04-16 15:20:36 -07:00
if ( ! dev )
return - ENOMEM ;
2008-04-16 00:41:53 -07:00
dev_net_set ( dev , net ) ;
2009-01-21 16:02:16 -08:00
dev - > rtnl_link_ops = & tun_link_ops ;
tun: Add ability to create tun device with given index
Tun devices cannot be created with ifidex user wants, but it's
required by checkpoint-restore project.
Long time ago such ability was implemented for rtnl_ops-based
interface for creating links (9c7dafbf net: Allow to create links
with given ifindex), but the only API for creating and managing
tuntap devices is ioctl-based and is evolving with adding new ones
(cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
device).
Following that trend, here's how a new ioctl that sets the ifindex
for device, that _will_ be created by TUNSETIFF ioctl looks like.
So those who want a tuntap device with the ifindex N, should open
the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
If the index N is busy, then the register_netdev will find this out
and the ioctl would be failed with -EBUSY.
If setifindex is not called, then it will be generated as before.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 14:31:38 +04:00
dev - > ifindex = tfile - > ifindex ;
2015-02-04 14:37:34 +01:00
dev - > sysfs_groups [ 0 ] = & tun_attr_group ;
2008-11-19 22:10:37 -08:00
2005-04-16 15:20:36 -07:00
tun = netdev_priv ( dev ) ;
tun - > dev = dev ;
tun - > flags = flags ;
2008-07-14 22:18:19 -07:00
tun - > txflt . count = 0 ;
2010-03-17 17:45:01 +02:00
tun - > vnet_hdr_sz = sizeof ( struct virtio_net_hdr ) ;
2009-02-05 21:25:32 -08:00
2016-02-26 10:45:40 +01:00
tun - > align = NET_SKB_PAD ;
2012-10-31 19:45:57 +00:00
tun - > filter_attached = false ;
tun - > sndbuf = tfile - > socket . sk - > sk_sndbuf ;
2017-01-18 15:02:03 +08:00
tun - > rx_batched = 0 ;
2017-12-04 17:31:23 +08:00
RCU_INIT_POINTER ( tun - > steering_prog , NULL ) ;
2009-02-05 21:25:32 -08:00
2021-12-16 13:25:32 -05:00
tun - > ifr = ifr ;
tun - > file = file ;
2016-04-13 10:52:20 +02:00
2021-12-16 13:25:32 -05:00
tun_net_initialize ( dev ) ;
2012-12-02 17:19:45 +00:00
2005-04-16 15:20:36 -07:00
err = register_netdevice ( tun - > dev ) ;
2021-12-16 13:25:32 -05:00
if ( err < 0 ) {
free_netdev ( dev ) ;
return err ;
}
2021-01-18 03:15:39 -08:00
/* free_netdev() won't check refcnt, to avoid race
2019-09-10 18:56:57 +08:00
* with dev_put ( ) we need publish tun after registration .
*/
rcu_assign_pointer ( tfile - > tun , tun ) ;
2005-04-16 15:20:36 -07:00
}
2022-09-20 12:48:25 -07:00
if ( ifr - > ifr_flags & IFF_NO_CARRIER )
netif_carrier_off ( tun - > dev ) ;
else
netif_carrier_on ( tun - > dev ) ;
2013-01-28 00:38:02 +00:00
2008-07-10 16:59:11 -07:00
/* Make sure persistent devices do not get stuck in
* xoff state .
*/
if ( netif_running ( tun - > dev ) )
2012-10-31 19:46:00 +00:00
netif_tx_wake_all_queues ( tun - > dev ) ;
2008-07-10 16:59:11 -07:00
2005-04-16 15:20:36 -07:00
strcpy ( ifr - > ifr_name , tun - > dev - > name ) ;
return 0 ;
}
2019-03-20 12:16:53 +03:00
static void tun_get_iff ( struct tun_struct * tun , struct ifreq * ifr )
2008-08-15 15:09:56 -07:00
{
strcpy ( ifr - > ifr_name , tun - > dev - > name ) ;
2009-05-09 22:54:21 -07:00
ifr - > ifr_flags = tun_flags ( tun ) ;
2008-08-15 15:09:56 -07:00
}
2008-07-03 03:46:16 -07:00
/* This is like a cut-down ethtool ops, except done via tun fd so no
* privs required . */
2011-04-19 06:13:10 +00:00
static int set_offload ( struct tun_struct * tun , unsigned long arg )
2008-07-03 03:46:16 -07:00
{
2011-11-15 15:29:55 +00:00
netdev_features_t features = 0 ;
2008-07-03 03:46:16 -07:00
if ( arg & TUN_F_CSUM ) {
2011-04-19 06:13:10 +00:00
features | = NETIF_F_HW_CSUM ;
2008-07-03 03:46:16 -07:00
arg & = ~ TUN_F_CSUM ;
if ( arg & ( TUN_F_TSO4 | TUN_F_TSO6 ) ) {
if ( arg & TUN_F_TSO_ECN ) {
features | = NETIF_F_TSO_ECN ;
arg & = ~ TUN_F_TSO_ECN ;
}
if ( arg & TUN_F_TSO4 )
features | = NETIF_F_TSO ;
if ( arg & TUN_F_TSO6 )
features | = NETIF_F_TSO6 ;
arg & = ~ ( TUN_F_TSO4 | TUN_F_TSO6 ) ;
}
net: accept UFO datagrams from tuntap and packet
Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.
Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.
Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.
It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.
To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216fa8 ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643f1
("net: avoid skb_warn_bad_offload false positives on UFO").
(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.
Tested
Booted a v4.13 guest kernel with QEMU. On a host kernel before this
patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
enabled, same as on a v4.13 host kernel.
A UFO packet sent from the guest appears on the tap device:
host:
nc -l -p -u 8000 &
tcpdump -n -i tap0
guest:
dd if=/dev/zero of=payload.txt bs=1 count=2000
nc -u 192.16.1.1 8000 < payload.txt
Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
packets arriving fragmented:
./with_tap_pair.sh ./tap_send_ufo tap0 tap1
(from https://github.com/wdebruij/kerneltools/tree/master/tests)
Changes
v1 -> v2
- simplified set_offload change (review comment)
- documented test procedure
Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes: fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-21 10:22:25 -05:00
arg & = ~ TUN_F_UFO ;
2022-12-07 13:35:55 +02:00
/* TODO: for now USO4 and USO6 should work simultaneously */
if ( arg & TUN_F_USO4 & & arg & TUN_F_USO6 ) {
features | = NETIF_F_GSO_UDP_L4 ;
arg & = ~ ( TUN_F_USO4 | TUN_F_USO6 ) ;
}
2008-07-03 03:46:16 -07:00
}
/* This gives the user a way to test for new features in future by
* trying to set them . */
if ( arg )
return - EINVAL ;
2011-04-19 06:13:10 +00:00
tun - > set_features = features ;
2017-03-16 22:44:10 +03:00
tun - > dev - > wanted_features & = ~ TUN_USER_FEATURES ;
tun - > dev - > wanted_features | = features ;
2011-04-19 06:13:10 +00:00
netdev_update_features ( tun - > dev ) ;
2008-07-03 03:46:16 -07:00
return 0 ;
}
2012-10-31 19:46:00 +00:00
static void tun_detach_filter ( struct tun_struct * tun , int n )
{
int i ;
struct tun_file * tfile ;
for ( i = 0 ; i < n ; i + + ) {
2013-01-11 16:59:32 +00:00
tfile = rtnl_dereference ( tun - > tfiles [ i ] ) ;
2016-04-05 17:10:16 +02:00
lock_sock ( tfile - > socket . sk ) ;
sk_detach_filter ( tfile - > socket . sk ) ;
release_sock ( tfile - > socket . sk ) ;
2012-10-31 19:46:00 +00:00
}
tun - > filter_attached = false ;
}
static int tun_attach_filter ( struct tun_struct * tun )
{
int i , ret = 0 ;
struct tun_file * tfile ;
for ( i = 0 ; i < tun - > numqueues ; i + + ) {
2013-01-11 16:59:32 +00:00
tfile = rtnl_dereference ( tun - > tfiles [ i ] ) ;
2016-04-05 17:10:16 +02:00
lock_sock ( tfile - > socket . sk ) ;
ret = sk_attach_filter ( & tun - > fprog , tfile - > socket . sk ) ;
release_sock ( tfile - > socket . sk ) ;
2012-10-31 19:46:00 +00:00
if ( ret ) {
tun_detach_filter ( tun , i ) ;
return ret ;
}
}
tun - > filter_attached = true ;
return ret ;
}
static void tun_set_sndbuf ( struct tun_struct * tun )
{
struct tun_file * tfile ;
int i ;
for ( i = 0 ; i < tun - > numqueues ; i + + ) {
2013-01-11 16:59:32 +00:00
tfile = rtnl_dereference ( tun - > tfiles [ i ] ) ;
2012-10-31 19:46:00 +00:00
tfile - > socket . sk - > sk_sndbuf = tun - > sndbuf ;
}
}
2012-10-31 19:46:01 +00:00
static int tun_set_queue ( struct file * file , struct ifreq * ifr )
{
struct tun_file * tfile = file - > private_data ;
struct tun_struct * tun ;
int ret = 0 ;
rtnl_lock ( ) ;
if ( ifr - > ifr_flags & IFF_ATTACH_QUEUE ) {
2012-12-13 23:53:30 +00:00
tun = tfile - > detached ;
2013-01-14 07:12:19 +00:00
if ( ! tun ) {
2012-10-31 19:46:01 +00:00
ret = - EINVAL ;
2013-01-14 07:12:19 +00:00
goto unlock ;
}
ret = security_tun_dev_attach_queue ( tun - > security ) ;
if ( ret < 0 )
goto unlock ;
2018-09-28 14:51:49 -07:00
ret = tun_attach ( tun , file , false , tun - > flags & IFF_NAPI ,
2019-09-10 18:56:57 +08:00
tun - > flags & IFF_NAPI_FRAGS , true ) ;
2012-12-13 23:53:30 +00:00
} else if ( ifr - > ifr_flags & IFF_DETACH_QUEUE ) {
2013-01-11 16:59:32 +00:00
tun = rtnl_dereference ( tfile - > tun ) ;
2014-11-19 15:17:31 +02:00
if ( ! tun | | ! ( tun - > flags & IFF_MULTI_QUEUE ) | | tfile - > detached )
2012-12-13 23:53:30 +00:00
ret = - EINVAL ;
else
__tun_detach ( tfile , false ) ;
} else
2012-10-31 19:46:01 +00:00
ret = - EINVAL ;
2018-04-10 16:28:56 +02:00
if ( ret > = 0 )
netdev_state_change ( tun - > dev ) ;
2013-01-14 07:12:19 +00:00
unlock :
2012-10-31 19:46:01 +00:00
rtnl_unlock ( ) ;
return ret ;
}
2020-07-31 00:17:20 -04:00
static int tun_set_ebpf ( struct tun_struct * tun , struct tun_prog __rcu * * prog_p ,
2018-01-16 16:31:01 +08:00
void __user * data )
2017-12-04 17:31:23 +08:00
{
struct bpf_prog * prog ;
int fd ;
if ( copy_from_user ( & fd , data , sizeof ( fd ) ) )
return - EFAULT ;
if ( fd = = - 1 ) {
prog = NULL ;
} else {
prog = bpf_prog_get_type ( fd , BPF_PROG_TYPE_SOCKET_FILTER ) ;
if ( IS_ERR ( prog ) )
return PTR_ERR ( prog ) ;
}
2018-01-16 16:31:01 +08:00
return __tun_set_ebpf ( tun , prog_p , prog ) ;
2017-12-04 17:31:23 +08:00
}
2021-04-06 18:45:54 +01:00
/* Return correct value for tun->dev->addr_len based on tun->dev->type. */
static unsigned char tun_get_addr_len ( unsigned short type )
{
switch ( type ) {
case ARPHRD_IP6GRE :
case ARPHRD_TUNNEL6 :
return sizeof ( struct in6_addr ) ;
case ARPHRD_IPGRE :
case ARPHRD_TUNNEL :
case ARPHRD_SIT :
return 4 ;
case ARPHRD_ETHER :
return ETH_ALEN ;
case ARPHRD_IEEE802154 :
case ARPHRD_IEEE802154_MONITOR :
return IEEE802154_EXTENDED_ADDR_LEN ;
case ARPHRD_PHONET_PIPE :
case ARPHRD_PPP :
case ARPHRD_NONE :
return 0 ;
case ARPHRD_6LOWPAN :
return EUI64_ADDR_LEN ;
case ARPHRD_FDDI :
return FDDI_K_ALEN ;
case ARPHRD_HIPPI :
return HIPPI_ALEN ;
case ARPHRD_IEEE802 :
return FC_ALEN ;
case ARPHRD_ROSE :
return ROSE_ADDR_LEN ;
case ARPHRD_NETROM :
return AX25_ADDR_LEN ;
case ARPHRD_LOCALTLK :
return LTALK_ALEN ;
default :
return 0 ;
}
}
2009-11-06 22:52:32 -08:00
static long __tun_chr_ioctl ( struct file * file , unsigned int cmd ,
unsigned long arg , int ifreq_len )
2005-04-16 15:20:36 -07:00
{
2009-01-20 11:01:48 +00:00
struct tun_file * tfile = file - > private_data ;
2018-05-08 19:21:34 +03:00
struct net * net = sock_net ( & tfile - > sk ) ;
2009-01-20 11:00:40 +00:00
struct tun_struct * tun ;
2005-04-16 15:20:36 -07:00
void __user * argp = ( void __user * ) arg ;
2018-11-28 19:12:56 +01:00
unsigned int ifindex , carrier ;
2005-04-16 15:20:36 -07:00
struct ifreq ifr ;
2012-02-07 16:48:55 -08:00
kuid_t owner ;
kgid_t group ;
2009-02-05 21:25:32 -08:00
int sndbuf ;
2010-03-17 17:45:01 +02:00
int vnet_hdr_sz ;
2014-12-16 15:05:06 +02:00
int le ;
2008-07-14 22:18:19 -07:00
int ret ;
2018-04-10 16:28:56 +02:00
bool do_notify = false ;
2005-04-16 15:20:36 -07:00
2018-02-14 16:40:14 +03:00
if ( cmd = = TUNSETIFF | | cmd = = TUNSETQUEUE | |
( _IOC_TYPE ( cmd ) = = SOCK_IOC_TYPE & & cmd ! = SIOCGSKNS ) ) {
2009-11-06 22:52:32 -08:00
if ( copy_from_user ( & ifr , argp , ifreq_len ) )
2005-04-16 15:20:36 -07:00
return - EFAULT ;
2012-07-30 14:52:48 -07:00
} else {
2012-07-29 19:45:14 +00:00
memset ( & ifr , 0 , sizeof ( ifr ) ) ;
2012-07-30 14:52:48 -07:00
}
2009-01-20 11:00:40 +00:00
if ( cmd = = TUNGETFEATURES ) {
/* Currently this just means: "what IFF flags are valid?".
* This is needed because we never checked for invalid flags on
2014-11-19 14:44:40 +02:00
* TUNSETIFF .
*/
2022-09-20 12:48:25 -07:00
return put_user ( IFF_TUN | IFF_TAP | IFF_NO_CARRIER |
TUN_FEATURES , ( unsigned int __user * ) argp ) ;
2018-05-08 19:21:34 +03:00
} else if ( cmd = = TUNSETQUEUE ) {
2012-10-31 19:46:01 +00:00
return tun_set_queue ( file , & ifr ) ;
2018-05-08 19:21:34 +03:00
} else if ( cmd = = SIOCGSKNS ) {
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
return - EPERM ;
return open_related_ns ( & net - > ns , get_net_ns ) ;
}
2009-01-20 11:00:40 +00:00
2009-08-06 14:22:44 +00:00
rtnl_lock ( ) ;
2017-09-23 22:36:52 +08:00
tun = tun_get ( tfile ) ;
2016-10-25 22:26:09 +08:00
if ( cmd = = TUNSETIFF ) {
ret = - EEXIST ;
if ( tun )
goto unlock ;
2005-04-16 15:20:36 -07:00
ifr . ifr_name [ IFNAMSIZ - 1 ] = ' \0 ' ;
2018-02-14 16:40:14 +03:00
ret = tun_set_iff ( net , file , & ifr ) ;
2005-04-16 15:20:36 -07:00
2009-08-06 14:22:44 +00:00
if ( ret )
goto unlock ;
2005-04-16 15:20:36 -07:00
2009-11-06 22:52:32 -08:00
if ( copy_to_user ( argp , & ifr , ifreq_len ) )
2009-08-06 14:22:44 +00:00
ret = - EFAULT ;
goto unlock ;
2005-04-16 15:20:36 -07:00
}
tun: Add ability to create tun device with given index
Tun devices cannot be created with ifidex user wants, but it's
required by checkpoint-restore project.
Long time ago such ability was implemented for rtnl_ops-based
interface for creating links (9c7dafbf net: Allow to create links
with given ifindex), but the only API for creating and managing
tuntap devices is ioctl-based and is evolving with adding new ones
(cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
device).
Following that trend, here's how a new ioctl that sets the ifindex
for device, that _will_ be created by TUNSETIFF ioctl looks like.
So those who want a tuntap device with the ifindex N, should open
the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
If the index N is busy, then the register_netdev will find this out
and the ioctl would be failed with -EBUSY.
If setifindex is not called, then it will be generated as before.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 14:31:38 +04:00
if ( cmd = = TUNSETIFINDEX ) {
ret = - EPERM ;
if ( tun )
goto unlock ;
ret = - EFAULT ;
if ( copy_from_user ( & ifindex , argp , sizeof ( ifindex ) ) )
goto unlock ;
ret = 0 ;
tfile - > ifindex = ifindex ;
goto unlock ;
}
2005-04-16 15:20:36 -07:00
2009-08-06 14:22:44 +00:00
ret = - EBADFD ;
2005-04-16 15:20:36 -07:00
if ( ! tun )
2009-08-06 14:22:44 +00:00
goto unlock ;
2005-04-16 15:20:36 -07:00
2020-03-04 17:24:14 +01:00
netif_info ( tun , drv , tun - > dev , " tun_chr_ioctl cmd %u \n " , cmd ) ;
2005-04-16 15:20:36 -07:00
2019-03-20 12:16:42 +03:00
net = dev_net ( tun - > dev ) ;
2009-01-20 11:00:40 +00:00
ret = 0 ;
2005-04-16 15:20:36 -07:00
switch ( cmd ) {
2008-08-15 15:09:56 -07:00
case TUNGETIFF :
2019-03-20 12:16:53 +03:00
tun_get_iff ( tun , & ifr ) ;
2008-08-15 15:09:56 -07:00
2013-08-21 14:32:00 +04:00
if ( tfile - > detached )
ifr . ifr_flags | = IFF_DETACH_QUEUE ;
tun: Allow to skip filter on attach
There's a small problem with sk-filters on tun devices. Consider
an application doing this sequence of steps:
fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
ioctl(fd, TUNATTACHFILTER, &my_filter);
ioctl(fd, TUNSETPERSIST, 1);
close(fd);
At that point the tun0 will remain in the system and will keep in
mind that there should be a socket filter at address '&my_filter'.
If after that we do
fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
we most likely receive the -EFAULT error, since tun_attach() would
try to connect the filter back. But (!) if we provide a filter at
address &my_filter, then tun0 will be created and the "new" filter
would be attached, but application may not know about that.
This may create certain problems to anyone using tun-s, but it's
critical problem for c/r -- if we meet a persistent tun device
with a filter in mind, we will not be able to attach to it to dump
its state (flags, owner, address, vnethdr size, etc.).
The proposal is to allow to attach to tun device (with TUNSETIFF)
w/o attaching the filter to the tun-file's socket. After this
attach app may e.g clean the device by dropping the filter, it
doesn't want to have one, or (in case of c/r) get information
about the device with tun ioctls.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 14:32:21 +04:00
if ( ! tfile - > socket . sk - > sk_filter )
ifr . ifr_flags | = IFF_NOFILTER ;
2013-08-21 14:32:00 +04:00
2009-11-06 22:52:32 -08:00
if ( copy_to_user ( argp , & ifr , ifreq_len ) )
2009-01-20 11:00:40 +00:00
ret = - EFAULT ;
2008-08-15 15:09:56 -07:00
break ;
2005-04-16 15:20:36 -07:00
case TUNSETNOCSUM :
/* Disable/Enable checksum */
2011-04-19 06:13:10 +00:00
/* [unimplemented] */
2020-03-04 17:24:14 +01:00
netif_info ( tun , drv , tun - > dev , " ignored: set checksum %s \n " ,
arg ? " disabled " : " enabled " ) ;
2005-04-16 15:20:36 -07:00
break ;
case TUNSETPERSIST :
2012-10-31 19:45:57 +00:00
/* Disable/Enable persist mode. Keep an extra reference to the
* module to prevent the module being unprobed .
*/
2014-11-19 15:17:31 +02:00
if ( arg & & ! ( tun - > flags & IFF_PERSIST ) ) {
tun - > flags | = IFF_PERSIST ;
2012-10-31 19:45:57 +00:00
__module_get ( THIS_MODULE ) ;
2018-04-10 16:28:56 +02:00
do_notify = true ;
2013-01-11 16:59:34 +00:00
}
2014-11-19 15:17:31 +02:00
if ( ! arg & & ( tun - > flags & IFF_PERSIST ) ) {
tun - > flags & = ~ IFF_PERSIST ;
2012-10-31 19:45:57 +00:00
module_put ( THIS_MODULE ) ;
2018-04-10 16:28:56 +02:00
do_notify = true ;
2012-10-31 19:45:57 +00:00
}
2005-04-16 15:20:36 -07:00
2020-03-04 17:24:14 +01:00
netif_info ( tun , drv , tun - > dev , " persist %s \n " ,
arg ? " enabled " : " disabled " ) ;
2005-04-16 15:20:36 -07:00
break ;
case TUNSETOWNER :
/* Set owner of the device */
2012-02-07 16:48:55 -08:00
owner = make_kuid ( current_user_ns ( ) , arg ) ;
if ( ! uid_valid ( owner ) ) {
ret = - EINVAL ;
break ;
}
tun - > owner = owner ;
2018-04-10 16:28:56 +02:00
do_notify = true ;
2020-03-04 17:24:14 +01:00
netif_info ( tun , drv , tun - > dev , " owner set to %u \n " ,
from_kuid ( & init_user_ns , tun - > owner ) ) ;
2005-04-16 15:20:36 -07:00
break ;
2007-07-02 22:50:25 -07:00
case TUNSETGROUP :
/* Set group of the device */
2012-02-07 16:48:55 -08:00
group = make_kgid ( current_user_ns ( ) , arg ) ;
if ( ! gid_valid ( group ) ) {
ret = - EINVAL ;
break ;
}
tun - > group = group ;
2018-04-10 16:28:56 +02:00
do_notify = true ;
2020-03-04 17:24:14 +01:00
netif_info ( tun , drv , tun - > dev , " group set to %u \n " ,
from_kgid ( & init_user_ns , tun - > group ) ) ;
2007-07-02 22:50:25 -07:00
break ;
2005-09-01 17:40:05 -07:00
case TUNSETLINK :
/* Only allow setting the type when the interface is down */
if ( tun - > dev - > flags & IFF_UP ) {
2020-03-04 17:24:14 +01:00
netif_info ( tun , drv , tun - > dev ,
" Linktype set failed because interface is up \n " ) ;
2008-04-23 19:37:58 -07:00
ret = - EBUSY ;
2005-09-01 17:40:05 -07:00
} else {
2020-11-18 07:39:19 +01:00
ret = call_netdevice_notifiers ( NETDEV_PRE_TYPE_CHANGE ,
tun - > dev ) ;
ret = notifier_to_errno ( ret ) ;
if ( ret ) {
netif_info ( tun , drv , tun - > dev ,
" Refused to change device type \n " ) ;
break ;
}
2005-09-01 17:40:05 -07:00
tun - > dev - > type = ( int ) arg ;
2021-04-06 18:45:54 +01:00
tun - > dev - > addr_len = tun_get_addr_len ( tun - > dev - > type ) ;
2020-03-04 17:24:14 +01:00
netif_info ( tun , drv , tun - > dev , " linktype set to %d \n " ,
tun - > dev - > type ) ;
2020-11-18 07:39:19 +01:00
call_netdevice_notifiers ( NETDEV_POST_TYPE_CHANGE ,
tun - > dev ) ;
2005-09-01 17:40:05 -07:00
}
2009-01-20 11:00:40 +00:00
break ;
2005-09-01 17:40:05 -07:00
2005-04-16 15:20:36 -07:00
case TUNSETDEBUG :
2020-03-04 17:24:14 +01:00
tun - > msg_enable = ( u32 ) arg ;
2005-04-16 15:20:36 -07:00
break ;
2020-03-04 17:24:14 +01:00
2008-07-03 03:46:16 -07:00
case TUNSETOFFLOAD :
2011-04-19 06:13:10 +00:00
ret = set_offload ( tun , arg ) ;
2009-01-20 11:00:40 +00:00
break ;
2008-07-03 03:46:16 -07:00
2008-07-14 22:18:19 -07:00
case TUNSETTXFILTER :
/* Can be set only for TAPs */
2009-01-20 11:00:40 +00:00
ret = - EINVAL ;
2014-11-19 15:17:31 +02:00
if ( ( tun - > flags & TUN_TYPE_MASK ) ! = IFF_TAP )
2009-01-20 11:00:40 +00:00
break ;
2008-07-16 12:45:34 -07:00
ret = update_filter ( & tun - > txflt , ( void __user * ) arg ) ;
2009-01-20 11:00:40 +00:00
break ;
2005-04-16 15:20:36 -07:00
case SIOCGIFHWADDR :
tree-wide: fix comment/printk typos
"gadget", "through", "command", "maintain", "maintain", "controller", "address",
"between", "initiali[zs]e", "instead", "function", "select", "already",
"equal", "access", "management", "hierarchy", "registration", "interest",
"relative", "memory", "offset", "already",
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-01 15:38:34 -04:00
/* Get hw address */
net: fix dev_ifsioc_locked() race condition
dev_ifsioc_locked() is called with only RCU read lock, so when
there is a parallel writer changing the mac address, it could
get a partially updated mac address, as shown below:
Thread 1 Thread 2
// eth_commit_mac_addr_change()
memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
// dev_ifsioc_locked()
memcpy(ifr->ifr_hwaddr.sa_data,
dev->dev_addr,...);
Close this race condition by guarding them with a RW semaphore,
like netdev_get_name(). We can not use seqlock here as it does not
allow blocking. The writers already take RTNL anyway, so this does
not affect the slow path. To avoid bothering existing
dev_set_mac_address() callers in drivers, introduce a new wrapper
just for user-facing callers on ioctl and rtnetlink paths.
Note, bonding also changes slave mac addresses but that requires
a separate patch due to the complexity of bonding code.
Fixes: 3710becf8a58 ("net: RCU locking for simple ioctl()")
Reported-by: "Gong, Sishuai" <sishuai@purdue.edu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-11 11:34:10 -08:00
dev_get_mac_address ( & ifr . ifr_hwaddr , net , tun - > dev - > name ) ;
2009-11-06 22:52:32 -08:00
if ( copy_to_user ( argp , & ifr , ifreq_len ) )
2009-01-20 11:00:40 +00:00
ret = - EFAULT ;
break ;
2005-04-16 15:20:36 -07:00
case SIOCSIFHWADDR :
2008-07-14 22:18:19 -07:00
/* Set hw address */
net: fix dev_ifsioc_locked() race condition
dev_ifsioc_locked() is called with only RCU read lock, so when
there is a parallel writer changing the mac address, it could
get a partially updated mac address, as shown below:
Thread 1 Thread 2
// eth_commit_mac_addr_change()
memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
// dev_ifsioc_locked()
memcpy(ifr->ifr_hwaddr.sa_data,
dev->dev_addr,...);
Close this race condition by guarding them with a RW semaphore,
like netdev_get_name(). We can not use seqlock here as it does not
allow blocking. The writers already take RTNL anyway, so this does
not affect the slow path. To avoid bothering existing
dev_set_mac_address() callers in drivers, introduce a new wrapper
just for user-facing callers on ioctl and rtnetlink paths.
Note, bonding also changes slave mac addresses but that requires
a separate patch due to the complexity of bonding code.
Fixes: 3710becf8a58 ("net: RCU locking for simple ioctl()")
Reported-by: "Gong, Sishuai" <sishuai@purdue.edu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-11 11:34:10 -08:00
ret = dev_set_mac_address_user ( tun - > dev , & ifr . ifr_hwaddr , NULL ) ;
2009-01-20 11:00:40 +00:00
break ;
2009-02-05 21:25:32 -08:00
case TUNGETSNDBUF :
2012-10-31 19:45:57 +00:00
sndbuf = tfile - > socket . sk - > sk_sndbuf ;
2009-02-05 21:25:32 -08:00
if ( copy_to_user ( argp , & sndbuf , sizeof ( sndbuf ) ) )
ret = - EFAULT ;
break ;
case TUNSETSNDBUF :
if ( copy_from_user ( & sndbuf , argp , sizeof ( sndbuf ) ) ) {
ret = - EFAULT ;
break ;
}
2017-10-30 18:50:11 -04:00
if ( sndbuf < = 0 ) {
ret = - EINVAL ;
break ;
}
2009-02-05 21:25:32 -08:00
2012-10-31 19:46:00 +00:00
tun - > sndbuf = sndbuf ;
tun_set_sndbuf ( tun ) ;
2009-02-05 21:25:32 -08:00
break ;
2010-03-17 17:45:01 +02:00
case TUNGETVNETHDRSZ :
vnet_hdr_sz = tun - > vnet_hdr_sz ;
if ( copy_to_user ( argp , & vnet_hdr_sz , sizeof ( vnet_hdr_sz ) ) )
ret = - EFAULT ;
break ;
case TUNSETVNETHDRSZ :
if ( copy_from_user ( & vnet_hdr_sz , argp , sizeof ( vnet_hdr_sz ) ) ) {
ret = - EFAULT ;
break ;
}
if ( vnet_hdr_sz < ( int ) sizeof ( struct virtio_net_hdr ) ) {
ret = - EINVAL ;
break ;
}
tun - > vnet_hdr_sz = vnet_hdr_sz ;
break ;
2014-12-16 15:05:06 +02:00
case TUNGETVNETLE :
le = ! ! ( tun - > flags & TUN_VNET_LE ) ;
if ( put_user ( le , ( int __user * ) argp ) )
ret = - EFAULT ;
break ;
case TUNSETVNETLE :
if ( get_user ( le , ( int __user * ) argp ) ) {
ret = - EFAULT ;
break ;
}
if ( le )
tun - > flags | = TUN_VNET_LE ;
else
tun - > flags & = ~ TUN_VNET_LE ;
break ;
2015-04-24 14:50:36 +02:00
case TUNGETVNETBE :
ret = tun_get_vnet_be ( tun , argp ) ;
break ;
case TUNSETVNETBE :
ret = tun_set_vnet_be ( tun , argp ) ;
break ;
2010-02-14 01:01:10 +00:00
case TUNATTACHFILTER :
/* Can be set only for TAPs */
ret = - EINVAL ;
2014-11-19 15:17:31 +02:00
if ( ( tun - > flags & TUN_TYPE_MASK ) ! = IFF_TAP )
2010-02-14 01:01:10 +00:00
break ;
ret = - EFAULT ;
2012-10-31 19:45:57 +00:00
if ( copy_from_user ( & tun - > fprog , argp , sizeof ( tun - > fprog ) ) )
2010-02-14 01:01:10 +00:00
break ;
2012-10-31 19:46:00 +00:00
ret = tun_attach_filter ( tun ) ;
2010-02-14 01:01:10 +00:00
break ;
case TUNDETACHFILTER :
/* Can be set only for TAPs */
ret = - EINVAL ;
2014-11-19 15:17:31 +02:00
if ( ( tun - > flags & TUN_TYPE_MASK ) ! = IFF_TAP )
2010-02-14 01:01:10 +00:00
break ;
2012-10-31 19:46:00 +00:00
ret = 0 ;
tun_detach_filter ( tun , tun - > numqueues ) ;
2010-02-14 01:01:10 +00:00
break ;
2013-08-21 14:32:39 +04:00
case TUNGETFILTER :
ret = - EINVAL ;
2014-11-19 15:17:31 +02:00
if ( ( tun - > flags & TUN_TYPE_MASK ) ! = IFF_TAP )
2013-08-21 14:32:39 +04:00
break ;
ret = - EFAULT ;
if ( copy_to_user ( argp , & tun - > fprog , sizeof ( tun - > fprog ) ) )
break ;
ret = 0 ;
break ;
2017-12-04 17:31:23 +08:00
case TUNSETSTEERINGEBPF :
2018-01-16 16:31:01 +08:00
ret = tun_set_ebpf ( tun , & tun - > steering_prog , argp ) ;
2017-12-04 17:31:23 +08:00
break ;
2018-01-16 16:31:02 +08:00
case TUNSETFILTEREBPF :
ret = tun_set_ebpf ( tun , & tun - > filter_prog , argp ) ;
break ;
2018-11-28 19:12:56 +01:00
case TUNSETCARRIER :
ret = - EFAULT ;
if ( copy_from_user ( & carrier , argp , sizeof ( carrier ) ) )
goto unlock ;
ret = tun_net_change_carrier ( tun - > dev , ( bool ) carrier ) ;
break ;
2019-03-20 12:16:42 +03:00
case TUNGETDEVNETNS :
ret = - EPERM ;
if ( ! ns_capable ( net - > user_ns , CAP_NET_ADMIN ) )
goto unlock ;
ret = open_related_ns ( & net - > ns , get_net_ns ) ;
break ;
2005-04-16 15:20:36 -07:00
default :
2009-01-20 11:00:40 +00:00
ret = - EINVAL ;
break ;
2010-05-17 22:47:34 -07:00
}
2005-04-16 15:20:36 -07:00
2018-04-10 16:28:56 +02:00
if ( do_notify )
netdev_state_change ( tun - > dev ) ;
2009-08-06 14:22:44 +00:00
unlock :
rtnl_unlock ( ) ;
if ( tun )
tun_put ( tun ) ;
2009-01-20 11:00:40 +00:00
return ret ;
2005-04-16 15:20:36 -07:00
}
2009-11-06 22:52:32 -08:00
static long tun_chr_ioctl ( struct file * file ,
unsigned int cmd , unsigned long arg )
{
return __tun_chr_ioctl ( file , cmd , arg , sizeof ( struct ifreq ) ) ;
}
# ifdef CONFIG_COMPAT
static long tun_chr_compat_ioctl ( struct file * file ,
unsigned int cmd , unsigned long arg )
{
switch ( cmd ) {
case TUNSETIFF :
case TUNGETIFF :
case TUNSETTXFILTER :
case TUNGETSNDBUF :
case TUNSETSNDBUF :
case SIOCGIFHWADDR :
case SIOCSIFHWADDR :
arg = ( unsigned long ) compat_ptr ( arg ) ;
break ;
default :
arg = ( compat_ulong_t ) arg ;
break ;
}
/*
* compat_ifreq is shorter than ifreq , so we must not access beyond
* the end of that structure . All fields that are used in this
* driver are compatible though , we don ' t need to convert the
* contents .
*/
return __tun_chr_ioctl ( file , cmd , arg , sizeof ( struct compat_ifreq ) ) ;
}
# endif /* CONFIG_COMPAT */
2005-04-16 15:20:36 -07:00
static int tun_chr_fasync ( int fd , struct file * file , int on )
{
2012-10-31 19:45:57 +00:00
struct tun_file * tfile = file - > private_data ;
2005-04-16 15:20:36 -07:00
int ret ;
2012-10-31 19:45:57 +00:00
if ( ( ret = fasync_helper ( fd , file , on , & tfile - > fasync ) ) < 0 )
2008-06-19 15:50:37 -06:00
goto out ;
2006-09-13 13:24:59 -04:00
2005-04-16 15:20:36 -07:00
if ( on ) {
2017-07-16 22:05:57 -05:00
__f_setown ( file , task_pid ( current ) , PIDTYPE_TGID , 0 ) ;
2012-10-31 19:45:57 +00:00
tfile - > flags | = TUN_FASYNC ;
2006-09-13 13:24:59 -04:00
} else
2012-10-31 19:45:57 +00:00
tfile - > flags & = ~ TUN_FASYNC ;
2008-06-19 15:50:37 -06:00
ret = 0 ;
out :
return ret ;
2005-04-16 15:20:36 -07:00
}
static int tun_chr_open ( struct inode * inode , struct file * file )
{
2015-05-08 21:07:08 -05:00
struct net * net = current - > nsproxy - > net_ns ;
2009-01-20 11:00:40 +00:00
struct tun_file * tfile ;
2009-10-14 01:19:46 -07:00
2015-05-08 21:07:08 -05:00
tfile = ( struct tun_file * ) sk_alloc ( net , AF_UNSPEC , GFP_KERNEL ,
2015-05-08 21:09:13 -05:00
& tun_proto , 0 ) ;
2009-01-20 11:00:40 +00:00
if ( ! tfile )
return - ENOMEM ;
2018-05-11 10:49:25 +08:00
if ( ptr_ring_init ( & tfile - > tx_ring , 0 , GFP_KERNEL ) ) {
sk_free ( & tfile - > sk ) ;
return - ENOMEM ;
}
2018-09-28 14:51:48 -07:00
mutex_init ( & tfile - > napi_mutex ) ;
2014-03-24 00:02:32 +05:30
RCU_INIT_POINTER ( tfile - > tun , NULL ) ;
2012-10-31 19:45:57 +00:00
tfile - > flags = 0 ;
tun: Add ability to create tun device with given index
Tun devices cannot be created with ifidex user wants, but it's
required by checkpoint-restore project.
Long time ago such ability was implemented for rtnl_ops-based
interface for creating links (9c7dafbf net: Allow to create links
with given ifindex), but the only API for creating and managing
tuntap devices is ioctl-based and is evolving with adding new ones
(cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
device).
Following that trend, here's how a new ioctl that sets the ifindex
for device, that _will_ be created by TUNSETIFF ioctl looks like.
So those who want a tuntap device with the ifindex N, should open
the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
If the index N is busy, then the register_netdev will find this out
and the ioctl would be failed with -EBUSY.
If setifindex is not called, then it will be generated as before.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 14:31:38 +04:00
tfile - > ifindex = 0 ;
2012-10-31 19:45:57 +00:00
2019-07-05 20:14:16 +01:00
init_waitqueue_head ( & tfile - > socket . wq . wait ) ;
2012-10-31 19:45:57 +00:00
tfile - > socket . file = file ;
tfile - > socket . ops = & tun_socket_ops ;
net: tun_chr_open(): set sk_uid from current_fsuid()
Commit a096ccca6e50 initializes the "sk_uid" field in the protocol socket
(struct sock) from the "/dev/net/tun" device node's owner UID. Per
original commit 86741ec25462 ("net: core: Add a UID field to struct
sock.", 2016-11-04), that's wrong: the idea is to cache the UID of the
userspace process that creates the socket. Commit 86741ec25462 mentions
socket() and accept(); with "tun", the action that creates the socket is
open("/dev/net/tun").
Therefore the device node's owner UID is irrelevant. In most cases,
"/dev/net/tun" will be owned by root, so in practice, commit a096ccca6e50
has no observable effect:
- before, "sk_uid" would be zero, due to undefined behavior
(CVE-2023-1076),
- after, "sk_uid" would be zero, due to "/dev/net/tun" being owned by root.
What matters is the (fs)UID of the process performing the open(), so cache
that in "sk_uid".
Cc: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Pietro Borrello <borrello@diag.uniroma1.it>
Cc: netdev@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: a096ccca6e50 ("tun: tun_chr_open(): correctly initialize socket uid")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2173435
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-31 18:42:36 +02:00
sock_init_data_uid ( & tfile - > socket , & tfile - > sk , current_fsuid ( ) ) ;
2012-10-31 19:45:57 +00:00
tfile - > sk . sk_write_space = tun_sock_write_space ;
tfile - > sk . sk_sndbuf = INT_MAX ;
2009-01-20 11:00:40 +00:00
file - > private_data = tfile ;
2012-12-13 23:53:30 +00:00
INIT_LIST_HEAD ( & tfile - > next ) ;
2012-10-31 19:45:57 +00:00
2013-06-08 14:17:41 +08:00
sock_set_flag ( & tfile - > sk , SOCK_ZEROCOPY ) ;
2023-03-07 20:45:56 -07:00
/* tun groks IOCB_NOWAIT just fine, mark it as such */
file - > f_mode | = FMODE_NOWAIT ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
static int tun_chr_close ( struct inode * inode , struct file * file )
{
2009-01-20 11:00:40 +00:00
struct tun_file * tfile = file - > private_data ;
2005-04-16 15:20:36 -07:00
2012-10-31 19:46:00 +00:00
tun_detach ( tfile , true ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
2014-01-29 16:43:31 +09:00
# ifdef CONFIG_PROC_FS
2017-09-23 22:36:52 +08:00
static void tun_chr_show_fdinfo ( struct seq_file * m , struct file * file )
2014-01-29 16:43:31 +09:00
{
2017-09-23 22:36:52 +08:00
struct tun_file * tfile = file - > private_data ;
2014-01-29 16:43:31 +09:00
struct tun_struct * tun ;
struct ifreq ifr ;
memset ( & ifr , 0 , sizeof ( ifr ) ) ;
rtnl_lock ( ) ;
2017-09-23 22:36:52 +08:00
tun = tun_get ( tfile ) ;
2014-01-29 16:43:31 +09:00
if ( tun )
2019-03-20 12:16:53 +03:00
tun_get_iff ( tun , & ifr ) ;
2014-01-29 16:43:31 +09:00
rtnl_unlock ( ) ;
if ( tun )
tun_put ( tun ) ;
2014-09-29 16:08:25 -07:00
seq_printf ( m , " iff: \t %s \n " , ifr . ifr_name ) ;
2014-01-29 16:43:31 +09:00
}
# endif
2007-02-12 00:55:34 -08:00
static const struct file_operations tun_fops = {
2006-09-13 13:24:59 -04:00
. owner = THIS_MODULE ,
2005-04-16 15:20:36 -07:00
. llseek = no_llseek ,
2014-11-07 13:52:07 -05:00
. read_iter = tun_chr_read_iter ,
2014-06-19 15:36:49 -04:00
. write_iter = tun_chr_write_iter ,
2005-04-16 15:20:36 -07:00
. poll = tun_chr_poll ,
2009-11-06 22:52:32 -08:00
. unlocked_ioctl = tun_chr_ioctl ,
# ifdef CONFIG_COMPAT
. compat_ioctl = tun_chr_compat_ioctl ,
# endif
2005-04-16 15:20:36 -07:00
. open = tun_chr_open ,
. release = tun_chr_close ,
2014-01-29 16:43:31 +09:00
. fasync = tun_chr_fasync ,
# ifdef CONFIG_PROC_FS
. show_fdinfo = tun_chr_show_fdinfo ,
# endif
2005-04-16 15:20:36 -07:00
} ;
static struct miscdevice tun_miscdev = {
. minor = TUN_MINOR ,
. name = " tun " ,
2009-09-18 23:01:12 +02:00
. nodename = " net/tun " ,
2005-04-16 15:20:36 -07:00
. fops = & tun_fops ,
} ;
/* ethtool interface */
2018-06-02 17:49:53 -04:00
static void tun_default_link_ksettings ( struct net_device * dev ,
struct ethtool_link_ksettings * cmd )
2017-03-11 22:03:50 +01:00
{
ethtool_link_ksettings_zero_link_mode ( cmd , supported ) ;
ethtool_link_ksettings_zero_link_mode ( cmd , advertising ) ;
2022-10-31 18:39:53 +01:00
cmd - > base . speed = SPEED_10000 ;
2017-03-11 22:03:50 +01:00
cmd - > base . duplex = DUPLEX_FULL ;
cmd - > base . port = PORT_TP ;
cmd - > base . phy_address = 0 ;
cmd - > base . autoneg = AUTONEG_DISABLE ;
2018-06-02 17:49:53 -04:00
}
static int tun_get_link_ksettings ( struct net_device * dev ,
struct ethtool_link_ksettings * cmd )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
memcpy ( cmd , & tun - > link_ksettings , sizeof ( * cmd ) ) ;
return 0 ;
}
static int tun_set_link_ksettings ( struct net_device * dev ,
const struct ethtool_link_ksettings * cmd )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
memcpy ( & tun - > link_ksettings , cmd , sizeof ( * cmd ) ) ;
2005-04-16 15:20:36 -07:00
return 0 ;
}
static void tun_get_drvinfo ( struct net_device * dev , struct ethtool_drvinfo * info )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2022-08-30 22:14:52 +02:00
strscpy ( info - > driver , DRV_NAME , sizeof ( info - > driver ) ) ;
strscpy ( info - > version , DRV_VERSION , sizeof ( info - > version ) ) ;
2005-04-16 15:20:36 -07:00
switch ( tun - > flags & TUN_TYPE_MASK ) {
2014-11-19 15:17:31 +02:00
case IFF_TUN :
2022-08-30 22:14:52 +02:00
strscpy ( info - > bus_info , " tun " , sizeof ( info - > bus_info ) ) ;
2005-04-16 15:20:36 -07:00
break ;
2014-11-19 15:17:31 +02:00
case IFF_TAP :
2022-08-30 22:14:52 +02:00
strscpy ( info - > bus_info , " tap " , sizeof ( info - > bus_info ) ) ;
2005-04-16 15:20:36 -07:00
break ;
}
}
static u32 tun_get_msglevel ( struct net_device * dev )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2020-03-04 17:24:14 +01:00
return tun - > msg_enable ;
2005-04-16 15:20:36 -07:00
}
static void tun_set_msglevel ( struct net_device * dev , u32 value )
{
struct tun_struct * tun = netdev_priv ( dev ) ;
2020-03-04 17:24:14 +01:00
tun - > msg_enable = value ;
2005-04-16 15:20:36 -07:00
}
2017-01-18 15:02:03 +08:00
static int tun_get_coalesce ( struct net_device * dev ,
2021-08-20 15:35:18 +08:00
struct ethtool_coalesce * ec ,
struct kernel_ethtool_coalesce * kernel_coal ,
struct netlink_ext_ack * extack )
2017-01-18 15:02:03 +08:00
{
struct tun_struct * tun = netdev_priv ( dev ) ;
ec - > rx_max_coalesced_frames = tun - > rx_batched ;
return 0 ;
}
static int tun_set_coalesce ( struct net_device * dev ,
2021-08-20 15:35:18 +08:00
struct ethtool_coalesce * ec ,
struct kernel_ethtool_coalesce * kernel_coal ,
struct netlink_ext_ack * extack )
2017-01-18 15:02:03 +08:00
{
struct tun_struct * tun = netdev_priv ( dev ) ;
if ( ec - > rx_max_coalesced_frames > NAPI_POLL_WEIGHT )
tun - > rx_batched = NAPI_POLL_WEIGHT ;
else
tun - > rx_batched = ec - > rx_max_coalesced_frames ;
return 0 ;
}
2006-09-13 14:30:00 -04:00
static const struct ethtool_ops tun_ethtool_ops = {
2020-03-05 17:05:58 -08:00
. supported_coalesce_params = ETHTOOL_COALESCE_RX_MAX_FRAMES ,
2005-04-16 15:20:36 -07:00
. get_drvinfo = tun_get_drvinfo ,
. get_msglevel = tun_get_msglevel ,
. set_msglevel = tun_set_msglevel ,
2010-07-27 13:53:43 +00:00
. get_link = ethtool_op_get_link ,
2013-07-19 19:40:10 +02:00
. get_ts_info = ethtool_op_get_ts_info ,
2017-01-18 15:02:03 +08:00
. get_coalesce = tun_get_coalesce ,
. set_coalesce = tun_set_coalesce ,
2017-03-11 22:03:50 +01:00
. get_link_ksettings = tun_get_link_ksettings ,
2018-06-02 17:49:53 -04:00
. set_link_ksettings = tun_set_link_ksettings ,
2005-04-16 15:20:36 -07:00
} ;
2016-06-30 14:45:36 +08:00
static int tun_queue_resize ( struct tun_struct * tun )
{
struct net_device * dev = tun - > dev ;
struct tun_file * tfile ;
2018-01-04 11:14:27 +08:00
struct ptr_ring * * rings ;
2016-06-30 14:45:36 +08:00
int n = tun - > numqueues + tun - > numdisabled ;
int ret , i ;
2018-01-04 11:14:27 +08:00
rings = kmalloc_array ( n , sizeof ( * rings ) , GFP_KERNEL ) ;
if ( ! rings )
2016-06-30 14:45:36 +08:00
return - ENOMEM ;
for ( i = 0 ; i < tun - > numqueues ; i + + ) {
tfile = rtnl_dereference ( tun - > tfiles [ i ] ) ;
2018-01-04 11:14:27 +08:00
rings [ i ] = & tfile - > tx_ring ;
2016-06-30 14:45:36 +08:00
}
list_for_each_entry ( tfile , & tun - > disabled , next )
2018-01-04 11:14:27 +08:00
rings [ i + + ] = & tfile - > tx_ring ;
2016-06-30 14:45:36 +08:00
2018-01-04 11:14:27 +08:00
ret = ptr_ring_resize_multiple ( rings , n ,
dev - > tx_queue_len , GFP_KERNEL ,
2018-01-04 11:14:28 +08:00
tun_ptr_free ) ;
2016-06-30 14:45:36 +08:00
2018-01-04 11:14:27 +08:00
kfree ( rings ) ;
2016-06-30 14:45:36 +08:00
return ret ;
}
static int tun_device_event ( struct notifier_block * unused ,
unsigned long event , void * ptr )
{
struct net_device * dev = netdev_notifier_info_to_dev ( ptr ) ;
struct tun_struct * tun = netdev_priv ( dev ) ;
2019-06-17 21:26:36 +08:00
int i ;
2016-06-30 14:45:36 +08:00
2016-07-06 18:44:20 -04:00
if ( dev - > rtnl_link_ops ! = & tun_link_ops )
return NOTIFY_DONE ;
2016-06-30 14:45:36 +08:00
switch ( event ) {
case NETDEV_CHANGE_TX_QUEUE_LEN :
if ( tun_queue_resize ( tun ) )
return NOTIFY_BAD ;
break ;
2019-06-17 21:26:36 +08:00
case NETDEV_UP :
for ( i = 0 ; i < tun - > numqueues ; i + + ) {
struct tun_file * tfile ;
tfile = rtnl_dereference ( tun - > tfiles [ i ] ) ;
tfile - > socket . sk - > sk_write_space ( tfile - > socket . sk ) ;
}
break ;
2016-06-30 14:45:36 +08:00
default :
break ;
}
return NOTIFY_DONE ;
}
static struct notifier_block tun_notifier_block __read_mostly = {
. notifier_call = tun_device_event ,
} ;
2008-04-16 00:40:46 -07:00
2005-04-16 15:20:36 -07:00
static int __init tun_init ( void )
{
int ret = 0 ;
2011-03-02 07:18:10 +00:00
pr_info ( " %s, %s \n " , DRV_DESCRIPTION , DRV_VERSION ) ;
2005-04-16 15:20:36 -07:00
2009-01-21 16:02:16 -08:00
ret = rtnl_link_register ( & tun_link_ops ) ;
2008-04-16 00:40:46 -07:00
if ( ret ) {
2011-03-02 07:18:10 +00:00
pr_err ( " Can't register link_ops \n " ) ;
2009-01-21 16:02:16 -08:00
goto err_linkops ;
2008-04-16 00:40:46 -07:00
}
2005-04-16 15:20:36 -07:00
ret = misc_register ( & tun_miscdev ) ;
2008-04-16 00:40:46 -07:00
if ( ret ) {
2011-03-02 07:18:10 +00:00
pr_err ( " Can't register misc device %d \n " , TUN_MINOR ) ;
2008-04-16 00:40:46 -07:00
goto err_misc ;
}
2016-06-30 14:45:36 +08:00
2017-07-20 02:41:34 -07:00
ret = register_netdevice_notifier ( & tun_notifier_block ) ;
if ( ret ) {
pr_err ( " Can't register netdevice notifier \n " ) ;
goto err_notifier ;
}
2009-01-21 16:02:16 -08:00
return 0 ;
2017-07-20 02:41:34 -07:00
err_notifier :
misc_deregister ( & tun_miscdev ) ;
2008-04-16 00:40:46 -07:00
err_misc :
2009-01-21 16:02:16 -08:00
rtnl_link_unregister ( & tun_link_ops ) ;
err_linkops :
2005-04-16 15:20:36 -07:00
return ret ;
}
2023-08-14 16:30:00 +08:00
static void __exit tun_cleanup ( void )
2005-04-16 15:20:36 -07:00
{
2006-09-13 13:24:59 -04:00
misc_deregister ( & tun_miscdev ) ;
2009-01-21 16:02:16 -08:00
rtnl_link_unregister ( & tun_link_ops ) ;
2016-06-30 14:45:36 +08:00
unregister_netdevice_notifier ( & tun_notifier_block ) ;
2005-04-16 15:20:36 -07:00
}
2010-01-14 06:17:09 +00:00
/* Get an underlying socket object from tun file. Returns error unless file is
* attached to a device . The returned object works like a packet socket , it
* can be used for sock_sendmsg / sock_recvmsg . The caller is responsible for
* holding a reference to the file for as long as the socket is in use . */
struct socket * tun_get_socket ( struct file * file )
{
2012-10-31 19:45:58 +00:00
struct tun_file * tfile ;
2010-01-14 06:17:09 +00:00
if ( file - > f_op ! = & tun_fops )
return ERR_PTR ( - EINVAL ) ;
2012-10-31 19:45:58 +00:00
tfile = file - > private_data ;
if ( ! tfile )
2010-01-14 06:17:09 +00:00
return ERR_PTR ( - EBADFD ) ;
2012-10-31 19:45:57 +00:00
return & tfile - > socket ;
2010-01-14 06:17:09 +00:00
}
EXPORT_SYMBOL_GPL ( tun_get_socket ) ;
2018-01-04 11:14:27 +08:00
struct ptr_ring * tun_get_tx_ring ( struct file * file )
2017-05-17 12:14:41 +08:00
{
struct tun_file * tfile ;
if ( file - > f_op ! = & tun_fops )
return ERR_PTR ( - EINVAL ) ;
tfile = file - > private_data ;
if ( ! tfile )
return ERR_PTR ( - EBADFD ) ;
2018-01-04 11:14:27 +08:00
return & tfile - > tx_ring ;
2017-05-17 12:14:41 +08:00
}
2018-01-04 11:14:27 +08:00
EXPORT_SYMBOL_GPL ( tun_get_tx_ring ) ;
2017-05-17 12:14:41 +08:00
2005-04-16 15:20:36 -07:00
module_init ( tun_init ) ;
module_exit ( tun_cleanup ) ;
MODULE_DESCRIPTION ( DRV_DESCRIPTION ) ;
MODULE_AUTHOR ( DRV_COPYRIGHT ) ;
MODULE_LICENSE ( " GPL " ) ;
MODULE_ALIAS_MISCDEV ( TUN_MINOR ) ;
driver core: add devname module aliases to allow module on-demand auto-loading
This adds:
alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.
Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.
The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
$ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
# Device nodes to trigger on-demand module loading.
microcode cpu/microcode c10:184
fuse fuse c10:229
ppp_generic ppp c108:0
tun net/tun c10:200
dm_mod mapper/control c10:235
Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
$ /sbin/udevd --debug
...
static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
static_dev_create_from_modules: mknod '/dev/fuse' c10:229
static_dev_create_from_modules: mknod '/dev/ppp' c108:0
static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.
Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.
This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-20 18:07:20 +02:00
MODULE_ALIAS ( " devname:net/tun " ) ;