2019-05-27 08:55:01 +02:00
/* SPDX-License-Identifier: GPL-2.0-or-later */
2005-04-16 15:20:36 -07:00
/*
* INET An implementation of the TCP / IP protocol suite for the LINUX
* operating system . INET is implemented using the BSD Socket
* interface as the means of communication with the user level .
*
* Definitions for the IP router .
*
* Version : @ ( # ) route . h 1.0 .4 05 / 27 / 93
*
2005-05-05 16:16:16 -07:00
* Authors : Ross Biro
2005-04-16 15:20:36 -07:00
* Fred N . van Kempen , < waltje @ uWalt . NL . Mugnet . ORG >
* Fixes :
* Alan Cox : Reformatted . Added ip_rt_local ( )
* Alan Cox : Support for TCP parameters .
* Alexey Kuznetsov : Major changes for new routing code .
* Mike McLagan : Routing by source
* Robert Olsson : Added rt_cache statistics
*/
# ifndef _ROUTE_H
# define _ROUTE_H
# include <net/dst.h>
# include <net/inetpeer.h>
# include <net/flow.h>
2008-10-01 07:35:39 -07:00
# include <net/inet_sock.h>
2015-09-30 10:12:22 +02:00
# include <net/ip_fib.h>
ipv4: Add helpers for neigh lookup for nexthop
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:
nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.
To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.
ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05 16:30:34 -07:00
# include <net/arp.h>
# include <net/ndisc.h>
2005-04-16 15:20:36 -07:00
# include <linux/in_route.h>
# include <linux/rtnetlink.h>
2012-07-26 11:14:38 +00:00
# include <linux/rcupdate.h>
2005-04-16 15:20:36 -07:00
# include <linux/route.h>
# include <linux/ip.h>
# include <linux/cache.h>
2006-08-04 23:12:42 -07:00
# include <linux/security.h>
2005-04-16 15:20:36 -07:00
# define RTO_ONLINK 0x01
2022-04-21 01:21:33 +02:00
static inline __u8 ip_sock_rt_scope ( const struct sock * sk )
{
if ( sock_flag ( sk , SOCK_LOCALROUTE ) )
return RT_SCOPE_LINK ;
return RT_SCOPE_UNIVERSE ;
}
static inline __u8 ip_sock_rt_tos ( const struct sock * sk )
{
2023-09-22 03:42:16 +00:00
return RT_TOS ( READ_ONCE ( inet_sk ( sk ) - > tos ) ) ;
2022-04-21 01:21:33 +02:00
}
2021-12-28 16:49:13 -08:00
struct ip_tunnel_info ;
2005-04-16 15:20:36 -07:00
struct fib_nh ;
net: Implement read-only protection and COW'ing of metrics.
Routing metrics are now copy-on-write.
Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there. Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.
The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.
For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing. Very likely
this "somewhere else" will be the inetpeer cache.
Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.
But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads. In those
cases the read-only metric copies stay in place and never get written
to.
TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit. But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.
Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.
Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.
The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline. This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-26 20:51:05 -08:00
struct fib_info ;
2015-01-14 15:17:06 -08:00
struct uncached_list ;
2009-11-03 03:26:03 +00:00
struct rtable {
2010-06-10 23:31:35 -07:00
struct dst_entry dst ;
2005-04-16 15:20:36 -07:00
[IPV4] route cache: Introduce rt_genid for smooth cache invalidation
Current ip route cache implementation is not suited to large caches.
We can consume a lot of CPU when cache must be invalidated, since we
currently need to evict all cache entries, and this eviction is
sometimes asynchronous. min_delay & max_delay can somewhat control this
asynchronism behavior, but whole thing is a kludge, regularly triggering
infamous soft lockup messages. When entries are still in use, this also
consumes a lot of ram, filling dst_garbage.list.
A better scheme is to use a generation identifier on each entry,
so that cache invalidation can be performed by changing the table
identifier, without having to scan all entries.
No more delayed flushing, no more stalling when secret_interval expires.
Invalidated entries will then be freed at GC time (controled by
ip_rt_gc_timeout or stress), or when an invalidated entry is found
in a chain when an insert is done.
Thus we keep a normal equilibrium.
This patch :
- renames rt_hash_rnd to rt_genid (and makes it an atomic_t)
- Adds a new rt_genid field to 'struct rtable' (filling a hole on 64bit)
- Checks entry->rt_genid at appropriate places :
2008-01-31 17:05:09 -08:00
int rt_genid ;
2012-04-15 05:58:06 +00:00
unsigned int rt_flags ;
2005-04-16 15:20:36 -07:00
__u16 rt_type ;
2012-10-08 11:41:18 +00:00
__u8 rt_is_input ;
2019-09-17 10:39:49 -07:00
__u8 rt_uses_gateway ;
2005-04-16 15:20:36 -07:00
int rt_iif ;
2019-09-17 10:39:49 -07:00
u8 rt_gw_family ;
2005-04-16 15:20:36 -07:00
/* Info on neighbour */
2019-04-05 16:30:29 -07:00
union {
__be32 rt_gw4 ;
struct in6_addr rt_gw6 ;
} ;
2005-04-16 15:20:36 -07:00
/* Miscellaneous cached information */
2018-03-14 10:21:14 +01:00
u32 rt_mtu_locked : 1 ,
rt_pmtu : 31 ;
2005-04-16 15:20:36 -07:00
} ;
2011-11-23 02:14:15 +00:00
static inline bool rt_is_input_route ( const struct rtable * rt )
2010-11-11 17:07:48 -08:00
{
2012-07-17 14:44:26 -07:00
return rt - > rt_is_input ! = 0 ;
2010-11-11 17:07:48 -08:00
}
2011-11-23 02:14:15 +00:00
static inline bool rt_is_output_route ( const struct rtable * rt )
2010-11-11 17:07:48 -08:00
{
2012-07-17 14:44:26 -07:00
return rt - > rt_is_input = = 0 ;
2010-11-11 17:07:48 -08:00
}
2012-07-13 05:03:45 -07:00
static inline __be32 rt_nexthop ( const struct rtable * rt , __be32 daddr )
{
2019-04-05 16:30:27 -07:00
if ( rt - > rt_gw_family = = AF_INET )
return rt - > rt_gw4 ;
2012-07-13 05:03:45 -07:00
return daddr ;
}
2009-11-03 03:26:03 +00:00
struct ip_rt_acct {
2005-04-16 15:20:36 -07:00
__u32 o_bytes ;
__u32 o_packets ;
__u32 i_bytes ;
__u32 i_packets ;
} ;
2009-11-03 03:26:03 +00:00
struct rt_cache_stat {
2005-04-16 15:20:36 -07:00
unsigned int in_slow_tot ;
unsigned int in_slow_mc ;
unsigned int in_no_route ;
unsigned int in_brd ;
unsigned int in_martian_dst ;
unsigned int in_martian_src ;
unsigned int out_slow_tot ;
unsigned int out_slow_mc ;
} ;
2010-02-16 15:20:26 +00:00
extern struct ip_rt_acct __percpu * ip_rt_acct ;
2005-04-16 15:20:36 -07:00
struct in_device ;
2013-09-22 10:32:22 -07:00
int ip_rt_init ( void ) ;
void rt_cache_flush ( struct net * net ) ;
void rt_flush_dev ( struct net_device * dev ) ;
2017-05-25 10:42:33 -07:00
struct rtable * ip_route_output_key_hash ( struct net * net , struct flowi4 * flp ,
const struct sk_buff * skb ) ;
struct rtable * ip_route_output_key_hash_rcu ( struct net * net , struct flowi4 * flp ,
struct fib_result * res ,
const struct sk_buff * skb ) ;
2015-09-30 10:12:22 +02:00
static inline struct rtable * __ip_route_output_key ( struct net * net ,
struct flowi4 * flp )
{
2017-05-25 10:42:33 -07:00
return ip_route_output_key_hash ( net , flp , NULL ) ;
2015-09-30 10:12:22 +02:00
}
2013-09-22 10:32:22 -07:00
struct rtable * ip_route_output_flow ( struct net * , struct flowi4 * flp ,
2015-09-25 07:39:10 -07:00
const struct sock * sk ) ;
2013-09-22 10:32:22 -07:00
struct dst_entry * ipv4_blackhole_route ( struct net * net ,
struct dst_entry * dst_orig ) ;
2010-05-10 11:32:55 +00:00
2011-03-12 01:12:47 -05:00
static inline struct rtable * ip_route_output_key ( struct net * net , struct flowi4 * flp )
2011-03-02 14:56:30 -08:00
{
return ip_route_output_flow ( net , flp , NULL ) ;
}
2011-03-12 00:00:52 -05:00
static inline struct rtable * ip_route_output ( struct net * net , __be32 daddr ,
__be32 saddr , u8 tos , int oif )
{
2011-03-12 01:12:47 -05:00
struct flowi4 fl4 = {
. flowi4_oif = oif ,
2012-06-10 20:05:24 +00:00
. flowi4_tos = tos ,
2011-03-12 01:12:47 -05:00
. daddr = daddr ,
. saddr = saddr ,
2011-03-12 00:00:52 -05:00
} ;
2011-03-12 01:12:47 -05:00
return ip_route_output_key ( net , & fl4 ) ;
2011-03-12 00:00:52 -05:00
}
2011-05-03 20:25:42 -07:00
static inline struct rtable * ip_route_output_ports ( struct net * net , struct flowi4 * fl4 ,
2023-07-11 15:06:14 +02:00
const struct sock * sk ,
2011-03-12 00:00:52 -05:00
__be32 daddr , __be32 saddr ,
__be16 dport , __be16 sport ,
__u8 proto , __u8 tos , int oif )
{
2023-07-28 15:03:15 +00:00
flowi4_init_output ( fl4 , oif , sk ? READ_ONCE ( sk - > sk_mark ) : 0 , tos ,
ipv4: Set the routing scope properly in ip_route_output_ports().
Set scope automatically in ip_route_output_ports() (using the socket
SOCK_LOCALROUTE flag). This way, callers don't have to overload the
tos with the RTO_ONLINK flag, like RT_CONN_FLAGS() does.
For callers that don't pass a struct sock, this doesn't change anything
as the scope is still set to RT_SCOPE_UNIVERSE when sk is NULL.
Callers that passed a struct sock and used RT_CONN_FLAGS(sk) or
RT_CONN_FLAGS_TOS(sk, tos) for the tos are modified to use
ip_sock_tos(sk) and RT_TOS(tos) respectively, as overloading tos with
the RTO_ONLINK flag now becomes unnecessary.
In drivers/net/amt.c, all ip_route_output_ports() calls use a 0 tos
parameter, ignoring the SOCK_LOCALROUTE flag of the socket. But the sk
parameter is a kernel socket, which doesn't have any configuration path
for setting SOCK_LOCALROUTE anyway. Therefore, ip_route_output_ports()
will continue to initialise scope with RT_SCOPE_UNIVERSE and amt.c
doesn't need to be modified.
Also, remove RT_CONN_FLAGS() and RT_CONN_FLAGS_TOS() from route.h as
these macros are now unused.
The objective is to eventually remove RTO_ONLINK entirely to allow
converting ->flowi4_tos to dscp_t. This will ensure proper isolation
between the DSCP and ECN bits, thus minimising the risk of introducing
bugs where TOS values interfere with ECN.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/dacfd2ab40685e20959ab7b53c427595ba229e7d.1707496938.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09 17:43:37 +01:00
sk ? ip_sock_rt_scope ( sk ) : RT_SCOPE_UNIVERSE ,
proto , sk ? inet_sk_flowi_flags ( sk ) : 0 ,
2016-11-04 02:23:43 +09:00
daddr , saddr , dport , sport , sock_net_uid ( net , sk ) ) ;
2011-03-12 00:00:52 -05:00
if ( sk )
2020-09-27 22:38:26 -04:00
security_sk_classify_flow ( sk , flowi4_to_flowi_common ( fl4 ) ) ;
2011-05-03 20:25:42 -07:00
return ip_route_output_flow ( net , fl4 , sk ) ;
2011-03-12 00:00:52 -05:00
}
2011-05-04 12:33:34 -07:00
static inline struct rtable * ip_route_output_gre ( struct net * net , struct flowi4 * fl4 ,
2011-03-12 00:00:52 -05:00
__be32 daddr , __be32 saddr ,
__be32 gre_key , __u8 tos , int oif )
{
2011-05-04 12:33:34 -07:00
memset ( fl4 , 0 , sizeof ( * fl4 ) ) ;
fl4 - > flowi4_oif = oif ;
fl4 - > daddr = daddr ;
fl4 - > saddr = saddr ;
fl4 - > flowi4_tos = tos ;
fl4 - > flowi4_proto = IPPROTO_GRE ;
fl4 - > fl4_gre_key = gre_key ;
return ip_route_output_key ( net , fl4 ) ;
2011-03-12 00:00:52 -05:00
}
2017-09-28 15:51:37 +02:00
int ip_mc_validate_source ( struct sk_buff * skb , __be32 daddr , __be32 saddr ,
u8 tos , struct net_device * dev ,
struct in_device * in_dev , u32 * itag ) ;
2013-09-22 10:32:22 -07:00
int ip_route_input_noref ( struct sk_buff * skb , __be32 dst , __be32 src ,
u8 tos , struct net_device * devin ) ;
2019-11-20 13:47:37 +01:00
int ip_route_use_hint ( struct sk_buff * skb , __be32 dst , __be32 src ,
u8 tos , struct net_device * devin ,
const struct sk_buff * hint ) ;
2012-07-26 11:14:38 +00:00
static inline int ip_route_input ( struct sk_buff * skb , __be32 dst , __be32 src ,
u8 tos , struct net_device * devin )
{
int err ;
rcu_read_lock ( ) ;
err = ip_route_input_noref ( skb , dst , src , tos , devin ) ;
2017-08-31 18:11:41 +02:00
if ( ! err ) {
2017-09-21 09:15:46 -07:00
skb_dst_force ( skb ) ;
2017-08-31 18:11:41 +02:00
if ( ! skb_dst ( skb ) )
err = - EINVAL ;
}
2012-07-26 11:14:38 +00:00
rcu_read_unlock ( ) ;
return err ;
}
2010-05-10 11:32:55 +00:00
2013-09-22 10:32:22 -07:00
void ipv4_update_pmtu ( struct sk_buff * skb , struct net * net , u32 mtu , int oif ,
2018-09-25 20:56:26 -07:00
u8 protocol ) ;
2013-09-22 10:32:22 -07:00
void ipv4_sk_update_pmtu ( struct sk_buff * skb , struct sock * sk , u32 mtu ) ;
2018-09-25 20:56:27 -07:00
void ipv4_redirect ( struct sk_buff * skb , struct net * net , int oif , u8 protocol ) ;
2013-09-22 10:32:22 -07:00
void ipv4_sk_redirect ( struct sk_buff * skb , struct sock * sk ) ;
void ip_rt_send_redirect ( struct sk_buff * skb ) ;
unsigned int inet_addr_type ( struct net * net , __be32 addr ) ;
2015-09-01 14:26:35 -06:00
unsigned int inet_addr_type_table ( struct net * net , __be32 addr , u32 tb_id ) ;
2013-09-22 10:32:22 -07:00
unsigned int inet_dev_addr_type ( struct net * net , const struct net_device * dev ,
__be32 addr ) ;
2015-08-13 14:59:05 -06:00
unsigned int inet_addr_type_dev_table ( struct net * net ,
const struct net_device * dev ,
__be32 addr ) ;
2013-09-22 10:32:22 -07:00
void ip_rt_multicast_event ( struct in_device * ) ;
2017-07-01 08:03:10 -04:00
int ip_rt_ioctl ( struct net * , unsigned int cmd , struct rtentry * rt ) ;
2013-09-22 10:32:22 -07:00
void ip_rt_get_source ( u8 * src , struct sk_buff * skb , struct rtable * rt ) ;
2016-04-07 11:10:06 -07:00
struct rtable * rt_dst_alloc ( struct net_device * dev ,
2022-05-20 13:48:45 +03:00
unsigned int flags , u16 type , bool noxfrm ) ;
2019-06-26 02:21:16 -04:00
struct rtable * rt_dst_clone ( struct net_device * dev , struct rtable * rt ) ;
2005-04-16 15:20:36 -07:00
2005-11-22 14:47:37 -08:00
struct in_ifaddr ;
2013-09-22 10:32:22 -07:00
void fib_add_ifaddr ( struct in_ifaddr * ) ;
void fib_del_ifaddr ( struct in_ifaddr * , struct in_ifaddr * ) ;
2018-05-27 08:09:57 -07:00
void fib_modify_prefix_metric ( struct in_ifaddr * ifa , u32 new_metric ) ;
2005-11-22 14:47:37 -08:00
xfrm: reuse uncached_list to track xdsts
In early time, when freeing a xdst, it would be inserted into
dst_garbage.list first. Then if it's refcnt was still held
somewhere, later it would be put into dst_busy_list in
dst_gc_task().
When one dev was being unregistered, the dev of these dsts in
dst_busy_list would be set with loopback_dev and put this dev.
So that this dev's removal wouldn't get blocked, and avoid the
kmsg warning:
kernel:unregister_netdevice: waiting for veth0 to become \
free. Usage count = 2
However after Commit 52df157f17e5 ("xfrm: take refcnt of dst
when creating struct xfrm_dst bundle"), the xdst will not be
freed with dst gc, and this warning happens.
To fix it, we need to find these xdsts that are still held by
others when removing the dev, and free xdst's dev and set it
with loopback_dev.
But unfortunately after flow_cache for xfrm was deleted, no
list tracks them anymore. So we need to save these xdsts
somewhere to release the xdst's dev later.
To make this easier, this patch is to reuse uncached_list to
track xdsts, so that the dev refcnt can be released in the
event NETDEV_UNREGISTER process of fib_netdev_notifier.
Thanks to Florian, we could move forward this fix quickly.
Fixes: 52df157f17e5 ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle")
Reported-by: Jianlin Shi <jishi@redhat.com>
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Tested-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-02-14 19:06:02 +08:00
void rt_add_uncached_list ( struct rtable * rt ) ;
void rt_del_uncached_list ( struct rtable * rt ) ;
ipv4: Dump route exceptions if requested
Since commit 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), cached
exception routes are stored as a separate entity, so they are not dumped
on a FIB dump, even if the RTM_F_CLONED flag is passed.
This implies that the command 'ip route list cache' doesn't return any
result anymore.
If the RTM_F_CLONED is passed, and strict checking requested, retrieve
nexthop exception routes and dump them. If no strict checking is
requested, filtering can't be performed consistently: dump everything in
that case.
With this, we need to add an argument to the netlink callback in order to
track how many entries were already dumped for the last leaf included in
a partial netlink dump.
A single additional argument is sufficient, even if we traverse logically
nested structures (nexthop objects, hash table buckets, bucket chains): it
doesn't matter if we stop in the middle of any of those, because they are
always traversed the same way. As an example, s_i values in [], s_fa
values in ():
node (fa) #1 [1]
nexthop #1
bucket #1 -> #0 in chain (1)
bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4)
bucket #3 -> #0 in chain (5) -> #1 in chain (6)
nexthop #2
bucket #1 -> #0 in chain (7) -> #1 in chain (8)
bucket #2 -> #0 in chain (9)
--
node (fa) #2 [2]
nexthop #1
bucket #1 -> #0 in chain (1) -> #1 in chain (2)
bucket #2 -> #0 in chain (3)
it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2)
for "node #2": walking flattens all that.
It would even be possible to drop the distinction between the in-tree
(s_i) and in-node (s_fa) counter, but a further improvement might
advise against this. This is only as accurate as the existing tracking
mechanism for leaves: if a partial dump is restarted after exceptions
are removed or expired, we might skip some non-dumped entries.
To improve this, we could attach a 'sernum' attribute (similar to the
one used for IPv6) to nexthop entities, and bump this counter whenever
exceptions change: having a distinction between the two counters would
make this more convenient.
Listing of exception routes (modified routes pre-3.5) was tested against
these versions of kernel and iproute2:
iproute2
kernel 4.14.0 4.15.0 4.19.0 5.0.0 5.1.0
3.5-rc4 + + + + +
4.4
4.9
4.14
4.15
4.19
5.0
5.1
fixed + + + + +
v7:
- Move loop over nexthop objects to route.c, and pass struct fib_info
and table ID to it, not a struct fib_alias (suggested by David Ahern)
- While at it, note that the NULL check on fa->fa_info is redundant,
and the check on RTNH_F_DEAD is also not consistent with what's done
with regular route listing: just keep it for nhc_flags
- Rename entry point function for dumping exceptions to
fib_dump_info_fnhe(), and rearrange arguments for consistency with
fib_dump_info()
- Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle
one bucket at a time
- Expand commit message to describe why we can have a single "skip"
counter for all exceptions stored in bucket chains in nexthop objects
(suggested by David Ahern)
v6:
- Rebased onto net-next
- Loop over nexthop paths too. Move loop over fnhe buckets to route.c,
avoids need to export rt_fill_info() and to touch exceptions from
fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that
(suggested by David Ahern)
Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-21 17:45:23 +02:00
int fib_dump_info_fnhe ( struct sk_buff * skb , struct netlink_callback * cb ,
u32 table_id , struct fib_info * fi ,
2019-08-23 17:11:38 -07:00
int * fa_index , int fa_start , unsigned int flags ) ;
ipv4: Dump route exceptions if requested
Since commit 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions."), cached
exception routes are stored as a separate entity, so they are not dumped
on a FIB dump, even if the RTM_F_CLONED flag is passed.
This implies that the command 'ip route list cache' doesn't return any
result anymore.
If the RTM_F_CLONED is passed, and strict checking requested, retrieve
nexthop exception routes and dump them. If no strict checking is
requested, filtering can't be performed consistently: dump everything in
that case.
With this, we need to add an argument to the netlink callback in order to
track how many entries were already dumped for the last leaf included in
a partial netlink dump.
A single additional argument is sufficient, even if we traverse logically
nested structures (nexthop objects, hash table buckets, bucket chains): it
doesn't matter if we stop in the middle of any of those, because they are
always traversed the same way. As an example, s_i values in [], s_fa
values in ():
node (fa) #1 [1]
nexthop #1
bucket #1 -> #0 in chain (1)
bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4)
bucket #3 -> #0 in chain (5) -> #1 in chain (6)
nexthop #2
bucket #1 -> #0 in chain (7) -> #1 in chain (8)
bucket #2 -> #0 in chain (9)
--
node (fa) #2 [2]
nexthop #1
bucket #1 -> #0 in chain (1) -> #1 in chain (2)
bucket #2 -> #0 in chain (3)
it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2)
for "node #2": walking flattens all that.
It would even be possible to drop the distinction between the in-tree
(s_i) and in-node (s_fa) counter, but a further improvement might
advise against this. This is only as accurate as the existing tracking
mechanism for leaves: if a partial dump is restarted after exceptions
are removed or expired, we might skip some non-dumped entries.
To improve this, we could attach a 'sernum' attribute (similar to the
one used for IPv6) to nexthop entities, and bump this counter whenever
exceptions change: having a distinction between the two counters would
make this more convenient.
Listing of exception routes (modified routes pre-3.5) was tested against
these versions of kernel and iproute2:
iproute2
kernel 4.14.0 4.15.0 4.19.0 5.0.0 5.1.0
3.5-rc4 + + + + +
4.4
4.9
4.14
4.15
4.19
5.0
5.1
fixed + + + + +
v7:
- Move loop over nexthop objects to route.c, and pass struct fib_info
and table ID to it, not a struct fib_alias (suggested by David Ahern)
- While at it, note that the NULL check on fa->fa_info is redundant,
and the check on RTNH_F_DEAD is also not consistent with what's done
with regular route listing: just keep it for nhc_flags
- Rename entry point function for dumping exceptions to
fib_dump_info_fnhe(), and rearrange arguments for consistency with
fib_dump_info()
- Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle
one bucket at a time
- Expand commit message to describe why we can have a single "skip"
counter for all exceptions stored in bucket chains in nexthop objects
(suggested by David Ahern)
v6:
- Rebased onto net-next
- Loop over nexthop paths too. Move loop over fnhe buckets to route.c,
avoids need to export rt_fill_info() and to touch exceptions from
fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that
(suggested by David Ahern)
Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-21 17:45:23 +02:00
2012-10-28 22:33:23 +00:00
static inline void ip_rt_put ( struct rtable * rt )
2005-04-16 15:20:36 -07:00
{
2012-10-28 22:33:23 +00:00
/* dst_release() accepts a NULL parameter.
* We rely on dst being first structure in struct rtable
*/
BUILD_BUG_ON ( offsetof ( struct rtable , dst ) ! = 0 ) ;
dst_release ( & rt - > dst ) ;
2005-04-16 15:20:36 -07:00
}
# define IPTOS_RT_MASK (IPTOS_TOS_MASK & ~3)
2007-07-09 15:32:57 -07:00
extern const __u8 ip_tos2prio [ 16 ] ;
2005-04-16 15:20:36 -07:00
static inline char rt_tos2priority ( u8 tos )
{
return ip_tos2prio [ IPTOS_TOS ( tos ) > > 1 ] ;
}
2011-04-26 13:28:44 -07:00
/* ip_route_connect() and ip_route_newports() work in tandem whilst
* binding a socket for a new outgoing connection .
*
* In order to use IPSEC properly , we must , in the end , have a
* route that was looked up using all available keys including source
* and destination ports .
*
* However , if a source port needs to be allocated ( the user specified
* a wildcard source port ) we need to obtain addressing information
* in order to perform that allocation .
*
* So ip_route_connect ( ) looks up a route using wildcarded source and
* destination ports in the key , simply so that we can get a pair of
* addresses to use for port allocation .
*
* Later , once the ports are allocated , ip_route_newports ( ) will make
* another route lookup if needed to make sure we catch any IPSEC
* rules keyed on the port information .
*
* The callers allocate the flow key on their stack , and must pass in
* the same flowi4 object to both the ip_route_connect ( ) and the
* ip_route_newports ( ) calls .
*/
2022-04-21 01:21:33 +02:00
static inline void ip_route_connect_init ( struct flowi4 * fl4 , __be32 dst ,
__be32 src , int oif , u8 protocol ,
2011-04-26 13:28:44 -07:00
__be16 sport , __be16 dport ,
2022-04-21 01:21:33 +02:00
const struct sock * sk )
2005-04-16 15:20:36 -07:00
{
2011-04-26 13:28:44 -07:00
__u8 flow_flags = 0 ;
2008-10-01 07:35:39 -07:00
2023-08-16 08:15:41 +00:00
if ( inet_test_bit ( TRANSPARENT , sk ) )
2011-03-31 04:52:59 -07:00
flow_flags | = FLOWI_FLAG_ANYSRC ;
2023-07-28 15:03:15 +00:00
flowi4_init_output ( fl4 , oif , READ_ONCE ( sk - > sk_mark ) , ip_sock_rt_tos ( sk ) ,
2022-04-21 01:21:33 +02:00
ip_sock_rt_scope ( sk ) , protocol , flow_flags , dst ,
src , dport , sport , sk - > sk_uid ) ;
2011-04-26 13:28:44 -07:00
}
2022-04-21 01:21:33 +02:00
static inline struct rtable * ip_route_connect ( struct flowi4 * fl4 , __be32 dst ,
__be32 src , int oif , u8 protocol ,
2011-04-26 13:28:44 -07:00
__be16 sport , __be16 dport ,
2023-07-11 15:06:14 +02:00
const struct sock * sk )
2011-04-26 13:28:44 -07:00
{
struct net * net = sock_net ( sk ) ;
struct rtable * rt ;
2022-04-21 01:21:33 +02:00
ip_route_connect_init ( fl4 , dst , src , oif , protocol , sport , dport , sk ) ;
2008-10-01 07:35:39 -07:00
2005-04-16 15:20:36 -07:00
if ( ! dst | | ! src ) {
2011-04-26 13:28:44 -07:00
rt = __ip_route_output_key ( net , fl4 ) ;
2011-03-02 14:31:35 -08:00
if ( IS_ERR ( rt ) )
return rt ;
ip_rt_put ( rt ) ;
2023-06-01 18:37:46 +02:00
flowi4_update_output ( fl4 , oif , fl4 - > daddr , fl4 - > saddr ) ;
2005-04-16 15:20:36 -07:00
}
2020-09-27 22:38:26 -04:00
security_sk_classify_flow ( sk , flowi4_to_flowi_common ( fl4 ) ) ;
2011-04-26 13:28:44 -07:00
return ip_route_output_flow ( net , fl4 , sk ) ;
2005-04-16 15:20:36 -07:00
}
2011-04-26 13:28:44 -07:00
static inline struct rtable * ip_route_newports ( struct flowi4 * fl4 , struct rtable * rt ,
__be16 orig_sport , __be16 orig_dport ,
__be16 sport , __be16 dport ,
2023-07-11 15:06:14 +02:00
const struct sock * sk )
2005-04-16 15:20:36 -07:00
{
2011-02-24 13:38:12 -08:00
if ( sport ! = orig_sport | | dport ! = orig_dport ) {
2011-04-26 13:28:44 -07:00
fl4 - > fl4_dport = dport ;
fl4 - > fl4_sport = sport ;
2011-03-02 14:31:35 -08:00
ip_rt_put ( rt ) ;
2023-06-01 18:37:46 +02:00
flowi4_update_output ( fl4 , sk - > sk_bound_dev_if , fl4 - > daddr ,
2012-02-04 13:04:46 +00:00
fl4 - > saddr ) ;
2020-09-27 22:38:26 -04:00
security_sk_classify_flow ( sk , flowi4_to_flowi_common ( fl4 ) ) ;
2011-04-26 13:28:44 -07:00
return ip_route_output_flow ( sock_net ( sk ) , fl4 , sk ) ;
2005-04-16 15:20:36 -07:00
}
2011-03-02 14:31:35 -08:00
return rt ;
2005-04-16 15:20:36 -07:00
}
2008-10-01 07:33:10 -07:00
static inline int inet_iif ( const struct sk_buff * skb )
{
2016-04-05 08:22:49 -07:00
struct rtable * rt = skb_rtable ( skb ) ;
if ( rt & & rt - > rt_iif )
return rt - > rt_iif ;
2012-07-23 13:57:45 -07:00
return skb - > skb_iif ;
2008-10-01 07:33:10 -07:00
}
2010-12-12 21:55:08 -08:00
static inline int ip4_dst_hoplimit ( const struct dst_entry * dst )
{
int hoplimit = dst_metric_raw ( dst , RTAX_HOPLIMIT ) ;
2016-02-15 12:11:27 +02:00
struct net * net = dev_net ( dst - > dev ) ;
2010-12-12 21:55:08 -08:00
if ( hoplimit = = 0 )
2022-07-13 13:51:51 -07:00
hoplimit = READ_ONCE ( net - > ipv4 . sysctl_ip_default_ttl ) ;
2010-12-12 21:55:08 -08:00
return hoplimit ;
}
ipv4: Add helpers for neigh lookup for nexthop
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:
nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.
To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.
ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05 16:30:34 -07:00
static inline struct neighbour * ip_neigh_gw4 ( struct net_device * dev ,
__be32 daddr )
{
struct neighbour * neigh ;
2022-01-26 17:34:04 -08:00
neigh = __ipv4_neigh_lookup_noref ( dev , ( __force u32 ) daddr ) ;
ipv4: Add helpers for neigh lookup for nexthop
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:
nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.
To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.
ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05 16:30:34 -07:00
if ( unlikely ( ! neigh ) )
neigh = __neigh_create ( & arp_tbl , & daddr , dev , false ) ;
return neigh ;
}
static inline struct neighbour * ip_neigh_for_gw ( struct rtable * rt ,
struct sk_buff * skb ,
bool * is_v6gw )
{
struct net_device * dev = rt - > dst . dev ;
struct neighbour * neigh ;
if ( likely ( rt - > rt_gw_family = = AF_INET ) ) {
neigh = ip_neigh_gw4 ( dev , rt - > rt_gw4 ) ;
} else if ( rt - > rt_gw_family = = AF_INET6 ) {
neigh = ip_neigh_gw6 ( dev , & rt - > rt_gw6 ) ;
* is_v6gw = true ;
} else {
neigh = ip_neigh_gw4 ( dev , ip_hdr ( skb ) - > daddr ) ;
}
return neigh ;
}
2005-04-16 15:20:36 -07:00
# endif /* _ROUTE_H */