Commit Graph

707857 Commits

Author SHA1 Message Date
908432ca84 bpf: add helper bpf_perf_event_read_value for perf event array map
Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.

This patch adds helper bpf_perf_event_read_value for kprobed based perf
event array map, to read perf counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
To achieve scaling factor between two bpf invocations, users
can can use cpu_id as the key (which is typical for perf array usage model)
to remember the previous value and do the calculation inside the
bpf program.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 23:05:57 +01:00
97562633bc bpf: perf event change needed for subsequent bpf helpers
This patch does not impact existing functionalities.
It contains the changes in perf event area needed for
subsequent bpf_perf_event_read_value and
bpf_perf_prog_read_value helpers.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 23:05:57 +01:00
bdc476413d ip_tunnel: add mpls over gre support
This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
API by simply adding ipgre_tunnel_encap_(add|del)_mpls_ops() and the new
tunnel type TUNNEL_ENCAP_MPLS.

Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:38:31 +01:00
2af48d430a Merge branch 'fib6-rcu'
Wei Wang says:

====================
ipv6: replace rwlock with rcu and spinlock in fib6 table

Currently, fib6 table is protected by rwlock. During route lookup,
reader lock is taken and during route insertion, deletion or
modification, writer lock is taken. This is a very inefficient
implementation because the fastpath always has to do the operation
to grab the reader lock.
According to my latest syn flood test on an iota ivybridage machine
with 2 10G mlx nics bonded together, each with 8 rx queues on 2 NUMA
nodes, and with the upstream net-next kernel:
ipv4 stack can handle around 4.2Mpps
ipv6 stack can handle around 1.3Mpps

In order to close the gap of the performance number between ipv4
and ipv6 stack, this patch series tries to get rid of the usage of
the rwlock and replace it with rcu and spinlock protection. This will
greatly speed up the fastpath performance as it only needs to hold
rcu which is much less expensive than grabbing the reader lock. It
also makes ipv6 fib implementation more consistent with ipv4.

In order to be able to replace the current rwlock with rcu and
spinlock, some preparation work is needed:
Patch 1-8 introduces a per-route hash table (protected by rcu and a
different spinlock) to store all cached routes created by pmtu and ip
redirect under its main route. This makes the main fib6 tree only
contain static routes.
Patch 9-14 prepares all the reader path to be ready to tolerate
concurrent writer.
Patch 15 finally does the rwlock to rcu and spinlock conversion.
Patch 16 takes care of rt6_stats.

After this patch series, in the same syn flood test,
ipv6 stack can now handle around 3.5Mpps compared to previous 1.3Mpps
in my test setup.

After this patch series, there are still some improvements that should
be done in ipv6 stack:
1. During route lookup, dst_use() is called everytime on the selected
route to update dst->__use and dst->lastuse. This dirties the cacheline
and causes extra cacheline miss and should be avoided.
2. when no route is found in the current table, net->ip6.ipv6_null_entry
is used and refcnt is taken on it. As there is no pcpu cache for this
specific route, frequent change on the refcnt for this route causes
quite some cacheline misses.
And to make things worse, if CONFIG_IPV6_MULTIPLE_TABLES is defined,
output path route lookup always starts with local table first and
guarantees to hit net->ipv6.ip6_null_entry before continuing to do
lookup in the main table.
These operations on net->ipv6.ip6_null_entry could potentially be
avoided.
3. ipv6 input path route lookup grabs refcnt on dst. This is different
from ipv4. We could potentially change this behavior to let ipv6 input
path route lookup not to grab refcnt on dst. However, it does not give
us much performance boost as we currently have pcpu route cache for
input path as well in ipv6. But this work probably is still worth doing
to unify ipv6 and ipv4 route lookup behavior.

The above issues will be addressed separately after this patch series
has been accepted.

This is a joint work with Martin KaFai Lau and Eric Dumazet. And many
many thanks to them for their inspiring ideas and big big code review
efforts.
====================

Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:59 +01:00
81eb8447da ipv6: take care of rt6_stats
Currently, most of the rt6_stats are not hooked up correctly. As the
last part of this patch series, hook up all existing rt6_stats and add
one new stat fib_rt_uncache to indicate the number of routes in the
uncached list.
For details of the stats, please refer to the comments added in
include/net/ip6_fib.h.

Note: fib_rt_alloc and fib_rt_uncache are not guaranteed to be modified
under a lock. So atomic_t is used for them.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:58 +01:00
66f5d6ce53 ipv6: replace rwlock with rcu and spinlock in fib6_table
With all the preparation work before, we are now ready to replace rwlock
with rcu and spinlock in fib6_table.
That means now all fib6_node in fib6_table are protected by rcu. And
when freeing fib6_node, call_rcu() is used to wait for the rcu grace
period before releasing the memory.
When accessing fib6_node, corresponding rcu APIs need to be used.
And all previous sessions protected by the write lock will now be
protected by the spin lock per table.
All previous sessions protected by read lock will now be protected by
rcu_read_lock().

A couple of things to note here:
1. As part of the work of replacing rwlock with rcu, the linked list of
fn->leaf now has to be rcu protected as well. So both fn->leaf and
rt->dst.rt6_next are now __rcu tagged and corresponding rcu APIs are
used when manipulating them.

2. For fn->rr_ptr, first of all, it also needs to be rcu protected now
and is tagged with __rcu and rcu APIs are used in corresponding places.
Secondly, fn->rr_ptr is changed in rt6_select() which is a reader
thread. This makes the issue a bit complicated. We think a valid
solution for it is to let rt6_select() grab the tb6_lock if it decides
to change it. As it is not in the normal operation and only happens when
there is no valid neighbor cache for the route, we think the performance
impact should be low.

3. fib6_walk_continue() has to be called with tb6_lock held even in the
route dumping related functions, e.g. inet6_dump_fib(),
fib6_tables_dump() and ipv6_route_seq_ops. It is because
fib6_walk_continue() makes modifications to the walker structure, and so
are fib6_repair_tree() and fib6_del_route(). In order to do proper
syncing between them, we need to let fib6_walk_continue() hold the lock.
We may be able to do further improvement on the way we do the tree walk
to get rid of the need for holding the spin lock. But not for now.

4. When fib6_del_route() removes a route from the tree, we no longer
mark rt->dst.rt6_next to NULL to make simultaneous reader be able to
further traverse the list with rcu. However, rt->dst.rt6_next is only
valid within this same rcu period. No one should access it later.

5. All the operation of atomic_inc(rt->rt6i_ref) is changed to be
performed before we publish this route (either by linking it to fn->leaf
or insert it in the list pointed by fn->leaf) just to be safe because as
soon as we publish the route, some read thread will be able to access it.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:58 +01:00
17ecf590b3 ipv6: add key length check into rt6_select()
After rwlock is replaced with rcu and spinlock, fib6_lookup() could
potentially return an intermediate node if other thread is doing
fib6_del() on a route which is the only route on the node so that
fib6_repair_tree() will be called on this node and potentially assigns
fn->leaf to the its child's fn->leaf.

In order to detect this situation in rt6_select(), we have to check if
fn->fn_bit is consistent with the key length stored in the route. And
depending on if the fn is in the subtree or not, the key is either
rt->rt6i_dst or rt->rt6i_src.
If any inconsistency is found, that means the node no longer holds valid
routes in it. So net->ipv6.ip6_null_entry is returned.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:58 +01:00
8d1040e808 ipv6: check fn->leaf before it is used
If rwlock is replaced with rcu and spinlock, it is possible that the
reader thread will see fn->leaf as NULL in the following scenarios:
1. fib6_add() is in progress and we have already inserted a new node but
not yet inserted the route.
2. fib6_del_route() is in progress and we have already set fn->leaf to
NULL but not yet freed the node because of rcu grace period.

This patch makes sure all the reader threads check fn->leaf first before
using it. And together with later patch to grab rcu_read_lock() and
rcu_dereference() fn->leaf, it makes sure reader threads are safe when
accessing fn->leaf.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:58 +01:00
bbd63f06d1 ipv6: update fn_sernum after route is inserted to tree
fib6_add() logic currently calls fib6_add_1() to figure out what node
should be used for the newly added route and then call
fib6_add_rt2node() to insert the route to the node.
And during the call of fib6_add_1(), fn_sernum is updated for all nodes
that share the same prefix as the new route.
This does not have issue in the current code because reader thread will
not be able to access the tree while writer thread is inserting new
route to it. However, it is not the case once we transition to use RCU.
Reader thread could potentially see the new fn_sernum before the new
route is inserted. As a result, reader thread's route lookup will return
a stale route with the new fn_sernum.

In order to solve this issue, we remove all the update of fn_sernum in
fib6_add_1(), and instead, introduce a new function that updates fn_sernum
for all related nodes and call this functions once the route is
successfully inserted to the tree.
Also, smp_wmb() is used after a route is successfully inserted into the
fib tree and right before the updated of fn->sernum. And smp_rmb() is
used right after fn->sernum is accessed in rt6_get_cookie_safe(). This
is to guarantee that when the reader thread sees the new fn->sernum, the
new route is already inserted in the tree in memory.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:58 +01:00
d3843fe5fd ipv6: replace dst_hold() with dst_hold_safe() in routing code
With rwlock, it is safe to call dst_hold() in the read thread because
read thread is guaranteed to be separated from write thread.
However, after we replace rwlock with rcu, it is no longer safe to use
dst_hold(). A dst might already have been deleted but is waiting for the
rcu grace period to pass before freeing the memory when a read thread is
trying to do dst_hold(). This could potentially cause double free issue.

So this commit replaces all dst_hold() with dst_hold_safe() in all read
thread to avoid this double free issue.
And in order to make the code more compact, a new function ip6_hold_safe()
is introduced. It calls dst_hold_safe() first, and if that fails, it will
either fall back to hold and return net->ipv6.ip6_null_entry or set rt to
NULL according to the caller's need.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:58 +01:00
51e398e86d ipv6: don't release rt->rt6i_pcpu memory during rt6_release()
After rwlock is replaced with rcu and spinlock, route lookup can happen
simultanously with route deletion.
This patch removes the call to free_percpu(rt->rt6i_pcpu) from
rt6_release() to avoid the race condition between rt6_release() and
rt6_get_pcpu_route(). And as free_percpu(rt->rt6i_pcpu) is already
called in ip6_dst_destroy() after the rcu grace period, it is safe to do
this change.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:58 +01:00
a94b9367e0 ipv6: grab rt->rt6i_ref before allocating pcpu rt
After rwlock is replaced with rcu and spinlock, ip6_pol_route() will be
called with only rcu held. That means rt6 route deletion could happen
simultaneously with rt6_make_pcpu_rt(). This could potentially cause
memory leak if rt6_release() is called right before rt6_make_pcpu_rt()
on the same route.

This patch grabs rt->rt6i_ref safely before calling rt6_make_pcpu_rt()
to make sure rt6_release() will not get triggered while
rt6_make_pcpu_rt() is in progress. And rt6_release() is called after
rt6_make_pcpu_rt() is finished.

Note: As we are incrementing rt->rt6i_ref in ip6_pol_route(), there is a
very slim chance that fib6_purge_rt() will be triggered unnecessarily
when deleting a route if ip6_pol_route() running on another thread picks
this route as well and tries to make pcpu cache for it.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:58 +01:00
2b760fcf5c ipv6: hook up exception table to store dst cache
This commit makes use of the exception hash table implementation to
store dst caches created by pmtu discovery and ip redirect into the hash
table under the rt_info and no longer inserts these routes into fib6
tree.
This makes the fib6 tree only contain static configured routes and could
now be protected by rcu instead of a rw lock.
With this change, in the route lookup related functions, after finding
the rt6_info with the longest prefix, we also need to search for the
exception table before doing backtracking.
In the route delete function, if the route being deleted is not a dst
cache, deletion of this route also need to flush the whole hash table
under it. If it is a dst cache, then only delete the cached dst in the
hash table.

Note: for fib6_walk_continue() function, w->root now is always pointing
to a root node considering that fib6_prune_clones() is removed from the
code. So we add a WARN_ON() msg to make sure w->root always points to a
root node and also removed the update of w->root in fib6_repair_tree().
This is a prerequisite for later patch because we don't need to make
w->root as rcu protected when replacing rwlock with RCU.
Also, we remove all prune related variables as it is no longer used.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:57 +01:00
38fbeeeecc ipv6: prepare fib6_locate() for exception table
fib6_locate() is used to find the fib6_node according to the passed in
prefix address key. It currently tries to find the fib6_node with the
exact match of the passed in key. However, when we move cached routes
into the exception table, fib6_locate() will fail to find the fib6_node
for it as the cached routes will be stored in the exception table under
the fib6_node with the longest prefix match of the cache's dst addr key.
This commit adds a new parameter to let the caller specify if it needs
exact match or longest prefix match.
Right now, all callers still does exact match when calling
fib6_locate(). It will be changed in later commit where exception table
is hooked up to store cached routes.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:57 +01:00
c757faa8bf ipv6: prepare fib6_age() for exception table
If all dst cache entries are stored in the exception table under the
main route, we have to go through them during fib6_age() when doing
garbage collecting.
Introduce a new function rt6_age_exception() which goes through all dst
entries in the exception table and remove those entries that are expired.
This function is called in fib6_age() so that all dst caches are also
garbage collected.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:57 +01:00
b16cb459d7 ipv6: prepare rt6_clean_tohost() for exception table
If we move all cached dst into the exception table under the main route,
current rt6_clean_tohost() will no longer be able to access them.
This commit makes fib6_clean_tohost() to also go through all cached
routes in exception table and removes cached gateway routes to the
passed in gateway.
This is a preparation in order to move all cached routes into the
exception table.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:57 +01:00
f5bbe7ee79 ipv6: prepare rt6_mtu_change() for exception table
If we move all cached dst into the exception table under the main route,
current rt6_mtu_change() will no longer be able to access them.
This commit makes rt6_mtu_change_route() function to also go through all
cached routes in the exception table under the main route and do proper
updates on the mtu.
This is a preparation in order to move all cached routes into the
exception table.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:57 +01:00
60006a4825 ipv6: prepare fib6_remove_prefsrc() for exception table
After we move cached dst entries into the exception table under its
parent route, current fib6_remove_prefsrc() no longer can access them.
This commit makes fib6_remove_prefsrc() also go through all routes
in the exception table to remove the pref src.
This is a preparation patch in order to move all cached dst into the
exception table.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:57 +01:00
35732d01fe ipv6: introduce a hash table to store dst cache
Add a hash table into struct rt6_info in order to store dst caches
created by pmtu discovery and ip redirect in ipv6 routing code.
APIs to add dst cache, delete dst cache, find dst cache and update
dst cache in the hash table are implemented and will be used in later
commits.
This is a preparation work to move all cache routes into the exception
table instead of getting inserted into the fib6 tree.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:57 +01:00
180ca444b9 ipv6: introduce a new function fib6_update_sernum()
This function takes a route as input and tries to update the sernum in
the fib6_node this route is associated with. It will be used in later
commit when adding a cached route into the exception table under that
route.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 21:22:57 +01:00
85b1bb2480 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:

 - a couple of serious fixes: use after free and blacklist for WRITE
   SAME

 - one error leg fix: write_pending failure

 - one user experience problem: do not override max_sectors_kb

 - one minor unused function removal

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ibmvscsis: Fix write_pending failure path
  scsi: libiscsi: Remove iscsi_destroy_session
  scsi: libiscsi: Fix use-after-free race during iscsi_session_teardown
  scsi: sd: Do not override max_sectors_kb sysfs setting
  scsi: sd: Implement blacklist option for WRITE SAME w/ UNMAP
2017-10-07 12:34:16 -07:00
67936a41e5 Merge branch 'i2c/for-current-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
 "I2C has three driver fixes for the newly introduced drivers and one ID
  addition for the i801 driver"

* 'i2c/for-current-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: i2c-stm32f7: make structure stm32f7_setup static const
  i2c: ensure termination of *_device_id tables
  i2c: i801: Add support for Intel Cedar Fork
  i2c: stm32f7: fix setup structure
2017-10-07 10:07:51 -07:00
031b814030 Merge tag 'mmc-v4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
 "MMC core:

   - Fix driver strength selection when selecting hs400es

   - Delete bounce buffer handling:

     This change fixes a problem related to how bounce buffers are being
     allocated. However, instead of trying to fix that, let's just
     remove the mmc bounce buffer code altogether, as it has practically
     no use.

  MMC host:

   - meson-gx: A couple of fixes related to clock/phase/tuning

   - sdhci-xenon: Fix clock resource by adding an optional bus clock"

* tag 'mmc-v4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: sdhci-xenon: Fix clock resource by adding an optional bus clock
  mmc: meson-gx: include tx phase in the tuning process
  mmc: meson-gx: fix rx phase reset
  mmc: meson-gx: make sure the clock is rounded down
  mmc: Delete bounce buffer handling
  mmc: core: add driver strength selection when selecting hs400es
2017-10-07 10:03:03 -07:00
0d7b70e836 bnxt_en: don't consider building bnxt_tc.o if option not enabled
Instead of zeroing out bnxt_tc.c with a #ifdef foo, instead don't compile
the file when the option is not enabled. Now make and the preprocessor do
not have to waste time compiling a no-op.

Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 03:00:26 +01:00
1c86f2e4c8 Merge tag 'hwmon-for-linus-v4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fix from Guenter Roeck:
 "Fix up error path in xgene driver"

* tag 'hwmon-for-linus-v4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (xgene) Fix up error handling path mixup in 'xgene_hwmon_probe()'
2017-10-06 17:59:32 -07:00
ca82214144 Merge branch 'tcp-rbtree-retransmit-queue'
Eric Dumazet says:

====================
tcp: implement rb-tree based retransmit queue

This patch series implement RB-tree based retransmit queue for TCP,
to better match modern BDP.

Tested:

 On receiver :
 netem on ingress : delay 150ms 200us loss 1
 GRO disabled to force stress and SACK storms.

for f in `seq 1 10`
do
 ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'

Before patch :

323.87  351.48  339.59  338.62  306.72
204.07  304.93  291.88  202.47  176.88
->   2840

After patch:

1700.83 2207.98 2070.17 1544.26 2114.76
2124.89 1693.14 1080.91 2216.82 1299.94
->  18053

Average of 1805 Mbits istead of 284 Mbits.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:54 +01:00
75c119afe1 tcp: implement rb-tree based retransmit queue
Using a linear list to store all skbs in write queue has been okay
for quite a while : O(N) is not too bad when N < 500.

Things get messy when N is the order of 100,000 : Modern TCP stacks
want 10Gbit+ of throughput even with 200 ms RTT flows.

40 ns per cache line miss means a full scan can use 4 ms,
blowing away CPU caches.

SACK processing often can use various hints to avoid parsing
whole retransmit queue. But with high packet losses and/or high
reordering, hints no longer work.

Sender has to process thousands of unfriendly SACK, accumulating
a huge socket backlog, burning a cpu and massively dropping packets.

Using an rb-tree for retransmit queue has been avoided for years
because it added complexity and overhead, but now is the time
to be more resistant and say no to quadratic behavior.

1) RTX queue is no longer part of the write queue : already sent skbs
are stored in one rb-tree.

2) Since reaching the head of write queue no longer needs
sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue

Tested:

 On receiver :
 netem on ingress : delay 150ms 200us loss 1
 GRO disabled to force stress and SACK storms.

for f in `seq 1 10`
do
 ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'

Before patch :

323.87
351.48
339.59
338.62
306.72
204.07
304.93
291.88
202.47
176.88
   2840

After patch:

1700.83
2207.98
2070.17
1544.26
2114.76
2124.89
1693.14
1080.91
2216.82
1299.94
  18053

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:54 +01:00
f33198163a tcp: pass previous skb to tcp_shifted_skb()
No need to recompute previous skb, as it will be a bit more
expensive when rtx queue is converted to RB tree.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:54 +01:00
8ba6ddaaf8 tcp: reduce tcp_fastretrans_alert() verbosity
With upcoming rb-tree implementation, the checks will trigger
more often, and this is expected.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:53 +01:00
5e76ee4b8e tcp: tcp_mark_head_lost() optimization
It will be a bit more expensive to get the head of rtx queue
once rtx queue is converted to an rb-tree.

We can avoid this extra cost in case tp->lost_skb_hint is set.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:53 +01:00
4e8cc22803 tcp: tcp_tx_timestamp() cleanup
tcp_write_queue_tail() call can be factorized.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:53 +01:00
ac3f09ba3e tcp: uninline tcp_write_queue_purge()
Since the upcoming rtx rbtree will add some extra code,
it is time to not inline this fat function anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:53 +01:00
18a4c0eab2 net: add rb_to_skb() and other rb tree helpers
Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:53 +01:00
dbeb1a8ff5 Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:

 - build fix to export the clk_bulk_prepare() symbol

 - suspend fix for Samsung Exynos SoCs where we need to keep clks on
   across suspend

 - two critical clk markings for clks that shouldn't ever turn off on
   Rockchip SoCs

 - a fix for a copy-paste mistake on Rockchip rk3128 causing some clks
   to touch the same bit and trample over one another

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: samsung: exynos4: Enable VPLL and EPLL clocks for suspend/resume cycle
  clk: Export clk_bulk_prepare()
  clk: rockchip: add sclk_timer5 as critical clock on rk3128
  clk: rockchip: fix up rk3128 pvtm and mipi_24m gate regs error
  clk: rockchip: add pclk_pmu as critical clock on rk3128
2017-10-06 16:25:08 -07:00
ed0f72f4ea Merge tag 'arc-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC udpates from Vineet Gupta:

 - updates for various platforms

 - boot log updates for upcoming HS48 family of cores (dual issue)

* tag 'arc-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: [plat-hsdk]: Add reset controller node to manage ethernet reset
  ARC: [plat-hsdk]: Temporary fix to set CPU frequency to 1GHz
  ARC: fix allnoconfig build warning
  ARCv2: boot log: identify HS48 cores (dual issue)
  ARC: boot log: decontaminate ARCv2 ISA_CONFIG register
  arc: remove redundant UTS_MACHINE define in arch/arc/Makefile
  ARC: [plat-eznps] Update platform maintainer as Noam left
  ARC: [plat-hsdk] use actual clk driver to manage cpu clk
  ARC: [*defconfig] Reenable soft lock-up detector
  ARC: [plat-axs10x] sdio: Temporary fix of sdio ciu frequency
  ARC: [plat-hsdk] sdio: Temporary fix of sdio ciu frequency
  ARC: [plat-axs103] Add temporary quirk to reset ethernet IP
2017-10-06 15:57:08 -07:00
eab26ad197 Merge tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:

 - fix a race between overlapping copy on write aio

 - fix cow fork swapping when we defragment reflinked files

* tag 'xfs-4.14-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: handle racy AIO in xfs_reflink_end_cow
  xfs: always swap the cow forks when swapping extents
2017-10-06 15:53:36 -07:00
f5333f80c3 Merge branch '40GbE' of ra.kernel.org:/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2017-10-06

This series contains updates to i40e and i40evf only.

Rami fixes a typo in the code comments.

Mitch adds an ethtool private flag to control source pruning to resolve an
issue where our default behavior is to enable source pruning which breaks ARP
monitoring in channel bonding.  Fixes a couple of register definitions, which
were incorrect.

Jake fixes an issue with multiple logical CPUs per core (simultaneous
multithreading - SMT) and how we set an affinity hint based on the v_idx of
that q_vector, which is an incremental value and might lead to multiple
offline CPUs being assigned to a q_vector.  Instead, we should only assign
hints for CPUs which are online, so look to use cpumask_local_spread().
Also fixed a VF VLAN tag stripping issue, where the flag created to change
this feature was seen as unchangeable.  Lastly, organized and re-numbered
the feature flags.

Alan re-enables PTP L4 for XL710 devices with firmware version 6.0 or
greater, now that the previous bug in the older firmware is fixed.
Implements the PCI error handlers for reset_prepare() and reset_done() to
allow us to handle function level resets.

Alice cleans up code that was added to the incorrect function during a
merge.

Filip adds a change to display an error message when a module is inserted
that does not meet the thermal requirements, Talking Heads "Burning Down
the House" comes to mind.  Also fixed a flow director filter issue where
a variable was not being cleared which stores the filter number to be
removed from the list when the firmware refused to add the requested
filter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-06 23:29:38 +01:00
17d084c8d1 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A collection of fixes for this series. This contains:

   - NVMe pull request from Christoph, one uuid attribute fix, and one
     fix for the controller memory buffer address for remapped BARs.

   - use-after-free fix for bsg, from Benjamin Block.

   - bcache race/use-after-free fix for a list traversal, fixing a
     regression in this merge window. From Coly Li.

   - null_blk change configfs dependency change from a 'depends' to a
     'select'. This is a change from this merge window as well. From me.

   - nbd signal fix from Josef, fixing a regression introduced with the
     status code changes.

   - nbd MAINTAINERS mailing list entry update.

   - blk-throttle stall fix from Joseph Qi.

   - blk-mq-debugfs fix from Omar, fixing an issue where we don't
     register the IO scheduler debugfs directory, if the driver is
     loaded with it. Only shows up if you switch through the sysfs
     interface"

* 'for-linus' of git://git.kernel.dk/linux-block:
  bsg-lib: fix use-after-free under memory-pressure
  nvme-pci: Use PCI bus address for data/queues in CMB
  blk-mq-debugfs: fix device sched directory for default scheduler
  null_blk: change configfs dependency to select
  blk-throttle: fix possible io stall when upgrade to max
  MAINTAINERS: update list for NBD
  nbd: fix -ERESTARTSYS handling
  nvme: fix visibility of "uuid" ns attribute
  bcache: use llist_for_each_entry_safe() in __closure_wake_up()
2017-10-06 12:13:50 -07:00
80cf1f8c16 Merge tag 'pci-v4.14-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
 "Fix legacy IDE probe issues exposed by recent PCI core IRQ mapping
  changes (Bartlomiej Zolnierkiewicz, Lorenzo Pieralisi)"

* tag 'pci-v4.14-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  ide: fix IRQ assignment for PCI bus order probing
  ide: pci: free PCI BARs on initialization failure
  ide: free hwif->portdev on hwif_init() failure
2017-10-06 12:07:09 -07:00
275490680c Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:

 - Bring initialisation of user space undefined instruction handling
   early (core_initcall) since late_initcall() happens after modprobe in
   initramfs is invoked. Similar fix for fpsimd initialisation

 - Increase the kernel stack when KASAN is enabled

 - Bring the PCI ACS enabling earlier via the
   iort_init_platform_devices()

 - Fix misleading data abort address printing (decimal vs hex)

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Ensure fpsimd support is ready before userspace is active
  arm64: Ensure the instruction emulation is ready for userspace
  arm64: Use larger stacks when KASAN is selected
  ACPI/IORT: Fix PCI ACS enablement
  arm64: fix misleading data abort decoding
2017-10-06 11:31:46 -07:00
8d473320ee Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:

 - fix PPC XIVE interrupt delivery

 - fix x86 RCU breakage from asynchronous page faults when built without
   PREEMPT_COUNT

 - fix x86 build with -frecord-gcc-switches

 - fix x86 build without X86_LOCAL_APIC

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: add X86_LOCAL_APIC dependency
  x86/kvm: Move kvm_fastop_exception to .fixup section
  kvm/x86: Avoid async PF preempting the kernel incorrectly
  KVM: PPC: Book3S: Fix server always zero from kvmppc_xive_get_xive()
2017-10-06 11:28:34 -07:00
d109d83fc8 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
 "This is a pretty small pull request. Only 6 patches in total. There
  are no outstanding -rc patches on the mailing list after this pull
  request, so only if some new issues are discovered in the remainder of
  the rc cycles will you hear from me again.

  Summary:
   - a fix for iwpm netlink usage
   - a fix for error unwinding in mlx5
   - two fixes to vlan handling in qedr
   - a couple small i40iw fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  i40iw: Fix port number for query QP
  i40iw: Add missing memory barriers
  RDMA/qedr: Parse vlan priority as sl
  RDMA/qedr: Parse VLAN ID correctly and ignore the value of zero
  IB/mlx5: Fix label order in error path handling
  RDMA/iwpm: Properly mark end of NL messages
2017-10-06 11:25:55 -07:00
6151b8b37b ppp: fix race in ppp device destruction
ppp_release() tries to ensure that netdevices are unregistered before
decrementing the unit refcount and running ppp_destroy_interface().

This is all fine as long as the the device is unregistered by
ppp_release(): the unregister_netdevice() call, followed by
rtnl_unlock(), guarantee that the unregistration process completes
before rtnl_unlock() returns.

However, the device may be unregistered by other means (like
ppp_nl_dellink()). If this happens right before ppp_release() calling
rtnl_lock(), then ppp_release() has to wait for the concurrent
unregistration code to release the lock.
But rtnl_unlock() releases the lock before completing the device
unregistration process. This allows ppp_release() to proceed and
eventually call ppp_destroy_interface() before the unregistration
process completes. Calling free_netdev() on this partially unregistered
device will BUG():

 ------------[ cut here ]------------
 kernel BUG at net/core/dev.c:8141!
 invalid opcode: 0000 [#1] SMP

 CPU: 1 PID: 1557 Comm: pppd Not tainted 4.14.0-rc2+ #4
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014

 Call Trace:
  ppp_destroy_interface+0xd8/0xe0 [ppp_generic]
  ppp_disconnect_channel+0xda/0x110 [ppp_generic]
  ppp_unregister_channel+0x5e/0x110 [ppp_generic]
  pppox_unbind_sock+0x23/0x30 [pppox]
  pppoe_connect+0x130/0x440 [pppoe]
  SYSC_connect+0x98/0x110
  ? do_fcntl+0x2c0/0x5d0
  SyS_connect+0xe/0x10
  entry_SYSCALL_64_fastpath+0x1a/0xa5

 RIP: free_netdev+0x107/0x110 RSP: ffffc28a40573d88
 ---[ end trace ed294ff0cc40eeff ]---

We could set the ->needs_free_netdev flag on PPP devices and move the
ppp_destroy_interface() logic in the ->priv_destructor() callback. But
that'd be quite intrusive as we'd first need to unlink from the other
channels and units that depend on the device (the ones that used the
PPPIOCCONNECT and PPPIOCATTACH ioctls).

Instead, we can just let the netdevice hold a reference on its
ppp_file. This reference is dropped in ->priv_destructor(), at the very
end of the unregistration process, so that neither ppp_release() nor
ppp_disconnect_channel() can call ppp_destroy_interface() in the interim.

Reported-by: Beniamino Galvani <bgalvani@redhat.com>
Fixes: 8cb775bc0a ("ppp: fix device unregistration upon netns deletion")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-06 10:16:34 -07:00
4bc4e64c2c Merge tag 'batadv-next-for-davem-20171006' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - Cleanup patches to make checkpatch happy, by Sven Eckelmann (3 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-06 10:12:52 -07:00
d2746fe538 bnx2x: Use pci_ari_enabled() instead of local copy
Use pci_ari_enabled() from the PCI core instead of the identical local copy
bnx2x_ari_enabled().  No functional change intended.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-06 10:06:06 -07:00
399e3514e1 Merge branch 'xdp_monitor-improve'
Jesper Dangaard Brouer says:

====================
Improve xdp_monitor samples/bpf

Here are some improvements to the xdp_monitor tool currently located
under samples/bpf/.  Once the tools library libbpf become more feature
complete, xdp_monitor should be converted to use it, and be moved into
tools/bpf/xdp/ or tools/xdp/.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-06 10:04:36 -07:00
c4eb7f4643 samples/bpf: xdp_monitor increase memory rlimit
Other concurrent running programs, like perf or the XDP program what
needed to be monitored, might take up part of the max locked memory
limit.  Thus, the xdp_monitor tool have to set the RLIMIT_MEMLOCK to
RLIM_INFINITY, as it cannot determine a more sane limit.

Using the man exit(3) specified EXIT_FAILURE return exit code, and
correct other users too.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-06 10:04:36 -07:00
280b058d48 samples/bpf: xdp_monitor also record xdp_exception tracepoint
Also monitor the tracepoint xdp_exception.  This tracepoint is usually
invoked by the drivers.  Programs themselves can activate this by
returning XDP_ABORTED, which will drop the packet but also trigger the
tracepoint.  This is useful for distinguishing intentional (XDP_DROP)
vs. ebpf-program error cases that cased a drop (XDP_ABORTED).

Drivers also use this tracepoint for reporting on XDP actions that are
unknown to the specific driver.  This can help the user to detect if a
driver e.g. doesn't implement XDP_REDIRECT yet.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-06 10:04:35 -07:00
f4ce0a0116 samples/bpf: xdp_monitor first 8 bytes are not accessible by bpf
The first 8 bytes of the tracepoint context struct are not accessible
by the bpf code.  This is a choice that dates back to the original
inclusion of this code.

See explaination in:
 commit 98b5c2c65c ("perf, bpf: allow bpf programs attach to tracepoints")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-06 10:04:35 -07:00
c6a157525b Merge branch 'nfp-extend-match-and-action'
Simon Horman says:

====================
nfp: extend match and action for flower offload

Pieter says:

This series extends flower offload match and action capabilities. It
specifically adds offload capabilities for matching on MPLS, TTL, TOS
and flow label. Furthermore offload capabilities for action have been
expanded to include set ethernet, ipv4, ipv6, tcp and udp headers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-06 09:56:36 -07:00