44719 Commits

Author SHA1 Message Date
David Howells
8732db67c6 rxrpc: Fix exclusive client connections
Exclusive connections are currently reusable (which they shouldn't be)
because rxrpc_alloc_client_connection() checks the exclusive flag in the
rxrpc_connection struct before it's initialised from the function
parameters.  This means that the DONT_REUSE flag doesn't get set.

Fix this by checking the function parameters for the exclusive flag.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-29 22:37:15 +01:00
Eric Dumazet
7836667cec net: do not export sk_stream_write_space
Since commit 900f65d361d3 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()") we no longer need
to export sk_stream_write_space()

From: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 20:32:38 -04:00
Johannes Berg
8f7d99ba85 cfg80211: wext: really don't store non-WEP keys
Jouni reported that during (repeated) wext_pmf test runs (from the
wpa_supplicant hwsim test suite) the kernel crashes. The reason is
that after the key is set, the wext code still unnecessarily stores
it into the key cache. Despite smatch pointing out an overflow, I
failed to identify the possibility for this in the code and missed
it during development of the earlier patch series.

In order to fix this, simply check that we never store anything but
WEP keys into the cache, adding a comment as to why that's enough.

Also, since the cache is still allocated early even if it won't be
used in many cases, add a comment explaining why - otherwise we'd
have to roll back key settings to the driver in case of allocation
failures, which is far more difficult.

Fixes: 89b706fb28e4 ("cfg80211: reduce connect key caching struct size")
Reported-by: Jouni Malinen <j@w1.fi>
Bisected-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-28 23:55:23 +02:00
Lawrence Brakmo
3acf3ec3f4 tcp: Change txhash on every SYN and RTO retransmit
The current code changes txhash (flowlables) on every retransmitted
SYN/ACK, but only after the 2nd retransmitted SYN and only after
tcp_retries1 RTO retransmits.

With this patch:
1) txhash is changed with every SYN retransmits
2) txhash is changed with every RTO.

The result is that we can start re-routing around failed (or very
congested paths) as soon as possible. Otherwise application health
checks may fail and the connection may be terminated before we start
to change txhash.

v4: Removed sysctl, txhash is changed for all RTOs
v3: Removed text saying default value of sysctl is 0 (it is 100)
v2: Added sysctl documentation and cleaned code

Tested with packetdrill tests

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 07:52:34 -04:00
Jiri Pirko
347e3b28c1 switchdev: remove FIB offload infrastructure
Since this is now taken care of by FIB notifier, remove the code, with
all unused dependencies.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 04:48:00 -04:00
Jiri Pirko
c98501879b fib: introduce FIB info offload flag helpers
These helpers are to be used in case someone offloads the FIB entry. The
result is that if the entry is offloaded to at least one device, the
offload flag is set.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 04:48:00 -04:00
Jiri Pirko
b90eb75494 fib: introduce FIB notification infrastructure
This allows to pass information about added/deleted FIB entries/rules to
whoever is interested. This is done in a very similar way as devinet
notifies address additions/removals.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 04:48:00 -04:00
Hadar Hen Zion
eb523f42d7 net/sched: cls_flower: Use a proper mask value for enc key id parameter
The current code use the encapsulation key id value as the mask of that
parameter which is wrong. Fix that by using a full mask.

Fixes: bc3103f1ed40 ('net/sched: cls_flower: Classify packet in ip tunnels')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Acked-by: Amir Vadai <amir@vadai.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 03:11:22 -04:00
Deepa Dinamani
078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Ke Wang
77b00bc037 sunrpc: queue work on system_power_efficient_wq
sunrpc uses workqueue to clean cache regulary. There is no real dependency
of executing work on the cpu which queueing it.

On a idle system, especially for a heterogeneous systems like big.LITTLE,
it is observed that the big idle cpu was woke up many times just to service
this work, which against the principle of power saving. It would be better
if we can schedule it on a cpu which the scheduler believes to be the most
appropriate one.

After apply this patch, system_wq will be replaced by
system_power_efficient_wq for sunrpc. This functionality is enabled when
CONFIG_WQ_POWER_EFFICIENT is selected.

Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-27 14:35:36 -04:00
Yotam Gigi
c006da0be0 act_ife: Fix false encoding
On ife encode side, the action stores the different tlvs inside the ife
header, where each tlv length field should refer to the length of the
whole tlv (without additional padding) and not just the data length.

On ife decode side, the action iterates over the tlvs in the ife header
and parses them one by one, where in each iteration the current pointer is
advanced according to the tlv size.

Before, the encoding encoded only the data length inside the tlv, which led
to false parsing of ife the header. In addition, due to the fact that the
loop counter was unsigned, it could lead to infinite parsing loop.

This fix changes the loop counter to be signed and fixes the encoding to
take into account the tlv type and size.

Fixes: 28a10c426e81 ("net sched: fix encoding to use real length")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 09:53:17 -04:00
Yotam Gigi
4b1d488a28 act_ife: Fix external mac header on encode
On ife encode side, external mac header is copied from the original packet
and may be overridden if the user requests. Before, the mac header copy
was done from memory region that might not be accessible anymore, as
skb_cow_head might free it and copy the packet. This led to random values
in the external mac header once the values were not set by user.

This fix takes the internal mac header from the packet, after the call to
skb_cow_head.

Fixes: ef6980b6becb ("net sched: introduce IFE action")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 09:53:16 -04:00
Jorgen Hansen
1190cfdb1a VSOCK: Don't dec ack backlog twice for rejected connections
If a pending socket is marked as rejected, we will decrease the
sk_ack_backlog twice. So don't decrement it for rejected sockets
in vsock_pending_work().

Testing of the rejected socket path was done through code
modifications.

Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: Adit Ranadive <aditr@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-27 07:59:25 -04:00
Johannes Berg
8564e38206 cfg80211: add checks for beacon rate, extend to mesh
The previous commit added support for specifying the beacon rate
for AP mode. Add features checks to this, and extend it to also
support the rate configuration for mesh networks. For IBSS it's
not as simple due to joining etc., so that's not yet supported.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-26 10:23:48 +02:00
Purushottam Kushwaha
a7c7fbff6a cfg80211: Add support to configure a beacon data rate
This allows an option to configure a single beacon tx rate for an AP.

Signed-off-by: Purushottam Kushwaha <pkushwah@qti.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-09-26 10:23:48 +02:00
David S. Miller
71527eb2be Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-09-25

Here are a few more Bluetooth & 802.15.4 patches for the 4.9 kernel that
have popped up during the past week:

 - New USB ID for QCA_ROME Bluetooth device
 - NULL pointer dereference fix for Bluetooth mgmt sockets
 - Fixes for BCSP driver
 - Fix for updating LE scan response

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-25 23:52:22 -04:00
Nikolay Aleksandrov
2cf750704b ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route
Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid
instead of the previous dst_pid which was copied from in_skb's portid.
Since the skb is new the portid is 0 at that point so the packets are sent
to the kernel and we get scheduling while atomic or a deadlock (depending
on where it happens) by trying to acquire rtnl two times.
Also since this is RTM_GETROUTE, it can be triggered by a normal user.

Here's the sleeping while atomic trace:
[ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
[ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0
[ 7858.212881] 2 locks held by swapper/0/0:
[ 7858.213013]  #0:  (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350
[ 7858.213422]  #1:  (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130
[ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179
[ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 7858.214108]  0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000
[ 7858.214412]  ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e
[ 7858.214716]  000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f
[ 7858.215251] Call Trace:
[ 7858.215412]  <IRQ>  [<ffffffff813a7804>] dump_stack+0x85/0xc1
[ 7858.215662]  [<ffffffff810a4a72>] ___might_sleep+0x192/0x250
[ 7858.215868]  [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100
[ 7858.216072]  [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0
[ 7858.216279]  [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460
[ 7858.216487]  [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40
[ 7858.216687]  [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260
[ 7858.216900]  [<ffffffff81573c70>] rtnl_unicast+0x20/0x30
[ 7858.217128]  [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0
[ 7858.217351]  [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130
[ 7858.217581]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
[ 7858.217785]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
[ 7858.217990]  [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350
[ 7858.218192]  [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350
[ 7858.218415]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
[ 7858.218656]  [<ffffffff810fde10>] run_timer_softirq+0x260/0x640
[ 7858.218865]  [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f
[ 7858.219068]  [<ffffffff816637c8>] __do_softirq+0xe8/0x54f
[ 7858.219269]  [<ffffffff8107a948>] irq_exit+0xb8/0xc0
[ 7858.219463]  [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50
[ 7858.219678]  [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0
[ 7858.219897]  <EOI>  [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10
[ 7858.220165]  [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10
[ 7858.220373]  [<ffffffff810298e3>] default_idle+0x23/0x190
[ 7858.220574]  [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20
[ 7858.220790]  [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60
[ 7858.221016]  [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0
[ 7858.221257]  [<ffffffff8164f995>] rest_init+0x135/0x140
[ 7858.221469]  [<ffffffff81f83014>] start_kernel+0x50e/0x51b
[ 7858.221670]  [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120
[ 7858.221894]  [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c
[ 7858.222113]  [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a

Fixes: 2942e9005056 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-25 23:41:39 -04:00
Pablo Neira Ayuso
f20fbc0717 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Conflicts:
	net/netfilter/core.c
	net/netfilter/nf_tables_netdev.c

Resolve two conflicts before pull request for David's net-next tree:

1) Between c73c24849011 ("netfilter: nf_tables_netdev: remove redundant
   ip_hdr assignment") from the net tree and commit ddc8b6027ad0
   ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()").

2) Between e8bffe0cf964 ("net: Add _nf_(un)register_hooks symbols") and
   Aaron Conole's patches to replace list_head with single linked list.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:34:19 +02:00
Liping Zhang
8cb2a7d566 netfilter: nf_log: get rid of XT_LOG_* macros
nf_log is used by both nftables and iptables, so use XT_LOG_XXX macros
here is not appropriate. Replace them with NF_LOG_XXX.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:45 +02:00
Liping Zhang
ff107d2776 netfilter: nft_log: complete NFTA_LOG_FLAGS attr support
NFTA_LOG_FLAGS attribute is already supported, but the related
NF_LOG_XXX flags are not exposed to the userspace. So we cannot
explicitly enable log flags to log uid, tcp sequence, ip options
and so on, i.e. such rule "nft add rule filter output log uid"
is not supported yet.

So move NF_LOG_XXX macro definitions to the uapi/../nf_log.h. In
order to keep consistent with other modules, change NF_LOG_MASK to
refer to all supported log flags. On the other hand, add a new
NF_LOG_DEFAULT_MASK to refer to the original default log flags.

Finally, if user specify the unsupported log flags or NFTA_LOG_GROUP
and NFTA_LOG_FLAGS are set at the same time, report EINVAL to the
userspace.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:43 +02:00
Pablo Neira Ayuso
0f3cd9b369 netfilter: nf_tables: add range expression
Inverse ranges != [a,b] are not currently possible because rules are
composites of && operations, and we need to express this:

	data < a || data > b

This patch adds a new range expression. Positive ranges can be already
through two cmp expressions:

	cmp(sreg, data, >=)
	cmp(sreg, data, <=)

This new range expression provides an alternative way to express this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:42 +02:00
KOVACS Krisztian
7a682575ad netfilter: xt_socket: fix transparent match for IPv6 request sockets
The introduction of TCP_NEW_SYN_RECV state, and the addition of request
sockets to the ehash table seems to have broken the --transparent option
of the socket match for IPv6 (around commit a9407000).

Now that the socket lookup finds the TCP_NEW_SYN_RECV socket instead of the
listener, the --transparent option tries to match on the no_srccheck flag
of the request socket.

Unfortunately, that flag was only set for IPv4 sockets in tcp_v4_init_req()
by copying the transparent flag of the listener socket. This effectively
causes '-m socket --transparent' not match on the ACK packet sent by the
client in a TCP handshake.

Based on the suggestion from Eric Dumazet, this change moves the code
initializing no_srccheck to tcp_conn_request(), rendering the above
scenario working again.

Fixes: a940700003 ("netfilter: xt_socket: prepare for TCP_NEW_SYN_RECV support")
Signed-off-by: Alex Badics <alex.badics@balabit.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 23:16:11 +02:00
Florian Westphal
58e207e498 netfilter: evict stale entries when user reads /proc/net/nf_conntrack
Fabian reports a possible conntrack memory leak (could not reproduce so
far), however, one minor issue can be easily resolved:

> cat /proc/net/nf_conntrack | wc -l = 5
> 4 minutes required to clean up the table.

We should not report those timed-out entries to the user in first place.
And instead of just skipping those timed-out entries while iterating over
the table we can also zap them (we already do this during ctnetlink
walks, but I forgot about the /proc interface).

Fixes: f330a7fdbe16 ("netfilter: conntrack: get rid of conntrack timer")
Reported-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:08 +02:00
Vishwanath Pai
11d5f15723 netfilter: xt_hashlimit: Create revision 2 to support higher pps rates
Create a new revision for the hashlimit iptables extension module. Rev 2
will support higher pps of upto 1 million, Version 1 supports only 10k.

To support this we have to increase the size of the variables avg and
burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2
and xt_hashlimit_mtinfo2 and also create newer versions of all the
functions for match, checkentry and destroy.

Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very
similar in both rev1 and rev2 with only minor changes, so I have split
those functions and moved all the common code to a *_common function.

Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:06 +02:00
Vishwanath Pai
0dc60a4546 netfilter: xt_hashlimit: Prepare for revision 2
I am planning to add a revision 2 for the hashlimit xtables module to
support higher packets per second rates. This patch renames all the
functions and variables related to revision 1 by adding _v1 at the
end of the names.

Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:05 +02:00
Liping Zhang
7bfdde7045 netfilter: nft_ct: report error if mark and dir specified simultaneously
NFT_CT_MARK is unrelated to direction, so if NFTA_CT_DIRECTION attr is
specified, report EINVAL to the userspace. This validation check was
already done at nft_ct_get_init, but we missed it in nft_ct_set_init.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:04 +02:00
Liping Zhang
d767ff2c84 netfilter: nft_ct: unnecessary to require dir when use ct l3proto/protocol
Currently, if the user want to match ct l3proto, we must specify the
direction, for example:
  # nft add rule filter input ct original l3proto ipv4
                                 ^^^^^^^^
Otherwise, error message will be reported:
  # nft add rule filter input ct l3proto ipv4
  nft add rule filter input ct l3proto ipv4
  <cmdline>:1:1-38: Error: Could not process rule: Invalid argument
  add rule filter input ct l3proto ipv4
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Actually, there's no need to require NFTA_CT_DIRECTION attr, because
ct l3proto and protocol are unrelated to direction.

And for compatibility, even if the user specify the NFTA_CT_DIRECTION
attr, do not report error, just skip it.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:02 +02:00
Gao Feng
8d11350f5f netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ack
It is valid that the TCP RST packet which does not set ack flag, and bytes
of ack number are zero. But current seqadj codes would adjust the "0" ack
to invalid ack number. Actually seqadj need to check the ack flag before
adjust it for these RST packets.

The following is my test case

client is 10.26.98.245, and add one iptable rule:
iptables  -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2:
--connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with
tcp-reset
This iptables rule could generate on TCP RST without ack flag.

server:10.172.135.55
Enable the synproxy with seqadjust by the following iptables rules
iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345
-m tcp --syn -j CT --notrack

iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7
--mss 1460
iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack
--ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT

The following is my test result.

1. packet trace on client
root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829,
win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7],
length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266,
ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884,
nop,wscale 7], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229,
options [nop,nop,TS val 452367885 ecr 15643479], length 0
IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226,
options [nop,nop,TS val 15643479 ecr 452367885], length 0
IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830,
win 0, length 0

2. seqadj log on server
[62873.867319] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.867644] Adjusting sequence number from 602341895->546723267,
ack from 3695959830->3695959830
[62873.869040] Adjusting sequence number from 3695959830->3695959830,
ack from 0->55618628

To summarize, it is clear that the seqadj codes adjust the 0 ack when receive
one TCP RST packet without ack.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:54:01 +02:00
Aaron Conole
e3b37f11e6 netfilter: replace list_head with single linked list
The netfilter hook list never uses the prev pointer, and so can be trimmed to
be a simple singly-linked list.

In addition to having a more light weight structure for hook traversal,
struct net becomes 5568 bytes (down from 6400) and struct net_device becomes
2176 bytes (down from 2240).

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25 14:38:48 +02:00
David S. Miller
21445c91dc RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+cFjvSw1s6N8H32AQIRyBAAkedosPtgW2JBsCEgvYcfQuSg36gBc0Km
 ozAcbzbWA50P+WSdIq+GhcnQLo3OWRsaGKp6vIrgDtJjgKZXQJyOwzlMm6xFf0A+
 Z7NtnWHuSZ90xcxVHCqXRqdNGqwmve8VTNWUbQmZpKIc6d6wNiO3rsg/WFTshroz
 FSMf67PsOf5i1hCR8x+3dPlSdkIjH/hWh3wyEmmnPjDPq49twnKkaklLylf9V9NO
 DGuW+/+wxkht3ylPRxGD7IQ1t0eA9iyU59Zk7SizuEmzrUNQrqQ3RM8ZsBe/cAA8
 VEr+RVOvVoxWUYnZbYEvwfwuKQIWhX5ptt6VLQH7sLJlR8AiMIV6/K47ZqK7ILoI
 xBlkQAxI5K4SdFkOux7JPZ31yrIQMTl5xAUkbZxQOFsIPh237mtvGatUIKYcKyi6
 uURrPtm13C1uqBZ+sfiUcjq82YdnLkliTUvUncarjZjUAddncQaSOwxcXZGCxThE
 24JAzBAJfWLn0760SzNDZytcWSU4tpLVI3X+FpujtaF447DfjzJ/NdyBXWXbmOyY
 ubX3pk0UOHVj5kMr7WXK5c+txOI2qIH4aV8cYZpVsFAXp63WjqgfRvRbNhICUA86
 3v2M/qLgAEFR3rYh9i+mm0D74aug0+NwST/QDtf5dYFiH53+BteEpQqd/kcYxt8m
 6bWwKxzUQG0=
 =PTOG
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160924' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Implement slow-start and other bits

This set of patches implements the RxRPC slow-start feature for AF_RXRPC to
improve performance and handling of occasional packet loss.  This is more or
less the same as TCP slow start [RFC 5681].  Firstly, there are some ACK
generation improvements:

 (1) Send ACKs regularly to apprise the peer of our state so that they can do
     congestion management of their own.

 (2) Send an ACK when we fill in a hole in the buffer so that the peer can
     find out that we did this thus forestalling retransmission.

 (3) Note the final DATA packet's serial number in the final ACK for
     correlation purposes.

and a couple of bug fixes:

 (4) Reinitialise the ACK state and clear the ACK and resend timers upon
     entering the client reply reception phase to kill off any pending probe
     ACKs.

 (5) Delay the resend timer to allow for nsec->jiffies conversion errors.

and then there's the slow-start pieces:

 (6) Summarise an ACK.

 (7) Schedule a PING or IDLE ACK if the reply to a client call is overdue to
     try and find out what happened to it.

 (8) Implement the slow start feature.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-25 05:56:05 -04:00
Lance Richardson
c2675de447 gre: use nla_get_be32() to extract flowinfo
Eliminate a sparse endianness mismatch warning, use nla_get_be32() to
extract a __be32 value instead of nla_get_u32().

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 20:07:44 -04:00
David Howells
57494343cb rxrpc: Implement slow-start
Implement RxRPC slow-start, which is similar to RFC 5681 for TCP.  A
tracepoint is added to log the state of the congestion management algorithm
and the decisions it makes.

Notes:

 (1) Since we send fixed-size DATA packets (apart from the final packet in
     each phase), counters and calculations are in terms of packets rather
     than bytes.

 (2) The ACK packet carries the equivalent of TCP SACK.

 (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
     suited to SACK of a small number of packets.  It seems that, almost
     inevitably, by the time three 'duplicate' ACKs have been seen, we have
     narrowed the loss down to one or two missing packets, and the
     FLIGHT_SIZE calculation ends up as 2.

 (4) In rxrpc_resend(), if there was no data that apparently needed
     retransmission, we transmit a PING ACK to ask the peer to tell us what
     its Rx window state is.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
0d967960d3 rxrpc: Schedule an ACK if the reply to a client call appears overdue
If we've sent all the request data in a client call but haven't seen any
sign of the reply data yet, schedule an ACK to be sent to the server to
find out if the reply data got lost.

If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
demand a response to find out whether we need to retransmit.

If the server says it has received all of the data, we send an IDLE ACK to
tell the server that we haven't received anything in the receive phase as
yet.

To make this work, a non-immediate PING ACK must carry a delay.  I've chosen
the same as the IDLE ACK for the moment.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
31a1b98950 rxrpc: Generate a summary of the ACK state for later use
Generate a summary of the Tx buffer packet state when an ACK is received
for use in a later patch that does congestion management.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
df0562a72d rxrpc: Delay the resend timer to allow for nsec->jiffies conv error
When determining the resend timer value, we have a value in nsec but the
timer is in jiffies which may be a million or more times more coarse.
nsecs_to_jiffies() rounds down - which means that the resend timeout
expressed as jiffies is very likely earlier than the one expressed as
nanoseconds from which it was derived.

The problem is that rxrpc_resend() gets triggered by the timer, but can't
then find anything to resend yet.  It sets the timer again - but gets
kicked off immediately again and again until the nanosecond-based expiry
time is reached and we actually retransmit.

Fix this by adding 1 to the jiffies-based resend_at value to counteract the
rounding and make sure that the timer happens after the nanosecond-based
expiry is passed.

Alternatives would be to adjust the timestamp on the packets to align
with the jiffie scale or to switch back to using jiffie-timestamps.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
dd7c1ee59a rxrpc: Reinitialise the call ACK and timer state for client reply phase
Clear the ACK reason, ACK timer and resend timer when entering the client
reply phase when the first DATA packet is received.  New ACKs will be
proposed once the data is queued.

The resend timer is no longer relevant and we need to cancel ACKs scheduled
to probe for a lost reply.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
b69d94d799 rxrpc: Include the last reply DATA serial number in the final ACK
In a client call, include the serial number of the last DATA packet of the
reply in the final ACK.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
a7056c5ba6 rxrpc: Send an immediate ACK if we fill in a hole
Send an immediate ACK if we fill in a hole in the buffer left by an
out-of-sequence packet.  This may allow the congestion management in the peer
to avoid a retransmission if packets got reordered on the wire.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
Aaron Conole
d4bb5caa9c netfilter: Only allow sane values in nf_register_net_hook
This commit adds an upfront check for sane values to be passed when
registering a netfilter hook.  This will be used in a future patch for a
simplified hook list traversal.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:30:19 +02:00
Aaron Conole
e2361cb90a netfilter: Remove explicit rcu_read_lock in nf_hook_slow
All of the callers of nf_hook_slow already hold the rcu_read_lock, so this
cleanup removes the recursive call.  This is just a cleanup, as the locking
code gracefully handles this situation.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:29:53 +02:00
Aaron Conole
2c1e2703ff netfilter: call nf_hook_ingress with rcu_read_lock
This commit ensures that the rcu read-side lock is held while the
ingress hook is called.  This ensures that a call to nf_hook_slow (and
ultimately nf_ingress) will be read protected.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:25:49 +02:00
Florian Westphal
c5136b15ea netfilter: bridge: add and use br_nf_hook_thresh
This replaces the last uses of NF_HOOK_THRESH().
Followup patch will remove it and rename nf_hook_thresh.

The reason is that inet (non-bridge) netfilter no longer invokes the
hooks from hooks, so we do no longer need the thresh value to skip hooks
with a lower priority.

The bridge netfilter however may need to do this. br_nf_hook_thresh is a
wrapper that is supposed to do this, i.e. only call hooks with a
priority that exceeds NF_BR_PRI_BRNF.

It's used only in the recursion cases of br_netfilter.  It invokes
nf_hook_slow while holding an rcu read-side critical section to make a
future cleanup simpler.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:25:48 +02:00
Gao Feng
50f4c7b73f netfilter: xt_TCPMSS: Refactor the codes to decrease one condition check and more readable
The origin codes perform two condition checks with dst_mtu(skb_dst(skb))
and in_mtu. And the last statement is "min(dst_mtu(skb_dst(skb)),
in_mtu) - minlen". It may let reader think about how about the result.
Would it be negative.

Now assign the result of min(dst_mtu(skb_dst(skb)), in_mtu) to a new
variable, then only perform one condition check, and it is more readable.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24 21:13:21 +02:00
David Howells
805b21b929 rxrpc: Send an ACK after every few DATA packets we receive
Send an ACK if we haven't sent one for the last two packets we've received.
This keeps the other end apprised of where we've got to - which is
important if they're doing slow-start.

We do this in recvmsg so that we can dispatch a packet directly without the
need to wake up the background thread.

This should possibly be made configurable in future.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 18:05:26 +01:00
Lance Richardson
db32e4e49c ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()
Similar to commit 3be07244b733 ("ip6_gre: fix flowi6_proto value in
xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup.

Up until now, ip6gre_xmit_other() has set flowi6_proto to a bogus value.
This affected output route lookup for packets sent on an ip6gretap device
in cases where routing was dependent on the value of flowi6_proto.

Since the correct proto is already set in the tunnel flowi6 template via
commit 252f3f5a1189 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit
path."), simply delete the line setting the incorrect flowi6_proto value.

Suggested-by: Jiri Benc <jbenc@redhat.com>
Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 09:44:31 -04:00
David S. Miller
2a9aa41fd2 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+VA9/Sw1s6N8H32AQJufQ//bP1Aq6Ej+3km0o+S1w2lEl7kc1Gnown+
 PLKecg73rtMLEXDCJPKrh/OjGBX1GFaPC5HXWF/4TRav3/qjsi2/P7MSIYbt8ZAk
 0kVcdpIU1RWrPFt0tdhxTDSgPMhRWaKJTFV0CwQgnCqeyrVX44syx42B941Wsgcs
 N/6ANRr1qFLTwT1LKD0EhRWxnds1YI9WPgF5huaAn5RHpymCynwxsxyTDdU+NL0j
 YE75o5rjUzAZzMk5xrTjO8yd7xjUfjO7xT/CzjjVfXB4TOBmYExfOrLNZX0vOPPr
 LBOj3LTLwklWtx5/RJGe3A4y1+yfiUYGTu90ArARHUxP6AMwnxztvjuy2MQvBOgP
 xTlHcEOPoHrxxfIGdJH/0NRl1zgn6IF2lFOH3EgRcw8hzMdFp46AXV74ckst1vOQ
 LsqTbhndG2vkVFV8uZCZO7om0HRbpxbvy6+hEakdZ6k9kJA3amezS6Q8wiXxGhAE
 0wBns3XI/75Zd7QyXRkcHl3iUqS5uW0dKmAqitIlgf3CUsr4cZFlHO8IDD3rkTFD
 WbnvFhaL9Ee5IQjbsHgyzqPSI7sUJUYG1xBAz8wIWnjD0bGu5b/QzgdBSCVIFJ3e
 bNvpe1B6HEHj0EyIj1wK2H8311Y4lPNeJlOj7BDUqqRIZW8vviG26MTGML6GnJwU
 L35wa/s4ebo=
 =+8HJ
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160923' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Bug fixes and tracepoints

Here are a bunch of bug fixes:

 (1) Need to set the timestamp on a Tx packet before queueing it to avoid
     trouble with the retransmission function.

 (2) Don't send an ACK at the end of the service reply transmission; it's
     the responsibility of the client to send an ACK to close the call.
     The service can resend the last DATA packet or send a PING ACK.

 (3) Wake sendmsg() on abnormal call termination.

 (4) Use ktime_add_ms() not ktime_add_ns() to add millisecond offsets.

 (5) Use before_eq() & co. to compare serial numbers (which may wrap).

 (6) Start the resend timer on DATA packet transmission.

 (7) Don't accidentally cancel a retransmission upon receiving a NACK.

 (8) Fix the call timer setting function to deal with timeouts that are now
     or past.

 (9) Don't use a flag to communicate the presence of the last packet in the
     Tx buffer from sendmsg to the input routines where ACK and DATA
     reception is handled.  The problem is that there's a window between
     queueing the last packet for transmission and setting the flag in
     which ACKs or reply DATA packets can arrive, causing apparent state
     machine violation issues.

     Instead use the annotation buffer to mark the last packet and pick up
     and set the flag in the input routines.

(10) Don't call the tx_ack tracepoint and don't allocate a serial number if
     someone else nicked the ACK we were about to transmit.

There are also new tracepoints and one altered tracepoint used to track
down the above bugs:

(11) Call timer tracepoint.

(12) Data Tx tracepoint (and adjustments to ACK tracepoint).

(13) Injected Rx packet loss tracepoint.

(14) Ack proposal tracepoint.

(15) Retransmission selection tracepoint.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:24:19 -04:00
David S. Miller
1678c1134f Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2016-09-23

Only two patches this time:

1) Fix a comment reference to struct xfrm_replay_state_esn.
   From Richard Guy Briggs.

2) Convert xfrm_state_lookup to rcu, we don't need the
   xfrm_state_lock anymore in the input path.
   From Florian Westphal.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:18:19 -04:00
Moshe Shemesh
79aab093a0 net: Update API for VF vlan protocol 802.1ad support
Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
the ability for user-space application to specify it for the VF, as an
option to support 802.1ad.
We adjusted IP Link tool to support this option.

For future use cases, the new UAPI supports multiple vlans. For now we
limit the list size to a single vlan in kernel.
Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
compatibility with older versions of IP Link tool.

Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
We kept 802.1Q as the drivers' default vlan protocol.
Suitable ip link tool command examples:
  Set vf vlan protocol 802.1ad:
    ip link set eth0 vf 1 vlan 100 proto 802.1ad
  Set vf to VST (802.1Q) mode:
    ip link set eth0 vf 1 vlan 100 proto 802.1Q
  Or by omitting the new parameter
    ip link set eth0 vf 1 vlan 100

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:01:26 -04:00
Arnd Bergmann
2ed6afdee7 netns: move {inc,dec}_net_namespaces into #ifdef
With the newly enforced limit on the number of namespaces,
we get a build warning if CONFIG_NETNS is disabled:

net/core/net_namespace.c:273:13: error: 'dec_net_namespaces' defined but not used [-Werror=unused-function]
net/core/net_namespace.c:268:24: error: 'inc_net_namespaces' defined but not used [-Werror=unused-function]

This moves the two added functions inside the #ifdef that guards
their callers.

Fixes: 703286608a22 ("netns: Add a limit on the number of net namespaces")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-23 13:39:43 -05:00
Christoph Hellwig
ed082d36a7 IB/core: add support to create a unsafe global rkey to ib_create_pd
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
less unchecked, this moves the capability of creating a global rkey into
the RDMA core, where it can be easily audited.  It also prints a warning
everytime this feature is used as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-23 13:47:44 -04:00