36488 Commits

Author SHA1 Message Date
Richard Alpe
9ab154658a tipc: convert legacy nl bearer enable/disable to nl compat
Introduce a framework for transcoding legacy nl action into actions
(.doit) calls from the new nl API. This is done by converting the
incoming TLV data into netlink data with nested netlink attributes.
Unfortunately due to the randomness of the legacy API we can't do this
generically so each legacy netlink command requires a specific
transcoding recipe. In this case for bearer enable and bearer disable.

Convert TIPC_CMD_ENABLE_BEARER and TIPC_CMD_DISABLE_BEARER into doit
compat calls.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09 13:20:47 -08:00
Richard Alpe
d0796d1ef6 tipc: convert legacy nl bearer dump to nl compat
Introduce a framework for dumping netlink data from the new netlink
API and formatting it to the old legacy API format. This is done by
looping the dump data and calling a format handler for each entity, in
this case a bearer.

We dump until either all data is dumped or we reach the limited buffer
size of the legacy API. Remember, the legacy API doesn't scale.

In this commit we convert TIPC_CMD_GET_BEARER_NAMES to use the compat
layer.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09 13:20:47 -08:00
Richard Alpe
bfb3e5dd8d tipc: move and rename the legacy nl api to "nl compat"
The new netlink API is no longer "v2" but rather the standard API and
the legacy API is now "nl compat". We split them into separate
start/stop and put them in different files in order to further
distinguish them.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09 13:20:47 -08:00
Trond Myklebust
54c0987492 SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG
Now that the linger code is gone, the xs_tcp_fin_timeout variable has
no real function. Keep it for now, since it is part of the /proc
interface, but only define it if that /proc interface is enabled.

Suggested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09 11:27:45 -05:00
Trond Myklebust
b70ae915e4 SUNRPC: Handle connection reset more efficiently.
If the connection reset is due to an active call on our side, then
the state change is sometimes not reported. Catch those instances
using xs_error_report() instead.
Also remove the xs_tcp_shutdown() call in xs_tcp_send_request() as
the change in behaviour makes it redundant.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09 11:27:42 -05:00
Trond Myklebust
9e2b9f3776 SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09 11:26:06 -05:00
Trond Myklebust
caf4ccd4e8 SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release
Use of socket shutdown() means that we monitor the shutdown process
through the xs_tcp_state_change() callback, so it is preferable to
a full close in all cases unless we're destroying the transport.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09 11:20:44 -05:00
Trond Myklebust
0efeac261c SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection
The previous behaviour left the connection half-open in order to try
to scrape the last replies from the socket. Now that we have more reliable
reconnection, change the behaviour to close down the socket faster.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09 09:31:11 -05:00
Trond Myklebust
505936f59f SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09 09:20:40 -05:00
Trond Myklebust
9cbc94fb06 SUNRPC: Remove TCP socket linger code
Now that we no longer use the partial shutdown code when closing the
socket, we no longer need to worry about the TCP linger2 state.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-09 09:20:40 -05:00
Julian Anastasov
cd67cd5eb2 ipvs: use 64-bit rates in stats
IPVS stats are limited to 2^(32-10) conns/s and packets/s,
2^(32-5) bytes/s. It is time to use 64 bits:

* Change all conn/packet kernel counters to 64-bit and update
them in u64_stats_update_{begin,end} section

* In kernel use struct ip_vs_kstats instead of the user-space
struct ip_vs_stats_user and use new func ip_vs_export_stats_user
to export it to sockopt users to preserve compatibility with
32-bit values

* Rename cpu counters "ustats" to "cnt"

* To netlink users provide additionally 64-bit stats:
IPVS_SVC_ATTR_STATS64 and IPVS_DEST_ATTR_STATS64. Old stats
remain for old binaries.

* We can use ip_vs_copy_stats in ip_vs_stats_percpu_show

Thanks to Chris Caputo for providing initial patch for ip_vs_est.c

Signed-off-by: Chris Caputo <ccaputo@alt.net>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-02-09 16:59:03 +09:00
Eric Dumazet
93c1af6ca9 net:rfs: adjust table size checking
Make sure root user does not try something stupid.

Also make sure mask field in struct rps_sock_flow_table
does not share a cache line with the potentially often dirtied
flow table.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 567e4b79731c ("net: rfs: add hash collision detection")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 21:54:09 -08:00
Trond Myklebust
4efdd92c92 SUNRPC: Remove TCP client connection reset hack
Instead we rely on SO_REUSEPORT to provide the reconnection semantics
that we need for NFSv2/v3.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08 21:47:30 -05:00
Trond Myklebust
de84d89030 SUNRPC: TCP/UDP always close the old socket before reconnecting
It is not safe to call xs_reset_transport() from inside xs_udp_setup_socket()
or xs_tcp_setup_socket(), since they do not own the correct locks. Instead,
do it in xs_connect().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08 21:47:30 -05:00
Trond Myklebust
718ba5b873 SUNRPC: Add helpers to prevent socket create from racing
The socket lock is currently held by the task that is requesting the
connection be established. While that is efficient in the case where
the connection happens quickly, it is racy in the case where it doesn't.
What we really want is for the connect helper to be able to block access
to the socket while it is being set up.

This patch does so by arranging to transfer the socket lock from the
task that is requesting the connect attempt, and then releasing that
lock once everything is done.
This scheme also gives us automatic protection against collisions with
the RPC close code, so we can kill the cancel_delayed_work_sync()
call in xs_close().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08 21:47:29 -05:00
Trond Myklebust
6cc7e90836 SUNRPC: Ensure xs_reset_transport() resets the close connection flags
Otherwise, we may end up looping.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08 21:47:28 -05:00
Trond Myklebust
76698b2358 SUNRPC: Do not clear the source port in xs_reset_transport
Now that we can reuse bound ports after a close, we never really want to
clear the transport's source port after it has been set. Doing so really
messes up the NFSv3 DRC on the server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08 21:47:28 -05:00
Trond Myklebust
3913c78c3a SUNRPC: Handle EADDRINUSE on connect
Now that we're setting SO_REUSEPORT, we still need to handle the
case where a connect() is attempted, but the old socket is still
lingering.
Essentially, all we want to do here is handle the error by waiting
a few seconds and then retrying.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08 21:47:27 -05:00
Eric Dumazet
567e4b7973 net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.

Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)

This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.

I use a 32bit value, and dynamically split it in two parts.

For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.

Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.

If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).

This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.

This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 16:53:57 -08:00
Sabrina Dubroca
3e97fa7059 gre/ipip: use be16 variants of netlink functions
encap.sport and encap.dport are __be16, use nla_{get,put}_be16 instead
of nla_{get,put}_u16.

Fixes the sparse warnings:

warning: incorrect type in assignment (different base types)
   expected restricted __be32 [addressable] [usertype] o_key
   got restricted __be16 [addressable] [usertype] i_flags
warning: incorrect type in assignment (different base types)
   expected restricted __be16 [usertype] sport
   got unsigned short
warning: incorrect type in assignment (different base types)
   expected restricted __be16 [usertype] dport
   got unsigned short
warning: incorrect type in argument 3 (different base types)
   expected unsigned short [unsigned] [usertype] value
   got restricted __be16 [usertype] sport
warning: incorrect type in argument 3 (different base types)
   expected unsigned short [unsigned] [usertype] value
   got restricted __be16 [usertype] dport

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 16:28:06 -08:00
Trond Myklebust
4dda9c8a5e SUNRPC: Set SO_REUSEPORT socket option for TCP connections
When using TCP, we need the ability to reuse port numbers after
a disconnection, so that the NFSv3 server knows that we're the same
client. Currently we use a hack to work around the TCP socket's
TIME_WAIT: we send an RST instead of closing, which doesn't
always work...
The SO_REUSEPORT option added in Linux 3.9 allows us to bind multiple
TCP connections to the same source address+port combination, and thus
to use ordinary TCP close() instead of the current hack.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-08 18:52:11 -05:00
Jon Paul Maloy
51a00daf73 tipc: fix bug in socket reception function
In commit c637c1035534867b85b78b453c38c495b58e2c5a ("tipc: resolve race
problem at unicast message reception") we introduced a time limit
for how long the function tipc_sk_eneque() would be allowed to execute
its loop. Unfortunately, the test for when this limit is passed was put
in the wrong place, resulting in a lost message when the test is true.

We fix this by moving the test to before we dequeue the next buffer
from the input queue.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 13:09:25 -08:00
Michael Büsch
662f5533c4 rt6_probe_deferred: Do not depend on struct ordering
rt6_probe allocates a struct __rt6_probe_work and schedules a work handler rt6_probe_deferred.
But rt6_probe_deferred kfree's the struct work_struct instead of struct __rt6_probe_work.
This works, because struct work_struct is the first element of struct __rt6_probe_work.

Change it to kfree struct __rt6_probe_work to not implicitly depend on
struct work_struct being the first element.

This does not affect the generated code.

Signed-off-by: Michael Buesch <m@bues.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 13:00:43 -08:00
Trond Myklebust
bc3203cdca NFS: RDMA Client Sparse Fixes
This patch fixes a sparse warning in the initial submission.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJU09WdAAoJENfLVL+wpUDr2HMP/2rvjkncrlbumSUlkMMUPenZ
 PUM6dDAnvvhkRj/YrhmJ94v9SPb4BsR0ay4apn5aqmoEHqEy/LfAsoy1TlcspRO6
 DwIKD8wHQ9RkpIVaEt1IGhApWlWK3R5FqoDTSMW9VLq72QuCsnbX3EXWN9B+hgT7
 UVmcsGe0/5zd9v8RXzWJjyabOsx7rx7gNuuAyULO3RudNMzEpldMf7oHq0RwndOt
 akxSSvRSfMFyfJscfI3F2IJsbXBST+ZzzJ0oRXXWn0RfiWR+Z438Nk9HDuOfGBFv
 LACoELeJxtBEFAjBv6GnBILdSxRi6dTZPpAEzvNR45SJAk4c1gGDqgUbyIYXqeYM
 k8alEMCS+eKvDe7DL6N1Nlxq6oc5pAt7BYYpRzCz+2dd7gfHIOwurBnZwFgxN/rK
 GhtXCaBqqVr4lmu/lbgWjVFN/Jt6TGZ6+FqQhDM70CmJGBWbOEJNE6H50R7SiFpv
 7UapXg9iknCpjexkZS3cyAB0Qu31qvIK64ZzsPhVafmSYh9ed5hlT2WSwphfDki5
 ZQOktuPFtVkOI55Rz+2EpNNYe9nRuL5wBqD4Na3uUiEWgbnDLRmx+sDyzcfPujTN
 trFTwUO082SFK0yZVusZ04vlqjfzxSd4aYSeC5Zzgt0CYJCZd7z0tFrdLpyO/fDn
 PLfvcJyFRBBbaPgiwBHg
 =3fRY
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-3.20-part-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: RDMA Client Sparse Fixes

This patch fixes a sparse warning in the initial submission.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

* tag 'nfs-rdma-for-3.20-part-2' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  xprtrdma: Address sparse complaint in rpcr_to_rdmar()
2015-02-08 10:37:34 -05:00
Neal Cardwell
4fb17a6091 tcp: mitigate ACK loops for connections as tcp_timewait_sock
Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
represented by a tcp_timewait_sock, we rate limit dupacks in response
to incoming packets (a) with TCP timestamps that fail PAWS checks, or
(b) with sequence numbers that are out of the acceptable window.

We do not send a dupack in response to out-of-window packets if it has
been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
last sent a dupack in response to an out-of-window packet.

Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 01:03:13 -08:00
Neal Cardwell
f2b2c582e8 tcp: mitigate ACK loops for connections as tcp_sock
Ensure that in state ESTABLISHED, where the connection is represented
by a tcp_sock, we rate limit dupacks in response to incoming packets
(a) with TCP timestamps that fail PAWS checks, or (b) with sequence
numbers or ACK numbers that are out of the acceptable window.

We do not send a dupack in response to out-of-window packets if it has
been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
last sent a dupack in response to an out-of-window packet.

There is already a similar (although global) rate-limiting mechanism
for "challenge ACKs". When deciding whether to send a challence ACK,
we first consult the new per-connection rate limit, and then the
global rate limit.

Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 01:03:12 -08:00
Neal Cardwell
a9b2c06dbe tcp: mitigate ACK loops for connections as tcp_request_sock
In the SYN_RECV state, where the TCP connection is represented by
tcp_request_sock, we now rate-limit SYNACKs in response to a client's
retransmitted SYNs: we do not send a SYNACK in response to client SYN
if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
since we last sent a SYNACK in response to a client's retransmitted
SYN.

This allows the vast majority of legitimate client connections to
proceed unimpeded, even for the most aggressive platforms, iOS and
MacOS, which actually retransmit SYNs 1-second intervals for several
times in a row. They use SYN RTO timeouts following the progression:
1,1,1,1,1,2,4,8,16,32.

Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 01:03:12 -08:00
Neal Cardwell
032ee42369 tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks
Helpers for mitigating ACK loops by rate-limiting dupacks sent in
response to incoming out-of-window packets.

This patch includes:

- rate-limiting logic
- sysctl to control how often we allow dupacks to out-of-window packets
- SNMP counter for cases where we rate-limited our dupack sending

The rate-limiting logic in this patch decides to not send dupacks in
response to out-of-window segments if (a) they are SYNs or pure ACKs
and (b) the remote endpoint is sending them faster than the configured
rate limit.

We rate-limit our responses rather than blocking them entirely or
resetting the connection, because legitimate connections can rely on
dupacks in response to some out-of-window segments. For example, zero
window probes are typically sent with a sequence number that is below
the current window, and ZWPs thus expect to thus elicit a dupack in
response.

We allow dupacks in response to TCP segments with data, because these
may be spurious retransmissions for which the remote endpoint wants to
receive DSACKs. This is safe because segments with data can't
realistically be part of ACK loops, which by their nature consist of
each side sending pure/data-less ACKs to each other.

The dupack interval is controlled by a new sysctl knob,
tcp_invalid_ratelimit, given in milliseconds, in case an administrator
needs to dial this upward in the face of a high-rate DoS attack. The
name and units are chosen to be analogous to the existing analogous
knob for ICMP, icmp_ratelimit.

The default value for tcp_invalid_ratelimit is 500ms, which allows at
most one such dupack per 500ms. This is chosen to be 2x faster than
the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
2.4). We allow the extra 2x factor because network delay variations
can cause packets sent at 1 second intervals to be compressed and
arrive much closer.

Reported-by: Avery Fay <avery@mixpanel.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 01:03:12 -08:00
Pravin B Shelar
ca539345f8 openvswitch: Initialize unmasked key and uid len
Flow alloc needs to initialize unmasked key pointer. Otherwise
it can crash kernel trying to free random unmasked-key pointer.

general protection fault: 0000 [#1] SMP
3.19.0-rc6-net-next+ #457
Hardware name: Supermicro X7DWU/X7DWU, BIOS  1.1 04/30/2008
RIP: 0010:[<ffffffff8111df0e>] [<ffffffff8111df0e>] kfree+0xac/0x196
Call Trace:
 [<ffffffffa060bd87>] flow_free+0x21/0x59 [openvswitch]
 [<ffffffffa060bde0>] ovs_flow_free+0x21/0x23 [openvswitch]
 [<ffffffffa0605b4a>] ovs_packet_cmd_execute+0x2f3/0x35f [openvswitch]
 [<ffffffffa0605995>] ? ovs_packet_cmd_execute+0x13e/0x35f [openvswitch]
 [<ffffffff811fe6fb>] ? nla_parse+0x4f/0xec
 [<ffffffff8139a2fc>] genl_family_rcv_msg+0x26d/0x2c9
 [<ffffffff8107620f>] ? __lock_acquire+0x90e/0x9aa
 [<ffffffff8139a3be>] genl_rcv_msg+0x66/0x89
 [<ffffffff8139a358>] ? genl_family_rcv_msg+0x2c9/0x2c9
 [<ffffffff81399591>] netlink_rcv_skb+0x3e/0x95
 [<ffffffff81399898>] ? genl_rcv+0x18/0x37
 [<ffffffff813998a7>] genl_rcv+0x27/0x37
 [<ffffffff81399033>] netlink_unicast+0x103/0x191
 [<ffffffff81399382>] netlink_sendmsg+0x2c1/0x310
 [<ffffffff811007ad>] ? might_fault+0x50/0xa0
 [<ffffffff8135c773>] do_sock_sendmsg+0x5f/0x7a
 [<ffffffff8135c799>] sock_sendmsg+0xb/0xd
 [<ffffffff8135cacf>] ___sys_sendmsg+0x1a3/0x218
 [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86
 [<ffffffff8115a9d0>] ? fsnotify+0x32c/0x348
 [<ffffffff8115a720>] ? fsnotify+0x7c/0x348
 [<ffffffff8113e5f5>] ? __fget+0xaa/0xbf
 [<ffffffff8113e54b>] ? get_close_on_exec+0x86/0x86
 [<ffffffff8135cccd>] __sys_sendmsg+0x3d/0x5e
 [<ffffffff8135cd02>] SyS_sendmsg+0x14/0x16
 [<ffffffff81411852>] system_call_fastpath+0x12/0x17

Fixes: 74ed7ab9264("openvswitch: Add support for unique flow IDs.")
CC: Joe Stringer <joestringer@nicira.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 00:51:14 -08:00
Roopa Prabhu
1fd0bddb61 bridge: add missing bridge port check for offloads
This patch fixes a missing bridge port check caught by smatch.

setlink/dellink of attributes like vlans can come for a bridge device
and there is no need to offload those today. So, this patch adds a bridge
port check. (In these cases however, the BRIDGE_SELF flags will always be set
and we may not hit a problem with the current code).

smatch complaint:

The patch 68e331c785b8: "bridge: offload bridge port attributes to
switch asic if feature flag set" from Jan 29, 2015, leads to the
following Smatch complaint:

net/bridge/br_netlink.c:552 br_setlink()
	 error: we previously assumed 'p' could be null (see line 518)

net/bridge/br_netlink.c
   517
   518		if (p && protinfo) {
                    ^
Check for NULL.

Reported-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 22:49:47 -08:00
Eric Dumazet
91e83133e7 net: use netif_rx_ni() from process context
Hotpluging a cpu might be rare, yet we have to use proper
handlers when taking over packets found in backlog queues.

dev_cpu_callback() runs from process context, thus we should
call netif_rx_ni() to properly invoke softirq handler.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 22:43:52 -08:00
Sowmini Varadhan
d0a47d3272 rds: Make rds_message_copy_from_user() return 0 on success.
Commit 083735f4b01b ("rds: switch rds_message_copy_from_user() to iov_iter")
breaks rds_message_copy_from_user() semantics on success, and causes it
to return nbytes copied, when it should return 0.  This commit fixes that bug.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 22:41:56 -08:00
Rasmus Villemoes
11ac11999b net: rds: Remove repeated function names from debug output
The macro rdsdebug is defined as

  pr_debug("%s(): " fmt, __func__ , ##args)

Hence it doesn't make sense to include the name of the calling
function explicitly in the format string passed to rdsdebug.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 22:41:56 -08:00
Jarno Rajahalme
83d2b9ba1a net: openvswitch: Support masked set actions.
OVS userspace already probes the openvswitch kernel module for
OVS_ACTION_ATTR_SET_MASKED support.  This patch adds the kernel module
implementation of masked set actions.

The existing set action sets many fields at once.  When only a subset
of the IP header fields, for example, should be modified, all the IP
fields need to be exact matched so that the other field values can be
copied to the set action.  A masked set action allows modification of
an arbitrary subset of the supported header bits without requiring the
rest to be matched.

Masked set action is now supported for all writeable key types, except
for the tunnel key.  The set tunnel action is an exception as any
input tunnel info is cleared before action processing starts, so there
is no tunnel info to mask.

The kernel module converts all (non-tunnel) set actions to masked set
actions.  This makes action processing more uniform, and results in
less branching and duplicating the action processing code.  When
returning actions to userspace, the fully masked set actions are
converted back to normal set actions.  We use a kernel internal action
code to be able to tell the userspace provided and converted masked
set actions apart.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 22:40:17 -08:00
David S. Miller
3c09e92fb6 NFC: 3.20 second pull request
This is the second NFC pull request for 3.20.
 
 It brings:
 
 - NCI NFCEE (NFC Execution Environment, typically an embedded or
   external secure element) discovery and enabling/disabling support.
   In order to communicate with an NFCEE, we also added NCI's logical
   connections support to the NCI stack.
 
 - HCI over NCI protocol support. Some secure elements only understand
   HCI and thus we need to send them HCI frames when they're part of
   an NCI chipset.
 
 - NFC_EVT_TRANSACTION userspace API addition. Whenever an application
   running on a secure element needs to notify its host counterpart,
   we send an NFC_EVENT_SE_TRANSACTION event to userspace through the
   NFC netlink socket.
 
 - Secure element and HCI transaction event support for the st21nfcb
   chipset.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU06ozAAoJEIqAPN1PVmxKdtkP/2CeQ0+xJnJWgyh4xjkVxfPb
 wPYrz3cHeHdnVX36mnIdRoQZRjxh/Ef/vgK8RsXohcldzHA9D8GjxRpj0GrkVGQ9
 xoDkiHsFDdsZHeezSrlnXNnObvZrkeMsmnUDJxfwLtcX1nzFKzBgW2GaDiNjUazC
 r5Ip4YIfoCaVLq5MSqYRqJ6uhplXd/ePdtPoo54L/E81y4ptQi8pUsQBy4K3GrFn
 FbXoemcymhi9AVBGl+OlppZ93t6bdOV1D5tcOwpytAbfa3jp2oUnJPZay7EoWruR
 aj8l5v3P+SSphAJwaGSbny0L83tTS7sq+N4QG+VxxdMkw2YjpmxgN3AGguNBtPGG
 SUThpZV6QhrkAOnwzP2BJnh68WSU1eJrGvk8zUWW0SanA8k2jaHO/VzXCA3/ZkC4
 IFcXl4Jr6R3iUxDFgS/6MB5E9EGedzFTacsWoHVyZPtaDAQp4WVdAIRQ4QKgJCLw
 1yRmCeVunTsrs6O0aenU1xHinPlnx+tkuiNh4H64APQYTz54WwXH1/S04OL1SkqI
 mTzUvFgWqJvSIOuU19g0HBw+xZ/9vvX8LLkm9kSTbe7tcmFH5CI7ZCN/hNDHvMSd
 sjNYOyag/SAHJ01tTKcSPAgWO49bHWPodI04zP0lK0gMLjKvRknVMdTgIDfu49HN
 2da9E23iBmNNgG79iVHN
 =8caD
 -----END PGP SIGNATURE-----

Merge tag 'nfc-next-3.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next

NFC: 3.20 second pull request

This is the second NFC pull request for 3.20.

It brings:

- NCI NFCEE (NFC Execution Environment, typically an embedded or
  external secure element) discovery and enabling/disabling support.
  In order to communicate with an NFCEE, we also added NCI's logical
  connections support to the NCI stack.

- HCI over NCI protocol support. Some secure elements only understand
  HCI and thus we need to send them HCI frames when they're part of
  an NCI chipset.

- NFC_EVT_TRANSACTION userspace API addition. Whenever an application
  running on a secure element needs to notify its host counterpart,
  we send an NFC_EVENT_SE_TRANSACTION event to userspace through the
  NFC netlink socket.

- Secure element and HCI transaction event support for the st21nfcb
  chipset.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 22:22:25 -08:00
Daniel Borkmann
364d5716a7 rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY
ifla_vf_policy[] is wrong in advertising its individual member types as
NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
len member as *max* attribute length [0, len].

The issue is that when do_setvfinfo() is being called to set up a VF
through ndo handler, we could set corrupted data if the attribute length
is less than the size of the related structure itself.

The intent is exactly the opposite, namely to make sure to pass at least
data of minimum size of len.

Fixes: ebc08a6f47ee ("rtnetlink: Add VF config code to rtnetlink")
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 22:13:10 -08:00
Tobias Waldekranz
e04449fcf2 dsa: correctly determine the number of switches in a system
The number of connected switches was sourced from the number of
children to the DSA node, change it to the number of available
children, skipping any disabled switches.

Fixes: 5e95329b701c4 ("dsa: add device tree bindings to register DSA switches")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 22:07:37 -08:00
Daniel Borkmann
11b1f8288d ipv6: addrconf: add missing validate_link_af handler
We still need a validate_link_af() handler with an appropriate nla policy,
similarly as we have in IPv4 case, otherwise size validations are not being
done properly in that case.

Fixes: f53adae4eae5 ("net: ipv6: add tokenized interface identifier support")
Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes")
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 12:51:33 -08:00
Jon Paul Maloy
cb1b728096 tipc: eliminate race condition at multicast reception
In a previous commit in this series we resolved a race problem during
unicast message reception.

Here, we resolve the same problem at multicast reception. We apply the
same technique: an input queue serializing the delivery of arriving
buffers. The main difference is that here we do it in two steps.
First, the broadcast link feeds arriving buffers into the tail of an
arrival queue, which head is consumed at the socket level, and where
destination lookup is performed. Second, if the lookup is successful,
the resulting buffer clones are fed into a second queue, the input
queue. This queue is consumed at reception in the socket just like
in the unicast case. Both queues are protected by the same lock, -the
one of the input queue.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 16:00:03 -08:00
Jon Paul Maloy
3c724acdd5 tipc: simplify socket multicast reception
The structure 'tipc_port_list' is used to collect port numbers
representing multicast destination socket on a receiving node.
The list is not based on a standard linked list, and is in reality
optimized for the uncommon case that there are more than one
multicast destinations per node. This makes the list handling
unecessarily complex, and as a consequence, even the socket
multicast reception becomes more complex.

In this commit, we replace 'tipc_port_list' with a new 'struct
tipc_plist', which is based on a standard list. We give the new
list stack (push/pop) semantics, someting that simplifies
the implementation of the function tipc_sk_mcast_rcv().

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 16:00:03 -08:00
Jon Paul Maloy
708ac32cb5 tipc: simplify connection abort notifications when links break
The new input message queue in struct tipc_link can be used for
delivering connection abort messages to subscribing sockets. This
makes it possible to simplify the code for such cases.

This commit removes the temporary list in tipc_node_unlock()
used for transforming abort subscriptions to messages. Instead, the
abort messages are now created at the moment of lost contact, and
then added to the last failed link's generic input queue for delivery
to the sockets concerned.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 16:00:02 -08:00
Jon Paul Maloy
c637c10355 tipc: resolve race problem at unicast message reception
TIPC handles message cardinality and sequencing at the link layer,
before passing messages upwards to the destination sockets. During the
upcall from link to socket no locks are held. It is therefore possible,
and we see it happen occasionally, that messages arriving in different
threads and delivered in sequence still bypass each other before they
reach the destination socket. This must not happen, since it violates
the sequentiality guarantee.

We solve this by adding a new input buffer queue to the link structure.
Arriving messages are added safely to the tail of that queue by the
link, while the head of the queue is consumed, also safely, by the
receiving socket. Sequentiality is secured per socket by only allowing
buffers to be dequeued inside the socket lock. Since there may be multiple
simultaneous readers of the queue, we use a 'filter' parameter to reduce
the risk that they peek the same buffer from the queue, hence also
reducing the risk of contention on the receiving socket locks.

This solves the sequentiality problem, and seems to cause no measurable
performance degradation.

A nice side effect of this change is that lock handling in the functions
tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that
will enable future simplifications of those functions.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 16:00:02 -08:00
Jon Paul Maloy
94153e36e7 tipc: use existing sk_write_queue for outgoing packet chain
The list for outgoing traffic buffers from a socket is currently
allocated on the stack. This forces us to initialize the queue for
each sent message, something costing extra CPU cycles in the most
critical data path. Later in this series we will introduce a new
safe input buffer queue, something that would force us to initialize
even the spinlock of the outgoing queue. A closer analysis reveals
that the queue always is filled and emptied within the same lock_sock()
session. It is therefore safe to use a queue aggregated in the socket
itself for this purpose. Since there already exists a queue for this
in struct sock, sk_write_queue, we introduce use of that queue in
this commit.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 16:00:02 -08:00
Jon Paul Maloy
e3a77561e7 tipc: split up function tipc_msg_eval()
The function tipc_msg_eval() is in reality doing two related, but
different tasks. First it tries to find a new destination for named
messages, in case there was no first lookup, or if the first lookup
failed. Second, it does what its name suggests, evaluating the validity
of the message and its destination, and returning an appropriate error
code depending on the result.

This is confusing, and in this commit we choose to break it up into two
functions. A new function, tipc_msg_lookup_dest(), first attempts to find
a new destination, if the message is of the right type. If this lookup
fails, or if the message should not be subject to a second lookup, the
already existing tipc_msg_reverse() is called. This function performs
prepares the message for rejection, if applicable.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 16:00:02 -08:00
Jon Paul Maloy
d570d86497 tipc: enqueue arrived buffers in socket in separate function
The code for enqueuing arriving buffers in the function tipc_sk_rcv()
contains long code lines and currently goes to two indentation levels.
As a cosmetic preparaton for the next commits, we break it out into
a separate function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 16:00:02 -08:00
Jon Paul Maloy
1186adf7df tipc: simplify message forwarding and rejection in socket layer
Despite recent improvements, the handling of error codes and return
values at reception of messages in the socket layer is still confusing.

In this commit, we try to make it more comprehensible. First, we
separate between the return values coming from the functions called
by tipc_sk_rcv(), -those are TIPC specific error codes, and the
return values returned by tipc_sk_rcv() itself. Second, we don't
use the returned TIPC error code as indication for whether a buffer
should be forwarded/rejected or not; instead we use the buffer pointer
passed along with filter_msg(). This separation is necessary because
we sometimes want to forward messages even when there is no error
(i.e., protocol messages and successfully secondary looked up data
messages).

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 16:00:01 -08:00
Jon Paul Maloy
c5898636c4 tipc: reduce usage of context info in socket and link
The most common usage of namespace information is when we fetch the
own node addess from the net structure. This leads to a lot of
passing around of a parameter of type 'struct net *' between
functions just to make them able to obtain this address.

However, in many cases this is unnecessary. The own node address
is readily available as a member of both struct tipc_sock and
tipc_link, and can be fetched from there instead.
The fact that the vast majority of functions in socket.c and link.c
anyway are maintaining a pointer to their respective base structures
makes this option even more compelling.

In this commit, we introduce the inline functions tsk_own_node()
and link_own_node() to make it easy for functions to fetch the node
address from those structs instead of having to pass along and
dereference the namespace struct.

In particular, we make calls to the msg_xx() functions in msg.{h,c}
context independent by directly passing them the own node address
as parameter when needed. Those functions should be regarded as
leaves in the code dependency tree, and it is hence desirable to
keep them namspace unaware.

Apart from a potential positive effect on cache behavior, these
changes make it easier to introduce the changes that will follow
later in this series.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 16:00:01 -08:00
Sabrina Dubroca
7744b5f369 pktgen: fix UDP checksum computation
This patch fixes two issues in UDP checksum computation in pktgen.

First, the pseudo-header uses the source and destination IP
addresses. Currently, the ports are used for IPv4.

Second, the UDP checksum covers both header and data.  So we need to
generate the data earlier (move pktgen_finalize_skb up), and compute
the checksum for UDP header + data.

Fixes: c26bf4a51308c ("pktgen: Add UDPCSUM flag to support UDP checksums")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 15:41:34 -08:00
Erik Kline
c58da4c659 net: ipv6: allow explicitly choosing optimistic addresses
RFC 4429 ("Optimistic DAD") states that optimistic addresses
should be treated as deprecated addresses.  From section 2.1:

   Unless noted otherwise, components of the IPv6 protocol stack
   should treat addresses in the Optimistic state equivalently to
   those in the Deprecated state, indicating that the address is
   available for use but should not be used if another suitable
   address is available.

Optimistic addresses are indeed avoided when other addresses are
available (i.e. at source address selection time), but they have
not heretofore been available for things like explicit bind() and
sendmsg() with struct in6_pktinfo, etc.

This change makes optimistic addresses treated more like
deprecated addresses than tentative ones.

Signed-off-by: Erik Kline <ek@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 15:37:41 -08:00
Miroslav Urbanek
233c96fc07 flowcache: Fix kernel panic in flow_cache_flush_task
flow_cache_flush_task references a structure member flow_cache_gc_work
where it should reference flow_cache_flush_task instead.

Kernel panic occurs on kernels using IPsec during XFRM garbage
collection. The garbage collection interval can be shortened using the
following sysctl settings:

net.ipv4.xfrm4_gc_thresh=4
net.ipv6.xfrm6_gc_thresh=4

With the default settings, our productions servers crash approximately
once a week. With the settings above, they crash immediately.

Fixes: ca925cf1534e ("flowcache: Make flow cache name space aware")
Reported-by: Tomáš Charvát <tc@excello.cz>
Tested-by: Jan Hejl <jh@excello.cz>
Signed-off-by: Miroslav Urbanek <mu@miroslavurbanek.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 14:38:53 -08:00