42367 Commits

Author SHA1 Message Date
Eric Dumazet
b3d7e2b29b net_sched: sch_codel: defer skb freeing in codel_change()
codel_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.

codel_reset() already has support for deferred skb freeing
because it uses qdisc_reset_queue()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:08:35 -07:00
Eric Dumazet
f9aed311b6 net_sched: sch_choke: defer skb freeing
choke_reset() and choke_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:08:34 -07:00
Eric Dumazet
1b5c5493e3 net_sched: add the ability to defer skb freeing
qdisc are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.

When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.

This commit adds rtnl_kfree_skbs(), used to queue
skbs for deferred freeing.

Actual freeing happens right after RTNL is released,
with appropriate scheduling points.

rtnl_qdisc_drop() can also be used in place
of disc_drop() when RTNL is held.

qdisc_reset_queue() and __qdisc_reset_queue() get
the new behavior, so standard qdiscs like pfifo, pfifo_fast...
have their ->reset() method automatically handled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:08:34 -07:00
Jon Paul Maloy
35c55c9877 tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.

This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.

This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:

- Each node makes up a linearly ascending, circular list of all its N
  known neighbors, based on their TIPC node identity. This algorithm
  must be the same on all nodes.

- The node then selects the next M = sqrt(N) - 1 nodes downstream from
  itself in the list, and chooses to actively monitor those. This is
  called its "local monitoring domain".

- It creates a domain record describing the monitoring domain, and
  piggy-backs this in the data area of all neighbor monitoring messages
  (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
  the cluster eventually (default within 400 ms) will learn about
  its monitoring domain.

- Whenever a node discovers a change in its local domain, e.g., a node
  has been added or has gone down, it creates and sends out a new
  version of its node record to inform all neighbors about the change.

- A node receiving a domain record from anybody outside its local domain
  matches this against its own list (which may not look the same), and
  chooses to not actively monitor those members of the received domain
  record that are also present in its own list. Instead, it relies on
  indications from the direct monitoring nodes if an indirectly
  monitored node has gone up or down. If a node is indicated lost, the
  receiving node temporarily activates its own direct monitoring towards
  that node in order to confirm, or not, that it is actually gone.

- Since each node is actively monitoring sqrt(N) downstream neighbors,
  each node is also actively monitored by the same number of upstream
  neighbors. This means that all non-direct monitoring nodes normally
  will receive sqrt(N) indications that a node is gone.

- A major drawback with ring monitoring is how it handles failures that
  cause massive network partitionings. If both a lost node and all its
  direct monitoring neighbors are inside the lost partition, the nodes in
  the remaining partition will never receive indications about the loss.
  To overcome this, each node also chooses to actively monitor some
  nodes outside its local domain. Those nodes are called remote domain
  "heads", and are selected in such a way that no node in the cluster
  will be more than two direct monitoring hops away. Because of this,
  each node, apart from monitoring the member of its local domain, will
  also typically monitor sqrt(N) remote head nodes.

- As an optimization, local list status, domain status and domain
  records are marked with a generation number. This saves senders from
  unnecessarily conveying  unaltered domain records, and receivers from
  performing unneeded re-adaptations of their node monitoring list, such
  as re-assigning domain heads.

- As a measure of caution we have added the possibility to disable the
  new algorithm through configuration. We do this by keeping a threshold
  value for the cluster size; a cluster that grows beyond this value
  will switch from full-mesh to ring monitoring, and vice versa when
  it shrinks below the value. This means that if the threshold is set to
  a value larger than any anticipated cluster size (default size is 32)
  the new algorithm is effectively disabled. A patch set for altering the
  threshold value and for listing the table contents will follow shortly.

- This change is fully backwards compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:06:28 -07:00
WANG Cong
b2313077ed net_sched: make tcf_hash_check() boolean
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 12:43:35 -07:00
David Ahern
9ff7438460 net: vrf: Handle ipv6 multicast and link-local addresses
IPv6 multicast and link-local addresses require special handling by the
VRF driver:
1. Rather than using the VRF device index and full FIB lookups,
   packets to/from these addresses should use direct FIB lookups based on
   the VRF device table.

2. fail sends/receives on a VRF device to/from a multicast address
   (e.g, make ping6 ff02::1%<vrf> fail)

3. move the setting of the flow oif to the first dst lookup and revert
   the change in icmpv6_echo_reply made in ca254490c8dfd ("net: Add VRF
   support to IPv6 stack"). Linklocal/mcast addresses require use of the
   skb->dev.

With this change connections into and out of a VRF enslaved device work
for multicast and link-local addresses work (icmp, tcp, and udp)
e.g.,

1. packets into VM with VRF config:
    ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
    ping6 -c3 ff02::1%br1

    ssh -6 fe80::e0:f9ff:fe1c:b974%br1

2. packets going out a VRF enslaved device:
    ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
    ping6 -c3 ff02::1%eth1
    ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 12:34:34 -07:00
David Ahern
ba46ee4c0e net: ipv6: Do not add multicast route for l3 master devices
L3 master devices are virtual devices similar to the loopback
device. Link local and multicast routes for these devices do
not make sense. The ipv6 addrconf code already skips adding a
linklocal address; do the same for the mcast route.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 12:34:34 -07:00
David Ahern
cd2a9e62c8 net: l3mdev: Remove const from flowi6 arg to get_rt6_dst
Allow drivers to pass flow arg to functions where the arg is not const
and allow the driver to make updates as needed (eg., setting oif).

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 12:34:34 -07:00
Eugene Crosser
a006353a9a af_iucv: use paged SKBs for big inbound messages
When an inbound message is bigger than a page, allocate a paged SKB,
and subsequently use IUCV receive primitive with IPBUFLST flag.
This relaxes the pressure to allocate big contiguous kernel buffers.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 12:21:05 -07:00
Eugene Crosser
291759a575 af_iucv: remove fragment_skb() to use paged SKBs
Before introducing paged skbs in the receive path, get rid of the
function `iucv_fragment_skb()` that replaces one large linear skb
with several smaller linear skbs.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 12:21:04 -07:00
Eugene Crosser
e53743994e af_iucv: use paged SKBs for big outbound messages
When an outbound message is bigger than a page, allocate and fill
a paged SKB, and subsequently use IUCV send primitive with IPBUFLST
flag. This relaxes the pressure to allocate big contiguous kernel
buffers.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 12:21:04 -07:00
David Howells
4f95dd78a7 rxrpc: Rework local endpoint management
Rework the local RxRPC endpoint management.

Local endpoint objects are maintained in a flat list as before.  This
should be okay as there shouldn't be more than one per open AF_RXRPC socket
(there can be fewer as local endpoints can be shared if their local service
ID is 0 and they share the same local transport parameters).

Changes:

 (1) Local endpoints may now only be shared if they have local service ID 0
     (ie. they're not being used for listening).

     This prevents a scenario where process A is listening of the Cache
     Manager port and process B contacts a fileserver - which may then
     attempt to send CM requests back to B.  But if A and B are sharing a
     local endpoint, A will get the CM requests meant for B.

 (2) We use a mutex to handle lookups and don't provide RCU-only lookups
     since we only expect to access the list when opening a socket or
     destroying an endpoint.

     The local endpoint object is pointed to by the transport socket's
     sk_user_data for the life of the transport socket - allowing us to
     refer to it directly from the sk_data_ready and sk_error_report
     callbacks.

 (3) atomic_inc_not_zero() now exists and can be used to only share a local
     endpoint if the last reference hasn't yet gone.

 (4) We can remove rxrpc_local_lock - a spinlock that had to be taken with
     BH processing disabled given that we assume sk_user_data won't change
     under us.

 (5) The transport socket is shut down before we clear the sk_user_data
     pointer so that we can be sure that the transport socket's callbacks
     won't be invoked once the RCU destruction is scheduled.

 (6) Local endpoints have a work item that handles both destruction and
     event processing.  The means that destruction doesn't then need to
     wait for event processing.  The event queues can then be cleared after
     the transport socket is shut down.

 (7) Local endpoints are no longer available for resurrection beyond the
     life of the sockets that had them open.  As soon as their last ref
     goes, they are scheduled for destruction and may not have their usage
     count moved from 0.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 15:38:17 +01:00
David Howells
875636163b rxrpc: Separate local endpoint event handling out into its own file
Separate local endpoint event handling out into its own file preparatory to
overhauling the object management aspect (which remains in the original
file).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 15:37:12 +01:00
David Howells
f66d749019 rxrpc: Use the peer record to distribute network errors
Use the peer record to distribute network errors rather than the transport
object (which I want to get rid of).  An error from a particular peer
terminates all calls on that peer.

For future consideration:

 (1) For ICMP-induced errors it might be worth trying to extract the RxRPC
     header from the offending packet, if one is returned attached to the
     ICMP packet, to better direct the error.

     This may be overkill, though, since an ICMP packet would be expected
     to be relating to the destination port, machine or network.  RxRPC
     ABORT and BUSY packets give notice at RxRPC level.

 (2) To also abort connection-level communications (such as CHALLENGE
     packets) where indicted by an error - but that requires some revamping
     of the connection event handling first.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 10:15:16 +01:00
David Howells
fe77d5fc5a rxrpc: Do a little bit of tidying in the ICMP processing
Do a little bit of tidying in the ICMP processing code.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 10:15:09 +01:00
David Howells
1c1df86fad rxrpc: Don't assume anything about the address in an ICMP packet
Don't assume anything about the address in an ICMP packet in
rxrpc_error_report() as the address may not be IPv4 in future, especially
since we're just printing these details.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 10:15:08 +01:00
David Howells
1a70c05bad rxrpc: Break MTU determination from ICMP into its own function
Break MTU determination from ICMP out into its own function to reduce the
complexity of the error report handler.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 10:15:06 +01:00
David Howells
abe89ef0ed rxrpc: Rename rxrpc_UDP_error_report() to rxrpc_error_report()
Rename rxrpc_UDP_error_report() to rxrpc_error_report() as it might get
called for something other than UDP.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 10:14:37 +01:00
David Howells
be6e6707f6 rxrpc: Rework peer object handling to use hash table and RCU
Rework peer object handling to use a hash table instead of a flat list and
to use RCU.  Peer objects are no longer destroyed by passing them to a
workqueue to process, but rather are just passed to the RCU garbage
collector as kfree'able objects.

The hash function uses the local endpoint plus all the components of the
remote address, except for the RxRPC service ID.  Peers thus represent a
UDP port on the remote machine as contacted by a UDP port on this machine.

The RCU read lock is used to handle non-creating lookups so that they can
be called from bottom half context in the sk_error_report handler without
having to lock the hash table against modification.
rxrpc_lookup_peer_rcu() *does* take a reference on the peer object as in
the future, this will be passed to a work item for error distribution in
the error_report path and this function will cease being used in the
data_ready path.

Creating lookups are done under spinlock rather than mutex as they might be
set up due to an external stimulus if the local endpoint is a server.

Captured network error messages (ICMP) are handled with respect to this
struct and MTU size and RTT are cached here.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-15 10:12:33 +01:00
WANG Cong
d9fa17ef9f act_police: rename tcf_act_police_locate() to tcf_act_police_init()
This function is just ->init(), rename it to make it obvious.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 00:05:57 -07:00
WANG Cong
95df1b1607 net_sched: remove internal use of TC_POLICE_*
These should be gone when we removed CONFIG_NET_CLS_POLICE.
We can not totally remove them since they are exposed
to userspace.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 00:05:57 -07:00
Sowmini Varadhan
3ecc5693c0 RDS: Update rds_conn_destroy to be MP capable
Refactor rds_conn_destroy() so that the per-path dismantling
is done in rds_conn_path_destroy, and then iterate as needed
over rds_conn_path_destroy().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:44 -07:00
Sowmini Varadhan
d769ef81d5 RDS: Update rds_conn_shutdown to work with rds_conn_path
This commit changes rds_conn_shutdown to take a rds_conn_path *
argument, allowing it to shutdown paths other than c_path[0] for
MP-capable transports.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:44 -07:00
Sowmini Varadhan
1c5113cf79 RDS: Initialize all RDS_MPATH_WORKERS in __rds_conn_create
Add a for() loop in __rds_conn_create to initialize all the
conn_paths, in preparate for MP capable transports.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:44 -07:00
Sowmini Varadhan
fb1b3dc43d RDS: Add rds_conn_path_error()
rds_conn_path_error() is the MP-aware analog of rds_conn_error,
to be used by multipath-capable callers.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:43 -07:00
Sowmini Varadhan
992c9ec5fe RDS: update rds-info related functions to traverse multiple conn_paths
This commit updates the callbacks related to the rds-info command
so that they walk through all the rds_conn_path structures and
report the requested info.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:43 -07:00
Sowmini Varadhan
3c0a59001a RDS: Add rds_conn_path_connect_if_down() for MP-aware callers
rds_conn_path_connect_if_down() works on the rds_conn_path
that it is passed. Callers who are not t_m_capable may continue
calling rds_conn_connect_if_down, which will invoke
rds_conn_path_connect_if_down() with the default c_path[0].

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:43 -07:00
Sowmini Varadhan
45997e9e2e RDS: Make rds_send_pong() take a rds_conn_path argument
This commit allows rds_send_pong() callers to send back
the rds pong message on some path other than c_path[0] by
passing in a struct rds_conn_path * argument.  It also
removes the last dependency on the #defines in rds_single.h
from send.c

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:43 -07:00
Sowmini Varadhan
01ff34ed44 RDS: Extract rds_conn_path from i_conn_path in rds_send_drop_to() for MP-capable transports
Explicitly set up rds_conn_path, either from i_conn_path (for
MP capable transpots) or as c_path[0], and use this in
rds_send_drop_to()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:43 -07:00
Sowmini Varadhan
1f9ecd7eac RDS: Pass rds_conn_path to rds_send_xmit()
Pass a struct rds_conn_path to rds_send_xmit so that MP capable
transports can transmit packets on something other than c_path[0].
The eventual goal for MP capable transports is to hash the rds
socket to a path based on the bound local address/port, and use
this path as the argument to rds_send_xmit()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:43 -07:00
Sowmini Varadhan
780a6d9e16 RDS: Make rds_send_queue_rm() rds_conn_path aware
Pass the rds_conn_path to rds_send_queue_rm, and use it to initialize
the i_conn_path field in struct rds_incoming. This commit also makes
rds_send_queue_rm() MP capable, because it now takes locks
specific to the rds_conn_path passed in, instead of defaulting to
the c_path[0] based defines from rds_single_path.h

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:42 -07:00
Sowmini Varadhan
7d885d0fc6 RDS: Remove stale function rds_send_get_message()
The only caller of rds_send_get_message() was
rds_iw_send_cq_comp_handler() which was removed as part of
commit dcdede0406d3 ("RDS: Drop stale iWARP RDMA transport"),
so remove rds_send_get_message() for the same reason.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:42 -07:00
Sowmini Varadhan
5c3d274c75 RDS: Add rds_send_path_drop_acked()
rds_send_path_drop_acked() is the path-specific version of
rds_send_drop_acked() to be invoked by MP capable callers.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:42 -07:00
Sowmini Varadhan
4e9b551c14 RDS: Add rds_send_path_reset()
rds_send_path_reset() is the path specific version of rds_send_reset()
intended for MP capable callers.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:42 -07:00
Sowmini Varadhan
5e833e025d RDS: rds_inc_path_init() helper function for MP capable transports
t_mp_capable transports can use rds_inc_path_init to initialize
all fields in struct rds_incoming, including the i_conn_path.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:42 -07:00
Sowmini Varadhan
ef9e62c2e5 RDS: recv path gets the conn_path from rds_incoming for MP capable transports
Transports that are t_mp_capable should set the rds_conn_path
on which the datagram was recived in the ->i_conn_path field
of struct rds_incoming.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:42 -07:00
Sowmini Varadhan
7e8f4413d7 RDS: add t_mp_capable bit to be set by MP capable transports
The t_mp_capable bit will be used in the core rds module
to support multipathing logic when the transport supports it.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:41 -07:00
Sowmini Varadhan
0cb43965d4 RDS: split out connection specific state from rds_connection to rds_conn_path
In preparation for multipath RDS, split the rds_connection
structure into a base structure, and a per-path struct rds_conn_path.
The base structure tracks information and locks common to all
paths. The workqs for send/recv/shutdown etc are tracked per
rds_conn_path. Thus the workq callbacks now work with rds_conn_path.

This commit allows for one rds_conn_path per rds_connection, and will
be extended into multiple conn_paths in  subsequent commits.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:50:41 -07:00
Neal Cardwell
dcf1158b27 tcp: return sizeof tcp_dctcp_info in dctcp_get_info()
Make sure that dctcp_get_info() returns only the size of the
info->dctcp struct that it zeroes out and fills in. Previously it had
been returning the size of the enclosing tcp_cc_info union,
sizeof(*info).  There is no problem yet, but that union that may one
day be larger than struct tcp_dctcp_info, in which case the
TCP_CC_INFO code might accidentally copy uninitialized bytes from the
stack.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:46:30 -07:00
Wei Yongjun
a5e27d18fe sctp: fix error return code in sctp_init()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:45:42 -07:00
David S. Miller
d4c76c1afe RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAV16wsPSw1s6N8H32AQLGqA/+MC6/oufeV5WF/csMucx6I5wauPDQND0b
 E1bOgzceGckwY3NUh14sYo8o/JVxzkKCWrWb0bNBSdqUup6vTMxCBgkSUxvaCbN4
 RIcqu8LDB3vCrzMobMWijZDGqCpF8ZprUcvfCVj1tdrnEbK3njoYJjJ9UZN/lVPw
 Tw+ojKpAty7r/JtZUkWEBhlZZtG26Ko7t+mDbgISOGulPNnw0HSsodMCDb1iSMRS
 MZESOqQ5Kb8vhNOQ0zn+fNWB5d94yLZIxmt/siDXuetopHWuD1u8imqz+lhQ1BTU
 7tOO1PKGxnsalw7FlgNL1cNmOJi/kEKISURG82K1u3pTk2Bf9vtiEPpLnC6p0Qy2
 WwIQVJWewSCYsLiujqhXwmTGHC1m53979VtETHZoKhj/Jm8t3S38XN92Kt/P9y9g
 z1r3EBs6+WzTt4zaw5rjVBY33reeDgEo2RYusLxMOgvPsmX5KMNvpov9R04+JthC
 Qp1zWsE2FerxRJ+CNFBNl0ei4NVMsds3OHuEfgAiEedRowjGokDRQd6stVEev05S
 S2XggUvt011Fvhud0UJWH9koeVcHA85KNCK8V1aj9xH3zsQDizMYhZMhaL34wnu+
 z9XSv6DjZzJVWxC4gyQ/bsvdg34Uerv6g2BwRgqDCfh90Sa5U2n1CL8b3LwTKFU3
 3R3s0l3R5gg=
 =FXID
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160613' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Rename rxrpc source files

Here's the next part of the AF_RXRPC rewrite.  In this set I rename some of
the files in the net/rxrpc/ directory and adjust the Makefile and
ar-internal.h to reflect the changes.

The aim is twofold:

 (1) Remove the "ar-" prefix on those files that have it as it's not really
     useful, especially now that I'm building rxkad in.

 (2) To aid splitting the local, peer, connection and call handling code
     into separate files for object and event handling in future patches by
     making it easier to come up with new filenames.

There are two commits:

 (1) The first commit does a bunch of renames of .c files and alters the
     Makefile.  ar-internal.h isn't renamed at this time to avoid having to
     change the contents of the files being renamed.

 (2) The second commit changes the section label comments in ar-internal.h
     to reflect the changed filenames and reorders the file so that the
     sections are back in filename order.

The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
	rxrpc-rewrite-20160613
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 23:30:32 -07:00
Amir Vadai
e8eb36cd8c net/sched: flower: Return error when hw can't offload and skip_sw is set
When skip_sw is set and hardware fails to apply filter, return error to
user. This will make error propagation logic similar to the one
currently used in u32 classifier.
Also, changed code to use tc_skip_sw() utility function.

Signed-off-by: Amir Vadai <amirva@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14 22:37:26 -07:00
David Howells
0d81a51ab9 rxrpc: Update the comments in ar-internal.h to reflect renames
Update the section comments in ar-internal.h that indicate the locations of
the referenced items to reflect the renames done to the .c files in
net/rxrpc/.

This also involves some rearrangement to reflect keep the sections in order
of filename.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-06-13 13:38:51 +01:00
David Howells
8c3e34a4ff rxrpc: Rename files matching ar-*.c to git rid of the "ar-" prefix
Rename files matching net/rxrpc/ar-*.c to get rid of the "ar-" prefix.
This will aid splitting those files by making easier to come up with new
names.

Note that the not all files are simply renamed from ar-X.c to X.c.  The
following exceptions are made:

 (*) ar-call.c -> call_object.c
     ar-ack.c -> call_event.c

     call_object.c is going to contain the core of the call object
     handling.  Call event handling is all going to be in call_event.c.

 (*) ar-accept.c -> call_accept.c

     Incoming call handling is going to be here.

 (*) ar-connection.c -> conn_object.c
     ar-connevent.c -> conn_event.c

     The former file is going to have the basic connection object handling,
     but there will likely be some differentiation between client
     connections and service connections in additional files later.  The
     latter file will have all the connection-level event handling.

 (*) ar-local.c -> local_object.c

     This will have the local endpoint object handling code.  The local
     endpoint event handling code will later be split out into
     local_event.c.

 (*) ar-peer.c -> peer_object.c

     This will have the peer endpoint object handling code.  Peer event
     handling code will be placed in peer_event.c (for the moment, there is
     none).

 (*) ar-error.c -> peer_event.c

     This will become the peer event handling code, though for the moment
     it's actually driven from the local endpoint's perspective.

Note that I haven't renamed ar-transport.c to transport_object.c as the
intention is to delete it when the rxrpc_transport struct is excised.

The only file that actually has its contents changed is net/rxrpc/Makefile.

net/rxrpc/ar-internal.h will need its section marker comments updating, but
I'll do that in a separate patch to make it easier for git to follow the
history across the rename.  I may also want to rename ar-internal.h at some
point - but that would mean updating all the #includes and I'd rather do
that in a separate step.

Signed-off-by: David Howells <dhowells@redhat.com.
2016-06-13 12:16:05 +01:00
Florian Westphal
99860208bc sched: remove NET_XMIT_POLICED
sch_atm returns this when TC_ACT_SHOT classification occurs.

But all other schedulers that use tc_classify
(htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS
in this case so just do that in atm.

BATMAN uses it as an intermediate return value to signal
forwarding vs. buffering, but it did not return POLICED to
callers outside of BATMAN.

Reviewed-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-12 22:02:11 -04:00
Hannes Frederic Sowa
38b7097b55 ipv6: use TOS marks from sockets for routing decision
In IPv6 the ToS values are part of the flowlabel in flowi6 and get
extracted during fib rule lookup, but we forgot to correctly initialize
the flowlabel before the routing lookup.

Reported-by: <liam.mcbirnie@boeing.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-11 15:33:26 -07:00
Eric Dumazet
45f50bed1d net_sched: remove generic throttled management
__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.

I already removed it for sch_fq in commit f2600cf02b5b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.

When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.

Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.

This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:21 -07:00
Eric Dumazet
42117927ca net_sched: netem: remove qdisc_is_throttled() use
Looks like it is only there as some optimization attempt.

Since __QDISC_STATE_THROTTLED set/unset is way too expensive,
and netem is the last user, just remove this check.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:21 -07:00
Eric Dumazet
cca605dd4b net_sched: cbq: remove a flaky use of qdisc_is_throttled()
So far no qdisc ever unset the throttled bit at enqueue() time,
so CBQ usage of qdisc_is_throttled() was flaky.

Since __QDISC_STATE_THROTTLED set/unset is way too expensive
considering that only CBQ was eventually caring for this status,
it would make sense to implement a Qdisc ops ->is_throttled()
if we find that this is needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:20 -07:00
Eric Dumazet
8fe6a79fb8 net_sched: sch_plug: use a private throttled status
We want to get rid of generic qdisc throttled management,
so this qdisc has to use a private flag.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:20 -07:00