IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Since batadv_orig_node_new() sets the refcount to two, assuming that
the calling function will use a reference for putting the orig_node into
a hash or similar, both references must be freed if initialization of
the orig_node fails. Otherwise that object may be leaked in that error
case.
Reported-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
When adding a new neighbour it is important to atomically
perform the following:
- check if the neighbour already exists
- append the neighbour to the proper list
If the two operations are not performed in an atomic context
it is possible that two concurrent insertions add the same
neighbour twice.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
pskb_may_pull() returns 1 on success and 0 in case of failure,
therefore checking for the return value being negative does
not make sense at all.
This way if the function fails we will probably read beyond the current
skb data buffer. Fix this by doing the proper check.
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
There is a refcounter unbalance in the CRC checking routine
invoked on OGM reception. A vlan object is retrieved (thus
its refcounter is increased by one) but it is never properly
released. This leads to a memleak because the vlan object
will never be free'd.
Fix this by releasing the vlan object after having read the
CRC.
Reported-by: Russell Senior <russell@personaltelco.net>
Reported-by: Daniel <daniel@makrotopia.org>
Reported-by: cmsv <cmsv@wirelesspt.net>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
When accessing a TT-TVLV container in the OGM RX path
the variable pointing to the list of changes to apply is
altered by mistake.
This makes the TT component read data at the wrong position
in the OGM packet buffer.
Fix it by removing the bogus pointer alteration.
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
The current MTU computation always returns a value
smaller than 1500bytes even if the real interfaces
have an MTU large enough to compensate the batman-adv
overhead.
Fix the computation by properly returning the highest
admitted value.
Introduced by a19d3d85e1
("batman-adv: limit local translation table max size")
Reported-by: Russell Senior <russell@personaltelco.net>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Add a check if payload's length is a power of 2 when selecting ops.
The fast ops were meant for well aligned loads, also this fixes a
small bug when using a length of 3 with some offsets which causes
only 1 byte to be loaded because the fast ops are chosen.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get
lots of "TRACE: filter:output:policy:1 IN=..." warnings as several
places will leave skb->nf_trace uninitialised.
Unlike iptables tracing functionality is not conditional in nftables,
so always copy/zero nf_trace setting when nftables is enabled.
Move this into __nf_copy() helper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The goal of this patch is to allow userland to dump only a part of SA by
specifying a filter during the dump.
The kernel is in charge to filter SA, this avoids to generate useless netlink
traffic (it save also some cpu cycles). This is particularly useful when there
is a big number of SA set on the system.
Note that I removed the union in struct xfrm_state_walk to fix a problem on arm.
struct netlink_callback->args is defined as a array of 6 long and the first long
is used in xfrm code to flag the cb as initialized. Hence, we must have:
sizeof(struct xfrm_state_walk) <= sizeof(long) * 5.
With the union, it was false on arm (sizeof(struct xfrm_state_walk) was
sizeof(long) * 7), due to the padding.
In fact, whatever the arch is, this union seems useless, there will be always
padding after it. Removing it will not increase the size of this struct (and
reduce it on arm).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Mathias reported that on an AMD Geode LX embedded board (ALiX)
with ath9k driver PACKET_QDISC_BYPASS, introduced in commit
d346a3fae3 ("packet: introduce PACKET_QDISC_BYPASS socket
option"), triggers a WARN_ON() coming from the driver itself
via 066dae93bd ("ath9k: rework tx queue selection and fix
queue stopping/waking").
The reason why this happened is that ndo_select_queue() call
is not invoked from direct xmit path i.e. for ieee80211 subsystem
that sets queue and TID (similar to 802.1d tag) which is being
put into the frame through 802.11e (WMM, QoS). If that is not
set, pending frame counter for e.g. ath9k can get messed up.
So the WARN_ON() in ath9k is absolutely legitimate. Generally,
the hw queue selection in ieee80211 depends on the type of
traffic, and priorities are set according to ieee80211_ac_numbers
mapping; working in a similar way as DiffServ only on a lower
layer, so that the AP can favour frames that have "real-time"
requirements like voice or video data frames.
Therefore, check for presence of ndo_select_queue() in netdev
ops and, if available, invoke it with a fallback handler to
__packet_pick_tx_queue(), so that driver such as bnx2x, ixgbe,
or mlx4 can still select a hw queue for transmission in
relation to the current CPU while e.g. ieee80211 subsystem
can make their own choices.
Reported-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to allow users to invoke netdev_cap_txqueue, it needs to
be moved into netdevice.h header file. While at it, also add kernel
doc header to document the API.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new argument for ndo_select_queue() callback that passes a
fallback handler. This gets invoked through netdev_pick_tx();
fallback handler is currently __netdev_pick_tx() as most drivers
invoke this function within their customized implementation in
case for skbs that don't need any special handling. This fallback
handler can then be replaced on other call-sites with different
queue selection methods (e.g. in packet sockets, pktgen etc).
This also has the nice side-effect that __netdev_pick_tx() is
then only invoked from netdev_pick_tx() and export of that
function to modules can be undone.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I commit e099e86c9e
("tipc: add node_lock protection to link lookup function")
we are calling spin_lock(&node->lock) directly instead of indirectly
via the tipc_node_lock(node) function. However, tipc_node_lock() is
using spin_lock_bh(), not spin_lock(), something leading to
unbalanced usage in one place, and a smatch warning.
We fix this by consistently using tipc_node_lock()/unlock() in
in the places touched by the mentioned commit.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 7d33939f47
("tipc: delay delete of link when failover is needed") we
introduced a loop for finding and removing a link pointer
in an array. The removal is done after we have left the loop,
giving the impression that one may remove the wrong pointer
if no matching element is found.
This is not really a bug, since we know that there will always
be a matching element, but it looks wrong, and causes a smatch
warning.
We fix this loop with this commit.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implementation of (a)rwnd calculation might lead to severe performance issues
and associations completely stalling. These problems are described and solution
is proposed which improves lksctp's robustness in congestion state.
1) Sudden drop of a_rwnd and incomplete window recovery afterwards
Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
but size of sk_buff, which is blamed against receiver buffer, is not accounted
in rwnd. Theoretically, this should not be the problem as actual size of buffer
is double the amount requested on the socket (SO_RECVBUF). Problem here is
that this will have bad scaling for data which is less then sizeof sk_buff.
E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
of traffic of this size (less then 100B).
An example of sudden drop and incomplete window recovery is given below. Node B
exhibits problematic behavior. Node A initiates association and B is configured
to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
message in 4G (LTE) network). On B data is left in buffer by not reading socket
in userspace.
Lets examine when we will hit pressure state and declare rwnd to be 0 for
scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
chunk is sent in separate sctp packet)
Logic is implemented in sctp_assoc_rwnd_decrease:
socket_buffer (see below) is maximum size which can be held in socket buffer
(sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)
A simple expression is given for which it will be examined after how many
packets for above stated parameters we enter pressure state:
We start by condition which has to be met in order to enter pressure state:
socket_buffer < currently_alloced;
currently_alloced is represented as size of sctp packets received so far and not
yet delivered to userspace. x is the number of chunks/packets (since there is no
bundling, and each chunk is delivered in separate packet, we can observe each
chunk also as sctp packet, and what is important here, having its own sk_buff):
socket_buffer < x*each_sctp_packet;
each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
twice the amount of initially requested size of socket buffer, which is in case
of sctp, twice the a_rwnd requested:
2*rwnd < x*(payload+sizeof(struc sk_buff));
sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
and each payload size is 43
20000 < x(43+190);
x > 20000/233;
x ~> 84;
After ~84 messages, pressure state is entered and 0 rwnd is advertised while
received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
drop from 6474 to 0, as it will be now shown in example:
IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
<...>
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]
--> Sudden drop
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
At this point, rwnd_press stores current rwnd value so it can be later restored
in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
adding amount which is read from userspace. Let us observe values in above
example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
amount of actual sctp data currently waiting to be delivered to userspace
is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
only for sctp data, which is ~3500. Condition is never met, and when userspace
reads all data, rwnd stays on 3569.
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
--> At this point userspace read everything, rwnd recovered only to 3569
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]
Reproduction is straight forward, it is enough for sender to send packets of
size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.
2) Minute size window for associations sharing the same socket buffer
In case multiple associations share the same socket, and same socket buffer
(sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
of the associations can permanently drop rwnd of other association(s).
Situation will be typically observed as one association suddenly having rwnd
dropped to size of last packet received and never recovering beyond that point.
Different scenarios will lead to it, but all have in common that one of the
associations (let it be association from 1)) nearly depleted socket buffer, and
the other association blames socket buffer just for the amount enough to start
the pressure. This association will enter pressure state, set rwnd_press and
announce 0 rwnd.
When data is read by userspace, similar situation as in 1) will occur, rwnd will
increase just for the size read by userspace but rwnd_press will be high enough
so that association doesn't have enough credit to reach rwnd_press and restore
to previous state. This case is special case of 1), being worse as there is, in
the worst case, only one packet in buffer for which size rwnd will be increased.
Consequence is association which has very low maximum rwnd ('minute size', in
our case down to 43B - size of packet which caused pressure) and as such
unusable.
Scenario happened in the field and labs frequently after congestion state (link
breaks, different probabilities of packet drop, packet reordering) and with
scenario 1) preceding. Here is given a deterministic scenario for reproduction:
>From node A establish two associations on the same socket, with rcvbuf_policy
being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
bring it down close to 0, as in just one more packet would close it. This has as
a consequence that association number 2 is able to receive (at least) one more
packet which will bring it in pressure state. E.g. if association 2 had rwnd of
10000, packet received was 43, and we enter at this point into pressure,
rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
increase for 43, but conditions to restore rwnd to original state, just as in
1), will never be satisfied.
--> Association 1, between A.y and B.12345
IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.55915: sctp (1) [COOKIE ACK]
--> Association 2, between A.z and B.12346
IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
IP B.12346 > A.55915: sctp (1) [COOKIE ACK]
--> Deplete socket buffer by sending messages of size 43B over association 1
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
<...>
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]
--> Sudden drop on 1
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
--> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
association 1 so there is place in buffer for only one more packet
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
--> Socket buffer is almost depleted, but there is space for one more packet,
send them over association 2, size 43B
IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
--> Immediate drop
IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]
--> Read everything from the socket, both association recover up to maximum rwnd
they are capable of reaching, note that association 1 recovered up to 3698,
and association 2 recovered only to 43
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]
A careful reader might wonder why it is necessary to reproduce 1) prior
reproduction of 2). It is simply easier to observe when to send packet over
association 2 which will push association into the pressure state.
Proposed solution:
Both problems share the same root cause, and that is improper scaling of socket
buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
calculating rwnd is not possible due to fact that there is no linear
relationship between amount of data blamed in increase/decrease with IP packet
in which payload arrived. Even in case such solution would be followed,
complexity of the code would increase. Due to nature of current rwnd handling,
slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
entered is rationale, but it gives false representation to the sender of current
buffer space. Furthermore, it implements additional congestion control mechanism
which is defined on implementation, and not on standard basis.
Proposed solution simplifies whole algorithm having on mind definition from rfc:
o Receiver Window (rwnd): This gives the sender an indication of the space
available in the receiver's inbound buffer.
Core of the proposed solution is given with these lines:
sctp_assoc_rwnd_update:
if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
else
asoc->rwnd = 0;
We advertise to sender (half of) actual space we have. Half is in the braces
depending whether you would like to observe size of socket buffer as SO_RECVBUF
or twice the amount, i.e. size is the one visible from userspace, that is,
from kernelspace.
In this way sender is given with good approximation of our buffer space,
regardless of the buffer policy - we always advertise what we have. Proposed
solution fixes described problems and removes necessity for rwnd restoration
algorithm. Finally, as proposed solution is simplification, some lines of code,
along with some bytes in struct sctp_association are saved.
Version 2 of the patch addressed comments from Vlad. Name of the function is set
to be more descriptive, and two parts of code are changed, in one removing the
superfluous call to sctp_assoc_rwnd_update since call would not result in update
of rwnd, and the other being reordering of the code in a way that call to
sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
in a way that existing function is not reordered/copied in line, but it is
correctly called. Thanks Vlad for suggesting.
Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently this always returns ENOBUFS, because the return value of
__ip_tunnel_create is discarded.
A more common failure is a duplicate name (EEXIST). Propagate the real
error code so userspace can display a more meaningful error message.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
since commit 251da413("ipv4: Cache ip_error() routes even when not forwarding."),
the counter IPSTATS_MIB_INADDRERRORS can't work correctly, because the value of
err was always set to ENETUNREACH.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dccp tfrc: revert
This reverts 6aee49c558 ("dccp: make local variable static") since
the variable tfrc_debug is referenced by the tfrc_pr_debug(fmt, ...)
macro when TFRC debugging is enabled. If it is enabled, use of the
macro produces a compilation error.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
In gss_alloc_msg(), if the call to gss_encode_v1_msg() fails, we
want to release the reference to the pipe_version that was obtained
earlier in the function.
Fixes: 9d3a2260f0 (SUNRPC: Fix buffer overflow checking in...)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fix a race in which the RPC client is shutting down while the
gss daemon is processing a downcall. If the RPC client manages to
shut down before the gss daemon is done, then the struct gss_auth
used in gss_release_msg() may have already been freed.
Link: http://lkml.kernel.org/r/1392494917.71728.YahooMailNeo@web140002.mail.bf1.yahoo.com
Reported-by: John <da_audiophile@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Openvswitch defines u64_stats_sync as ->sync rather than ->syncp,
so fails to compile with netdev_alloc_pcpu_stats(). So just rename it to ->syncp.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 1c213bd24a (net: introduce netdev_alloc_pcpu_stats() for drivers)
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
checkpatch error: switch and case should be at the same indent.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix checkpatch errors below:
ERROR: "foo* bar" should be "foo *bar"
ERROR: "foo * bar" should be "foo *bar"
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix checkpatch errors while the space is required or prohibited
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two new fields to struct tcp_info, to report sk_pacing_rate
and sk_max_pacing_rate to monitoring applications, as ss from iproute2.
User exported fields are 64bit, even if kernel is currently using 32bit
fields.
lpaa5:~# ss -i
..
skmem:(r0,rb357120,t0,tb2097152,f1584,w1980880,o0,bl0) ts sack cubic
wscale:6,6 rto:400 rtt:0.875/0.75 mss:1448 cwnd:1 ssthresh:12 send
13.2Mbps pacing_rate 3336.2Mbps unacked:15 retrans:1/5448 lost:15
rcv_space:29200
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove one inline keyword, and no need for a loop to find
an index into a table.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The solution was found by Patrick in 2.4 kernel sources.
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are two checks for CONFIG_NET_CLS_ROUTE, but the corresponding
Kconfig symbol was dropped in v2.6.39. Since the code guards access to
dst_entry.tclassid it seems CONFIG_IP_ROUTE_CLASSID should be used
instead.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
For IPv4 packets, we call both IPv4 and IPv6 reject.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Replace some magic numbers which describe states of GE model
loss generator with enumerate.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In netem_change(), we have already get "struct netem_sched_data *q".
Replace params of get_correlation() and other similar functions with
"struct netem_sched_data *q".
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
get_dist_table() and get_loss_clg() may be failed. These
two functions should be called after setting the members
of qdisc_priv(sch), or it will break the old settings while
either of them is failed.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if the 'time_before' macro expand with parentheses, the look is bad.
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packets which have L2 address different from ours should be
already filtered before entering into ip_forward().
Perform that check at the beginning to avoid processing such packets.
Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of my pet coding style peeves is the practice of
adding extra return; at the end of function.
Kill several instances of this in network code.
I suppose some coccinelle wizardy could do this automatically.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix incorrect comment reported by Norbert Kiesel. Edit another comment to add
more details. Also add references to algorithm (IETF draft and paper) to top of
file.
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
CC: Mythili Prabhu <mysuryan@cisco.com>
CC: Norbert Kiesel <nkiesel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 684bad1107 "tcp: use PRR to reduce cwin in CWR state" removed all
calls to min_cwnd, so we can safely remove it.
Also, remove tcp_reno_min_cwnd because it was only used for min_cwnd.
Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In an earlier commit, ("tipc: remove links list from bearer struct")
we described three issues that need to be pre-emptively resolved before
we can remove tipc_net_lock. Here we resolve issue a) described in that
commit:
"a) In access method #2, we access the link before taking the
protecting node_lock. This will not work once net_lock is gone,
so we will have to change the access order. We will deal with
this in a later commit in this series."
Here, we change that access order, by ensuring that the function
link_find_link() returns only a safe reference for finding
the link, i.e., a node pointer and an index into its 'links' array,
not the link pointer itself. We also change all callers of this
function to first take the node lock before they can check if there
still is a valid link pointer at the returned index. Since the
function now returns a node pointer rather than a link pointer,
we rename it to the more appropriate 'tipc_link_find_owner().
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the earlier commits ("tipc: remove 'links' list from
tipc_bearer struct") and ("tipc: introduce new spinlock to protect
struct link_req"), there is no longer any need to protect struct
link_req or or any link list by use of bearer_lock. Furthermore,
we have eliminated the need for using bearer_lock during downcalls
(send) from the link to the bearer, since we have ensured that
bearers always have a longer life cycle that their associated links,
and always contain valid data.
So, the only need now for a lock protecting bearers is for guaranteeing
consistency of the bearer list itself. For this, it is sufficient, at
least for the time being, to continue applying 'net_lock´ in write mode.
By removing bearer_lock we also pre-empt introduction of issue b) descibed
in the previous commit "tipc: remove 'links' list from tipc_bearer struct":
"b) When the outer protection from net_lock is gone, taking
bearer_lock and node_lock in opposite order of method 1) and 2)
will become an obvious deadlock hazard".
Therefore, we now eliminate the bearer_lock spinlock.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a bearer is disabled, all its attached links are deleted.
Ideally, we should do link failover to redundant links on other bearers,
if there are any, in such cases. This would be consistent with current
behavior when a link is reset, but not deleted. However, due to the
complexity involved, and the (wrongly) perceived low demand for this
feature, it was never implemented until now.
We mark the doomed link for deletion with a new flag, but wait until the
failover process is finished before we actually delete it. With the
improved link tunnelling/failover code introduced earlier in this commit
series, it is now easy to identify a spot in the code where the failover
is finished and it is safe to delete the marked link. Moreover, the test
for the flag and the deletion can be done synchronously, and outside the
most time critical data path.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We change the order of checking for destination users when processing
incoming packets. By placing the checks for users that may potentially
replace the processed buffer, i.e., CHANGEOVER_PROTOCOL and
MSG_FRAGMENTER, in a separate step before we check for the true end
users, we get rid of a label and a 'goto', at the same time making the
code more comprehensible and easy to follow.
This commit does not change any functionality, it is just a cosmetic
code reshuffle.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous redesign of the tunnel reception algorithm and
functions, we finalize it by renaming a couple of stack variables
in tipc_tunnel_rcv(). This makes it more consistent with the naming
scheme elsewhere in this part of the code.
This change is purely cosmetic, with no functional changes.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We simplify and slim down the code in function tipc_tunnel_rcv()
No impact on the users of this function.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the earlier commits in this series related to the function
tipc_link_tunnel_rcv(), we can now go further and simplify its
signature.
The function now consumes all DUPLICATE packets, and only returns such
ORIGINAL packets that are ready for immediate delivery, i.e., no
more link level protocol processing needs to be done by the caller.
As a consequence, the the caller, tipc_rcv(), does not access the link
pointer after call return, and it becomes unnecessary to pass a link
pointer reference in the call. Instead, we now only pass it the tunnel
link's owner node, which is sufficient to find the destination link for
the tunnelled packet.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a link is reset, and there is a redundant link available, all
sender sockets will steer their subsequent traffic through the
remaining link. In order to guarantee preserved packet order and
cardinality during the transition, we tunnel the failing link's send
queue through the remaining link before we allow any sockets to use it.
In this commit, we change the algorithm for receiving failover
("ORIGINAL_MSG") packets in tipc_link_tunnel_rcv(), at the same time
delegating it to a new subfuncton, tipc_link_failover_rcv(). Instead
of directly returning an extracted inner packet to the packet reception
loop in tipc_rcv(), we first check if it is a message fragment, in which
case we append it to the reset link's fragment chain. If the fragment
chain is complete, we return the whole chain instead of the individual
buffer, eliminating any need for the tipc_rcv() loop to do reassembly of
tunneled packets.
This change makes it possible to further simplify tipc_link_tunnel_rcv(),
as well as the calling tipc_rcv() loop. We will do that in later
commits. It also makes it possible to identify a single spot in the code
where we can tell that a failover procedure is finished, something that
is useful when we are deleting links after a failover. This will also
be done in a later commit.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a second link to a destination comes up, some sender sockets will
steer their subsequent traffic through the new link. In order to
guarantee preserved packet order and cardinality for those sockets, we
tunnel a duplicate of the old link's send queue through the new link
before we open it for regular traffic. The last arriving packet copy,
on whichever link, will be dropped at the receiving end based on the
original sequence number, to ensure that only one copy is delivered to
the end receiver.
In this commit, we change the algorithm for receiving DUPLICATE_MSG
packets, at the same time delegating it to a new subfunction,
tipc_link_dup_rcv(). Instead of returning an extracted inner packet to
the packet reception loop in tipc_rcv(), we just add it to the receiving
(new) link's deferred packet queue. The packet will then be processed by
that link when it receives its first non-tunneled packet, i.e., at
latest when the changeover procedure is finished.
Because tipc_link_tunnel_rcv()/tipc_link_dup_rcv() now is consuming all
packets of type DUPLICATE_MSG, the calling tipc_rcv() function can omit
testing for this. This in turn means that the current conditional jump
to the label 'protocol_check' becomes redundant, and we can remove that
label.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In our ongoing effort to simplify the TIPC locking structure,
we see a need to remove the linked list for tipc_links
in the bearer. This can be explained as follows.
Currently, we have three different ways to access a link,
via three different lists/tables:
1: Via a node hash table:
Used by the time-critical outgoing/incoming data paths.
(e.g. link_send_sections_fast() and tipc_recv_msg() ):
grab net_lock(read)
find node from node hash table
grab node_lock
select link
grab bearer_lock
send_msg()
release bearer_lock
release node lock
release net_lock
2: Via a global linked list for nodes:
Used by configuration commands (link_cmd_set_value())
grab net_lock(read)
find node and link from global node list (using link name)
grab node_lock
update link
release node lock
release net_lock
(Same locking order as above. No problem.)
3: Via the bearer's linked link list:
Used by notifications from interface (e.g. tipc_disable_bearer() )
grab net_lock(write)
grab bearer_lock
get link ptr from bearer's link list
get node from link
grab node_lock
delete link
release node lock
release bearer_lock
release net_lock
(Different order from above, but works because we grab the
outer net_lock in write mode first, excluding all other access.)
The first major goal in our simplification effort is to get rid
of the "big" net_lock, replacing it with rcu-locks when accessing
the node list and node hash array. This will come in a later patch
series.
But to get there we first need to rewrite access methods ##2 and 3,
since removal of net_lock would introduce three major problems:
a) In access method #2, we access the link before taking the
protecting node_lock. This will not work once net_lock is gone,
so we will have to change the access order. We will deal with
this in a later commit in this series, "tipc: add node lock
protection to link found by link_find_link()".
b) When the outer protection from net_lock is gone, taking
bearer_lock and node_lock in opposite order of method 1) and 2)
will become an obvious deadlock hazard. This is fixed in the
commit ("tipc: remove bearer_lock from tipc_bearer struct")
later in this series.
c) Similar to what is described in problem a), access method #3
starts with using a link pointer that is unprotected by node_lock,
in order to via that pointer find the correct node struct and
lock it. Before we remove net_lock, this access order must be
altered. This is what we do with this commit.
We can avoid introducing problem problem c) by even here using the
global node list to find the node, before accessing its links. When
we loop though the node list we use the own bearer identity as search
criteria, thus easily finding the links that are associated to the
resetting/disabling bearer. It should be noted that although this
method is somewhat slower than the current list traversal, it is in
no way time critical. This is only about resetting or deleting links,
something that must be considered relatively infrequent events.
As a bonus, we can get rid of the mutual pointers between links and
bearers. After this commit, pointer dependency go in one direction
only: from the link to the bearer.
This commit pre-empts introduction of problem c) as described above.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the 'started' field in struct tipc_link represents only a
binary state, 'started' or 'not started'. We need it to represent
more link execution states in the coming commits in this series.
Hence, we rename the field to 'flags', and define the current
started/non-started state to be represented by the LSB bit of
that field.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We break out the code for deleting attached links in the
function bearer_disable(), and define a new function named
tipc_link_delete_list() to do this job.
This commit incurs no functional changes, but makes the code of
function bearer_disable() cleaner. It is also a preparation
for a more important change to the bearer code, in a subsequent
commit in this series.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We break out the code for resetting attached links in the
function tipc_reset_bearer(), and define a new function named
tipc_link_reset_list() to do this job.
This commit incurs no functional changes, but makes the code
of function tipc_reset_bearer() cleaner. It is also a preparation
for a more important change to the bearer code, in a subsequent
commit in this series.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function tipc_link_recv_fragment(struct sk_buff **buf) currently
leaves the value of the input buffer pointer undefined when it returns,
except when the return code indicates that the reassembly is complete.
This despite the fact that it always consumes the input buffer.
Here, we enforce a stricter behavior by this function, ensuring that
the returned buffer pointer is non-NULL if and only if the reassembly
is complete. This makes it possible to test for the buffer pointer as
criteria for successful reassembly.
We also rename the function to tipc_link_frag_rcv(), which is both
shorter and more in line with common naming practice in the network
subsystem.
Apart from the new name, these changes have no impact on current
users of the function, but makes it more practical for use in some
planned future commits.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The inline functions in addr.h uses tipc_own_addr which is exported by
core.h, but addr.h never actually includes it. It works because it is
explicitly included where this is used, but it looks a bit strange.
Include core.h in addr.h explicitly to make the dependency clearer.
Signed-off-by: Andreas Bofjäll <andreas.bofjall@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will be used by upcoming ipv4 forward path change that needs to
determine feature mask using skb->dst->dev instead of skb->dev.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here, when the net is init_net, we needn't to kmemdup the ctl_table
again. So add a check for net. Also we can save some memory.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As commit 3c68198e75111a90("sctp: Make hmac algorithm selection for
cookie generation dynamic"), we miss the .data initialization.
If we don't use the net_namespace, the problem that parts of the
sysctl configuration won't be isolation and won't occur.
In sctp_sysctl_net_register(), we register the sysctl for each
net, in the for(), we use the 'table[i].data' as check condition, so
when the 'i' is the index of sctp_hmac_alg, the data is NULL, then
break. So add the .data initialization.
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I saw the following BUG when ->newlink() fails in rtnl_newlink():
[ 40.240058] kernel BUG at net/core/dev.c:6438!
this is due to free_netdev() is not supposed to be called before
netdev is completely unregistered, therefore it is not correct
to call free_netdev() here, at least for ops->newlink!=NULL case,
many drivers call it in ->destructor so that rtnl_unlock() will
take care of it, we probably don't need to do anything here.
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a packet received on a link is out-of-sequence, it will be
placed on a deferred queue and later reinserted in the receive
path once the preceding packets have been processed. The problem
with this is that it will be subject to the buffer adjustment from
link_recv_buf_validate twice. The second adjustment for 20 bytes
header space will corrupt the packet.
We solve this by tagging the deferred packets and bail out from
receive buffer validation for packets that have already been
subjected to this.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case when KMs have no listeners, km_query() will fail and
temporary SAs are garbage collected immediately after their allocation.
This causes strain on memory allocation, leading even to OOM since
temporary SA alloc/free cycle is performed for every packet
and garbage collection does not keep up the pace.
The sane thing to do is to make sure we have audience before
temporary SA allocation.
Signed-off-by: Horia Geanta <horia.geanta@freescale.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
IPX doesn't implement shutdown, which poses a problem to some users:
https://bugzilla.kernel.org/show_bug.cgi?id=67841
This patch is heavily based on the shutdown implementation for unix
sockets.
Reported-by: Bruno Jesus <00cpxxx@gmail.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We could allocate tc_action on stack in tca_action_flush(),
since it is not large.
Also, we could use create_a() in tcf_action_get_1().
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an action is bonnd to a filter, there is no point to
remove it outside. Currently we just silently decrease the refcnt,
we should reject this explicitly with EPERM.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For bindcnt and refcnt etc., they are common for all actions,
not need to repeat such operations for their own, they can be unified
now. Actions just need to do its specific cleanup if needed.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we can totally hide it from modules. tcf_hash_*() API's
will operate on struct tc_action, modules don't need to care about
the details.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Improves reliability of wifi connections with WPA, since authentication
frames are prioritized over normal traffic and also typically exempt
from aggregation.
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Inserting a entry into flowcache, or flushing flowcache should be based
on per net scope. The reason to do so is flushing operation from fat
netns crammed with flow entries will also making the slim netns with only
a few flow cache entries go away in original implementation.
Since flowcache is tightly coupled with IPsec, so it would be easier to
put flow cache global parameters into xfrm namespace part. And one last
thing needs to do is bumping flow cache genid, and flush flow cache should
also be made in per net style.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Clear checking when user try to use ESN through netlink keymgr for AH.
As only ESP and AH support ESN feature according to RFC.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch add esn support for AH input stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.
Then the ICV value will guard upper 32bits sequence number as well when
packet going in.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch add esn support for AH output stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.
Then the ICV value will guard upper 32bits sequence number as well when
packet going out.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch add esn support for AH input stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.
Then the ICV value will guard upper 32bits sequence number as well when
packet getting in.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch add esn support for AH output stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.
Then the ICV value will guard upper 32bits sequence number as well when
packet going out.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
As compared with skb_to_sgvec, skb_to_sgvec_nomark only map skb to given sglist
without mark the sg which contain last skb data as the end. So the caller can
mannipulate sg list as will when padding new data after the first call without
calling sg_unmark_end to expend sg list.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pull networking updates from David Miller:
1) Fix flexcan build on big endian, from Arnd Bergmann
2) Correctly attach cpsw to GPIO bitbang MDIO drive, from Stefan Roese
3) udp_add_offload has to use GFP_ATOMIC since it can be invoked from
non-sleepable contexts. From Or Gerlitz
4) vxlan_gro_receive() does not iterate over all possible flows
properly, fix also from Or Gerlitz
5) CAN core doesn't use a proper SKB destructor when it hooks up
sockets to SKBs. Fix from Oliver Hartkopp
6) ip_tunnel_xmit() can use an uninitialized route pointer, fix from
Eric Dumazet
7) Fix address family assignment in IPVS, from Michal Kubecek
8) Fix ath9k build on ARM, from Sujith Manoharan
9) Make sure fail_over_mac only applies for the correct bonding modes,
from Ding Tianhong
10) The udp offload code doesn't use RCU correctly, from Shlomo Pongratz
11) Handle gigabit features properly in generic PHY code, from Florian
Fainelli
12) Don't blindly invoke link operations in
rtnl_link_get_slave_info_data_size, they are optional. Fix from
Fernando Luis Vazquez Cao
13) Add USB IDs for Netgear Aircard 340U, from Bjørn Mork
14) Handle netlink packet padding properly in openvswitch, from Thomas
Graf
15) Fix oops when deleting chains in nf_tables, from Patrick McHardy
16) Fix RX stalls in xen-netback driver, from Zoltan Kiss
17) Fix deadlock in mac80211 stack, from Emmanuel Grumbach
18) inet_nlmsg_size() forgets to consider ifa_cacheinfo, fix from Geert
Uytterhoeven
19) tg3_change_mtu() can deadlock, fix from Nithin Sujir
20) Fix regression in setting SCTP local source addresses on accepted
sockets, caused by some generic ipv6 socket changes. Fix from
Matija Glavinic Pecotic
21) IPPROTO_* must be pure defines, otherwise module aliases don't get
constructed properly. Fix from Jan Moskyto
22) IPV6 netconsole setup doesn't work properly unless an explicit
source address is specified, fix from Sabrina Dubroca
23) Use __GFP_NORETRY for high order skb page allocations in
sock_alloc_send_pskb and skb_page_frag_refill. From Eric Dumazet
24) Fix a regression added in netconsole over bridging, from Cong Wang
25) TCP uses an artificial offset of 1ms for SRTT, but this doesn't jive
well with TCP pacing which needs the SRTT to be accurate. Fix from
Eric Dumazet
26) Several cases of missing header file includes from Rashika Kheria
27) Add ZTE MF667 device ID to qmi_wwan driver, from Raymond Wanyoike
28) TCP Small Queues doesn't handle nonagle properly in some corner
cases, fix from Eric Dumazet
29) Remove extraneous read_unlock in bond_enslave, whoops. From Ding
Tianhong
30) Fix 9p trans_virtio handling of vmalloc buffers, from Richard Yao
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (136 commits)
6lowpan: fix lockdep splats
alx: add missing stats_lock spinlock init
9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers
bonding: remove unwanted bond lock for enslave processing
USB2NET : SR9800 : One chip USB2.0 USB2NET SR9800 Device Driver Support
tcp: tsq: fix nonagle handling
bridge: Prevent possible race condition in br_fdb_change_mac_address
bridge: Properly check if local fdb entry can be deleted when deleting vlan
bridge: Properly check if local fdb entry can be deleted in br_fdb_delete_by_port
bridge: Properly check if local fdb entry can be deleted in br_fdb_change_mac_address
bridge: Fix the way to check if a local fdb entry can be deleted
bridge: Change local fdb entries whenever mac address of bridge device changes
bridge: Fix the way to find old local fdb entries in br_fdb_change_mac_address
bridge: Fix the way to insert new local fdb entries in br_fdb_changeaddr
bridge: Fix the way to find old local fdb entries in br_fdb_changeaddr
tcp: correct code comment stating 3 min timeout for FIN_WAIT2, we only do 1 min
net: vxge: Remove unused device pointer
net: qmi_wwan: add ZTE MF667
3c59x: Remove unused pointer in vortex_eisa_cleanup()
net: fix 'ip rule' iif/oif device rename
...
The call to xprt_free_allocation() will call list_del() on
req->rq_bc_pa_list, which is not attached to a list.
This patch moves the list_del() out of xprt_free_allocation()
and into those callers that need it.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When a send failure occurs due to the socket being out of buffer space,
we call xs_nospace() in order to have the RPC task wait until the
socket has drained enough to make it worth while trying again.
The current patch fixes a race in which the socket is drained before
we get round to setting up the machinery in xs_nospace(), and which
is reported to cause hangs.
Link: http://lkml.kernel.org/r/20140210170315.33dfc621@notabene.brown
Fixes: a9a6b52ee1 (SUNRPC: Don't start the retransmission timer...)
Reported-by: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In case ieee80211_prep_connection() fails to dereference
sdata->vif.chanctx_conf, the function returns and doesn't
free new_sta. fixed.
Signed-off-by: Eytan Lifshitz <eytan.lifshitz@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In case we were not suspended, the reconfig function returns without
configuring the scheduled scan.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When a device ndo_start_xmit() calls again dev_queue_xmit(),
lockdep can complain because dev_queue_xmit() is re-entered and the
spinlocks protecting tx queues share a common lockdep class.
Same issue was fixed for bonding/l2tp/ppp in commits
0daa230302 ("[PATCH] bonding: lockdep annotation")
49ee49202b ("bonding: set qdisc_tx_busylock to avoid LOCKDEP splat")
23d3b8bfb8 ("net: qdisc busylock needs lockdep annotations ")
303c07db48 ("ppp: set qdisc_tx_busylock to avoid LOCKDEP splat ")
Reported-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 9p-virtio transport does zero copy on things larger than 1024 bytes
in size. It accomplishes this by returning the physical addresses of
pages to the virtio-pci device. At present, the translation is usually a
bit shift.
That approach produces an invalid page address when we read/write to
vmalloc buffers, such as those used for Linux kernel modules. Any
attempt to load a Linux kernel module from 9p-virtio produces the
following stack.
[<ffffffff814878ce>] p9_virtio_zc_request+0x45e/0x510
[<ffffffff814814ed>] p9_client_zc_rpc.constprop.16+0xfd/0x4f0
[<ffffffff814839dd>] p9_client_read+0x15d/0x240
[<ffffffff811c8440>] v9fs_fid_readn+0x50/0xa0
[<ffffffff811c84a0>] v9fs_file_readn+0x10/0x20
[<ffffffff811c84e7>] v9fs_file_read+0x37/0x70
[<ffffffff8114e3fb>] vfs_read+0x9b/0x160
[<ffffffff81153571>] kernel_read+0x41/0x60
[<ffffffff810c83ab>] copy_module_from_fd.isra.34+0xfb/0x180
Subsequently, QEMU will die printing:
qemu-system-x86_64: virtio: trying to map MMIO memory
This patch enables 9p-virtio to correctly handle this case. This not
only enables us to load Linux kernel modules off virtfs, but also
enables ZFS file-based vdevs on virtfs to be used without killing QEMU.
Special thanks to both Avi Kivity and Alexander Graf for their
interpretation of QEMU backtraces. Without their guidence, tracking down
this bug would have taken much longer. Also, special thanks to Linus
Torvalds for his insightful explanation of why this should use
is_vmalloc_addr() instead of is_vmalloc_or_module_addr():
https://lkml.org/lkml/2014/2/8/272
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 46d3ceabd8 ("tcp: TCP Small Queues") introduced a possible
regression for applications using TCP_NODELAY.
If TCP session is throttled because of tsq, we should consult
tp->nonagle when TX completion is done and allow us to send additional
segment, especially if this segment is not a full MSS.
Otherwise this segment is sent after an RTO.
[edumazet] : Cooked the changelog, added another fix about testing
sk_wmem_alloc twice because TX completion can happen right before
setting TSQ_THROTTLED bit.
This problem is particularly visible with recent auto corking,
but might also be triggered with low tcp_limit_output_bytes
values or NIC drivers delaying TX completion by hundred of usec,
and very low rtt.
Thomas Glanzmann for example reported an iscsi regression, caused
by tcp auto corking making this bug quite visible.
Fixes: 46d3ceabd8 ("tcp: TCP Small Queues")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Thomas Glanzmann <thomas@glanzmann.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_change_mac_address() calls fdb_insert()/fdb_delete() without
br->hash_lock.
These hash list updates are racy with br_fdb_update()/br_fdb_cleanup().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlan codes unconditionally delete local fdb entries.
We should consider the possibility that other ports have the same
address and vlan.
Example of problematic case:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
bridge vlan add dev eth0 vid 10
bridge vlan add dev eth1 vid 10
bridge vlan add dev br0 vid 10 self
We will have fdb entry such that f->dst == eth0, f->vlan_id == 10 and
f->addr == 12:34:56:78:90:ab at this time.
Next, delete eth0 vlan 10.
bridge vlan del dev eth0 vid 10
In this case, we still need the entry for br0, but it will be deleted.
Note that br0 needs the entry even though its mac address is not set
manually. To delete the entry with proper condition checking,
fdb_delete_local() is suitable to use.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_delete_by_port() doesn't care about vlan and mac address of the
bridge device.
As the check is almost the same as mac address changing, slightly modify
fdb_delete_local() and use it.
Note that we can always set added_by_user to 0 in fdb_delete_local() because
- br_fdb_delete_by_port() calls fdb_delete_local() for local entries
regardless of its added_by_user. In this case, we have to check if another
port has the same address and vlan, and if found, we have to create the
entry (by changing dst). This is kernel-added entry, not user-added.
- br_fdb_changeaddr() doesn't call fdb_delete_local() for user-added entry.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_change_mac_address() doesn't check if the local entry has the
same address as any of bridge ports.
Although I'm not sure when it is beneficial, current implementation allow
the bridge device to receive any mac address of its ports.
To preserve this behavior, we have to check if the mac address of the
entry being deleted is identical to that of any port.
As this check is almost the same as that in br_fdb_changeaddr(), create
a common function fdb_delete_local() and call it from
br_fdb_changeadddr() and br_fdb_change_mac_address().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should take into account the followings when deleting a local fdb
entry.
- nbp_vlan_find() can be used only when vid != 0 to check if an entry is
deletable, because a fdb entry with vid 0 can exist at any time while
nbp_vlan_find() always return false with vid 0.
Example of problematic case:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address 12:34:56:78:90:ab
brctl addif br0 eth0
brctl addif br0 eth1
ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the fdb entry 12:34:56:78:90:ab will be deleted even though the
bridge port eth1 still has that address.
- The port to which the bridge device is attached might needs a local entry
if its mac address is set manually.
Example of problematic case:
ip link set eth0 address 12:34:56:78:90:ab
brctl addif br0 eth0
ip link set br0 address 12:34:56:78:90:ab
ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the fdb still must have the entry 12:34:56:78:90:ab, but it will be
deleted.
We can use br->dev->addr_assign_type to check if the address is manually
set or not, but I propose another approach.
Since we delete and insert local entries whenever changing mac address
of the bridge device, we can change dst of the entry to NULL regardless of
addr_assign_type when deleting an entry associated with a certain port,
and if it is found to be unnecessary later, then delete it.
That is, if changing mac address of a port, the entry might be changed
to its dst being NULL first, but is eventually deleted when recalculating
and changing bridge id.
This approach is especially useful when we want to share the code with
deleting vlan in which the bridge device might want such an entry regardless
of addr_assign_type, and makes things easy because we don't have to consider
if mac address of the bridge device will be changed or not at the time we
delete a local entry of a port, which means fdb code will not be bothered
even if the bridge id calculating logic is changed in the future.
Also, this change reduces inconsistent state, where frames whose dst is the
mac address of the bridge, can't reach the bridge because of premature fdb
entry deletion. This change reduces the possibility that the bridge device
replies unreachable mac address to arp requests, which could occur during
the short window between calling del_nbp() and br_stp_recalculate_bridge_id()
in br_del_if(). This will effective after br_fdb_delete_by_port() starts to
use the same code by following patch.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlan code may need fdb change when changing mac address of bridge device
even if it is caused by the mac address changing of a bridge port.
Example configuration:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
brctl addif br0 eth1 # br0 will have mac address 12:34:56:78:90:ab
bridge vlan add dev br0 vid 10 self
bridge vlan add dev eth0 vid 10
We will have fdb entry such that f->dst == NULL, f->vlan_id == 10 and
f->addr == 12:34:56:78:90:ab at this time.
Next, change the mac address of eth0 to greater value.
ip link set eth0 address ee:ff:12:34:56:78
Then, mac address of br0 will be recalculated and set to aa:bb:cc:dd:ee:ff.
However, an entry aa:bb:cc:dd:ee:ff will not be created and we will be not
able to communicate using br0 on vlan 10.
Address this issue by deleting and adding local entries whenever
changing the mac address of the bridge device.
If there already exists an entry that has the same address, for example,
in case that br_fdb_changeaddr() has already inserted it,
br_fdb_change_mac_address() will simply fail to insert it and no
duplicated entry will be made, as it was.
This approach also needs br_add_if() to call br_fdb_insert() before
br_stp_recalculate_bridge_id() so that we don't create an entry whose
dst == NULL in this function to preserve previous behavior.
Note that this is a slight change in behavior where the bridge device can
receive the traffic to the new address before calling
br_stp_recalculate_bridge_id() in br_add_if().
However, it is not a problem because we have already the address on the
new port and such a way to insert new one before recalculating bridge id
is taken in br_device_event() as well.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have been always failed to delete the old entry at
br_fdb_change_mac_address() because br_set_mac_address() updates
dev->dev_addr before calling br_fdb_change_mac_address() and
br_fdb_change_mac_address() uses dev->dev_addr to find the old entry.
That update of dev_addr is completely unnecessary because the same work
is done in br_stp_change_bridge_id() which is called right away after
calling br_fdb_change_mac_address().
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit bc9a25d21e ("bridge: Add vlan support for local fdb entries"),
br_fdb_changeaddr() has inserted a new local fdb entry only if it can
find old one. But if we have two ports where they have the same address
or user has deleted a local entry, there will be no entry for one of the
ports.
Example of problematic case:
ip link set eth0 address aa:bb:cc:dd:ee:ff
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
brctl addif br0 eth1 # eth1 will not have a local entry due to dup.
ip link set eth1 address 12:34:56:78:90:ab
Then, the new entry for the address 12:34:56:78:90:ab will not be
created, and the bridge device will not be able to communicate.
Insert new entries regardless of whether we can find old entries or not.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_changeaddr() assumes that there is at most one local entry per port
per vlan. It used to be true, but since commit 36fd2b63e3 ("bridge: allow
creating/deleting fdb entries via netlink"), it has not been so.
Therefore, the function might fail to search a correct previous address
to be deleted and delete an arbitrary local entry if user has added local
entries manually.
Example of problematic case:
ip link set eth0 address ee:ff:12:34:56:78
brctl addif br0 eth0
bridge fdb add 12:34:56:78:90:ab dev eth0 master
ip link set eth0 address aa:bb:cc:dd:ee:ff
Then, the address 12:34:56:78:90:ab might be deleted instead of
ee:ff:12:34:56:78, the original mac address of eth0.
Address this issue by introducing a new flag, added_by_user, to struct
net_bridge_fdb_entry.
Note that br_fdb_delete_by_port() has to set added_by_user to 0 in cases
like:
ip link set eth0 address 12:34:56:78:90:ab
ip link set eth1 address aa:bb:cc:dd:ee:ff
brctl addif br0 eth0
bridge fdb add aa:bb:cc:dd:ee:ff dev eth0 master
brctl addif br0 eth1
brctl delif br0 eth0
In this case, kernel should delete the user-added entry aa:bb:cc:dd:ee:ff,
but it also should have been added by "brctl addif br0 eth1" originally,
so we don't delete it and treat it a new kernel-created entry.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
An infinite loop is caused when nfs4_establish_lease() fails
with -EACCES. This causes nfs4_handle_reclaim_lease_error()
to sleep a bit and resets the NFS4CLNT_LEASE_EXPIRED bit.
This in turn causes nfs4_state_manager() to try and
reestablished the lease, again, again, again...
The problem is a valid RPCSEC_GSS client is being created when
rpc.gssd is not running.
Link: http://lkml.kernel.org/r/1392066375-16502-1-git-send-email-steved@redhat.com
Fixes: 0ea9de0ea6 (sunrpc: turn warn_gssd() log message into a dprintk())
Reported-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
As far as I can tell we have used a default of 60 seconds for
FIN_WAIT2 timeout for ages (since 2.x times??).
In any case, the timeout these days is 60 seconds, so the 3 min
comment is wrong (and cost me a few minutes of my life when I was
debugging a FIN_WAIT2 related problem in a userspace application and
checked the kernel source for details).
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip rules with iif/oif references do not update:
(detach/attach) across interface renames.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: Willem de Bruijn <willemb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Chris Davis <chrismd@google.com>
CC: Carlo Contavalli <ccontavalli@google.com>
Google-Bug-Id: 12936021
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8df8c56a (6lowpan: Moving generic compression code into 6lowpan_iphc.c)
left pointer 'hdr' unused - remove it.
Detected by Coverity: CID 1164868.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch 446fab5933 ("ipv6: enable anycast addresses
as source addresses in ICMPv6 error messages") causes an Oops when pinging a not
set up IPv6 peer on a sit tunnel.
The problem is that ipv6_anycast_destination() uses unconditionally skb_dst(skb),
which is NULL in this case.
The solution is to use instead the ipv6_chk_acast_addr_src() function.
Here are the steps to reproduce it:
modprobe sit
ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249
ip l s sit1 up
ip -6 a a dev sit1 2001🔢:123 remote 2001🔢:121
ping6 2001🔢:121
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark functions as static in net/sunrpc/svc_xprt.c because they are not
used outside this file.
This eliminates the following warning in net/sunrpc/svc_xprt.c:
net/sunrpc/svc_xprt.c:574:5: warning: no previous prototype for ‘svc_alloc_arg’ [-Wmissing-prototypes]
net/sunrpc/svc_xprt.c:615:18: warning: no previous prototype for ‘svc_get_next_xprt’ [-Wmissing-prototypes]
net/sunrpc/svc_xprt.c:694:6: warning: no previous prototype for ‘svc_add_new_temp_xprt’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include appropriate header file net/netfilter/nf_tables_core.h in
net/netfilter/nft_lookup.c because it has prototype declaration of
functions defined in net/netfilter/nft_lookup.c.
This eliminates the following warning in net/netfilter/nft_lookup.c:
net/netfilter/nft_lookup.c:133:12: warning: no previous prototype for ‘nft_lookup_module_init’ [-Wmissing-prototypes]
net/netfilter/nft_lookup.c:138:6: warning: no previous prototype for ‘nft_lookup_module_exit’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move prototype declaration of function to header file
include/net/net_namespace.h from net/ipx/af_ipx.c because they are used
by more than one file.
This eliminates the following warning in net/ipx/sysctl_net_ipx.c:
net/ipx/sysctl_net_ipx.c:33:6: warning: no previous prototype for ‘ipx_register_sysctl’ [-Wmissing-prototypes]
net/ipx/sysctl_net_ipx.c:38:6: warning: no previous prototype for ‘ipx_unregister_sysctl’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move prototype declarations of function to header file
include/net/datalink.h from net/ipx/af_ipx.c because they are used by
more than one file.
This eliminates the following warning in net/ipx/pe2.c:
net/ipx/pe2.c:20:24: warning: no previous prototype for ‘make_EII_client’ [-Wmissing-prototypes]
net/ipx/pe2.c:32:6: warning: no previous prototype for ‘destroy_EII_client’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move prototype declaration of functions to header file include/net/ipx.h
from net/ipx/af_ipx.c because they are used by more than one file.
This eliminates the following warning in
net/ipx/ipx_route.c:33:19: warning: no previous prototype for ‘ipxrtr_lookup’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:52:5: warning: no previous prototype for ‘ipxrtr_add_route’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:94:6: warning: no previous prototype for ‘ipxrtr_del_routes’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:149:5: warning: no previous prototype for ‘ipxrtr_route_skb’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:171:5: warning: no previous prototype for ‘ipxrtr_route_packet’ [-Wmissing-prototypes]
net/ipx/ipx_route.c:261:5: warning: no previous prototype for ‘ipxrtr_ioctl’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move prototype definition of function to header file include/net/ipx.h
from net/ipx/ipx_route.c because they are used by more than one file.
This eliminates the following warning from net/ipx/af_ipx.c:
net/ipx/af_ipx.c:193:23: warning: no previous prototype for ‘ipxitf_find_using_net’ [-Wmissing-prototypes]
net/ipx/af_ipx.c:577:5: warning: no previous prototype for ‘ipxitf_send’ [-Wmissing-prototypes]
net/ipx/af_ipx.c:1219:8: warning: no previous prototype for ‘ipx_cksum’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move prototype declaration of functions to header file include/net/dn.h
from net/decnet/af_decnet.c because they are used by more than one file.
This eliminates the following warning in net/decnet/af_decnet.c:
net/decnet/sysctl_net_decnet.c:354:6: warning: no previous prototype for ‘dn_register_sysctl’ [-Wmissing-prototypes]
net/decnet/sysctl_net_decnet.c:359:6: warning: no previous prototype for ‘dn_unregister_sysctl’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move prototype declaration of functions to header file include/net/dn_route.h
from net/decnet/af_decnet.c because it is used by more than one file.
This eliminates the following warning in net/decnet/dn_route.c:
net/decnet/dn_route.c:629:5: warning: no previous prototype for ‘dn_route_rcv’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark functions as static in core/dev.c because they are not used outside
this file.
This eliminates the following warning in core/dev.c:
net/core/dev.c:2806:5: warning: no previous prototype for ‘__dev_queue_xmit’ [-Wmissing-prototypes]
net/core/dev.c:4640:5: warning: no previous prototype for ‘netdev_adjacent_sysfs_add’ [-Wmissing-prototypes]
net/core/dev.c:4650:6: warning: no previous prototype for ‘netdev_adjacent_sysfs_del’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include appropriate header file net/caif/caif_dev.h in caif/cfsrvl.c
because it has prototype declaration of functions defined in
caif/cfsrvl.c.
This eliminates the following warning in caif/cfsrvl.c:
net/caif/cfsrvl.c:198:6: warning: no previous prototype for ‘caif_free_client’ [-Wmissing-prototypes]
net/caif/cfsrvl.c:208:6: warning: no previous prototype for ‘caif_client_register_refcnt’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Include appropriate header file net/caif/caif_dev.h in caif/caif_dev.c
because it has prototype declarations of function defined in
caif/caif_dev.c.
This eliminates the following file in caif/caif_dev.c:
net/caif/caif_dev.c:303:6: warning: no previous prototype for ‘caif_enroll_dev’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark function as static in net/9p/client.c because it is not used
outside this file.
This eliminates the following warning in net/9p/client.c:
net/9p/client.c:207:18: warning: no previous prototype for ‘p9_fcall_alloc’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
Please pull this batch of fixes intended for the 3.14 stream!
For the mac80211 bits, Johannes says:
"This is just a collection of small fixes, the commit logs explain the
details. The only thing that isn't strictly a fix is the 5/10 MHz
enabling, I had forgotten this and there's little point in waiting
longer. The patch simply removes the force-disable code that I put in
when there was a problem with the userspace API (that has long been
fixed.)"
For the iwlwifi bits, Emmanuel says:
"I have an important fix that disables A band in case the driver thought
it was enabled, and the firmware disagreed. We ended up making the
firmware unhappy. I also fix the station table in AP mode and fix the
scan while we have BT working.
Johannes removes a static variable that could potentially lead to to
issues on multi-device setups and disables scheduled scan to avoid
issues with old versions of wpa_supplicant.
A small fix from David on scan and a few new device IDs for 7265."
On top of that...
Oleksij Rempel adds a USB ID to the ar5523 driver and changes the
default powersave setting for ath9k_htc to "off", due to observed
stability issues (based on an equivalent ath9k patch).
Stanislaw Gruszka similarly disables powersave for a couple of rt2x00
drivers. He also fixes a couple of scheduling while atomic issues
in ath9k_htc.
Sujith Manoharan rounds-out the powersave disables with one for ath9k.
He also fixes a build prolem with ath9k on ARM and fixes an ath9k Tx
power calculation.
Finally, Andrea Merello fixes a couple of lingering DMA mapping
problems in the rtl8180 driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/nftables/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes, mostly nftables
fixes, most relevantly they are:
* Fix a crash in the h323 conntrack NAT helper due to expectation list
corruption, from Alexey Dobriyan.
* A couple of RCU race fixes for conntrack, one manifests by hitting BUG_ON
in nf_nat_setup_info() and the destroy path, patches from Andrey Vagin and
me.
* Dump direction attribute in nft_ct only if it is set, from Arturo
Borrero.
* Fix IPVS bug in its own connection tracking system that may lead to
copying only 4 bytes of the IPv6 address when initializing the
ip_vs_conn object, from Michal Kubecek.
* Fix -EBUSY errors in nftables when deleting the rules, chain and tables
in a row due mixture of asynchronous and synchronous object releasing,
from me.
* Three fixes for the nf_tables set infrastructure when using intervals and
mappings, from me.
* Four patches to fixing the nf_tables log, reject and ct expressions from
the new inet table, from Patrick McHardy.
* Fix memory overrun in the map that is used to dynamically allocate names
from anonymous sets, also from Patrick.
* Fix a potential oops if you dump a set with NFPROTO_UNSPEC and a table
name, from Patrick McHardy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f38a5181d9 ("ceph: Convert to immutable biovecs") introduced
a NULL pointer dereference, which broke rbd in -rc1. Fix it.
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When an ABORT is sent, aborting a connection, the sender quite reasonably
forgets about the connection. If another frame is received, another ABORT
will be sent. When the receiver gets it, it no longer applies to an extant
connection, so an ABORT is sent, and so on...
Prevent this by never sending a rejection for an ABORT packet.
Signed-off-by: Tim Smith <tim@electronghost.co.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
The UDP checksum was already verified in rxrpc_data_ready() - which calls
skb_checksum_complete() - as the RxRPC packet header contains no checksum of
its own. Subsequent calls to skb_copy_and_csum_datagram_iovec() are thus
redundant and are, in any case, being passed only a subset of the UDP payload -
so the checksum will always fail if that path is taken.
So there is no need to check skb->ip_summed in rxrpc_recvmsg(), and no need for
the csum_copy_error: exit path.
Signed-off-by: Tim Smith <tim@electronghost.co.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Handling redirect replies requires both map_sem and request_mutex.
Taking map_sem unconditionally near the top of handle_reply() avoids
possible race conditions that arise from releasing request_mutex to be
able to acquire map_sem in redirect reply case. (Lock ordering is:
map_sem, request_mutex, crush_mutex.)
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Factor out logic from ceph_osdc_start_request() into a new helper,
__ceph_osdc_start_request(). ceph_osdc_start_request() now amounts to
taking locks and calling __ceph_osdc_start_request().
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
It makes no sense to inline a rarely used function meant for debugging
only that is called a total of five times in the main evaluation loop.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix access to uninitialized data for end interval elements. The
element data part is uninitialized in interval end elements.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch fixes several things which related to the handling of
end interval elements:
* Chain use underflow with intervals and map: If you add a rule
using intervals+map that introduces a loop, the error path of the
rbtree set decrements the chain refcount for each side of the
interval, leading to a chain use counter underflow.
* Don't copy the data part of the end interval element since, this
area is uninitialized and this confuses the loop detection code.
* Don't allocate room for the data part of end interval elements
since this is unused.
So, after this patch the idea is that end interval elements don't
have a data part.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
This combination is not allowed since end interval elements cannot
contain data.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
TCP pacing depends on an accurate srtt estimation.
Current srtt estimation is using jiffie resolution,
and has an artificial offset of at least 1 ms, which can produce
slowdowns when FQ/pacing is used, especially in DC world,
where typical rtt is below 1 ms.
We are planning a switch to usec resolution for linux-3.15,
but in the meantime, this patch removes the 1 ms offset.
All we need is to have tp->srtt minimal value of 1 to differentiate
the case of srtt being initialized or not, not 8.
The problematic behavior was observed on a 40Gbit testbed,
where 32 concurrent netperf were reaching 12Gbps of aggregate
speed, instead of line speed.
This patch also has the effect of reporting more accurate srtt and send
rates to iproute2 ss command as in :
$ ss -i dst cca2
Netid State Recv-Q Send-Q Local Address:Port
Peer Address:Port
tcp ESTAB 0 0 10.244.129.1:56984
10.244.129.2:12865
cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
463.4Mbps rcv_rtt:1 rcv_space:29200
tcp ESTAB 0 390960 10.244.129.1:60247
10.244.129.2:50204
cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
Reported-by: Vytautas Valancius <valas@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 93d8bf9fb8 ("bridge: cleanup netpoll code") introduced
a check in br_netpoll_enable(), but this check is incorrect for
br_netpoll_setup(). This patch moves the code after the check
into __br_netpoll_enable() and calls it in br_netpoll_setup().
For br_add_if(), the check is still needed.
Fixes: 93d8bf9fb8 ("bridge: cleanup netpoll code")
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Tested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_alloc_send_pskb() & sk_page_frag_refill()
have a loop trying high order allocations to prepare
skb with low number of fragments as this increases performance.
Problem is that under memory pressure/fragmentation, this can
trigger OOM while the intent was only to try the high order
allocations, then fallback to order-0 allocations.
We had various reports from unexpected regressions.
According to David, setting __GFP_NORETRY should be fine,
as the asynchronous compaction is still enabled, and this
will prevent OOM from kicking as in :
CFSClientEventm invoked oom-killer: gfp_mask=0x42d0, order=3, oom_adj=0,
oom_score_adj=0, oom_score_badness=2 (enabled),memcg_scoring=disabled
CFSClientEventm
Call Trace:
[<ffffffff8043766c>] dump_header+0xe1/0x23e
[<ffffffff80437a02>] oom_kill_process+0x6a/0x323
[<ffffffff80438443>] out_of_memory+0x4b3/0x50d
[<ffffffff8043a4a6>] __alloc_pages_may_oom+0xa2/0xc7
[<ffffffff80236f42>] __alloc_pages_nodemask+0x1002/0x17f0
[<ffffffff8024bd23>] alloc_pages_current+0x103/0x2b0
[<ffffffff8028567f>] sk_page_frag_refill+0x8f/0x160
[<ffffffff80295fa0>] tcp_sendmsg+0x560/0xee0
[<ffffffff802a5037>] inet_sendmsg+0x67/0x100
[<ffffffff80283c9c>] __sock_sendmsg_nosec+0x6c/0x90
[<ffffffff80283e85>] sock_sendmsg+0xc5/0xf0
[<ffffffff802847b6>] __sys_sendmsg+0x136/0x430
[<ffffffff80284ec8>] sys_sendmsg+0x88/0x110
[<ffffffff80711472>] system_call_fastpath+0x16/0x1b
Out of Memory: Kill process 2856 (bash) score 9999 or sacrifice child
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, to make netconsole start over IPv6, the source address
needs to be specified. Without a source address, netpoll_parse_options
assumes we're setting up over IPv4 and the destination IPv6 address is
rejected.
Check if the IP version has been forced by a source address before
checking for a version mismatch when parsing the destination address.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit efe4208f47:
'ipv6: make lookups simpler and faster' broke initialization of local source
address on accepted ipv6 sockets. Before the mentioned commit receive address
was copied along with the contents of ipv6_pinfo in sctp_v6_create_accept_sk.
Now when it is moved, it has to be copied separately.
This also fixes lksctp's ipv6 regression in a sense that test_getname_v6, TC5 -
'getsockname on a connected server socket' now passes.
Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We may lost race if we flush the rule-set (which happens asynchronously
via call_rcu) and we try to remove the table (that userspace assumes
to be empty).
Fix this by recovering synchronous rule and chain deletion. This was
introduced time ago before we had no batch support, and synchronous
rule deletion performance was not good. Now that we have the batch
support, we can just postpone the purge of old rule in a second step
in the commit phase. All object deletions are synchronous after this
patch.
As a side effect, we save memory as we don't need rcu_head per rule
anymore.
Cc: Patrick McHardy <kaber@trash.net>
Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The log and queue expressions both store the family during ->init() and
use it to deliver packets. This is wrong when used in NFPROTO_INET since
they should both deliver to the actual AF of the packet, not the dummy
NFPROTO_INET.
Use the family from the hook ops to fix this.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
During channel context assignment, the interface should
be found by interface iteration, so we need to assign the
pointer before the channel context.
Reported-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Tested-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The "new" fragmentation code (since my rewrite almost 5 years ago)
erroneously sets skb->len rather than using skb_trim() to adjust
the length of the first fragment after copying out all the others.
This leaves the skb tail pointer pointing to after where the data
originally ended, and thus causes the encryption MIC to be written
at that point, rather than where it belongs: immediately after the
data.
The impact of this is that if software encryption is done, then
a) encryption doesn't work for the first fragment, the connection
becomes unusable as the first fragment will never be properly
verified at the receiver, the MIC is practically guaranteed to
be wrong
b) we leak up to 8 bytes of plaintext (!) of the packet out into
the air
This is only mitigated by the fact that many devices are capable
of doing encryption in hardware, in which case this can't happen
as the tail pointer is irrelevant in that case. Additionally,
fragmentation is not used very frequently and would normally have
to be configured manually.
Fix this by using skb_trim() properly.
Cc: stable@vger.kernel.org
Fixes: 2de8e0d999 ("mac80211: rewrite fragmentation")
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, when a station leaves an IBSS network, the
corresponding BSS is not dropped from cfg80211 if there are
other active stations in the network. But, the small
window that is present when trying to determine a station's
status based on IEEE80211_IBSS_MERGE_INTERVAL introduces
a race.
Instead of trying to keep the BSS, always remove it when
leaving an IBSS network. There is not much benefit to retain
the BSS entry since it will be added with a subsequent join
operation.
This fixes an issue where a dangling BSS entry causes ath9k
to wait for a beacon indefinitely.
Cc: <stable@vger.kernel.org>
Reported-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sujith Manoharan <c_manoha@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When the driver cannot start the AP or when the assignement
of the beacon goes wrong, we need to unassign the vif.
Cc: stable@vger.kernel.org
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Due to the previous commit, when a scan finishes, it is in theory
possible to hit the following sequence:
1. interface starts being removed
2. scan is cancelled by driver and cfg80211 is notified
3. scan done work is scheduled
4. interface is removed completely, rdev->scan_req is freed,
event sent to userspace but scan done work remains pending
5. new scan is requested on another virtual interface
6. scan done work runs, freeing the still-running scan
To fix this situation, hang on to the scan done message and block
new scans while that is the case, and only send the message from
the work function, regardless of whether the scan_req is already
freed from interface removal. This makes step 5 above impossible
and changes step 6 to be
5. scan done work runs, sending the scan done message
As this can't work for wext, so we send the message immediately,
but this shouldn't be an issue since we still return -EBUSY.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When an interface/wdev is removed, any ongoing scan should be
cancelled by the driver. This will make it call cfg80211, which
only queues a work struct. If interface/wdev removal is quick
enough, this can leave the scan request pending and processed
only after the interface is gone, causing a use-after-free.
Fix this by making sure the scan request is not pending after
the interface is destroyed. We can't flush or cancel the work
item due to locking concerns, but when it'll run it shouldn't
find anything to do. This leaves a potential issue, if a new
scan gets requested before the work runs, it prematurely stops
the running scan, potentially causing another crash. I'll fix
that in the next patch.
This was particularly observed with P2P_DEVICE wdevs, likely
because freeing them is quicker than freeing netdevs.
Reported-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Fixes: 4a58e7c384 ("cfg80211: don't "leak" uncompleted scans")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Unfortunately I forgot this during the merge window, but the
patch seems small enough to go in as a fix. The userspace API
bug that was the reason for disabling it has long been fixed.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When the netlink skb is exhausted split_start is left set. In the
subsequent retry, with a larger buffer, the dump is continued from the
failing point instead of from the beginning.
This was causing my rt28xx based USB dongle to now show up when
running "iw list" with an old iw version without split dump support.
Cc: stable@vger.kernel.org
Fixes: 3713b4e364 ("nl80211: allow splitting wiphy information in dumps")
Signed-off-by: Pontus Fuchs <pontus.fuchs@gmail.com>
[avoid the entire workaround when state->split is set]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_start_roc_work() might add a new roc
to existing roc, and tell cfg80211 it has already
started.
However, this might happen before the roc cookie
was set, resulting in REMAIN_ON_CHANNEL (started)
event with null cookie. Consequently, it can make
wpa_supplicant go out of sync.
Fix it by setting the roc cookie earlier.
Cc: stable@vger.kernel.org
Signed-off-by: Eliad Peller <eliad@wizery.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a reject module for NFPROTO_INET. It does nothing but dispatch
to the AF-specific modules based on the hook family.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently the nft_reject module depends on symbols from ipv6. This is
wrong since no generic module should force IPv6 support to be loaded.
Split up the module into AF-specific and a generic part.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This capabilities weren't propagated to the radiotap header.
We don't set here the VHT_KNOWN / MCS_HAVE flag because not
all the low level drivers will know how to properly flag
the frames, hence the low level driver will be in charge
of setting IEEE80211_RADIOTAP_MCS_HAVE_FEC,
IEEE80211_RADIOTAP_MCS_HAVE_STBC and / or
IEEE80211_RADIOTAP_VHT_KNOWN_STBC according to its
capabilities.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ieee80211_rx_status.flags is full. Define a new vht_flag
variable to be able to set more VHT related flags and make
room in flags.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Acked-by: Kalle Valo <kvalo@qca.qualcomm.com> [ath10k]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The purpose of this housekeeping is to make some room for
VHT flags. The radiotap vendor fields weren't in use.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For the reject module, we need to add AF-specific implementations to
get rid of incorrect module dependencies. Try to load an AF-specific
module first and fall back to generic modules.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The key was missing in the list of valid keys, add it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit c9c8e48597 (netfilter: nf_tables: dump sets in all existing families)
changed nft_ctx_init_from_setattr() to only look up the address family if it
is not NFPROTO_UNSPEC. However if it is NFPROTO_UNSPEC and a table attribute
is given, nftables_afinfo_lookup() will dereference the NULL afi pointer.
Fix by checking for non-NULL afi and also move a check added by that commit
to the proper position.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The map that is used to allocate anonymous sets is indeed
BITS_PER_BYTE * PAGE_SIZE long.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.
Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().
A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.
CPU1 CPU2
____nf_conntrack_find()
nf_ct_put()
destroy_conntrack()
...
init_conntrack
__nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
if (!l4proto->new(ct, skb, dataoff, timeouts))
nf_conntrack_free(ct); (use = 2 !!!)
...
__nf_conntrack_alloc (set use = 1)
if (!nf_ct_key_equal(h, tuple, zone))
nf_ct_put(ct); (use = 0)
destroy_conntrack()
/* continue to work with CT */
After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():
<4>[67096.759334] ------------[ cut here ]------------
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C --------------- 2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>] [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255] [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255] [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255] [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255] [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255] [<ffffffff81484419>] nf_iterate+0x69/0xb0
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
<4>[67096.760255] [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
<4>[67096.760255] [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
<4>[67096.760255] [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
<4>[67096.760255] [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255] [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255] [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255] [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814457c9>] sys_sendto+0x139/0x190
<4>[67096.760255] [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255] [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255] [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Vagin <avagin@parallels.com>
Cc: Florian Westphal <fw@strlen.de>
Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
In association request frames, there may be IEs passed from
userspace (such as interworking IEs) between HT and VHT, so
add code to insert those inbetween them.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In probe request frames, the VHT IEs should come before any
vendor IEs, but after interworking and similar, so add code
to order them correctly wrt. the IEs passed from userspace.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Update the operating mode field is needed when an association
request contains the operating mode notification element and
it's not just changed later on the fly.
Signed-off-by: Marek Kwaczynski <marek.kwaczynski@tieto.com>
[clarify commit log, comments & fix whitespace]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In case we will get regulatory request with rule
where max_bandwidth_khz is set to 0 handle this
case as a special one.
If max_bandwidth_khz == 0 we should calculate maximum
available bandwidth base on all frequency contiguous rules.
In case we need auto calculation we just have to set:
country PL: DFS-ETSI
(2402 - 2482 @ 40), (N/A, 20)
(5170 - 5250 @ AUTO), (N/A, 20)
(5250 - 5330 @ AUTO), (N/A, 20), DFS
(5490 - 5710 @ 80), (N/A, 27), DFS
This mean we will calculate maximum bw for rules where
AUTO (N/A) were set, 160MHz (5330 - 5170) in example above.
So we will get:
(5170 - 5250 @ 160), (N/A, 20)
(5250 - 5330 @ 160), (N/A, 20), DFS
In other case:
country FR: DFS-ETSI
(2402 - 2482 @ 40), (N/A, 20)
(5170 - 5250 @ AUTO), (N/A, 20)
(5250 - 5330 @ 80), (N/A, 20), DFS
(5490 - 5710 @ 80), (N/A, 27), DFS
We will get 80MHz (5250 - 5170):
(5170 - 5250 @ 80), (N/A, 20)
(5250 - 5330 @ 80), (N/A, 20), DFS
Base on this calculations we will set correct channel
bandwidth flags (eg. IEEE80211_CHAN_NO_80MHZ).
We don't need any changes in CRDA or internal regulatory.
Signed-off-by: Janusz Dziedzic <janusz.dziedzic@tieto.com>
[extend nl80211 description a bit, fix typo]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Lets look at destroy_conntrack:
hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.
But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().
Florian Westphal suggested to check the confirm bit too. I think it's
right.
task 1 task 2 task 3
nf_conntrack_find_get
____nf_conntrack_find
destroy_conntrack
hlist_nulls_del_rcu
nf_conntrack_free
kmem_cache_free
__nf_conntrack_alloc
kmem_cache_alloc
memset(&ct->tuplehash[IP_CT_DIR_MAX],
if (nf_ct_is_dying(ct))
if (!nf_ct_tuple_equal()
I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.
<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>] [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622] [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697] [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770] [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843] [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919] [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991] [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063] [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133] [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207] [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277] [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346] [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419] [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491] [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562] [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638] [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712] [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785] [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858] [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936] [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006] [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081] [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151] [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229] [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303] [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378] [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454] [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531] [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The following commands trigger an oops:
# nft -i
nft> add table filter
nft> add chain filter input { type filter hook input priority 0; }
nft> add chain filter test
nft> add rule filter input jump test
nft> delete chain filter test
We need to check the chain use counter before allowing destruction since
we might have references from sets or jump rules.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=69341
Reported-by: Matthew Ife <deleriux1@gmail.com>
Tested-by: Matthew Ife <deleriux1@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We want to make sure that the information that we get from the kernel can
be reinjected without troubles. The kernel shouldn't return an attribute
that is not required, or even prohibited.
Dumping unconditionally NFTA_CT_DIRECTION could lead an application in
userspace to interpret that the attribute was originally set, while it
was not.
Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With subfacets, we'd expect megaflow updates message to carry
the original micro flow. If not, EINVAL is returned and kernel
logs an error message. Now that the user space subfacet layer is
removed, it is expected that flow updates can arrive with a
micro flow other than the original. Change the return code to
EEXIST and remove the kernel error log message.
Reported-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
ovs_flow_free() is not called under ovs-lock during packet
execute path (ovs_packet_cmd_execute()). Since packet execute
does not touch flow->mask, there is no need to take that
lock either. So move assert in case where flow->mask is checked.
Found by code inspection.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
commit 43d4be9cb5 (openvswitch: Allow user space
to announce ability to accept unaligned Netlink messages) introduced
OVS_DP_ATTR_USER_FEATURES netlink attribute in datapath responses,
but the attribute size was not taken into account in ovs_dp_cmd_msg_size().
Signed-off-by: Daniele Di Proietto <daniele.di.proietto@gmail.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Both mega flow mask's reference counter and per flow table mask list
should only be accessed when holding ovs_mutex() lock. However
this is not true with ovs_flow_table_flush(). The patch fixes this bug.
Reported-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
While the zerocopy method is correctly omitted if user space
does not support unaligned Netlink messages. The attribute is
still not padded correctly as skb_zerocopy() will not ensure
padding and the attribute size is no longer pre calculated
though nla_reserve() which ensured padding previously.
This patch applies appropriate padding if a linear data copy
was performed in skb_zerocopy().
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
RCU writer side should use rcu_dereference_protected() and not
rcu_dereference(), fix that. This also removes the "suspicious RCU usage"
warning seen when running with CONFIG_PROVE_RCU.
Also, don't use rcu_assign_pointer/rcu_dereference for pointers
which are invisible beyond the udp offload code.
Fixes: b582ef0 ('net: Add GRO support for UDP encapsulating protocols')
Reported-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was possible to break interface combinations in
the following way:
combo 1: iftype = AP, num_ifaces = 2, num_chans = 2,
combo 2: iftype = AP, num_ifaces = 1, num_chans = 1, radar = HT20
With the above interface combinations it was
possible to:
step 1. start AP on DFS channel by matching combo 2
step 2. start AP on non-DFS channel by matching combo 1
This was possible beacuse (step 2) did not consider
if other interfaces require radar detection.
The patch changes how cfg80211 tracks channels -
instead of channel itself now a complete chandef
is stored.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When receiving an IBSS_JOINED event select the BSS object
based on the {bssid, channel} couple rather than the bssid
only.
With the current approach if another cell having the same
BSSID (but using a different channel) exists then cfg80211
picks up the wrong BSS object.
The result is a mismatching channel configuration between
cfg80211 and the driver, that can lead to any sort of
problem.
The issue can be triggered by having an IBSS sitting on
given channel and then asking the driver to create a new
cell using the same BSSID but with a different frequency.
By passing the channel to cfg80211_get_bss() we can solve
this ambiguity and retrieve/create the correct BSS object.
All the users of cfg80211_ibss_joined() have been changed
accordingly.
Moreover WARN when cfg80211_ibss_joined() gets a NULL
channel as argument and remove a bogus call of the same
function in ath6kl (it does not make sense to call
cfg80211_ibss_joined() with a zero BSSID on ibss-leave).
Cc: Kalle Valo <kvalo@qca.qualcomm.com>
Cc: Arend van Spriel <arend@broadcom.com>
Cc: Bing Zhao <bzhao@marvell.com>
Cc: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Cc: libertas-dev@lists.infradead.org
Acked-by: Kalle Valo <kvalo@qca.qualcomm.com>
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
[minor code cleanup in ath6kl]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch moves the rfkill poll_work to the power efficient workqueue.
This work does not have to be bound to the CPU that scheduled it, hence
the selection of CPU that executes it would be left to the scheduler.
Net result is that CPU idle times would be extended, resulting in power
savings.
This behaviour is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
[zoran.markovic@linaro.org: Rebased to latest kernel, added commit message.
Fixed workqueue selection after suspend/resume cycle.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For better use of CPU idle time, allow the scheduler to select the CPU
on which the timeout work of regulatory settings would be executed.
This extends CPU idle residency time and saves power.
This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
[zoran.markovic@linaro.org: Rebased to latest kernel. Added commit message.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Action, disassoc and deauth frames are bufferable, and as such don't
have the PM bit in the frame control field reserved which means we
need to react to the bit when receiving in such a frame.
Fix this by introducing a new helper ieee80211_is_bufferable_mmpdu()
and using it for the RX path that currently ignores the PM bit in
any non-data frames for doze->wake transitions, but listens to it in
all frames for wake->doze transitions, both of which are wrong.
Also use the new helper in the TX path to clean up the code.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Compiling with W=1 found a few variables that are set
but not used (-Wunused-but-set-variable), remove them.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add helper function that will return regdomain.
Follow the driver's regulatory domain, if present,
unless a country IE has been processed or a user
wants to help compliance further.
Signed-off-by: Janusz Dziedzic <janusz.dziedzi@tieto.com>
[remove useless reg variable]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Set preset_chandef in channel switch notification.
In other case we will have old preset_chandef.
Signed-off-by: Janusz Dziedzic <janusz.dziedzic@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Responding to probe requests for scanning clients will often create
excessive retries, as it happens quite often that the scanning client
already left the channel. Therefore do it like hostapd and send probe
responses for wildcard SSID only once by using the noack flag.
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
[fix typo & 'wildcard SSID' in commit log]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The scheduled scan matchsets were intended to be a list of filters,
with the found BSS having to pass at least one of them to be passed
to the host. When the RSSI attribute was added, however, this was
broken and currently wpa_supplicant adds that attribute in its own
matchset; however, it doesn't intend that to mean that anything
that passes the RSSI filter should be passed to the host, instead
it wants it to mean that everything needs to also have higher RSSI.
This is semantically problematic because we have a list of filters
like [ SSID1, SSID2, SSID3, RSSI ] with no real indication which
one should be OR'ed and which one AND'ed.
To fix this, move the RSSI filter attribute into each matchset. As
we need to stay backward compatible, treat a matchset with only the
RSSI attribute as a "default RSSI filter" for all other matchsets,
but only if there are other matchsets (an RSSI-only matchset by
itself is still desirable.)
To make driver implementation easier, keep a global min_rssi_thold
for the entire request as well. The only affected driver is ath6kl.
I found this when I looked into the code after Raja Mani submitted
a patch fixing the n_match_sets calculation to disregard the RSSI,
but that patch didn't address the semantic issue.
Reported-by: Raja Mani <rmani@qti.qualcomm.com>
Acked-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The ieee80211_vif_use_channel() function calls
ieee80211_vif_release_channel(), so there's no need to call it
explicitly in __ieee80211_sta_join_ibss().
Signed-off-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The patch adds a missing sdata lock and adds a few
lockdeps for easier maintenance.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
radar_required setting wasn't protected by
local->mtx in some places. This should prevent
from scanning/radar detection/roc colliding.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The sdata->vif.csa_active could be left set after,
e.g. channel context constraints check fail in STA
mode leaving the interface in a strange state for
a brief period of time until it is disconnected.
This was harmless but ugly.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Reviewed-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If CSA for AP interface failed and the interface
was not stopped afterwards another CSA request
would leak sdata->u.ap.next_beacon.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Reviewed-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Instead of having
ieee80211_bss_info_change_notify() scattered all
over the place just call it once when finalizing
CSA.
As a side effect this patch adds missing error
checking for IBSS CSA beacon update.
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Reviewed-by: Luciano Coelho <luciano.coelho@intel.com>
[fix err vs. changed variable usage in ieee80211_csa_finalize()]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There are a few cases, e.g. suspend, where an AP interface is
stopped by the kernel rather than by userspace request, most
commonly when suspending. To let userspace know about this,
send the NL80211_CMD_STOP_AP command as an event every time
an AP interface is stopped. This also happens when userspace
did in fact request the AP stop, but that's not a problem.
For full-MAC drivers this may need to be extended to also
cover cases where the device stopped the AP operation for
some reason, this a bit more complicated because then all
cfg80211 state also needs to be reset; such API is not part
of this patch.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A few places weren't checking that the frame passed to the
function actually has enough data even though the function
clearly documents it must have a payload byte. Make this
safer by changing the function to take an skb and checking
the length inside. The old version is preserved for now as
the rtl* drivers use it and don't have a correct skb.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If there's a policy, then nla_parse() return values must be
checked, otherwise the policy is useless and there's nothing
that ensures the attributes are actually what we expect them
to be.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ATM, {ADD,DEL}BA and BAR frames are sent on the AC matching the TID of
the BA parameters. In the discussion [1] about this patch, Johannes
recalled that it fixed some races with the DELBA and indeed this
behavior was introduced in [2].
While [2] is right for the BARs, the part queueing the {ADD,DEL}BAs on
their BA params TID AC violates the spec and is more a workaround for
some drivers. Helmut expressed some concerns wrt such drivers, in
particular DELBAs in rt2x00.
ATM, DELBAs are sent after a driver has called (hence "purposely")
ieee80211_start_tx_ba_cb_irqsafe and Johannes and Emmanuel gave some
details wrt intentions behind the split of the IEEE80211_AMPDU_TX_STOP_*
given to the driver ampdu_action supposed to call this function, which
could prove handy to people trying to do the right thing in faulty
drivers (if their fw/hw don't get in their way).
[1] http://mid.gmane.org/1390391564-18481-1-git-send-email-karl.beldan@gmail.com
[2] Commit: cf6bb79ad8 ("mac80211: Use appropriate TID for sending BAR, ADDBA and DELBA frames")
Signed-off-by: Karl Beldan <karl.beldan@rivierawaves.com>
Cc: Helmut Schaa <helmut.schaa@googlemail.com>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
There's not a single rate control algorithm actually in
a separate module where the module refcount would be
required. Similarly, there's no specific rate control
module.
Therefore, all the module handling code in rate control
is really just dead code, so remove it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The sample table is initialized only once at module start, so
is really __read_mostly. Additionally, the code to init it can
be marked __init since it will never be needed again, it is
likely automatically inlined into the __init function already
by the compiler, so this doesn't really make a difference.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The wiphy privid (to identify wiphys) and the cfg80211
ops should both be const, so change them to be.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Change the code to allow making all the rate control ops
const, nothing ever needs to change them. Also change all
drivers to make use of this and mark the ops const.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Restrict our published beamformee STS capability according
to the AP value.
Some AP show bad behaviour in interoperability testing
when our capabilities are better.
Signed-off-by: Eyal Shapira <eyalx.shapira@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Initialize the buffer to all zeroes, otherwise the stack
data might be interpreted as the TID, which is likely to
fail completely.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In case the given ROC duration is 0, update it to a minimal value before
setting the ieee80211_roc_work parameters, so it also would be valid
for cases where scan is in progress or there are other ROCs queued.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since we copy the mesh_id_len into our own data structures,
use it consistently and don't sometimes use cfg80211's copy.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The mesh_id is an array so can't ever be NULL, it looks
like mesh_id_len check was intended instead. However,
since the previous patch, cfg80211 does the check, so
just remove it here.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Before allowing userspace to initiate a channel switch, check
that it's actually connected in some sense. Also use a more
appropriate error code for the not connected case.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The mesh precedence value in ieee80211_channel_switch
should be incremented or set to 1 only if this is the
initiator of mesh channel switch. For non-initiator,
the precedence value has updated using the Mesh
Channel Switch Parameters element. Fix this.
Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix the following issues in reg_process_hint():
1. Add verification that wiphy is valid before processing
NL80211_REGDOMAIN_SET_BY_COUNTRY_IE.
2. Free the request in case of invalid initiator.
3. Remove WARN_ON check on reg_request->alpha2 as it is not a
pointer.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>