Commit Graph

32079 Commits

Author SHA1 Message Date
Heiner Kallweit
ecab67015e ipv6: Avoid unnecessary temporary addresses being generated
tmp_prefered_lft is an offset to ifp->tstamp, not now. Therefore
age needs to be added to the condition.

Age calculation in ipv6_create_tempaddr is different from the one
in addrconf_verify and doesn't consider ADDRCONF_TIMER_FUZZ_MINUS.
This can cause age in ipv6_create_tempaddr to be less than the one
in addrconf_verify and therefore unnecessary temporary address to
be generated.
Use age calculation as in addrconf_modify to avoid this.

Signed-off-by: Heiner Kallweit <heiner.kallweit@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-13 15:49:14 -04:00
Yang Yingliang
d59b7d8059 net_sched: return nla_nest_end() instead of skb->len
nla_nest_end() already has return skb->len, so replace
return skb->len with return nla_nest_end instead().

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-13 15:39:20 -04:00
Jan Beulich
f9708b4302 consolidate duplicate code is skb_checksum_setup() helpers
consolidate duplicate code is skb_checksum_setup() helpers

Realizing that the skb_maybe_pull_tail() calls in the IP-protocol
specific portions of both helpers are terminal ones (i.e. no further
pulls are expected), their maximum size to be pulled can be made match
their minimal size needed, thus making the code identical and hence
possible to be moved into another helper.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-13 15:16:29 -04:00
John W. Linville
42775a34d2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
Conflicts:
	drivers/net/wireless/ath/ath9k/recv.c
2014-03-13 14:21:43 -04:00
Daniel Borkmann
433131ba03 net: sctp: remove NULL check in sctp_assoc_update_retran_path
This is basically just to let Coverity et al shut up. Remove an
unneeded NULL check in sctp_assoc_update_retran_path().

It is safe to remove it, because in sctp_assoc_update_retran_path()
we iterate over the list of transports, our own transport which is
asoc->peer.retran_path included. In the iteration, we skip the
list head element and transports in state SCTP_UNCONFIRMED.

Such transports came from peer addresses received in INIT/INIT-ACK
address parameters. They are not yet confirmed by a heartbeat and
not available for data transfers.

We know however that in the list of transports, even if it contains
such elements, it at least contains our asoc->peer.retran_path as
well, so even if next to that element, we only encounter
SCTP_UNCONFIRMED transports, we are always going to fall back to
asoc->peer.retran_path through sctp_trans_elect_best(), as that is
for sure not SCTP_UNCONFIRMED as per fbdf501c93 ("sctp: Do no
select unconfirmed transports for retransmissions").

Whenever we call sctp_trans_elect_best() it will give us a non-NULL
element back, and therefore when we break out of the loop, we are
guaranteed to have a non-NULL transport pointer, and can remove
the NULL check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-13 13:23:52 -04:00
Joe Perches
b80edf0b52 netfilter: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.

Make these uses consistent with the rest of the kernel.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-13 14:13:19 +01:00
Matthew Leach
dbb490b965 net: socket: error on a negative msg_namelen
When copying in a struct msghdr from the user, if the user has set the
msg_namelen parameter to a negative value it gets clamped to a valid
size due to a comparison between signed and unsigned values.

Ensure the syscall errors when the user passes in a negative value.

Signed-off-by: Matthew Leach <matthew.leach@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 16:29:24 -04:00
Jon Paul Maloy
5c311421a2 tipc: eliminate redundant lookups in registry
As an artefact from the native interface, the message sending functions
in the port takes a port ref as first parameter, and then looks up in
the registry to find the corresponding port pointer. This despite the
fact that the only currently existing caller, tipc_sock, already knows
this pointer.

We change the signature of these functions to take a struct tipc_port*
argument, and remove the redundant lookups.

We also remove an unmotivated extra lookup in the function
socket.c:auto_connect(), and, as the lookup functions tipc_port_deref()
and ref_deref() now become unused, we remove these two functions.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:53:49 -04:00
Jon Paul Maloy
58ed944241 tipc: align usage of variable names and macros in socket
The practice of naming variables in TIPC is inconistent, sometimes
even within the same file.

In this commit we align variable names and declarations within
socket.c, and function and macro names within socket.h. We also
reduce the number of conversion macros to two, in order to make
usage less obsure.

These changes are purely cosmetic.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:53:49 -04:00
Jon Paul Maloy
3b4f302d85 tipc: eliminate redundant locking
The three functions tipc_portimportance(), tipc_portunreliable() and
tipc_portunreturnable() and their corresponding tipc_set* functions,
are all grabbing port_lock when accessing the targeted port. This is
unnecessary in the current code, since these calls only are made from
within socket downcalls, already protected by sock_lock.

We remove the redundant locking. Also, since the functions now become
trivial one-liners, we move them to port.h and make them inline.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:53:49 -04:00
Jon Paul Maloy
24be34b5a0 tipc: eliminate upcall function pointers between port and socket
Due to the original one-to-many relation between port and user API
layers, upcalls to the API have been performed via function pointers,
installed in struct tipc_port at creation. Since this relation now
always is one-to-one, we can instead use ordinary function calls.

We remove the function pointers 'dispatcher' and ´wakeup' from
struct tipc_port, and replace them with calls to the renamed
functions tipc_sk_rcv() and tipc_sk_wakeup().

At the same time we change the name and signature of the functions
tipc_createport() and tipc_deleteport() to reflect their new role
as mere initialization/destruction functions.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:53:49 -04:00
Jon Paul Maloy
8826cde655 tipc: aggregate port structure into socket structure
After the removal of the tipc native API the relation between
a tipc_port and its API types is strictly one-to-one, i.e, the
latter can now only be a socket API. There is therefore no need
to allocate struct tipc_port and struct sock independently.

In this commit, we aggregate struct tipc_port into struct tipc_sock,
hence saving both CPU cycles and structure complexity.

There are no functional changes in this commit, except for the
elimination of the separate allocation/freeing of tipc_port.
All other changes are just adaptatons to the new data structure.

This commit also opens up for further code simplifications and
code volume reduction, something we will do in later commits.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:53:49 -04:00
Jon Paul Maloy
f9fef18c6d tipc: remove redundant 'peer_name' field in struct tipc_sock
The field 'peer_name' in struct tipc_sock is redundant, since
this information already is available from tipc_port, to which
tipc_sock has a reference.

We remove the field, and ensure that peer node and peer port
info instead is fetched via the functions that already exist
for this purpose.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:53:49 -04:00
Jon Paul Maloy
978813ee89 tipc: replace reference table rwlock with spinlock
The lock for protecting the reference table is declared as an
RWLOCK, although it is only used in write mode, never in read
mode.

We redefine it to become a spinlock.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:53:49 -04:00
Steffen Klassert
4a93f5095a flowcache: Fix resource leaks on namespace exit.
We leak an active timer, the hotcpu notifier and all allocated
resources when we exit a namespace. Fix this by introducing a
flow_cache_fini() function where we release the resources before
we exit.

Fixes: ca925cf153 ("flowcache: Make flow cache name space aware")
Reported-by: Jakub Kicinski <moorray3@wp.pl>
Tested-by: Jakub Kicinski <moorray3@wp.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:31:18 -04:00
Joe Perches
184593c734 tipc: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.

Make these uses consistent with the rest of the kernel.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:28:06 -04:00
Joe Perches
ec633eb5ff ieee802154: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.

Make these uses consistent with the rest of the kernel.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:28:06 -04:00
Joe Perches
2b8837aeaa net: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.

Make these uses consistent with the rest of the kernel.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:28:06 -04:00
Joe Perches
f0e78826e4 8021q: Convert uses of __constant_<foo> to <foo>
The use of __constant_<foo> has been unnecessary for quite awhile now.

Make these uses consistent with the rest of the kernel.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 15:28:06 -04:00
Florian Westphal
14e1a97776 netfilter: connlimit: use kmem_cache for conn objects
We might allocate thousands of these (one object per connection).
Use distinct kmem cache to permit simplte tracking on how many
objects are currently used by the connlimit match via the sysfs.

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-12 13:55:03 +01:00
Florian Westphal
3bcc5fdf1b netfilter: connlimit: move insertion of new element out of count function
Allows easier code-reuse in followup patches.

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-12 13:55:02 +01:00
Florian Westphal
d9ec4f1ee2 netfilter: connlimit: improve packet-to-closed-connection logic
Instead of freeing the entry from our list and then adding
it back again in the 'packet to closing connection' case just keep the
matching entry around.  Also drop the found_ct != NULL test as
nf_ct_tuplehash_to_ctrack is just container_of().

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-12 13:55:01 +01:00
Florian Westphal
15cfd52895 netfilter: connlimit: factor hlist search into new function
Simplifies followup patch that introduces separate locks for each of
the hash slots.

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-12 13:55:01 +01:00
Eric Dumazet
fba373d2bb pkt_sched: add cond_resched() to class and qdisc dump
We have seen delays of more than 50ms in class or qdisc dumps, in case
device is under high TX stress, even with the prior 4KB per skb limit.

Add cond_resched() to give a chance to higher prio tasks to get cpu.

Signed-off-by; Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 23:54:23 -04:00
Eric Dumazet
15dc36ebbb pkt_sched: do not use rcu in tc_dump_qdisc()
Like all rtnetlink dump operations, we hold RTNL in tc_dump_qdisc(),
so we do not need to use rcu protection to protect list of netdevices.

This will allow preemption to occur, thus reducing latencies.
Following patch adds explicit cond_resched() calls.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 23:54:23 -04:00
Linus Lüssing
20a599bec9 bridge: multicast: enable snooping on general queries only
Without this check someone could easily create a denial of service
by injecting multicast-specific queries to enable the bridge
snooping part if no real querier issuing periodic general queries
is present on the link which would result in the bridge wrongly
shutting down ports for multicast traffic as the bridge did not learn
about these listeners.

With this patch the snooping code is enabled upon receiving valid,
general queries only.

Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 23:22:10 -04:00
Linus Lüssing
9ed973cc40 bridge: multicast: add sanity check for general query destination
General IGMP and MLD queries are supposed to have the multicast
link-local all-nodes address as their destination according to RFC2236
section 9, RFC3376 section 4.1.12/9.1, RFC2710 section 8 and RFC3810
section 5.1.15.

Without this check, such malformed IGMP/MLD queries can result in a
denial of service: The queries are ignored by most IGMP/MLD listeners
therefore they will not respond with an IGMP/MLD report. However,
without this patch these malformed MLD queries would enable the
snooping part in the bridge code, potentially shutting down the
according ports towards these hosts for multicast traffic as the
bridge did not learn about these listeners.

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 23:22:10 -04:00
Eric Dumazet
c3f9b01849 tcp: tcp_release_cb() should release socket ownership
Lars Persson reported following deadlock :

-000 |M:0x0:0x802B6AF8(asm) <-- arch_spin_lock
-001 |tcp_v4_rcv(skb = 0x8BD527A0) <-- sk = 0x8BE6B2A0
-002 |ip_local_deliver_finish(skb = 0x8BD527A0)
-003 |__netif_receive_skb_core(skb = 0x8BD527A0, ?)
-004 |netif_receive_skb(skb = 0x8BD527A0)
-005 |elk_poll(napi = 0x8C770500, budget = 64)
-006 |net_rx_action(?)
-007 |__do_softirq()
-008 |do_softirq()
-009 |local_bh_enable()
-010 |tcp_rcv_established(sk = 0x8BE6B2A0, skb = 0x87D3A9E0, th = 0x814EBE14, ?)
-011 |tcp_v4_do_rcv(sk = 0x8BE6B2A0, skb = 0x87D3A9E0)
-012 |tcp_delack_timer_handler(sk = 0x8BE6B2A0)
-013 |tcp_release_cb(sk = 0x8BE6B2A0)
-014 |release_sock(sk = 0x8BE6B2A0)
-015 |tcp_sendmsg(?, sk = 0x8BE6B2A0, ?, ?)
-016 |sock_sendmsg(sock = 0x8518C4C0, msg = 0x87D8DAA8, size = 4096)
-017 |kernel_sendmsg(?, ?, ?, ?, size = 4096)
-018 |smb_send_kvec()
-019 |smb_send_rqst(server = 0x87C4D400, rqst = 0x87D8DBA0)
-020 |cifs_call_async()
-021 |cifs_async_writev(wdata = 0x87FD6580)
-022 |cifs_writepages(mapping = 0x852096E4, wbc = 0x87D8DC88)
-023 |__writeback_single_inode(inode = 0x852095D0, wbc = 0x87D8DC88)
-024 |writeback_sb_inodes(sb = 0x87D6D800, wb = 0x87E4A9C0, work = 0x87D8DD88)
-025 |__writeback_inodes_wb(wb = 0x87E4A9C0, work = 0x87D8DD88)
-026 |wb_writeback(wb = 0x87E4A9C0, work = 0x87D8DD88)
-027 |wb_do_writeback(wb = 0x87E4A9C0, force_wait = 0)
-028 |bdi_writeback_workfn(work = 0x87E4A9CC)
-029 |process_one_work(worker = 0x8B045880, work = 0x87E4A9CC)
-030 |worker_thread(__worker = 0x8B045880)
-031 |kthread(_create = 0x87CADD90)
-032 |ret_from_kernel_thread(asm)

Bug occurs because __tcp_checksum_complete_user() enables BH, assuming
it is running from softirq context.

Lars trace involved a NIC without RX checksum support but other points
are problematic as well, like the prequeue stuff.

Problem is triggered by a timer, that found socket being owned by user.

tcp_release_cb() should call tcp_write_timer_handler() or
tcp_delack_timer_handler() in the appropriate context :

BH disabled and socket lock held, but 'owned' field cleared,
as if they were running from timer handlers.

Fixes: 6f458dfb40 ("tcp: improve latencies of timer triggered events")
Reported-by: Lars Persson <lars.persson@axis.com>
Tested-by: Lars Persson <lars.persson@axis.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 16:45:59 -04:00
Michael S. Tsirkin
1fd819ecb9 skbuff: skb_segment: orphan frags before copying
skb_segment copies frags around, so we need
to copy them carefully to avoid accessing
user memory after reporting completion to userspace
through a callback.

skb_segment doesn't normally happen on datapath:
TSO needs to be disabled - so disabling zero copy
in this case does not look like a big deal.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 16:26:38 -04:00
Michael S. Tsirkin
1a4cedaf65 skbuff: skb_segment: s/fskb/list_skb/
fskb is unrelated to frag: it's coming from
frag_list. Rename it list_skb to avoid confusion.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 16:26:38 -04:00
Michael S. Tsirkin
df5771ffef skbuff: skb_segment: s/skb/head_skb/
rename local variable to make it easier to tell at a glance that we are
dealing with a head skb.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 16:26:38 -04:00
Michael S. Tsirkin
4e1beba12d skbuff: skb_segment: s/skb_frag/frag/
skb_frag can in fact point at either skb
or fskb so rename it generally "frag".

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 16:26:38 -04:00
Michael S. Tsirkin
8cb19905e9 skbuff: skb_segment: s/frag/nskb_frag/
frag points at nskb, so name it appropriately

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 16:26:38 -04:00
Li RongQing
090f1166c6 ipv6: ip6_forward: perform skb->pkt_type check at the beginning
Packets which have L2 address different from ours should be
already filtered before entering into ip6_forward().

Perform that check at the beginning to avoid processing such packets.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-11 00:37:42 -04:00
Peter Boström
dd38743b4c vlan: Set correct source MAC address with TX VLAN offload enabled
With TX VLAN offload enabled the source MAC address for frames sent using the
VLAN interface is currently set to the address of the real interface. This is
wrong since the VLAN interface may be configured with a different address.

The bug was introduced in commit 2205369a31
("vlan: Fix header ops passthru when doing TX VLAN offload.").

This patch sets the source address before calling the create function of the
real interface.

Signed-off-by: Peter Boström <peter.bostrom@netrounds.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 22:21:51 -04:00
Eric Dumazet
d32d9bb85c flowcache: restore a single flow_cache kmem_cache
It is not legal to create multiple kmem_cache having the same name.

flowcache can use a single kmem_cache, no need for a per netns
one.

Fixes: ca925cf153 ("flowcache: Make flow cache name space aware")
Reported-by: Jakub Kicinski <moorray3@wp.pl>
Tested-by: Jakub Kicinski <moorray3@wp.pl>
Tested-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 21:45:11 -04:00
Eric Dumazet
2818fa0fa0 pkt_sched: fq: do not hold qdisc lock while allocating memory
Resizing fq hash table allocates memory while holding qdisc spinlock,
with BH disabled.

This is definitely not good, as allocation might sleep.

We can drop the lock and get it when needed, we hold RTNL so no other
changes can happen at the same time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 16:17:52 -04:00
Eric Dumazet
431a91242d tcp: timestamp SYN+DATA messages
All skb in socket write queue should be properly timestamped.

In case of FastOpen, we special case the SYN+DATA 'message' as we
queue in socket wrote queue the two fallback skbs:

1) SYN message by itself.
2) DATA segment by itself.

We should make sure these skbs have proper timestamps.

Add a WARN_ON_ONCE() to eventually catch future violations.

Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 16:15:54 -04:00
Alexander Aring
3772ab1d37 6lowpan: reassembly: fix access of ctl table entry
Correct offset is 3 of the 6lowpanfrag_max_datagram_size value in proc
entry ctl table and not 2.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 16:03:03 -04:00
Linus Torvalds
e6a4b6f5ea Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.

Clean up file table accesses (get rid of fget_light() in favor of the
fdget() interface), add proper file position locking.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  get rid of fget_light()
  sockfd_lookup_light(): switch to fdget^W^Waway from fget_light
  vfs: atomic f_pos accesses as per POSIX
  ocfs2 syncs the wrong range...
2014-03-10 12:57:26 -07:00
Eric Dumazet
37314363cd pkt_sched: move the sanity test in qdisc_list_add()
The WARN_ON(root == &noop_qdisc)) added in qdisc_list_add()
can trigger in normal conditions when devices are not up.
It should be done only right before the list_add_tail() call.

Fixes: e57a784d8c ("pkt_sched: set root qdisc before change() in attach_default_qdiscs()")
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Mirco Tischler <mt-ml@gmx.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 15:44:21 -04:00
Eric Dumazet
746e349980 l2tp: fix unused variable warning
net/l2tp/l2tp_core.c:1111:15: warning: unused variable
'sk' [-Wunused-variable]

Fixes: 31c70d5956 ("l2tp: keep original skb ownership")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 15:32:24 -04:00
Eric Dumazet
9063e21fb0 netlink: autosize skb lengthes
One known problem with netlink is the fact that NLMSG_GOODSIZE is
really small on PAGE_SIZE==4096 architectures, and it is difficult
to know in advance what buffer size is used by the application.

This patch adds an automatic learning of the size.

First netlink message will still be limited to ~4K, but if user used
bigger buffers, then following messages will be able to use up to 16KB.

This speedups dump() operations by a large factor and should be safe
for legacy applications.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-10 13:56:26 -04:00
Al Viro
00e188ef6a sockfd_lookup_light(): switch to fdget^W^Waway from fget_light
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-03-10 11:44:41 -04:00
dingtianhong
be14cc98e9 vlan: use use ether_addr_equal_64bits to instead of ether_addr_equal
Ether_addr_equal_64bits is more efficient than ether_addr_equal, and
can be used when each argument is an array within a structure that
contains at least two bytes of data beyond the array, so it is safe
to use it for vlan, and make sense for fast path.

Cc: Joe Perches <joe@perches.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-09 19:03:51 -04:00
dingtianhong
375f67df28 vlan: slight optimization for vlan_do_receive()
According Joe's suggestion, maybe it'd be faster to add an unlikely to
the test for PCKET_OTHERHOST, so I add it and see whether the performance
could be better, although the differences is so small and negligible, but
it is hard to catch that any lower device would set the skb type to
PACKET_OTHERHOST, so most of time, I think it make sense to add unlikely
for the test.

Cc: Joe Perches <joe@perches.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-09 19:03:51 -04:00
Eric Dumazet
2d8d40afd1 pkt_sched: fq: do not hold qdisc lock while allocating memory
Resizing fq hash table allocates memory while holding qdisc spinlock,
with BH disabled.

This is definitely not good, as allocation might sleep.

We can drop the lock and get it when needed, we hold RTNL so no other
changes can happen at the same time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-08 19:09:10 -05:00
Patrick McHardy
a4c2e8beba netfilter: nft_nat: fix family validation
The family in the NAT expression is basically completely useless since
we have it available during runtime anyway. Nevertheless it is used to
decide the NAT family, so at least validate it properly. As we don't
support cross-family NAT, it needs to match the family of the table the
expression exists in.

Unfortunately we can't remove it completely since we need to dump it for
userspace (*sigh*), so at least reduce the memory waste.

Additionally clean up the module init function by removing useless
temporary variables.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-08 12:35:19 +01:00
Patrick McHardy
d46f2cd260 netfilter: nft_ct: remove family from struct nft_ct
Since we have the context available during destruction again, we can
remove the family from the private structure.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-08 12:35:19 +01:00
Patrick McHardy
ab9da5c19f netfilter: nf_tables: restore notifications for anonymous set destruction
Since we have the context available again, we can restore notifications
for destruction of anonymous sets.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-08 12:35:18 +01:00
Patrick McHardy
62472bcefb netfilter: nf_tables: restore context for expression destructors
In order to fix set destruction notifications and get rid of unnecessary
members in private data structures, pass the context to expressions'
destructor functions again.

In order to do so, replace various members in the nft_rule_trans structure
by the full context.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-08 12:35:17 +01:00
Patrick McHardy
a36e901cf6 netfilter: nf_tables: clean up nf_tables_trans_add() argument order
The context argument logically comes first, and this is what every other
function dealing with contexts does.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-08 12:35:16 +01:00
Alexander Aring
37147652cf 6lowpan: reassembly: fix return of init function
This patch adds a missing return after fragmentation init. Otherwise we
register a sysctl interface and deregister it afterwards which makes no
sense.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-07 17:11:19 -05:00
David S. Miller
3894004289 RxRPC development
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAUxX4KBOxKuMESys7AQJqkRAAmxU1GEGESCE/F/U3Hm9pRFeg9kj6+7BO
 7vPrUzB5KEnu4eXjCc60a+BVmV0fJRiS7zsFWzmcKgjIZzJdobbStD944tdz0kIx
 w7RFKOr+D6+zyo0zr1Fj0DwlhvuOpfBsisYtNC5J7OTWc3whgnIkroLt16FNFasu
 gxe7oo+KHiTXIJuIf6E/p5iK1te7CUsIIc1v01hgVVD91Udk4odWyl288BzzcXuC
 5WFYaMnOG+9ndo+4/mCt5dN0FDYL2a1djyknPdXP6HOSBoqVBjPxFGxuN8O9HFMV
 ghWjNG72YGDk0jy6ghiQijdoxE4l2iy0/gYGzSiCMSp59aHHlkq/XIhOoZWD96Nw
 TAoo0PbTFrqRzfvOQ8lMULDneB0KP20gI/AUbgnU3CGataALntAKiVOhrOtXaG21
 trE77omDA5FtMlQ60DabTl6RveMZXCZ+LwHPuZ0Euxo4uykmDLdOChZADGoqZA+5
 VzI1Qa4YK+9B/fkgUyiIzS4syCHGISFIIQ7/Srr26UMlpuK4WDPA9a11hKxrbrkA
 JxEkAk9i/KD3/XsWWfxEMPgNarwHOyDQQ4/XYKx20O2N9RDrao+cNw6ghlwZS+mo
 bmx/sqWmMDaQoMNqL8Q+bq2W7omG8wNgv1pn6MVnFSfiJHAMDhNZkBzp0uCLGFpA
 8WYj9OBcUgY=
 =PK2i
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-devel-20140304' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
net-next: AF_RXRPC fixes and development

Here are some AF_RXRPC fixes:

 (1) Fix to remove incorrect checksum calculation made during recvmsg().  It's
     unnecessary to try to do this there since we check the checksum before
     reading the RxRPC header from the packet.

 (2) Fix to prevent the sending of an ABORT packet in response to another
     ABORT packet and inducing a storm.

 (3) Fix UDP MTU calculation from parsing ICMP_FRAG_NEEDED packets where we
     don't handle the ICMP packet not specifying an MTU size.

And development patches:

 (4) Add sysctls for configuring RxRPC parameters, specifically various delays
     pertaining to ACK generation, the time before we resend a packet for
     which we don't receive an ACK, the maximum time a call is permitted to
     live and the amount of time transport, connection and dead call
     information is cached.

 (5) Improve ACK packet production by adjusting the handling of ACK_REQUESTED
     packets, ignoring the MORE_PACKETS flag, delaying the production of
     otherwise immediate ACK_IDLE packets and delaying all ACK_IDLE production
     (barring the call termination) to half a second.

 (6) Add more sysctl parameters to expose the Rx window size, the maximum
     packet size that we're willing to receive and the number of jumbo rxrpc
     packets we're willing to handle in a single UDP packet.

 (7) Request ACKs on alternate DATA packets so that the other side doesn't
     wait till we fill up the Tx window.

 (8) Use a RCU hash table to look up the rxrpc_call for an incoming packet
     rather than stepping through a hierarchy involving several spinlocks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-07 16:08:59 -05:00
John W. Linville
97bd5f0054 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem 2014-03-07 15:09:32 -05:00
Eric Dumazet
31c70d5956 l2tp: keep original skb ownership
There is no reason to orphan skb in l2tp.

This breaks things like per socket memory limits, TCP Small queues...

Fix this before more people copy/paste it.

This is very similar to commit 8f646c922d
("vxlan: keep original skb ownership")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-07 14:37:55 -05:00
Eric Dumazet
2196269242 tcp: do not leak non zero tstamp in output packets
Usage of skb->tstamp should remain private to TCP stack
(only set on packets on write queue, not on cloned ones)

Otherwise, packets given to loopback interface with a non null tstamp
can confuse netif_rx() / net_timestamp_check()

Other possibility would be to clear tstamp in loopback_xmit(),
as done in skb_scrub_packet()

Fixes: 740b0f1841 ("tcp: switch rtt estimations to usec resolution")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-07 14:32:57 -05:00
Patrick McHardy
ce6eb0d7c8 netfilter: nft_hash: bug fixes and resizing
The hash set type is very broken and was never meant to be merged in this
state. Missing RCU synchronization on element removal, leaking chain
refcounts when used as a verdict map, races during lookups, a fixed table
size are probably just some of the problems. Luckily it is currently
never chosen by the kernel when the rbtree type is also available.

Rewrite it to be usable.

The new implementation supports automatic hash table resizing using RCU,
based on Paul McKenney's and Josh Triplett's algorithm "Optimized Resizing
For RCU-Protected Hash Tables" described in [1].

Resizing doesn't require a second list head in the elements, it works by
chosing a hash function that remaps elements to a predictable set of buckets,
only resizing by integral factors and

- during expansion: linking new buckets to the old bucket that contains
  elements for any of the new buckets, thereby creating imprecise chains,
  then incrementally seperating the elements until the new buckets only
  contain elements that hash directly to them.

- during shrinking: linking the hash chains of all old buckets that hash
  to the same new bucket to form a single chain.

Expansion requires at most the number of elements in the longest hash chain
grace periods, shrinking requires a single grace period.

Due to the requirement of having hash chains/elements linked to multiple
buckets during resizing, homemade single linked lists are used instead of
the existing list helpers, that don't support this in a clean fashion.
As a side effect, the amount of memory required per element is reduced by
one pointer.

Expansion is triggered when the load factors exceeds 75%, shrinking when
the load factor goes below 30%. Both operations are allowed to fail and
will be retried on the next insertion or removal if their respective
conditions still hold.

[1] http://dl.acm.org/citation.cfm?id=2002181.2002192

Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:42:07 +01:00
Jesper Dangaard Brouer
93bb0ceb75 netfilter: conntrack: remove central spinlock nf_conntrack_lock
nf_conntrack_lock is a monolithic lock and suffers from huge contention
on current generation servers (8 or more core/threads).

Perf locking congestion is clear on base kernel:

-  72.56%  ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock_bh
   - _raw_spin_lock_bh
      + 25.33% init_conntrack
      + 24.86% nf_ct_delete_from_lists
      + 24.62% __nf_conntrack_confirm
      + 24.38% destroy_conntrack
      + 0.70% tcp_packet
+   2.21%  ksoftirqd/6  [kernel.kallsyms]    [k] fib_table_lookup
+   1.15%  ksoftirqd/6  [kernel.kallsyms]    [k] __slab_free
+   0.77%  ksoftirqd/6  [kernel.kallsyms]    [k] inet_getpeer
+   0.70%  ksoftirqd/6  [nf_conntrack]       [k] nf_ct_delete
+   0.55%  ksoftirqd/6  [ip_tables]          [k] ipt_do_table

This patch change conntrack locking and provides a huge performance
improvement.  SYN-flood attack tested on a 24-core E5-2695v2(ES) with
10Gbit/s ixgbe (with tool trafgen):

 Base kernel:   810.405 new conntrack/sec
 After patch: 2.233.876 new conntrack/sec

Notice other floods attack (SYN+ACK or ACK) can easily be deflected using:
 # iptables -A INPUT -m state --state INVALID -j DROP
 # sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

Use an array of hashed spinlocks to protect insertions/deletions of
conntracks into the hash table. 1024 spinlocks seem to give good
results, at minimal cost (4KB memory). Due to lockdep max depth,
1024 becomes 8 if CONFIG_LOCKDEP=y

The hash resize is a bit tricky, because we need to take all locks in
the array. A seqcount_t is used to synchronize the hash table users
with the resizing process.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:41:13 +01:00
Jesper Dangaard Brouer
ca7433df3a netfilter: conntrack: seperate expect locking from nf_conntrack_lock
Netfilter expectations are protected with the same lock as conntrack
entries (nf_conntrack_lock).  This patch split out expectations locking
to use it's own lock (nf_conntrack_expect_lock).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:41:01 +01:00
Jesper Dangaard Brouer
e1b207dac1 netfilter: avoid race with exp->master ct
Preparation for disconnecting the nf_conntrack_lock from the
expectations code.  Once the nf_conntrack_lock is lifted, a race
condition is exposed.

The expectations master conntrack exp->master, can race with
delete operations, as the refcnt increment happens too late in
init_conntrack().  Race is against other CPUs invoking
->destroy() (destroy_conntrack()), or nf_ct_delete() (via timeout
or early_drop()).

Avoid this race in nf_ct_find_expectation() by using atomic_inc_not_zero(),
and checking if nf_ct_is_dying() (path via nf_ct_delete()).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:40:47 +01:00
Jesper Dangaard Brouer
b7779d06f9 netfilter: conntrack: spinlock per cpu to protect special lists.
One spinlock per cpu to protect dying/unconfirmed/template special lists.
(These lists are now per cpu, a bit like the untracked ct)
Add a @cpu field to nf_conn, to make sure we hold the appropriate
spinlock at removal time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:40:38 +01:00
Jesper Dangaard Brouer
b476b72a0f netfilter: trivial code cleanup and doc changes
Changes while reading through the netfilter code.

Added hint about how conntrack nf_conn refcnt is accessed.
And renamed repl_hash to reply_hash for readability

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:40:04 +01:00
Pablo Neira Ayuso
52af2bfcc0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Via Simon Horman:

====================
* Whitespace cleanup spotted by checkpatch.pl from Tingwei Liu.
* Section conflict cleanup, basically removal of one wrong __read_mostly,
  from Andi Kleen.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-03-07 11:37:11 +01:00
Nicolas Dichtel
870a2df4ca xfrm: rename struct xfrm_filter
iproute2 already defines a structure with that name, let's use another one to
avoid any conflict.

CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-03-07 08:12:37 +01:00
Tingwei Liu
411fd527bc ipvs: Reduce checkpatch noise in ip_vs_lblc.c
Add whitespace after operator and put open brace { on the previous line

Cc: Tingwei Liu <liutingwei@hisense.com>
Cc: lvs-devel@vger.kernel.org
Signed-off-by: Tingwei Liu <tingw.liu@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
2014-03-07 12:36:21 +09:00
Andi Kleen
c61b0c1328 sections, ipvs: Remove useless __read_mostly for ipvs genl_ops
const __read_mostly does not make any sense, because const
data is already read-only. Remove the __read_mostly
for the ipvs genl_ops. This avoids a LTO
section conflict compile problem.

Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: Simon Horman <horms@verge.net.au>
Cc: Patrick McHardy <kaber@trash.net>
Cc: lvs-devel@vger.kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
2014-03-07 12:36:21 +09:00
Sabrina Dubroca
c88507fbad ipv6: don't set DST_NOCOUNT for remotely added routes
DST_NOCOUNT should only be used if an authorized user adds routes
locally. In case of routes which are added on behalf of router
advertisments this flag must not get used as it allows an unlimited
number of routes getting added remotely.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 17:29:27 -05:00
Eric Dumazet
6f542efcbc net_sched: htb: do not acquire qdisc lock in dump operations
htb_dump() and htb_dump_class() do not strictly need to acquire
qdisc lock to fetch qdisc and/or class parameters.

We hold RTNL and no changes can occur.

This reduces by 50% qdisc lock pressure while doing tc qdisc|class dump
operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 17:24:54 -05:00
Alexander Aring
cefc8c8a7c 6lowpan: move 6lowpan header to include/net
This header is used by bluetooth and ieee802154 branch. This patch
move this header to the include/net directory to avoid a use of a
relative path in include.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 17:21:38 -05:00
Alexander Aring
91c6922745 6lowpan: add missing include of net/ipv6.h
The 6lowpan.h file contains some static inline function which use
internal ipv6 api structs. Add a include of ipv6.h to be sure that it's
known before.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 17:21:38 -05:00
Anton Nayshtut
d2d273ffab ipv6: Fix exthdrs offload registration.
Without this fix, ipv6_exthdrs_offload_init doesn't register IPPROTO_DSTOPTS
offload, but returns 0 (as the IPPROTO_ROUTING registration actually succeeds).

This then causes the ipv6_gso_segment to drop IPv6 packets with IPPROTO_DSTOPTS
header.

The issue detected and the fix verified by running MS HCK Offload LSO test on
top of QEMU Windows guests, as this test sends IPv6 packets with
IPPROTO_DSTOPTS.

Signed-off-by: Anton Nayshtut <anton@swortex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 16:35:55 -05:00
Anton Blanchard
0a13404dd3 net: unix socket code abuses csum_partial
The unix socket code is using the result of csum_partial to
hash into a lookup table:

	unix_hash_fold(csum_partial(sunaddr, len, 0));

csum_partial is only guaranteed to produce something that can be
folded into a checksum, as its prototype explains:

 * returns a 32-bit number suitable for feeding into itself
 * or csum_tcpudp_magic

The 32bit value should not be used directly.

Depending on the alignment, the ppc64 csum_partial will return
different 32bit partial checksums that will fold into the same
16bit checksum.

This difference causes the following testcase (courtesy of
Gustavo) to sometimes fail:

#include <sys/socket.h>
#include <stdio.h>

int main()
{
	int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

	int i = 1;
	setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4);

	struct sockaddr addr;
	addr.sa_family = AF_LOCAL;
	bind(fd, &addr, 2);

	listen(fd, 128);

	struct sockaddr_storage ss;
	socklen_t sslen = (socklen_t)sizeof(ss);
	getsockname(fd, (struct sockaddr*)&ss, &sslen);

	fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0);

	if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){
		perror(NULL);
		return 1;
	}
	printf("OK\n");
	return 0;
}

As suggested by davem, fix this by using csum_fold to fold the
partial 32bit checksum into a 16bit checksum before using it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 16:19:33 -05:00
Florian Westphal
e588e2f286 inet: frag: make sure forced eviction removes all frags
Quoting Alexander Aring:
  While fragmentation and unloading of 6lowpan module I got this kernel Oops
  after few seconds:

  BUG: unable to handle kernel paging request at f88bbc30
  [..]
  Modules linked in: ipv6 [last unloaded: 6lowpan]
  Call Trace:
   [<c012af4c>] ? call_timer_fn+0x54/0xb3
   [<c012aef8>] ? process_timeout+0xa/0xa
   [<c012b66b>] run_timer_softirq+0x140/0x15f

Problem is that incomplete frags are still around after unload; when
their frag expire timer fires, we get crash.

When a netns is removed (also done when unloading module), inet_frag
calls the evictor with 'force' argument to purge remaining frags.

The evictor loop terminates when accounted memory ('work') drops to 0
or the lru-list becomes empty.  However, the mem accounting is done
via percpu counters and may not be accurate, i.e. loop may terminate
prematurely.

Alter evictor to only stop once the lru list is empty when force is
requested.

Reported-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Reported-by: Alexander Aring <alex.aring@gmail.com>
Tested-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 15:28:45 -05:00
David S. Miller
f7324acd98 tcp: Use NET_ADD_STATS instead of NET_ADD_STATS_BH in tcp_event_new_data_sent()
Can be invoked from non-BH context.

Based upon a patch by Eric Dumazet.

Fixes: f19c29e3e3 ("tcp: snmp stats for Fast Open, SYN rtx, and data pkts")
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 15:19:43 -05:00
Erik Hugne
2892505ea1 tipc: don't log disabled tasklet handler errors
Failure to schedule a TIPC tasklet with tipc_k_signal because the
tasklet handler is disabled is not an error. It means TIPC is
currently in the process of shutting down. We remove the error
logging in this case.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 14:46:24 -05:00
Erik Hugne
1bb8dce57f tipc: fix memory leak during module removal
When the TIPC module is removed, the tasklet handler is disabled
before all other subsystems. This will cause lingering publications
in the name table because the node_down tasklets responsible to
clean up publications from an unreachable node will never run.
When the name table is shut down, these publications are detected
and an error message is logged:
tipc: nametbl_stop(): orphaned hash chain detected
This is actually a memory leak, introduced with commit
993b858e37 ("tipc: correct the order
of stopping services at rmmod")

Instead of just logging an error and leaking memory, we free
the orphaned entries during nametable shutdown.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 14:46:24 -05:00
Erik Hugne
edcc0511b5 tipc: drop subscriber connection id invalidation
When a topology server subscriber is disconnected, the associated
connection id is set to zero. A check vs zero is then done in the
subscription timeout function to see if the subscriber have been
shut down. This is unnecessary, because all subscription timers
will be cancelled when a subscriber terminates. Setting the
connection id to zero is actually harmful because id zero is the
identity of the topology server listening socket, and can cause a
race that leads to this socket being closed instead.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 14:46:23 -05:00
Ying Xue
fe8e464939 tipc: avoid to unnecessary process switch under non-block mode
When messages are received via tipc socket under non-block mode,
schedule_timeout() is called in tipc_wait_for_rcvmsg(), that is,
the process of receiving messages will be scheduled once although
timeout value passed to schedule_timeout() is 0. The same issue
exists in accept()/wait_for_accept(). To avoid this unnecessary
process switch, we only call schedule_timeout() if the timeout
value is non-zero.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 14:46:23 -05:00
Ying Xue
4652edb70e tipc: fix connection refcount leak
When tipc_conn_sendmsg() calls tipc_conn_lookup() to query a
connection instance, its reference count value is increased if
it's found. But subsequently if it's found that the connection is
closed, the work of sending message is not queued into its server
send workqueue, and the connection reference count is not decreased.
This will cause a reference count leak. To reproduce this problem,
an application would need to open and closes topology server
connections with high intensity.

We fix this by immediately decrementing the connection reference
count if a send fails due to the connection being closed.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 14:46:23 -05:00
Ying Xue
6d4ebeb4df tipc: allow connection shutdown callback to be invoked in advance
Currently connection shutdown callback function is called when
connection instance is released in tipc_conn_kref_release(), and
receiving packets and sending packets are running in different
threads. Even if connection is closed by the thread of receiving
packets, its shutdown callback may not be called immediately as
the connection reference count is non-zero at that moment. So,
although the connection is shut down by the thread of receiving
packets, the thread of sending packets doesn't know it. Before
its shutdown callback is invoked to tell the sending thread its
connection has been closed, the sending thread may deliver
messages by tipc_conn_sendmsg(), this is why the following error
information appears:

"Sending subscription event failed, no memory"

To eliminate it, allow connection shutdown callback function to
be called before connection id is removed in tipc_close_conn(),
which makes the sending thread know the truth in time that its
socket is closed so that it doesn't send message to it. We also
remove the "Sending XXX failed..." error reporting for topology
and config services.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 14:46:23 -05:00
Guillaume Nault
9e9cb6221a l2tp: fix userspace reception on plain L2TP sockets
As pppol2tp_recv() never queues up packets to plain L2TP sockets,
pppol2tp_recvmsg() never returns data to userspace, thus making
the recv*() system calls unusable.

Instead of dropping packets when the L2TP socket isn't bound to a PPP
channel, this patch adds them to its reception queue.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 14:25:39 -05:00
Guillaume Nault
bb5016eac1 l2tp: fix manual sequencing (de)activation in L2TPv2
Commit e0d4435f "l2tp: Update PPP-over-L2TP driver to work over L2TPv3"
broke the PPPOL2TP_SO_SENDSEQ setsockopt. The L2TP header length was
previously computed by pppol2tp_l2t_header_len() before each call to
l2tp_xmit_skb(). Now that header length is retrieved from the hdr_len
session field, this field must be updated every time the L2TP header
format is modified, or l2tp_xmit_skb() won't push the right amount of
data for the L2TP header.

This patch uses l2tp_session_set_header_len() to adjust hdr_len every
time sequencing is (de)activated from userspace (either by the
PPPOL2TP_SO_SENDSEQ setsockopt or the L2TP_ATTR_SEND_SEQ netlink
attribute).

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 14:25:39 -05:00
John W. Linville
73c8daa083 Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 2014-03-06 13:47:16 -05:00
Hannes Frederic Sowa
e90c14835b inet: remove now unused flag DST_NOPEER
Commit e688a60480 ("net: introduce DST_NOPEER dst flag") introduced
DST_NOPEER because because of crashes in ipv6_select_ident called from
udp6_ufo_fragment.

Since commit 916e4cf46d ("ipv6: reuse ip6_frag_id from
ip6_ufo_append_data") we don't call ipv6_select_ident any more from
ip6_ufo_append_data, thus this flag lost its purpose and can be removed.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 13:15:52 -05:00
Josh Hunt
07cf8f5ae2 netfilter: ipset: add forceadd kernel support for hash set types
Adds a new property for hash set types, where if a set is created
with the 'forceadd' option and the set becomes full the next addition
to the set may succeed and evict a random entry from the set.

To keep overhead low eviction is done very simply. It checks to see
which bucket the new entry would be added. If the bucket's pos value
is non-zero (meaning there's at least one entry in the bucket) it
replaces the first entry in the bucket. If pos is zero, then it continues
down the normal add process.

This property is useful if you have a set for 'ban' lists where it may
not matter if you release some entries from the set early.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2014-03-06 09:31:43 +01:00
Ilia Mirkin
6843bc3c56 netfilter: ipset: move registration message to init from net_init
Commit 1785e8f473 ("netfiler: ipset: Add net namespace for ipset") moved
the initialization print into net_init, which can get called a lot due
to namespaces. Move it back into init, reduce to pr_info.

Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2014-03-06 09:31:43 +01:00
Vytas Dauksa
4d0e5c076d netfilter: ipset: add markmask for hash:ip,mark data type
Introduce packet mark mask for hash:ip,mark data type. This allows to
set mark bit filter for the ip set.

Change-Id: Id8dd9ca7e64477c4f7b022a1d9c1a5b187f1c96e

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2014-03-06 09:31:42 +01:00
Vytas Dauksa
3b02b56cd5 netfilter: ipset: add hash:ip,mark data type to ipset
Introduce packet mark support with new ip,mark hash set. This includes
userspace and kernelspace code, hash:ip,mark set tests and man page
updates.

The intended use of ip,mark set is similar to the ip:port type, but for
protocols which don't use a predictable port number. Instead of port
number it matches a firewall mark determined by a layer 7 filtering
program like opendpi.

As well as allowing or blocking traffic it will also be used for
accounting packets and bytes sent for each protocol.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2014-03-06 09:31:42 +01:00
Fengguang Wu
9562cf28d1 netfilter: ipset: Add hash: fix coccinelle warnings
net/netfilter/ipset/ip_set_hash_netnet.c:115:8-9: WARNING: return of 0/1 in function 'hash_netnet4_data_list' with return type bool
/c/kernel-tests/src/cocci/net/netfilter/ipset/ip_set_hash_netnet.c:338:8-9: WARNING: return of 0/1 in function 'hash_netnet6_data_list' with return type bool

Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: coccinelle/misc/boolreturn.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2014-03-06 09:31:42 +01:00
Sergey Popovich
35f6e63abe netfilter: ipset: Follow manual page behavior for SET target on list:set
ipset(8) for list:set says:
  The match will try to find a matching entry in the sets and the
  target will try to add an entry to the first set to which it can
  be added.

However real behavior is bit differ from described. Consider example:

 # ipset create test-1-v4 hash:ip family inet
 # ipset create test-1-v6 hash:ip family inet6
 # ipset create test-1 list:set
 # ipset add test-1 test-1-v4
 # ipset add test-1 test-1-v6

 # iptables  -A INPUT -p tcp --destination-port 25 -j SET --add-set test-1 src
 # ip6tables -A INPUT -p tcp --destination-port 25 -j SET --add-set test-1 src

And then when iptables/ip6tables rule matches packet IPSET target
tries to add src from packet to the list:set test-1 where first
entry is test-1-v4 and the second one is test-1-v6.

For IPv4, as it first entry in test-1 src added to test-1-v4
correctly, but for IPv6 src not added!

Placing test-1-v6 to the first element of list:set makes behavior
correct for IPv6, but brokes for IPv4.

This is due to result, returned from ip_set_add() and ip_set_del() from
net/netfilter/ipset/ip_set_core.c when set in list:set equires more
parameters than given or address families do not match (which is this
case).

It seems wrong returning 0 from ip_set_add() and ip_set_del() in
this case, as 0 should be returned only when an element successfuly
added/deleted to/from the set, contrary to ip_set_test() which
returns 0 when no entry exists and >0 when entry found in set.

Signed-off-by: Sergey Popovich <popovich_sergei@mail.ru>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2014-03-06 09:31:41 +01:00
Daniel Borkmann
c485658bae net: sctp: fix skb leakage in COOKIE ECHO path of chunk->auth_chunk
While working on ec0223ec48 ("net: sctp: fix sctp_sf_do_5_1D_ce to
verify if we/peer is AUTH capable"), we noticed that there's a skb
memory leakage in the error path.

Running the same reproducer as in ec0223ec48 and by unconditionally
jumping to the error label (to simulate an error condition) in
sctp_sf_do_5_1D_ce() receive path lets kmemleak detector bark about
the unfreed chunk->auth_chunk skb clone:

Unreferenced object 0xffff8800b8f3a000 (size 256):
  comm "softirq", pid 0, jiffies 4294769856 (age 110.757s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    89 ab 75 5e d4 01 58 13 00 00 00 00 00 00 00 00  ..u^..X.........
  backtrace:
    [<ffffffff816660be>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff8119f328>] kmem_cache_alloc+0xc8/0x210
    [<ffffffff81566929>] skb_clone+0x49/0xb0
    [<ffffffffa0467459>] sctp_endpoint_bh_rcv+0x1d9/0x230 [sctp]
    [<ffffffffa046fdbc>] sctp_inq_push+0x4c/0x70 [sctp]
    [<ffffffffa047e8de>] sctp_rcv+0x82e/0x9a0 [sctp]
    [<ffffffff815abd38>] ip_local_deliver_finish+0xa8/0x210
    [<ffffffff815a64af>] nf_reinject+0xbf/0x180
    [<ffffffffa04b4762>] nfqnl_recv_verdict+0x1d2/0x2b0 [nfnetlink_queue]
    [<ffffffffa04aa40b>] nfnetlink_rcv_msg+0x14b/0x250 [nfnetlink]
    [<ffffffff815a3269>] netlink_rcv_skb+0xa9/0xc0
    [<ffffffffa04aa7cf>] nfnetlink_rcv+0x23f/0x408 [nfnetlink]
    [<ffffffff815a2bd8>] netlink_unicast+0x168/0x250
    [<ffffffff815a2fa1>] netlink_sendmsg+0x2e1/0x3f0
    [<ffffffff8155cc6b>] sock_sendmsg+0x8b/0xc0
    [<ffffffff8155d449>] ___sys_sendmsg+0x369/0x380

What happens is that commit bbd0d59809 clones the skb containing
the AUTH chunk in sctp_endpoint_bh_rcv() when having the edge case
that an endpoint requires COOKIE-ECHO chunks to be authenticated:

  ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ---------->
  <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] ---------
  ------------------ AUTH; COOKIE-ECHO ---------------->
  <-------------------- COOKIE-ACK ---------------------

When we enter sctp_sf_do_5_1D_ce() and before we actually get to
the point where we process (and subsequently free) a non-NULL
chunk->auth_chunk, we could hit the "goto nomem_init" path from
an error condition and thus leave the cloned skb around w/o
freeing it.

The fix is to centrally free such clones in sctp_chunk_destroy()
handler that is invoked from sctp_chunk_free() after all refs have
dropped; and also move both kfree_skb(chunk->auth_chunk) there,
so that chunk->auth_chunk is either NULL (since sctp_chunkify()
allocs new chunks through kmem_cache_zalloc()) or non-NULL with
a valid skb pointer. chunk->skb and chunk->auth_chunk are the
only skbs in the sctp_chunk structure that need to be handeled.

While at it, we should use consume_skb() for both. It is the same
as dev_kfree_skb() but more appropriately named as we are not
a device but a protocol. Also, this effectively replaces the
kfree_skb() from both invocations into consume_skb(). Functions
are the same only that kfree_skb() assumes that the frame was
being dropped after a failure (e.g. for tools like drop monitor),
usage of consume_skb() seems more appropriate in function
sctp_chunk_destroy() though.

Fixes: bbd0d59809 ("[SCTP]: Implement the receive and verification of AUTH chunk")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Vlad Yasevich <yasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05 20:40:25 -05:00
Linus Lüssing
6565b9eeef bridge: multicast: add sanity check for query source addresses
MLD queries are supposed to have an IPv6 link-local source address
according to RFC2710, section 4 and RFC3810, section 5.1.14. This patch
adds a sanity check to ignore such broken MLD queries.

Without this check, such malformed MLD queries can result in a
denial of service: The queries are ignored by any MLD listener
therefore they will not respond with an MLD report. However,
without this patch these malformed MLD queries would enable the
snooping part in the bridge code, potentially shutting down the
according ports towards these hosts for multicast traffic as the
bridge did not learn about these listeners.

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05 20:40:24 -05:00
David S. Miller
67ddc87f16 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/ath/ath9k/recv.c
	drivers/net/wireless/mwifiex/pcie.c
	net/ipv6/sit.c

The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.

The two wireless conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05 20:32:02 -05:00
Nikolay Aleksandrov
24b9bf43e9 net: fix for a race condition in the inet frag code
I stumbled upon this very serious bug while hunting for another one,
it's a very subtle race condition between inet_frag_evictor,
inet_frag_intern and the IPv4/6 frag_queue and expire functions
(basically the users of inet_frag_kill/inet_frag_put).

What happens is that after a fragment has been added to the hash chain
but before it's been added to the lru_list (inet_frag_lru_add) in
inet_frag_intern, it may get deleted (either by an expired timer if
the system load is high or the timer sufficiently low, or by the
fraq_queue function for different reasons) before it's added to the
lru_list, then after it gets added it's a matter of time for the
evictor to get to a piece of memory which has been freed leading to a
number of different bugs depending on what's left there.

I've been able to trigger this on both IPv4 and IPv6 (which is normal
as the frag code is the same), but it's been much more difficult to
trigger on IPv4 due to the protocol differences about how fragments
are treated.

The setup I used to reproduce this is: 2 machines with 4 x 10G bonded
in a RR bond, so the same flow can be seen on multiple cards at the
same time. Then I used multiple instances of ping/ping6 to generate
fragmented packets and flood the machines with them while running
other processes to load the attacked machine.

*It is very important to have the _same flow_ coming in on multiple CPUs
concurrently. Usually the attacked machine would die in less than 30
minutes, if configured properly to have many evictor calls and timeouts
it could happen in 10 minutes or so.

An important point to make is that any caller (frag_queue or timer) of
inet_frag_kill will remove both the timer refcount and the
original/guarding refcount thus removing everything that's keeping the
frag from being freed at the next inet_frag_put.  All of this could
happen before the frag was ever added to the LRU list, then it gets
added and the evictor uses a freed fragment.

An example for IPv6 would be if a fragment is being added and is at
the stage of being inserted in the hash after the hash lock is
released, but before inet_frag_lru_add executes (or is able to obtain
the lru lock) another overlapping fragment for the same flow arrives
at a different CPU which finds it in the hash, but since it's
overlapping it drops it invoking inet_frag_kill and thus removing all
guarding refcounts, and afterwards freeing it by invoking
inet_frag_put which removes the last refcount added previously by
inet_frag_find, then inet_frag_lru_add gets executed by
inet_frag_intern and we have a freed fragment in the lru_list.

The fix is simple, just move the lru_add under the hash chain locked
region so when a removing function is called it'll have to wait for
the fragment to be added to the lru_list, and then it'll remove it (it
works because the hash chain removal is done before the lru_list one
and there's no window between the two list adds when the frag can get
dropped). With this fix applied I couldn't kill the same machine in 24
hours with the same setup.

Fixes: 3ef0eb0db4 ("net: frag, move LRU list maintenance outside of
rwlock")

CC: Florian Westphal <fw@strlen.de>
CC: Jesper Dangaard Brouer <brouer@redhat.com>
CC: David S. Miller <davem@davemloft.net>

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05 20:31:42 -05:00
Johannes Berg
864a6040f3 mac80211: clear sequence/fragment number in QoS-null frames
Avoid leaking data by sending uninitialized memory and setting an
invalid (non-zero) fragment number (the sequence number is ignored
anyway) by setting the seq_ctrl field to zero.

Cc: stable@vger.kernel.org
Fixes: 3f52b7e328 ("mac80211: mesh power save basics")
Fixes: ce662b44ce ("mac80211: send (QoS) Null if no buffered frames")
Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2014-03-05 15:49:54 +01:00
Alexander Aring
6092c79fd0 ieee802154: fix whitespace issues in Kconfig
This patch fixes some whitespace issues in Kconfig files of IEEE
802.15.4 subsytem.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-04 20:12:44 -05:00
Simon Wunderlich
960d97f951 cfg80211: add MPLS and 802.21 classification
MPLS labels may contain traffic control information, which should be
evaluated and used by the wireless subsystem if present.

Also check for IEEE 802.21 which is always network control traffic.

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-04 13:51:06 -05:00
John W. Linville
f3b6a488a6 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
Conflicts:
	drivers/net/wireless/ath/ath9k/recv.c
	drivers/net/wireless/mwifiex/pcie.c
2014-03-04 13:05:12 -05:00
Linus Torvalds
c3bebc71c4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix memory leak in ieee80211_prep_connection(), sta_info leaked on
    error.  From Eytan Lifshitz.

 2) Unintentional switch case fallthrough in nft_reject_inet_eval(),
    from Patrick McHardy.

 3) Must check if payload lenth is a power of 2 in
    nft_payload_select_ops(), from Nikolay Aleksandrov.

 4) Fix mis-checksumming in xen-netfront driver, ip_hdr() is not in the
    correct place when we invoke skb_checksum_setup().  From Wei Liu.

 5) TUN driver should not advertise HW vlan offload features in
    vlan_features.  Fix from Fernando Luis Vazquez Cao.

 6) IPV6_VTI needs to select NET_IPV_TUNNEL to avoid build errors, fix
    from Steffen Klassert.

 7) Add missing locking in xfrm_migrade_state_find(), we must hold the
    per-namespace xfrm_state_lock while traversing the lists.  Fix from
    Steffen Klassert.

 8) Missing locking in ath9k driver, access to tid->sched must be done
    under ath_txq_lock().  Fix from Stanislaw Gruszka.

 9) Fix two bugs in TCP fastopen.  First respect the size argument given
    to tcp_sendmsg() in the fastopen path, and secondly prevent
    tcp_send_syn_data() from potentially using order-5 allocations.
    From Eric Dumazet.

10) Fix handling of default neigh garbage collection params, from Jiri
    Pirko.

11) Fix cwnd bloat and over-inflation of RTT when transmit segmentation
    is in use.  From Eric Dumazet.

12) Missing initialization of Realtek r8169 driver's statistics
    seqlocks.  Fix from Kyle McMartin.

13) Fix RTNL assertion failures in 802.3ad and AB ARP monitor of bonding
    driver, from Ding Tianhong.

14) Bonding slave release race can cause divide by zero, fix from
    Nikolay Aleksandrov.

15) Overzealous return from neigh_periodic_work() causes reachability
    time to not be computed.  Fix from Duain Jiong.

16) Fix regression in ipv6_find_hdr(), it should not return -ENOENT when
    a specific target is specified and found.  From Hans Schillstrom.

17) Fix VLAN tag stripping regression in BNA driver, from Ivan Vecera.

18) Tail loss probe can calculate bogus RTTs due to missing packet
    marking on retransmit.  Fix from Yuchung Cheng.

19) We cannot do skb_dst_drop() in iptunnel_pull_header() because
    multicast loopback detection in later code paths need access to
    skb_rtable().  Fix from Xin Long.

20) The macvlan driver regresses in that it propagates lower device
    offload support disables into itself, causing severe slowdowns when
    running over a bridge.  Provide the software offloads always on
    macvlan devices to deal with this and the regression is gone.  From
    Vlad Yasevich.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
  macvlan: Add support for 'always_on' offload features
  net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable
  ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointer
  net: cpsw: fix cpdma rx descriptor leak on down interface
  be2net: isolate TX workarounds not applicable to Skyhawk-R
  be2net: Fix skb double free in be_xmit_wrokarounds() failure path
  be2net: clear promiscuous bits in adapter->flags while disabling promiscuous mode
  be2net: Fix to reset transparent vlan tagging
  qlcnic: dcb: a couple off by one bugs
  tcp: fix bogus RTT on special retransmission
  hsr: off by one sanity check in hsr_register_frame_in()
  can: remove CAN FD compatibility for CAN 2.0 sockets
  can: flexcan: factor out soft reset into seperate funtion
  can: flexcan: flexcan_remove(): add missing netif_napi_del()
  can: flexcan: fix transition from and to freeze mode in chip_{,un}freeze
  can: flexcan: factor out transceiver {en,dis}able into seperate functions
  can: flexcan: fix transition from and to low power mode in chip_{en,dis}able
  can: flexcan: flexcan_open(): fix error path if flexcan_chip_start() fails
  can: flexcan: fix shutdown: first disable chip, then all interrupts
  USB AX88179/178A: Support D-Link DUB-1312
  ...
2014-03-04 08:44:32 -08:00