37899 Commits

Author SHA1 Message Date
Florian Westphal
48ed7b26fa ipv6: reject locally assigned nexthop addresses
ip -6 addr add dead::1/128 dev eth0
sleep 5
ip -6 route add default via dead::1/128
-> fails
ip -6 addr add dead::1/128 dev eth0
ip -6 route add default via dead::1/128
-> succeeds

reason is that if (nonsensensical) route above is added,
dead::1 is still subject to DAD, so the route lookup will
pick eth0 as outdev due to the prefix route that is added before
DAD work is started.

Add explicit test that checks if nexthop gateway is a local address.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1167969
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 23:23:38 -04:00
Eric Dumazet
946f9eb226 tcp: improve REUSEADDR/NOREUSEADDR cohabitation
inet_csk_get_port() randomization effort tends to spread
sockets on all the available range (ip_local_port_range)

This is unfortunate because SO_REUSEADDR sockets have
less requirements than non SO_REUSEADDR ones.

If an application uses SO_REUSEADDR hint, it is to try to
allow source ports being shared.

So instead of picking a random port number in ip_local_port_range,
lets try first in first half of the range.

This gives more chances to use upper half of the range for the
sockets with strong requirements (not using SO_REUSEADDR)

Note this patch does not add a new sysctl, and only changes
the way we try to pick port number.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Flavio Leitner <fbl@redhat.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 18:55:32 -04:00
Eric Dumazet
f5af1f57a2 inet_hashinfo: remove bsocket counter
We no longer need bsocket atomic counter, as inet_csk_get_port()
calls bind_conflict() regardless of its value, after commit
2b05ad33e1e624e ("tcp: bind() fix autoselection to share ports")

This patch removes overhead of maintaining this counter and
double inet_csk_get_port() calls under pressure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Flavio Leitner <fbl@redhat.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 18:55:32 -04:00
Jason Baron
ce5ec44099 tcp: ensure epoll edge trigger wakeup when write queue is empty
We currently rely on the setting of SOCK_NOSPACE in the write()
path to ensure that we wake up any epoll edge trigger waiters when
acks return to free space in the write queue. However, if we fail
to allocate even a single skb in the write queue, we could end up
waiting indefinitely.

Fix this by explicitly issuing a wakeup when we detect the condition
of an empty write queue and a return value of -EAGAIN. This allows
userspace to re-try as we expect this to be a temporary failure.

I've tested this approach by artificially making
sk_stream_alloc_skb() return NULL periodically. In that case,
epoll edge trigger waiters will hang indefinitely in epoll_wait()
without this patch.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 18:52:47 -04:00
Daniel Borkmann
c78e1746d3 net: sched: fix call_rcu() race on classifier module unloads
Vijay reported that a loop as simple as ...

  while true; do
    tc qdisc add dev foo root handle 1: prio
    tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
    tc qdisc del dev foo root
    rmmod cls_u32
  done

... will panic the kernel. Moreover, he bisected the change
apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
locking to netlink_lookup() and seq walker").

The removal of synchronize_net() from the netlink socket
triggering the qdisc to be removed, seems to have uncovered
an RCU resp. module reference count race from the tc API.
Given that RCU conversion was done after e341694e3eb5 ("netlink:
Convert netlink_lookup() to use RCU protected hash table")
which added the synchronize_net() originally, occasion of
hitting the bug was less likely (not impossible though):

When qdiscs that i) support attaching classifiers and,
ii) have at least one of them attached, get deleted, they
invoke tcf_destroy_chain(), and thus call into ->destroy()
handler from a classifier module.

After RCU conversion, all classifier that have an internal
prio list, unlink them and initiate freeing via call_rcu()
deferral.

Meanhile, tcf_destroy() releases already reference to the
tp->ops->owner module before the queued RCU callback handler
has been invoked.

Subsequent rmmod on the classifier module is then not prevented
since all module references are already dropped.

By the time, the kernel invokes the RCU callback handler from
the module, that function address is then invalid.

One way to fix it would be to add an rcu_barrier() to
unregister_tcf_proto_ops() to wait for all pending call_rcu()s
to complete.

synchronize_rcu() is not appropriate as under heavy RCU
callback load, registered call_rcu()s could be deferred
longer than a grace period. In case we don't have any pending
call_rcu()s, the barrier is allowed to return immediately.

Since we came here via unregister_tcf_proto_ops(), there
are no users of a given classifier anymore. Further nested
call_rcu()s pointing into the module space are not being
done anywhere.

Only cls_bpf_delete_prog() may schedule a work item, to
unlock pages eventually, but that is not in the range/context
of cls_bpf anymore.

Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto")
Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 18:48:18 -04:00
Alexei Starovoitov
04fd61ab36 bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  bpf_tail_call(ctx, &jmp_table, index);
  ...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
  ...
  if (jmp_table[index])
    return (*jmp_table[index])(ctx);
  ...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.

bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table

Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.

New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.

The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.

Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>

==========
- simplify complex programs by splitting them into a sequence of small programs

- dispatch routine
  For tracing and future seccomp the program may be triggered on all system
  calls, but processing of syscall arguments will be different. It's more
  efficient to implement them as:
  int syscall_entry(struct seccomp_data *ctx)
  {
     bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
     ... default: process unknown syscall ...
  }
  int sys_write_event(struct seccomp_data *ctx) {...}
  int sys_read_event(struct seccomp_data *ctx) {...}
  syscall_jmp_table[__NR_write] = sys_write_event;
  syscall_jmp_table[__NR_read] = sys_read_event;

  For networking the program may call into different parsers depending on
  packet format, like:
  int packet_parser(struct __sk_buff *skb)
  {
     ... parse L2, L3 here ...
     __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
     bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
     ... default: process unknown protocol ...
  }
  int parse_tcp(struct __sk_buff *skb) {...}
  int parse_udp(struct __sk_buff *skb) {...}
  ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
  ipproto_jmp_table[IPPROTO_UDP] = parse_udp;

- for TC use case, bpf_tail_call() allows to implement reclassify-like logic

- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
  are atomic, so user space can build chains of BPF programs on the fly

Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
  It could have been implemented without JIT changes as a wrapper on top of
  BPF_PROG_RUN() macro, but with two downsides:
  . all programs would have to pay performance penalty for this feature and
    tail call itself would be slower, since mandatory stack unwind, return,
    stack allocate would be done for every tailcall.
  . tailcall would be limited to programs running preempt_disabled, since
    generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
    need to be either global per_cpu variable accessed by helper and by wrapper
    or global variable protected by locks.

  In this implementation x64 JIT bypasses stack unwind and jumps into the
  callee program after prologue.

- bpf_prog_array_compatible() ensures that prog_type of callee and caller
  are the same and JITed/non-JITed flag is the same, since calling JITed
  program from non-JITed is invalid, since stack frames are different.
  Similarly calling kprobe type program from socket type program is invalid.

- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
  abstraction, its user space API and all of verifier logic.
  It's in the existing arraymap.c file, since several functions are
  shared with regular array map.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 17:07:59 -04:00
Daniel Borkmann
e7582bab5d net: dev: reduce both ingress hook ifdefs
Reduce ifdef pollution slightly, no functional change. We can simply
remove the extra alternative definition of handle_ing() and nf_ingress().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 16:58:53 -04:00
Eric Dumazet
eb9344781a tcp: add a force_schedule argument to sk_stream_alloc_skb()
In commit 8e4d980ac215 ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.

It turns out there are other cases where we want to force memory
schedule :

tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 16:56:40 -04:00
Erik Kline
765c9c639f neigh: Better handling of transition to NUD_PROBE state
[1] When entering NUD_PROBE state via neigh_update(), perhaps received
    from userspace, correctly (re)initialize the probes count to zero.

    This is useful for forcing revalidation of a neighbor (for example
    if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).

[2] Notify listeners when a neighbor goes into NUD_PROBE state.

    By sending notifications on entry to NUD_PROBE state listeners get
    more timely warnings of imminent connectivity issues.

    The current notifications on entry to NUD_STALE have somewhat
    limited usefulness: NUD_STALE is a perfectly normal state, as is
    NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
    a neighbor reachability problem has been confirmed (typically after
    three probes).

Signed-off-by: Erik Kline <ek@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 16:52:17 -04:00
Herbert Xu
407d34ef29 xfrm: Always zero high-order sequence number bits
As we're now always including the high bits of the sequence number
in the IV generation process we need to ensure that they don't
contain crap.

This patch ensures that the high sequence bits are always zeroed
so that we don't leak random data into the IV.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-05-21 06:56:23 +02:00
Ilya Dryomov
521a04d06a Revert "libceph: clear r_req_lru_item in __unregister_linger_request()"
This reverts commit ba9d114ec5578e6e99a4dfa37ff8ae688040fd64.

.. which introduced a regression that prevented all lingering requests
requeued in kick_requests() from ever being sent to the OSDs, resulting
in a lot of missed notifies.  In retrospect it's pretty obvious that
r_req_lru_item item in the case of lingering requests can be used not
only for notarget, but also for unsent linkage due to how tightly
actual map and enqueue operations are coupled in __map_request().

The assertion that was being silenced is taken care of in the previous
("libceph: request a new osdmap if lingering request maps to no osd")
commit: by always kicking homeless lingering requests we ensure that
none of them ends up on the notarget list outside of the critical
section guarded by request_mutex.

Cc: stable@vger.kernel.org # 3.18+, needs b0494532214b "libceph: request a new osdmap if lingering request maps to no osd"
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2015-05-20 21:02:46 +03:00
Ilya Dryomov
b049453221 libceph: request a new osdmap if lingering request maps to no osd
This commit does two things.  First, if there are any homeless
lingering requests, we now request a new osdmap even if the osdmap that
is being processed brought no changes, i.e. if a given lingering
request turned homeless in one of the previous epochs and remained
homeless in the current epoch.  Not doing so leaves us with a stale
osdmap and as a result we may miss our window for reestablishing the
watch and lose notifies.

MON=1 OSD=1:

    # cat linger-needmap.sh
    #!/bin/bash
    rbd create --size 1 test
    DEV=$(rbd map test)
    ceph osd out 0
    rbd map dne/dne # obtain a new osdmap as a side effect (!)
    sleep 1
    ceph osd in 0
    rbd resize --size 2 test
    # rbd info test | grep size -> 2M
    # blockdev --getsize $DEV -> 1M

N.B.: Not obtaining a new osdmap in between "osd out" and "osd in"
above is enough to make it miss that resize notify, but that is a
bug^Wlimitation of ceph watch/notify v1.

Second, homeless lingering requests are now kicked just like those
lingering requests whose mapping has changed.  This is mainly to
recognize that a homeless lingering request makes no sense and to
preserve the invariant that a registered lingering request is not
sitting on any of r_req_lru_item lists.  This spares us a WARN_ON,
which commit ba9d114ec557 ("libceph: clear r_req_lru_item in
__unregister_linger_request()") tried to fix the _wrong_ way.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2015-05-20 21:02:14 +03:00
Michal Kubeček
2759647247 ipv6: fix ECMP route replacement
When replacing an IPv6 multipath route with "ip route replace", i.e.
NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first
matching route without fixing its siblings, resulting in corrupted
siblings linked list; removing one of the siblings can then end in an
infinite loop.

IPv6 ECMP implementation is a bit different from IPv4 so that route
replacement cannot work in exactly the same way. This should be a
reasonable approximation:

1. If the new route is ECMP-able and there is a matching ECMP-able one
already, replace it and all its siblings (if any).

2. If the new route is ECMP-able and no matching ECMP-able route exists,
replace first matching non-ECMP-able (if any) or just add the new one.

3. If the new route is not ECMP-able, replace first matching
non-ECMP-able route (if any) or add the new route.

We also need to remove the NLM_F_REPLACE flag after replacing old
route(s) by first nexthop of an ECMP route so that each subsequent
nexthop does not replace previous one.

Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 12:02:26 -04:00
Michal Kubeček
35f1b4e96b ipv6: do not delete previously existing ECMP routes if add fails
If adding a nexthop of an IPv6 multipath route fails, comment in
ip6_route_multipath() says we are going to delete all nexthops already
added. However, current implementation deletes even the routes it
hasn't even tried to add yet. For example, running

  ip route add 1234:5678::/64 \
      nexthop via fe80::aa dev dummy1 \
      nexthop via fe80::bb dev dummy1 \
      nexthop via fe80::cc dev dummy1

twice results in removing all routes first command added.

Limit the second (delete) run to nexthops that succeeded in the first
(add) run.

Fixes: 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-20 12:02:25 -04:00
Arik Nemtsov
c5a71688e1 mac80211: disconnect TDLS stations on STA CSA
When a station does a channel switch, it's not well defined what its TDLS
peers would do. Avoid a situation when the local side marks a potentially
disconnected peer as a TDLS peer.
Keeping peers connected through CSA is doubly problematic with the upcoming
TDLS WIDER-BW feature which allows peers to widen the BSS channel. The
new channel transitioned-to might not be compatible and would require
a re-negotiation anyway.

Make sure to disallow new TDLS link during CSA.

Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-20 15:14:54 +02:00
Michal Kazior
f9dca80b98 mac80211: fix AP_VLAN crypto tailroom calculation
Some splats I was seeing:

 (a) WARNING: CPU: 1 PID: 0 at /devel/src/linux/net/mac80211/wep.c:102 ieee80211_wep_add_iv
 (b) WARNING: CPU: 1 PID: 0 at /devel/src/linux/net/mac80211/wpa.c:73 ieee80211_tx_h_michael_mic_add
 (c) WARNING: CPU: 3 PID: 0 at /devel/src/linux/net/mac80211/wpa.c:433 ieee80211_crypto_ccmp_encrypt

I've seen (a) and (b) with ath9k hw crypto and (c)
with ath9k sw crypto. All of them were related to
insufficient skb tailroom and I was able to
trigger these with ping6 program.

AP_VLANs may inherit crypto keys from parent AP.
This wasn't considered and yielded problems in
some setups resulting in inability to transmit
data because mac80211 wouldn't resize skbs when
necessary and subsequently drop some packets due
to insufficient tailroom.

For efficiency purposes don't inspect both AP_VLAN
and AP sdata looking for tailroom counter. Instead
update AP_VLAN tailroom counters whenever their
master AP tailroom counter changes.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-20 15:10:11 +02:00
Johannes Berg
252ec2b3aa mac80211: don't split remain-on-channel for coalescing
Due to remain-on-channel scheduling delays, when we split an ROC
while coalescing, we'll usually get a picture like this:

existing ROC:  |------------------|
current time:              ^
new ROC:                   |------|              |-------|

If the expected response frames are then transmitted by the peer
in the hole between the two fragments of the new ROC, we miss
them and the process (e.g. ANQP query) fails.

mac80211 expects that the window to miss something is small:

existing ROC:  |------------------|
new ROC:                   |------||-------|

but that's normally not the case.

To avoid this problem, coalesce only if the new ROC's duration
is <= the remaining time on the existing one:

existing ROC:  |------------------|
new ROC:                   |-----|

and never split a new one but schedule it afterwards instead:

existing ROC:  |------------------|
new ROC:                                       |-------------|

type=bugfix
bug=not-tracked
fixes=unknown

Reported-by: Matti Gottlieb <matti.gottlieb@intel.com>
Reviewed-by: EliadX Peller <eliad@wizery.com>
Reviewed-by: Matti Gottlieb <matti.gottlieb@intel.com>
Tested-by: Matti Gottlieb <matti.gottlieb@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-20 15:09:22 +02:00
Michal Kazior
464daaf04c mac80211: check fast-xmit on station change
Drivers with fast-xmit (e.g. ath10k) running in
AP_VLAN setups would fail to communicate with
connected 4addr stations.

The reason was when new station associates it
first goes into master AP interface. It is not
until later that a dedicated AP_VLAN is created
for it and the station itself is moved there.
After that Tx directed at the station should use
4addr header. However fast-xmit wasn't
recalculated and 3addr header remained to be used.
This in turn caused the connected 4addr stations
to drop packets coming from the AP until some
other event would cause fast-xmit to recalculate
for that station (which could never come).

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-20 15:08:36 +02:00
Lars-Peter Clausen
262918d847 cfg80211: Switch to PM ops
Use dev_pm_ops instead of the legacy suspend/resume callbacks for the wiphy
class suspend and resume operations.

Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-20 15:00:12 +02:00
Lars-Peter Clausen
28f297a7af net: rfkill: Switch to PM ops
Use dev_pm_ops instead of the legacy suspend/resume callbacks for the
rfkill class suspend and resume operations.

Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-20 15:00:00 +02:00
Florian Westphal
faecbb45eb Revert "netfilter: bridge: query conntrack about skb dnat"
This reverts commit c055d5b03bb4cb69d349d787c9787c0383abd8b2.

There are two issues:
'dnat_took_place' made me think that this is related to
-j DNAT/MASQUERADE.

But thats only one part of the story.  This is also relevant for SNAT
when we undo snat translation in reverse/reply direction.

Furthermore, I originally wanted to do this mainly to avoid
storing ipv6 addresses once we make DNAT/REDIRECT work
for ipv6 on bridges.

However, I forgot about SNPT/DNPT which is stateless.

So we can't escape storing address for ipv6 anyway. Might as
well do it for ipv4 too.

Reported-and-tested-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-20 13:51:25 +02:00
Dave Jones
1086bbe97a netfilter: ensure number of counters is >0 in do_replace()
After improving setsockopt() coverage in trinity, I started triggering
vmalloc failures pretty reliably from this code path:

warn_alloc_failed+0xe9/0x140
__vmalloc_node_range+0x1be/0x270
vzalloc+0x4b/0x50
__do_replace+0x52/0x260 [ip_tables]
do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
nf_setsockopt+0x65/0x90
ip_setsockopt+0x61/0xa0
raw_setsockopt+0x16/0x60
sock_common_setsockopt+0x14/0x20
SyS_setsockopt+0x71/0xd0

It turns out we don't validate that the num_counters field in the
struct we pass in from userspace is initialized.

The same problem also exists in ebtables, arptables, ipv6, and the
compat variants.

Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-20 13:46:49 +02:00
Francesco Ruggeri
3bfe049807 netfilter: nfnetlink_{log,queue}: Register pernet in first place
nfnetlink_{log,queue}_init() register the netlink callback nf*_rcv_nl_event
before registering the pernet_subsys, but the callback relies on data
structures allocated by pernet init functions.

When nfnetlink_{log,queue} is loaded, if a netlink message is received after
the netlink callback is registered but before the pernet_subsys is registered,
the kernel will panic in the sequence

nfulnl_rcv_nl_event
  nfnl_log_pernet
    net_generic
      BUG_ON(id == 0)  where id is nfnl_log_net_id.

The panic can be easily reproduced in 4.0.3 by:

while true ;do modprobe nfnetlink_log ; rmmod nfnetlink_log ; done &
while true ;do ip netns add dummy ; ip netns del dummy ; done &

This patch moves register_pernet_subsys to earlier in nfnetlink_log_init.

Notice that the BUG_ON hit in 4.0.3 was recently removed in 2591ffd308
["netns: remove BUG_ONs from net_generic()"].

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-05-20 13:46:48 +02:00
Johannes Berg
94c78cb452 mac80211: fix memory leak
My recent change here introduced a possible memory leak if the
driver registers an invalid cipher schemes. This won't really
happen in practice, but fix the leak nonetheless.

Fixes: e3a55b5399d55 ("mac80211: validate cipher scheme PN length better")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-20 11:37:38 +02:00
Daniel Borkmann
492135557d tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f516 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:

  [...] A host that receives no reply to an ECN-setup SYN within the
  normal SYN retransmission timeout interval MAY resend the SYN and
  any subsequent SYN retransmissions with CWR and ECE cleared. [...]

Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3c9 ("tcp: extend
ECN sysctl to allow server-side only ECN"):

 1) Normal ECN-capable path:

    SYN ECE CWR ----->
                <----- SYN ACK ECE
            ACK ----->

 2) Path with broken middlebox, when client has fallback:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
            SYN ----->
                <----- SYN ACK
            ACK ----->

In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)

In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:

  Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
  Gorry Fairhurst, and Richard Scheffenegger:
    "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
    Proc. PAM 2015, New York.
  http://ecn.ethz.ch/ecn-pam15.pdf

Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.

tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.

Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:53:37 -04:00
David S. Miller
892bd6291a This has just a single fix, for a WEP tailroom check
problem that leads to dropped frames.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJVWuR+AAoJEDBSmw7B7bqrgRYP/1A/g6osMT4DG/WecYleDlip
 1De0c1S4rgHVI+/ZvK4JcyYbjHYwKhbXHBtsNHV+J9GehqXyWn0BJPkSnxZ3HdZV
 M5CHSgtN2OBfHJ03OTpduvdNzKjpVOCf2PWKFnhJhDzYdfa9qh9kKDwRGeDcHvfc
 ++vVs+bMzjhnWj2y0TpEs1fQcd69MrR9Af2ptftOrusuVkDxShKrgY4xj1d+OVyC
 FggUn/oj6/CgGVn8KV1hld+Cb1Tk1/D9uksXYZepHNo4qb0M8T8BBWIQCpdbK4Ge
 qAG8w7/suLGqb8VU5k0jM4Uqbn5l9cm7PX1PQrxCdyFHMf3kojR8LgI33Xqm4d40
 9HxnXLlDoaawTOiAJIG1HMEzawriWfxSly3hS1Q/B/FGo68C2KIg9h5/w98GNfIB
 PNE41GopCwQlmhORGXxpzwf/jJ5mL9V6PjxUnKpsd/BlbUlKLmFnx7JABicLl/Ps
 292l2yZR9Jrzaf8njmGoIyYb+AREvJF4zQu9rduiro56+rCvGvFJZ8xfwGsRvNTH
 f/HILhW+GDPlJp4StCvKQxm0bWJ6feRiPCYr2JRViMQmp6hX6AMYfVcf7jcpIWio
 uTB9FAW6XGPrltm+1IeyWICbiF0VpuqpPl8V3UTiDgIBdk4wqD5kr3HSW6zb1w9L
 KJYFbpC9igqGk+fc3u4b
 =DNh/
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
This has just a single fix, for a WEP tailroom check
problem that leads to dropped frames.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:44:25 -04:00
David S. Miller
b7a3a8e31f This just has a few fixes:
* LED throughput trigger was crashing
  * fast-xmit wasn't treating QoS changes in IBSS correctly
  * TDLS could use the wrong channel definition
  * using a reserved channel context could use the wrong channel width
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJVWuPvAAoJEDBSmw7B7bqr/A4P/0TqzkCC5L2qJvi3a6QNFxvf
 s3riQMJ8WQeUxCRNgFNeeeNUAgSJn3hhiINGrjRmkwXXxYC4mbCwM0YXNT+WhSRL
 /Kx4mRJr7u5ZU0olW+KRvIV5CyTsbr9zVnaraCh5NV43nT87ZVZRBKC9vz2UkSM5
 AsN6fUvhWMGhhHoGGDqtjRBjve8Xs5iKiEcE1iQTzLOPnFP3dKtB1zKKiA0JCQs4
 OjxkQ7uaF0T1IfkMFr0gyzgQi4A8iPoMKV3qcRIH/QZN5dpJ6DR1dgaU50CrzQ+R
 JD9W09ifF9U8GnvQU/baJHKCxEvnQWO2XwlV4+mV6bXF1j5Ng4LRiXntIeu2d3T3
 5JuvPV9cNJb8dSTzsYw+TRJg73hStlJCAjVMJ7hiOMQc1YCCY9Exrff0pWzJPJfE
 NygIkMHXymcy66yL3b7DIIXro5jHNVGVoHq3vMB+W+/EcEDFN6L9LeCzUVo+oKjl
 Qg4kC7VHDjcdt0f7Vgv2Cal76ZVfCZaq74QZV1cySF2sCiD27LnAAfoHVeMY979K
 qBsCRqhkBlc7ntnstv6tGz9LfG8ro+Fv548HIUDG80capZl6N6FR6g+8hYIuvJwu
 2abJq36bp/NAGDt43UofmtDxyZNyvoKmzcQKSdn2QpryGKQ3uRcDHG/I+WD0yWX9
 4WFNEm86sXmfL/Eyu0lU
 =iUFe
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2015-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This just has a few fixes:
 * LED throughput trigger was crashing
 * fast-xmit wasn't treating QoS changes in IBSS correctly
 * TDLS could use the wrong channel definition
 * using a reserved channel context could use the wrong channel width
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:43:17 -04:00
Yuchung Cheng
b7b0ed910c tcp: don't over-send F-RTO probes
After sending the new data packets to probe (step 2), F-RTO may
incorrectly send more probes if the next ACK advances SND_UNA and
does not sack new packet. However F-RTO RFC 5682 probes at most
once. This bug may cause sender to always send new data instead of
repairing holes, inducing longer HoL blocking on the receiver for
the application.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:36:57 -04:00
Yuchung Cheng
da34ac7626 tcp: only undo on partial ACKs in CA_Loss
Undo based on TCP timestamps should only happen on ACKs that advance
SND_UNA, according to the Eifel algorithm in RFC 3522:

Section 3.2:

  (4) If the value of the Timestamp Echo Reply field of the
      acceptable ACK's Timestamps option is smaller than the
      value of RetransmitTS, then proceed to step (5),

Section Terminology:
   We use the term 'acceptable ACK' as defined in [RFC793].  That is an
   ACK that acknowledges previously unacknowledged data.

This is because upon receiving an out-of-order packet, the receiver
returns the last timestamp that advances RCV_NXT, not the current
timestamp of the packet in the DUPACK. Without checking the flag,
the DUPACK will cause tcp_packet_delayed() to return true and
tcp_try_undo_loss() will revert cwnd reduction.

Note that we check the condition in CA_Recovery already by only
calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or
tcp_try_undo_recovery() if snd_una crosses high_seq.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:36:57 -04:00
Henning Rogge
33b4b015e1 net/ipv6/udp: Fix ipv6 multicast socket filter regression
Commit <5cf3d46192fc> ("udp: Simplify__udp*_lib_mcast_deliver")
simplified the filter for incoming IPv6 multicast but removed
the check of the local socket address and the UDP destination
address.

This patch restores the filter to prevent sockets bound to a IPv6
multicast IP to receive other UDP traffic link unicast.

Signed-off-by: Henning Rogge <hrogge@gmail.com>
Fixes: 5cf3d46192fc ("udp: Simplify__udp*_lib_mcast_deliver")
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:34:43 -04:00
Eric B Munson
aea0929e51 tcp: Return error instead of partial read for saved syn headers
Currently the getsockopt() requesting the cached contents of the syn
packet headers will fail silently if the caller uses a buffer that is
too small to contain the requested data.  Rather than fail silently and
discard the headers, getsockopt() should return an error and report the
required size to hold the data.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:33:34 -04:00
Johan Hedberg
011c391a09 Bluetooth: Add debug logs for legacy SMP crypto functions
To help debug legacy SMP crypto functions add debug logs of the
various values involved.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 21:07:29 +02:00
Arnd Bergmann
73e85ed36a mac802154: select CRYPTO when needed
The mac802154 subsystem uses functions from the crypto layer and correctly
selects the individual crypto algorithms, but fails to build when the
crypto layer is disabled altogether:

crypto/built-in.o: In function `crypto_ctr_free':
:(.text+0x80): undefined reference to `crypto_drop_spawn'
crypto/built-in.o: In function `crypto_rfc3686_free':
:(.text+0xac): undefined reference to `crypto_drop_spawn'
crypto/built-in.o: In function `crypto_ctr_crypt':
:(.text+0x2f0): undefined reference to `blkcipher_walk_virt_block'
:(.text+0x2f8): undefined reference to `crypto_inc'

To solve that, this patch also selects the core crypto code,
like all other users of that code do.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 19:35:48 +02:00
Johannes Berg
22d3a3c829 mac80211: don't use napi_gro_receive() outside NAPI context
No matter how the driver manages its NAPI context, there's no way
sending frames to it from a timer can be correct, since it would
corrupt the internal GRO lists.

To avoid that, always use the non-NAPI path when releasing frames
from the timer.

Cc: stable@vger.kernel.org
Reported-by: Jean Trivelly <jean.trivelly@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-19 15:46:21 +02:00
Alexander Aring
3862eba691 mac802154: tx: allow xmit complete from hard irq
Replace consume_skb with dev_consume_skb_any in ieee802154_xmit_complete
which can be called in hard irq and other contexts.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:45 +02:00
Alexander Aring
0e66545701 nl802154: add support for dump phy capabilities
This patch add support to nl802154 to dump all phy capabilities which is
inside the wpan_phy_supported struct. Also we introduce a new method to
dumping supported channels. The new method will offer a easier interface
and has lesser netlink traffic.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:44 +02:00
Alexander Aring
65318680c9 ieee802154: add iftypes capability
This patch adds capability flags for supported interface types.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring
edea8f7c75 cfg802154: introduce wpan phy flags
This patch introduce a flag property for the wpan phy structure.
The current flag settings in ieee802154_hw are accessable in mac802154
layer only which is okay for flags which indicates MAC handling which
are done by phy. For real PHY layer settings like cca mode, transmit
power, cca energy detection level.

The difference between these flags are that the MAC handling flags are
only handled in mac802154/HardMac layer e.g. on an interface up. The phy
settings are direct netlink calls from nl802154 into the driver layer
and the nl802154 need to have a chance to check if the driver supports
this handling before sending to the next layer.

We also check now on PHY flags while dumping and setting pib attributes.
In comparing with MIB attributes the 802.15.4 gives us an default value
which we assume when a transceiver implement less functionality. In case
of MIB settings the nl802154 layer doesn't need to check on the
ieee802154_hw flags then.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring
8329fcf11f mac802154: remove check if operation is supported
This patch removes the check if operation is supported by driver layer.
This is done now by capabilities flags, if these are valid then the
driver should support the operation, otherwise a WARN_ON occurs.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring
791021bf13 mac802154: check for really changes
This patch adds check if the value is really changed inside pib/mib.
If a transceiver do support only one value for e.g. max_be then this
will also handle that the driver layer doesn't need to care about
handling to set one value only.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring
fea3318d20 ieee802154: add several phy supported handling
This patch adds support for phy supported handling for all other already
existing handling 802.15.4 functionality. We assume now a fully 802.15.4
complaint transceiver at phy allocation. If a transceiver can support
802.15.4 default values only, then the values should be overwirtten by
values the transceiver supports. If the transceiver doesn't set the
according hardware flags, we assume the 802.15.4 defaults now which
cannot be changed.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring
72f655e44d ieee802154: introduce wpan_phy_supported
This patch introduce the wpan_phy_supported struct for wpan_phy. There
is currently no way to check if a transceiver can handle IEEE 802.15.4
complaint values. With this struct we can check before if the
transceiver supports these values before sending to driver layer.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Acked-by: Varka Bhadram <varkabhadram@gmail.com>
Cc: Alan Ott <alan@signal11.us>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring
32b23550ad ieee802154: change cca ed level to mbm
This patch change the handling of cca energy detection level from dbm to
mbm. This prepares to handle floating point cca energy detection levels
values. The old netlink 802.15.4 will convert the dbm value to mbm for
handling backward compatibility.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:42 +02:00
Alexander Aring
e2eb173aaa ieee802154: change transmit power to mbm
This patch change the handling of transmit power level from dbm to mbm.
This prepares to handle floating point transmit power levels values. The
old netlink 802.15.4 will convert the dbm value to mbm for handling
backward compatibility.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:41 +02:00
Alexander Aring
1a19cb680b ieee802154: change transmit power to s32
This patch change the transmit power from s8 to s32. This prepares to store a
mbm value instead dbm inside the transmit power variable. The old
interface keep the a s8 dbm value, which should be backward compatibility
when assign s8 to s32.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:41 +02:00
Alexander Aring
673692faf3 ieee802154: move validation check out of softmac
This patch moves the value validation out of softmac layer. We need
to be sure now that this value is accepted by the transceiver/mac802154 or
"possible" hardmac drivers before calling rdev-ops.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:41 +02:00
Alexander Aring
0cf0879acd nl802154: cleanup invalid argument handling
This patch cleanups the -EINVAL cases by combining them in one
condition.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-19 11:44:41 +02:00
Andy Zhou
49d16b23cd bridge_netfilter: No ICMP packet on IPv4 fragmentation error
When bridge netfilter re-fragments an IP packet for output, all
packets that can not be re-fragmented to their original input size
should be silently discarded.

However, current bridge netfilter output path generates an ICMP packet
with 'size exceeded MTU' message for such packets, this is a bug.

This patch refactors the ip_fragment() API to allow two separate
use cases. The bridge netfilter user case will not
send ICMP, the routing output will, as before.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:39 -04:00
Andy Zhou
8bc04864ac IPv4: skip ICMP for bridge contrack users when defrag expires
users in [IP_DEFRAG_CONNTRACK_BRIDGE_IN, __IP_DEFRAG_CONNTRACK_BR_IN]
should not ICMP message also.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:27 -04:00
Andy Zhou
5cf4228082 ipv4: introduce frag_expire_skip_icmp()
Improve readability of skip ICMP for de-fragmentation expiration logic.
This change will also make the logic easier to maintain when the
following patches in this series are applied.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:26 -04:00