11305 Commits

Author SHA1 Message Date
Alexey Dobriyan
73d189dce4 netns xfrm: per-netns xfrm_state_bydst hash
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:16:58 -08:00
Alexey Dobriyan
9d4139c769 netns xfrm: per-netns xfrm_state_all list
This is done to get
a) simple "something leaked" check
b) cover possible DoSes when other netns puts many, many xfrm_states
   onto a list.
c) not miss "alien xfrm_state" check in some of list iterators in future.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:16:11 -08:00
Alexey Dobriyan
673c09be45 netns xfrm: add struct xfrm_state::xs_net
To avoid unnecessary complications with passing netns around.

* set once, very early after allocating
* once set, never changes

For a while create every xfrm_state in init_net.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:15:16 -08:00
Alexey Dobriyan
d62ddc21b6 netns xfrm: add netns boilerplate
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:14:31 -08:00
Alexey Dobriyan
c95839693d xfrm: initialise xfrm_policy_gc_work statically
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 17:13:59 -08:00
Ingo Molnar
45555c0ed4 bluetooth: fix warning in net/bluetooth/rfcomm/sock.c
fix this warning:

  net/bluetooth/rfcomm/sock.c: In function ‘rfcomm_sock_ioctl’:
  net/bluetooth/rfcomm/sock.c:795: warning: unused variable ‘sk’

perhaps BT_DEBUG() should be improved to do printf format checking
instead of the #ifdef, but that looks quite intrusive: each bluetooth
.c file undefines the macro.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:59:21 -08:00
Ingo Molnar
ff0db0490a sunrpc: fix warning in net/sunrpc/xprtrdma/verbs.c
fix this warning:

  net/sunrpc/xprtrdma/verbs.c: In function ‘rpcrdma_conn_upcall’:
  net/sunrpc/xprtrdma/verbs.c:279: warning: unused variable ‘addr’

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:58:42 -08:00
Ingo Molnar
e14bec2e2b ax25: fix warning in net/ax25/sysctl_net_ax25.c
fix this warning:

  net/ax25/sysctl_net_ax25.c:27: warning: ‘min_ds_timeout’ defined but not used
  net/ax25/sysctl_net_ax25.c:27: warning: ‘max_ds_timeout’ defined but not used

These are only used in the CONFIG_AX25_DAMA_SLAVE case.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:58:19 -08:00
Ingo Molnar
3ed7cc0f8b dccp: fix warning in net/dccp/options.c
this warning:

  net/dccp/options.c: In function ‘dccp_parse_options’:
  net/dccp/options.c:67: warning: ‘value’ may be used uninitialized in this function

is a bogus GCC warning. The compiler does not recognize the relation
between "value" and "mandatory" variables: the code flow can ever reach
the "out_invalid_option:" label if 'mandatory' is set to 1, and when
'mandatory' is non-zero, we'll always have 'value' initialized.

Help out the compiler by annotating the variable.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:57:30 -08:00
Ingo Molnar
d3f644da90 dsa: fix warning in net/dsa/mv88e6060.c
this warning:

  net/dsa/mv88e6060.c: In function ‘mv88e6060_poll_link’:
  net/dsa/mv88e6060.c:225: warning: ‘port_status’ may be used uninitialized in this function

triggers because GCC does not recognize the (correct) error flow
between 'link' and 'port_status'.

Annotate it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:51:13 -08:00
Ingo Molnar
2a9e79782d dsa: fix warning in net/dsa/mv88e6xxx.c
this warning:

  net/dsa/mv88e6xxx.c: In function ‘mv88e6xxx_poll_link’:
  net/dsa/mv88e6xxx.c:361: warning: ‘port_status’ may be used uninitialized in this function

triggers because GCC does not recognize the (correct) error flow
between 'link' and 'port_status'.

Annotate it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:50:49 -08:00
Ingo Molnar
55205d400e ipv6: fix warning in net/ipv6/ip6_flowlabel.c
this warning:

  net/ipv6/ip6_flowlabel.c: In function ‘ipv6_flowlabel_opt’:
  net/ipv6/ip6_flowlabel.c:467: warning: ‘err’ may be used uninitialized in this function

triggers because GCC does not recognize the (correct) error flow
between fl_create() and 'err'.

Annotate it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:50:30 -08:00
Ingo Molnar
dc0a0011cf pkt_sched: fix warning in net/sched/sch_hfsc.c
this warning:

  net/sched/sch_hfsc.c: In function ‘hfsc_enqueue’:
  net/sched/sch_hfsc.c:1577: warning: ‘err’ may be used uninitialized in this function

triggers because GCC does not recognize the (correct) error flow
between hfsc_classify(), 'cl' and 'err'.

Annotate it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:50:02 -08:00
Ingo Molnar
ed72b9c6e0 sunrpc: fix warning in net/sunrpc/xprtrdma/svc_rdma_transport.c
this warning:

  net/sunrpc/xprtrdma/svc_rdma_transport.c: In function ‘svc_rdma_accept’:
  net/sunrpc/xprtrdma/svc_rdma_transport.c:830: warning: ‘dma_mr_acc’ may be used uninitialized in this function

triggers because GCC does not recognize the (correct) flow connection
between need_dma_mr and dma_mr_acc.

Annotate it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:49:37 -08:00
Daniel Lezcano
09bb52175b netns: filter out uevent not belonging to init_net
This patch will filter out the uevent not related to the init_net.
Without this patch if a network device is created in a network
namespace with the same name as one network device belonging to the
initial network namespace (eg. eth0), when the network namespace
will die and the network device will be destroyed, an event will
be sent and catched by the udevd daemon. That will result to have
the real network device to be shutdown because the udevd/uevent are
not namespace aware.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 16:46:37 -08:00
David S. Miller
d7713ccc7b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 2008-11-25 14:27:58 -08:00
Ilpo Järvinen
9f782db3f5 tcp: skb_shift cannot cache frag ptrs past pskb_expand_head
Since pskb_expand_head creates copy of the shared area we
cannot keep any frag ptr past de-cloning. This fixes the
tcpdump recvfrom -EFAULT problem.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 13:57:01 -08:00
Jarek Poplawski
f6486d40b3 pkt_sched: sch_api: Remove qdisc_list_lock
After implementing qdisc->ops->peek() there is no more calling
qdisc_tree_decrease_qlen() without rtnl_lock(), so qdisc_list_lock
added by commit: f6e0b239a2657ea8cb67f0d83d0bfdbfd19a481b "pkt_sched:
Fix qdisc list locking" can be removed.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 13:56:06 -08:00
Eric Dumazet
723b46108f net: udp_unhash() can test if sk is hashed
Impact: Optimization

Like done in inet_unhash(), we can avoid taking a chain lock if
socket is not hashed in udp_unhash()

Triggered by close(socket(AF_INET, SOCK_DGRAM, 0));

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 13:55:15 -08:00
Eric Dumazet
5bc0b3bfa7 net: Make sure BHs are disabled in sock_prot_inuse_add()
prot->destroy is not called with BH disabled. So we must add
explicit BH disable around call to sock_prot_inuse_add()
in sctp_destroy_sock()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 13:53:27 -08:00
Ilpo Järvinen
8eecaba900 tcp: tcp_limit_reno_sacked can become static
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 13:45:29 -08:00
Luis R. Rodriguez
4ada424db1 mac80211: don't assume driver has been attached on registration
mac80211's ieee80211_register_hw() is often called within the
probe path so it cannot assume the device's driver structure
has been attached yet so to create a workqueue instead of
using driver->name use the wiphy's phy%d name. The name doesn't
really matter anyway.

This should fix sporadic oopses found when we race to beat the
driver pointer setting. Not even sure how this was working properly.

http://www.kerneloops.org/search.php?search=ieee80211_register_hw

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:40 -05:00
Sujith
f16f33df4d mac80211: Use the HT capabilities from the IE instead of the station's caps.
Signed-off-by: Sujith <Sujith.Manoharan@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:36 -05:00
Luis R. Rodriguez
14b9815af3 cfg80211: add support for custom firmware regulatory solutions
This adds API to cfg80211 to allow wireless drivers to inform
us if their firmware can handle regulatory considerations *and*
they cannot map these regulatory domains to an ISO / IEC 3166
alpha2. In these cases we skip the first regulatory hint instead
of expecting the driver to build their own regulatory structure,
providing us with an alpha2, or using the reg_notifier().

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Zhu Yi <yi.zhu@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:27 -05:00
Luis R. Rodriguez
3f2355cb91 cfg80211/mac80211: Add 802.11d support
This adds country IE parsing to mac80211 and enables its usage
within the new regulatory infrastructure in cfg80211. We parse
the country IEs only on management beacons for the BSSID you are
associated to and disregard the IEs when the country and environment
(indoor, outdoor, any) matches the already processed country IE.

To avoid following misinformed or outdated APs we build and use
a regulatory domain out of the intersection between what the AP
provides us on the country IE and what CRDA is aware is allowed
on the same country.

A secondary device is allowed to follow only the same country IE
as it make no sense for two devices on a system to be in two
different countries.

In the case the AP is using country IEs for an incorrect country
the user may help compliance further by setting the regulatory
domain before or after the IE is parsed and in that case another
intersection will be performed.

CONFIG_WIRELESS_OLD_REGULATORY is supported but requires CRDA
present.

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:26 -05:00
Luis R. Rodriguez
88dc1c3f7f cfg80211: mark regdomains with > NL80211_MAX_SUPP_REG_RULES invalid
Lets remain consistent and mark rds with > NL80211_MAX_SUPP_REG_RULES
number of reg rules as invalid in is_valid_rd().

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:25 -05:00
Luis R. Rodriguez
02ba0b3263 cfg80211: call_crda() won't tell us if CRDA was present
kobject_uevent_env() can return an error but it just tells us
if the uvent was built/sent or not, it doesn't tell us anything
about what happened in userspace, whether the udev rule was present
nor does it tell us if CRDA was present or not. So remove
the informative complaint about it assuming it will tell us
such things.

Note that you can determine if CRDA is present after loading cfg80211
by using:

is_old_static_regdom(cfg80211_regdomain)

but this doesn't account for possible user install after initial
boot, and also for when the user uses the static EU regulatory
domain.

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:25 -05:00
Luis R. Rodriguez
a01ddafd43 cfg80211: expect different rd in cfg80211 when intersecting
When intersecting it is possible that set_regdom() was called
with a regulatory domain which we'll only use as an aid to
build a final regulatory domain.

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:24 -05:00
Luis R. Rodriguez
b8295acdc3 cfg80211: separate intersection section in __set_regdom()
So far the __set_regdom() code is pretty generic as the
intersection case is fairly straight forward; this will however
change when 802.11d support is added so lets separate intersection
code for now in preparation for 802.11d support.

This patch only has slight functional changes.

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:23 -05:00
Luis R. Rodriguez
8375af3ba2 cfg80211: remove switch from __set_regdom()
We have control over the REGDOM_SET_BY_* macros passed
so remove the switch.

This patch has no functional changes.

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:22 -05:00
Luis R. Rodriguez
5203cdb6ad cfg80211: remove switch from __regulatory_hint()
We have complete control over REGDOM_SET_BY_* enum passed
down to __regulatory_hint() as such there is no need to
account for unexpected REGDOM_SET_BY_*'s, lets just remove
the switch statement as this code does not change and
won't change even when we add 802.11d support.

This patch has no functional changes.

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:22 -05:00
Luis R. Rodriguez
91e9900418 cfg80211: mark negative frequencies as invalid
Regulatory rules with negative frequencies are now
marked as invalid in is_valid_reg_rule().

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:41:21 -05:00
Ingo Molnar
020cf6ba7a net/wireless/reg.c: fix bad WARN_ON in if statement
fix:

  net/wireless/reg.c:348:29: error: macro "if" passed 2 arguments, but takes just 1

triggered by the branch-tracer.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:13:09 -05:00
Abhijeet Kolekar
3dd3b79aea mac80211 : Fix setting ad-hoc mode and non-ibss channel
Patch fixes the kernel trace when user tries to set
ad-hoc mode on non IBSS channel.
e.g iwconfig wlan0 chan 36 mode ad-hoc

Signed-off-by: Abhijeet Kolekar <abhijeet.kolekar@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-25 16:13:08 -05:00
Ingo Molnar
d6e8cc6cc7 netfilter: fix warning in net/netfilter/nf_conntrack_ftp.c
this warning:

  net/netfilter/nf_conntrack_ftp.c: In function 'help':
  net/netfilter/nf_conntrack_ftp.c:360: warning: 'matchoff' may be used uninitialized in this function
  net/netfilter/nf_conntrack_ftp.c:360: warning: 'matchlen' may be used uninitialized in this function

triggers because GCC does not recognize the (correct) error flow
between find_pattern(), 'found', 'matchoff' and 'matchlen'.

Annotate it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-11-25 18:23:03 +01:00
Eric Leblond
9f40ac713c netfilter: nfmark IPV6 routing in OUTPUT, mangle, NFQUEUE
This patch let nfmark to be evaluated for routing decision for OUTPUT
packet, in mangle table, when process paquet in NFQUEUE. This patch is
an IPv6 port of Laurent Licour IPv4 one.

Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-11-25 12:18:11 +01:00
Eric Leblond
5f145e44ae netfilter: nfmark routing in OUTPUT, mangle, NFQUEUE
This patch let nfmark to be evaluated for routing decision for OUTPUT
packet, in mangle table, when process paquet in NFQUEUE
Until now, only change (in NFQUEUE process) on fields src_addr,
dest_addr and tos could make netfilter to reevalute the routing.

From: Laurent Licour <laurent@licour.com>
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2008-11-25 12:15:16 +01:00
Alexey Dobriyan
fb7e06748c xfrm: remove useless forward declarations
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 01:05:54 -08:00
Alexey Dobriyan
6daad37230 ah4/ah6: remove useless NULL assignments
struct will be kfreed in a moment, so...

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 01:05:09 -08:00
Jeff Kirsher
7a6b6f515f DCB: fix kconfig option
Since the netlink option for DCB is necessary to actually be useful,
simplified the Kconfig option.  In addition, added useful help text for the
Kconfig option.

Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 01:02:08 -08:00
Bernard Pidoux
244f46ae6e rose: zero length frame filtering in af_rose.c
Since changeset e79ad711a0108475c1b3a03815527e7237020b08 from  mainline,
>From David S. Miller,
empty packet can be transmitted on connected socket for datagram protocols.

However, this patch broke a high level application using ROSE network protocol with connected datagram.

Bulletin Board Stations perform bulletins forwarding between BBS stations via ROSE network using a forward protocol.
Now, if for some reason, a buffer in the application software happens to be empty at a specific moment,
ROSE sends an empty packet via unfiltered packet socket.
When received, this ROSE packet introduces perturbations of data exchange of BBS forwarding,
for the application message forwarding protocol is waiting for something else.
We agree that a more careful programming of the application protocol would avoid this situation and we are
willing to debug it.
But, as an empty frame is no use and does not have any meaning for ROSE protocol,
we may consider filtering zero length data both when sending and receiving socket data.

The proposed patch repaired BBS data exchange through ROSE network that were broken since 2.6.22.11 kernel.

Signed-off-by: Bernard Pidoux <f6bvp@amsat.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-25 00:56:20 -08:00
Ilpo Järvinen
0ace285605 tcp: handle shift/merge of cloned skbs too
This caused me to get repeatably:

  tcpdump: pcap_loop: recvfrom: Bad address

Happens occassionally when I tcpdump my for-looped test xfers:
  while [ : ]; do echo -n "$(date '+%s.%N') "; ./sendfile; sleep 20; done

Rest of the relevant commands:
  ethtool -K eth0 tso off
  tc qdisc add dev eth0 root netem drop 4%
  tcpdump -n -s0 -i eth0 -w sacklog.all

Running net-next under kvm, connection goes to the same host
(basically just out of kvm). The connection itself works ok
and data gets sent without corruption even with a large
number of tests while tcpdump fails usually within less than
5 tests.

Whether it only happens because of this change or not, I
don't know for sure but it's the only thing with which
I've seen that error. The non-cloned variant works w/o it
for much longer time. I'm yet to debug where the error
actually comes from.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:30:21 -08:00
Ilpo Järvinen
111cc8b913 tcp: add some mibs to track collapsing
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:27:22 -08:00
Ilpo Järvinen
92ee76b6d9 tcp: Make shifting not clear the hints
The earlier version was just very basic one which is "playing
safe" by always clearing the hints. However, clearing of a hint
is extremely costly operation with large windows, so it must be
avoided at all cost whenever possible, there is a way with
shifting too achieve not-clearing.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:26:56 -08:00
Ilpo Järvinen
832d11c5cd tcp: Try to restore large SKBs while SACK processing
During SACK processing, most of the benefits of TSO are eaten by
the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
Then we're in problems when cleanup work for them has to be done
when a large cumulative ACK comes. Try to return back to pre-split
state already while more and more SACK info gets discovered by
combining newly discovered SACK areas with the previous skb if
that's SACKed as well.

This approach has a number of benefits:

1) The processing overhead is spread more equally over the RTT
2) Write queue has less skbs to process (affect everything
   which has to walk in the queue past the sacked areas)
3) Write queue is consistent whole the time, so no other parts
   of TCP has to be aware of this (this was not the case with
   some other approach that was, well, quite intrusive all
   around).
4) Clean_rtx_queue can release most of the pages using single
   put_page instead of previous PAGE_SIZE/mss+1 calls

In case a hole is fully filled by the new SACK block, we attempt
to combine the next skb too which allows construction of skbs
that are even larger than what tso split them to and it handles
hole per on every nth patterns that often occur during slow start
overshoot pretty nicely. Though this to be really useful also
a retransmission would have to get lost since cumulative ACKs
advance one hole at a time in the most typical case.

TODO: handle upwards only merging. That should be rather easy
when segment is fully sacked but I'm leaving that as future
work item (it won't make very large difference anyway since
this current approach already covers quite a lot of normal
cases).

I was earlier thinking of some sophisticated way of tracking
timestamps of the first and the last segment but later on
realized that it won't be that necessary at all to store the
timestamp of the last segment. The cases that can occur are
basically either:
  1) ambiguous => no sensible measurement can be taken anyway
  2) non-ambiguous is due to reordering => having the timestamp
     of the last segment there is just skewing things more off
     than does some good since the ack got triggered by one of
     the holes (besides some substle issues that would make
     determining right hole/skb even harder problem). Anyway,
     it has nothing to do with this change then.

I choose to route some abnormal looking cases with goto noop,
some could be handled differently (eg., by stopping the
walking at that skb but again). In general, they either
shouldn't happen at all or are rare enough to make no difference
in practice.

In theory this change (as whole) could cause some macroscale
regression (global) because of cache misses that are taken over
the round-trip time but it gets very likely better because of much
less (local) cache misses per other write queue walkers and the
big recovery clearing cumulative ack.

Worth to note that these benefits would be very easy to get also
without TSO/GSO being on as long as the data is in pages so that
we can merge them. Currently I won't let that happen because
DSACK splitting at fragment that would mess up pcounts due to
sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
avoided, we have some conditions that can be made less strict.

TODO: I will probably have to convert the excessive pointer
passing to struct sacktag_state... :-)

My testing revealed that considerable amount of skbs couldn't
be shifted because they were cloned (most likely still awaiting
tx reclaim)...

[The rest is considering future work instead since I got
repeatably EFAULT to tcpdump's recvfrom when I added
pskb_expand_head to deal with clones, so I separated that
into another, later patch]

...To counter that, I gave up on the fifth advantage:

5) When growing previous SACK block, less allocs for new skbs
   are done, basically a new alloc is needed only when new hole
   is detected and when the previous skb runs out of frags space

...which now only happens of if reclaim is fast enough to dispose
the clone before the SACK block comes in (the window is RTT long),
otherwise we'll have to alloc some.

With clones being handled I got these numbers (will be somewhat
worse without that), taken with fine-grained mibs:

                  TCPSackShifted 398
                   TCPSackMerged 877
            TCPSackShiftFallback 320
      TCPSACKCOLLAPSEFALLBACKGSO 0
  TCPSACKCOLLAPSEFALLBACKSKBBITS 0
  TCPSACKCOLLAPSEFALLBACKSKBDATA 0
    TCPSACKCOLLAPSEFALLBACKBELOW 0
    TCPSACKCOLLAPSEFALLBACKFIRST 1
 TCPSACKCOLLAPSEFALLBACKPREVBITS 318
      TCPSACKCOLLAPSEFALLBACKMSS 1
   TCPSACKCOLLAPSEFALLBACKNOHEAD 0
    TCPSACKCOLLAPSEFALLBACKSHIFT 0
          TCPSACKCOLLAPSENOOPSEQ 0
  TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
     TCPSACKCOLLAPSENOOPSMALLLEN 0
             TCPSACKCOLLAPSEHOLE 12

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:20:15 -08:00
Ilpo Järvinen
f58b22fd3c tcp: make tcp_sacktag_one able to handle partial skb too
This is preparatory work for SACK combiner patch which may
have to count TCP state changes for only a part of the skb
because it will intentionally avoids splitting skb to SACKed
and not sacked parts.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:14:43 -08:00
Ilpo Järvinen
adb92db857 tcp: Make SACK code to split only at mss boundaries
Sadly enough, this adds possible divide though we try to avoid
it by checking one mss as common case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:13:50 -08:00
Ilpo Järvinen
e8bae275d9 tcp: more aggressive skipping
I knew already when rewriting the sacktag that this condition
was too conservative, change it now since it prevent lot of
useless work (especially in the sack shifter decision code
that is being added by a later patch). This shouldn't change
anything really, just save some processing regardless of the
shifter.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:12:28 -08:00
Ilpo Järvinen
e1aa680fa4 tcp: move tcp_simple_retransmit to tcp_input
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:11:55 -08:00
Ilpo Järvinen
4a17fc3add tcp: collapse more than two on retransmission
I always had thought that collapsing up to two at a time was
intentional decision to avoid excessive processing if 1 byte
sized skbs are to be combined for a full mtu, and consecutive
retransmissions would make the size of the retransmittee
double each round anyway, but some recent discussion made me
to understand that was not the case. Thus make collapse work
more and wait less.

It would be possible to take advantage of the shifting
machinery (added in the later patch) in the case of paged
data but that can be implemented on top of this change.

tcp_skb_is_last check is now provided by the loop.

I tested a bit (ss-after-idle-off, fill 4096x4096B xfer,
10s sleep + 4096 x 1byte writes while dropping them for
some a while with netem):

. 16774097:16775545(1448) ack 1 win 46
. 16775545:16776993(1448) ack 1 win 46
. ack 16759617 win 2399
P 16776993:16777217(224) ack 1 win 46
. ack 16762513 win 2399
. ack 16765409 win 2399
. ack 16768305 win 2399
. ack 16771201 win 2399
. ack 16774097 win 2399
. ack 16776993 win 2399
. ack 16777217 win 2399
P 16777217:16777257(40) ack 1 win 46
. ack 16777257 win 2399
P 16777257:16778705(1448) ack 1 win 46
P 16778705:16780153(1448) ack 1 win 46
FP 16780153:16781313(1160) ack 1 win 46
. ack 16778705 win 2399
. ack 16780153 win 2399
F 1:1(0) ack 16781314 win 2399

While without drop-all period I get this:

. 16773585:16775033(1448) ack 1 win 46
. ack 16764897 win 9367
. ack 16767793 win 9367
. ack 16770689 win 9367
. ack 16773585 win 9367
. 16775033:16776481(1448) ack 1 win 46
P 16776481:16777217(736) ack 1 win 46
. ack 16776481 win 9367
. ack 16777217 win 9367
P 16777217:16777218(1) ack 1 win 46
P 16777218:16777219(1) ack 1 win 46
P 16777219:16777220(1) ack 1 win 46
  ...
P 16777247:16777248(1) ack 1 win 46
. ack 16777218 win 9367
. ack 16777219 win 9367
  ...
. ack 16777233 win 9367
. ack 16777248 win 9367
P 16777248:16778696(1448) ack 1 win 46
P 16778696:16780144(1448) ack 1 win 46
FP 16780144:16781313(1169) ack 1 win 46
. ack 16780144 win 9367
F 1:1(0) ack 16781314 win 9367

The window seems to be 30-40 segments, which were successfully
combined into: P 16777217:16777257(40) ack 1 win 46

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24 21:03:43 -08:00