48957 Commits

Author SHA1 Message Date
Colin Ian King
03ac738d5c rtnetlink: fix missing size for IFLA_IF_NETNSID
The size for IFLA_IF_NETNSID is missing from the size calculation
because the proceeding semicolon was not removed. Fix this by removing
the semicolon.

Detected by CoverityScan, CID#1461135 ("Structurally dead code")

Fixes: 79e1ad148c84 ("rtnetlink: use netnsid to query interface")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 13:46:25 +09:00
Egil Hjelmeland
92f25cafe8 net: dsa: lan9303: Adjust indenting
Remove scripts/checkpatch.pl CHECKs by adjusting indenting.

Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 13:29:06 +09:00
Nogah Frankel
8521db4c7e net_sch: cbs: Change TC_SETUP_CBS to TC_SETUP_QDISC_CBS
Change TC_SETUP_CBS to TC_SETUP_QDISC_CBS to match the new convention..

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:38 +09:00
Nogah Frankel
575ed7d39e net_sch: mqprio: Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO
Change TC_SETUP_MQPRIO to TC_SETUP_QDISC_MQPRIO to match the new
convention.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:38 +09:00
Nogah Frankel
602f3baf22 net_sch: red: Add offload ability to RED qdisc
Add the ability to offload RED qdisc by using ndo_setup_tc.
There are four commands for RED offloading:
* TC_RED_SET: handles set and change.
* TC_RED_DESTROY: handle qdisc destroy.
* TC_RED_STATS: update the qdiscs counters (given as reference)
* TC_RED_XSTAT: returns red xstats.

Whether RED is being offloaded is being determined every time dump action
is being called because parent change of this qdisc could change its
offload state but doesn't require any RED function to be called.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 12:23:37 +09:00
Tom Herbert
fddb231ebe ila: Add a hook type for LWT routes
In LWT tunnels both an input and output route method is defined.
If both of these are executed in the same path then double translation
happens and the effect is not correct.

This patch adds a new attribute that indicates the hook type. Two
values are defined for route output and route output. ILA
translation is only done for the one that is set. The default is
to enable ILA on route output.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 11:20:49 +09:00
Tom Herbert
70d5aef48a ila: allow configuration of identifier type
Allow identifier to be explicitly configured for a mapping.
This can either be one of the identifier types specified in the
ILA draft or a value of ILA_ATYPE_USE_FORMAT which means the
identifier type is inferred from the identifier type field.
If a value other than ILA_ATYPE_USE_FORMAT is set for a
mapping then it is assumed that the identifier type field is
not present in an identifier.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 11:20:48 +09:00
Tom Herbert
84287bb328 ila: add checksum neutral map auto
Add checksum neutral auto that performs checksum neutral mapping
without using the C-bit. This is enabled by configuration of
a mapping.

The checksum neutral function has been split into
ila_csum_do_neutral_fmt and ila_csum_do_neutral_nofmt. The former
handles the C-bit and includes it in the adjustment value. The latter
just sets the adjustment value on the locator diff only.

Added configuration for checksum neutral map aut in ila_lwt
and ila_xlat.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 11:20:48 +09:00
Tom Herbert
80661e7687 ila: cleanup checksum diff
Consolidate computing checksum diff into one function.

Add get_csum_diff_iaddr that computes the checksum diff between
an address argument and locator being written. get_csum_diff
calls this using the destination address in the IP header as
the argument.

Also moved ila_init_saved_csum to be close to the checksum

diff functions.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-08 11:20:48 +09:00
Trond Myklebust
22700f3c6d SUNRPC: Improve ordering of transport processing
Since it can take a while before a specific thread gets scheduled, it
is better to just implement a first come first served queue mechanism.
That way, if a thread is already scheduled and is idle, it can pick up
the work to do from the queue.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:44:03 -05:00
Chuck Lever
77a08867a6 svcrdma: Enqueue after setting XPT_CLOSE in completion handlers
I noticed the server was sometimes not closing the connection after
a flushed Send. For example, if the client responds with an RNR NAK
to a Reply from the server, that client might be deadlocked, and
thus wouldn't send any more traffic. Thus the server wouldn't have
any opportunity to notice the XPT_CLOSE bit has been set.

Enqueue the transport so that svcxprt notices the bit even if there
is no more transport activity after a flushed completion, QP access
error, or device removal event.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-By: Devesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:44:02 -05:00
J. Bruce Fields
1754eb2b27 rpc: remove some BUG()s
It would be kinder to WARN() and recover in several spots here instead
of BUG()ing.

Also, it looks like the read_u32_from_xdr_buf() call could actually
fail, though it might require a broken (or malicious) client, so convert
that to just an error return.

Reported-by: Weston Andros Adamson <dros@monkey.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:44:01 -05:00
Chuck Lever
0bad47cada svcrdma: Preserve CB send buffer across retransmits
During each NFSv4 callback Call, an RDMA Send completion frees the
page that contains the RPC Call message. If the upper layer
determines that a retransmit is necessary, this is too soon.

One possible symptom: after a GARBAGE_ARGS response an NFSv4.1
callback request, the following BUG fires on the NFS server:

kernel: BUG: Bad page state in process kworker/0:2H  pfn:7d3ce2
kernel: page:ffffea001f4f3880 count:-2 mapcount:0 mapping:          (null) index:0x0
kernel: flags: 0x2fffff80000000()
kernel: raw: 002fffff80000000 0000000000000000 0000000000000000 fffffffeffffffff
kernel: raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
kernel: page dumped because: nonzero _refcount
kernel: Modules linked in: cts rpcsec_gss_krb5 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
ocfs2_nodemanager ocfs2_stackglue rpcrdm a ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pc lmul crc32_pclmul ghash_clmulni_intel pcbc iTCO_wdt
iTCO_vendor_support aesni_intel crypto_simd glue_helper cryptd pcspkr lpc_ich i2c_i801
mei_me mf d_core mei raid0 sg wmi ioatdma ipmi_si ipmi_devintf ipmi_msghandler shpchp
acpi_power_meter acpi_pad nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs
libcrc32c mlx4_en mlx4_ib mlx5_ib ib_core sd_mod sr_mod cdrom ast drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci crc32c_intel libahci drm
mlx5_core igb libata mlx4_core dca i2c_algo_bit i2c_core nvme
kernel: ptp nvme_core pps_core dm_mirror dm_region_hash dm_log dm_mod dax
kernel: CPU: 0 PID: 11495 Comm: kworker/0:2H Not tainted 4.14.0-rc3-00001-g577ce48 #811
kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
kernel: Call Trace:
kernel: dump_stack+0x62/0x80
kernel: bad_page+0xfe/0x11a
kernel: free_pages_check_bad+0x76/0x78
kernel: free_pcppages_bulk+0x364/0x441
kernel: ? ttwu_do_activate.isra.61+0x71/0x78
kernel: free_hot_cold_page+0x1c5/0x202
kernel: __put_page+0x2c/0x36
kernel: svc_rdma_put_context+0xd9/0xe4 [rpcrdma]
kernel: svc_rdma_wc_send+0x50/0x98 [rpcrdma]

This issue exists all the way back to v4.5, but refactoring and code
re-organization prevents this simple patch from applying to kernels
older than v4.12. The fix is the same, however, if someone needs to
backport it.

Reported-by: Ben Coddington <bcodding@redhat.com>
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=314
Fixes: 5d252f90a800 ('svcrdma: Add class for RDMA backwards ... ')
Cc: stable@vger.kernel.org # v4.12
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:44:00 -05:00
Colin Ian King
da36e6dbf4 sunrcp: make function _svc_create_xprt static
The function _svc_create_xprt is local to the source and
does not need to be in global scope, so make it static.

Cleans up sparse warning:
symbol '_svc_create_xprt' was not declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-07 16:43:57 -05:00
Ingo Molnar
8c5db92a70 Merge branch 'linus' into locking/core, to resolve conflicts
Conflicts:
	include/linux/compiler-clang.h
	include/linux/compiler-gcc.h
	include/linux/compiler-intel.h
	include/uapi/linux/stddef.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-07 10:32:44 +01:00
Eric Dumazet
fffcefe967 ipv6: addrconf: fix a lockdep splat
Fixes a case where GFP_ATOMIC allocation must be used instead of
GFP_KERNEL one.

[   54.891146]  lock_acquire+0xb3/0x2f0
[   54.891153]  ? fs_reclaim_acquire.part.60+0x5/0x30
[   54.891165]  fs_reclaim_acquire.part.60+0x29/0x30
[   54.891170]  ? fs_reclaim_acquire.part.60+0x5/0x30
[   54.891178]  kmem_cache_alloc_trace+0x3f/0x500
[   54.891186]  ? cyc2ns_read_end+0x1e/0x30
[   54.891196]  ipv6_add_addr+0x15a/0xc30
[   54.891217]  ? ipv6_create_tempaddr+0x2ea/0x5d0
[   54.891223]  ipv6_create_tempaddr+0x2ea/0x5d0
[   54.891238]  ? manage_tempaddrs+0x195/0x220
[   54.891249]  ? addrconf_prefix_rcv_add_addr+0x1c0/0x4f0
[   54.891255]  addrconf_prefix_rcv_add_addr+0x1c0/0x4f0
[   54.891268]  addrconf_prefix_rcv+0x2e5/0x9b0
[   54.891279]  ? neigh_update+0x446/0xb90
[   54.891298]  ? ndisc_router_discovery+0x5ab/0xf00
[   54.891303]  ndisc_router_discovery+0x5ab/0xf00
[   54.891311]  ? retint_kernel+0x2d/0x2d
[   54.891331]  ndisc_rcv+0x1b6/0x270
[   54.891340]  icmpv6_rcv+0x6aa/0x9f0
[   54.891345]  ? ipv6_chk_mcast_addr+0x176/0x530
[   54.891351]  ? do_csum+0x17b/0x260
[   54.891360]  ip6_input_finish+0x194/0xb20
[   54.891372]  ip6_input+0x5b/0x2c0
[   54.891380]  ? ip6_rcv_finish+0x320/0x320
[   54.891389]  ip6_mc_input+0x15a/0x250
[   54.891396]  ipv6_rcv+0x772/0x1050
[   54.891403]  ? consume_skb+0xbe/0x2d0
[   54.891412]  ? ip6_make_skb+0x2a0/0x2a0
[   54.891418]  ? ip6_input+0x2c0/0x2c0
[   54.891425]  __netif_receive_skb_core+0xa0f/0x1600
[   54.891436]  ? process_backlog+0xac/0x400
[   54.891441]  process_backlog+0xfa/0x400
[   54.891448]  ? net_rx_action+0x145/0x1130
[   54.891456]  net_rx_action+0x310/0x1130
[   54.891524]  ? RTUSBBulkReceive+0x11d/0x190 [mt7610u_sta]
[   54.891538]  __do_softirq+0x140/0xaba
[   54.891553]  irq_exit+0x10b/0x160
[   54.891561]  do_IRQ+0xbb/0x1b0

Fixes: f3d9832e56c4 ("ipv6: addrconf: cleanup locking in ipv6_add_addr")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-07 10:36:49 +09:00
Pablo Neira Ayuso
ba0e4d9917 netfilter: nf_tables: get set elements via netlink
This patch adds a new get operation to look up for specific elements in
a set via netlink interface. You can also use it to check if an interval
already exists.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-07 01:00:31 +01:00
Pablo Neira Ayuso
644e334eee netfilter: nf_tables: performance set policy skips size description in selection
Use the complexity and space notations if policy is performance, this
results in placing the bitmap set representation over the hashtable for
key <= 16 for better performance as we discussed during the last NFWS in
Faro, Portugal.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-07 01:00:30 +01:00
Vincent Guittot
0984d427c1 netfilter: conntrack: use power efficient workqueue
conntrack uses the bounded system_long_wq workqueue for its works that
don't have to run on the cpu they have been queued.
Using bounded workqueue prevents the scheduler to make smart decision about
the best place to schedule the work.

This patch replaces system_long_wq with system_power_efficient_wq. the work
stays bounded to a cpu by default unless the CONFIG_WQ_POWER_EFFICIENT is
enable. In the latter case, the work can be scheduled on the best cpu from
a power or a performance point of view.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-07 00:59:10 +01:00
Pablo Neira Ayuso
7e35ec0e80 netfilter: conntrack: move nf_ct_netns_{get,put}() to core
So we can call this from other expression that need conntrack in place
to work.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
2017-11-06 16:48:39 +01:00
Florian Westphal
5caaed151a netfilter: conntrack: don't cache nlattr_tuple_size result in nla_size
We currently call ->nlattr_tuple_size() once at register time and
cache result in l4proto->nla_size.

nla_size is the only member that is written to, avoiding this would
allow to make l4proto trackers const.

We can use ->nlattr_tuple_size() at run time, and cache result in
the individual trackers instead.

This is an intermediate step, next patch removes nlattr_size()
callback and computes size at compile time, then removes nla_size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 16:48:38 +01:00
Florian Westphal
7f4dae2d7f netfilter: nft_hash: fix nft_hash_deactivate
Jindřich Makovička says:
  The logical OR looks fishy to me. Shouldn't be && there instead?

Link: https://bugzilla.netfilter.org/show_bug.cgi?id=1199
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 16:48:37 +01:00
Florian Westphal
b1fc1372c4 netfilter: xt_connlimit: remove mask argument
Instead of passing mask to all the helpers, just fixup the search key
early.

After rbtree conversion, each rbtree node stores connections of same
'addr & mask', so no need to pass the mask too.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 14:47:30 +01:00
Colin Ian King
9912156c2e netfilter: ebtables: clean up initialization of buf
buf is initialized to buf_start and then set on the next statement
to buf_start + offsets[i].  Clean this up to just initialize buf
to buf_start + offsets[i] to clean up the clang build warning:
"Value stored to 'buf' during its initialization is never read"

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 14:47:29 +01:00
KUWAZAWA Takuya
c5504f724c netfilter: ipvs: Fix inappropriate output of procfs
Information about ipvs in different network namespace can be seen via procfs.

How to reproduce:

  # ip netns add ns01
  # ip netns add ns02
  # ip netns exec ns01 ip a add dev lo 127.0.0.1/8
  # ip netns exec ns02 ip a add dev lo 127.0.0.1/8
  # ip netns exec ns01 ipvsadm -A -t 10.1.1.1:80
  # ip netns exec ns02 ipvsadm -A -t 10.1.1.2:80

The ipvsadm displays information about its own network namespace only.

  # ip netns exec ns01 ipvsadm -Ln
  IP Virtual Server version 1.2.1 (size=4096)
  Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
  TCP  10.1.1.1:80 wlc

  # ip netns exec ns02 ipvsadm -Ln
  IP Virtual Server version 1.2.1 (size=4096)
  Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
  TCP  10.1.1.2:80 wlc

But I can see information about other network namespace via procfs.

  # ip netns exec ns01 cat /proc/net/ip_vs
  IP Virtual Server version 1.2.1 (size=4096)
  Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port Forward Weight ActiveConn InActConn
  TCP  0A010101:0050 wlc
  TCP  0A010102:0050 wlc

  # ip netns exec ns02 cat /proc/net/ip_vs
  IP Virtual Server version 1.2.1 (size=4096)
  Prot LocalAddress:Port Scheduler Flags
    -> RemoteAddress:Port Forward Weight ActiveConn InActConn
  TCP  0A010102:0050 wlc

Signed-off-by: KUWAZAWA Takuya <albatross0@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 14:47:22 +01:00
Helge Deller
c5cc0c6971 netfilter: ipvs: Use %pS printk format for direct addresses
The debug and error printk functions in ipvs uses wrongly the %pF instead of
the %pS printk format specifier for printing symbols for the address returned
by _builtin_return_address(0). Fix it for the ia64, ppc64 and parisc64
architectures.

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Cc: lvs-devel@vger.kernel.org
Cc: netfilter-devel@vger.kernel.org
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-11-06 14:44:20 +01:00
Allen Pais
4b519bb493 NFC: Convert timers to use timer_setup()
Switch to using the new timer_setup() and from_timer()
for net/nfc/*

Signed-off-by: Allen Pais <allen.pais@oracle.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-11-06 01:12:10 +01:00
Johan Hovold
c45e3e4c5b NFC: fix device-allocation error return
A recent change fixing NFC device allocation itself introduced an
error-handling bug by returning an error pointer in case device-id
allocation failed. This is clearly broken as the callers still expected
NULL to be returned on errors as detected by Dan's static checker.

Fix this up by returning NULL in the event that we've run out of memory
when allocating a new device id.

Note that the offending commit is marked for stable (3.8) so this fix
needs to be backported along with it.

Fixes: 20777bc57c34 ("NFC: fix broken device allocation")
Cc: stable <stable@vger.kernel.org>	# 3.8
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2017-11-06 00:53:39 +01:00
Priyaranjan Jha
d09b9e60e0 tcp: fix DSACK-based undo on non-duplicate ACK
Fixes DSACK-based undo when sender is in Open State and
an ACK advances snd_una.

Example scenario:
- Sender goes into recovery and makes some spurious rtx.
- It comes out of recovery and enters into open state.
- It sends some more packets, let's say 4.
- The receiver sends an ACK for the first two, but this ACK is lost.
- The sender receives ack for first two, and DSACK for previous
  spurious rtx.

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 23:16:50 +09:00
Priyaranjan Jha
1f2556916d tcp: higher throughput under reordering with adaptive RACK reordering wnd
Currently TCP RACK loss detection does not work well if packets are
being reordered beyond its static reordering window (min_rtt/4).Under
such reordering it may falsely trigger loss recoveries and reduce TCP
throughput significantly.

This patch improves that by increasing and reducing the reordering
window based on DSACK, which is now supported in major TCP implementations.
It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.

- If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
  by srtt), since there is possibility that spurious retransmission was
  due to reordering delay longer than reo_wnd.

- Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
  no. of successful recoveries (accounts for full DSACK-based loss
  recovery undo). After that, reset it to default (min_rtt/4).

- At max, reo_wnd is incremented only once per rtt. So that the new
  DSACK on which we are reacting, is due to the spurious retx (approx)
  after the reo_wnd has been updated last time.

- reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
  absolute value to account for change in rtt.

In our internal testing, we observed significant increase in throughput,
in scenarios where reordering exceeds min_rtt/4 (previous static value).

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 23:15:42 +09:00
Vivien Didelot
7354fcb0a3 net: dsa: resolve tagging protocol at parse time
Extend the dsa_port_parse_cpu() function to resolve the tagging protocol
at port parsing time, instead of waiting for the whole tree to be
complete.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:39 +09:00
Vivien Didelot
06e24d0868 net: dsa: add one port parsing function per type
Add dsa_port_parse_user, dsa_port_parse_dsa and dsa_port_parse_cpu
functions to factorize the code shared by both OF and pdata parsing.

They don't do much for the moment but will be extended later to support
tagging protocol resolution for example.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:39 +09:00
Vivien Didelot
54df6fa954 net: dsa: only check presence of link property
When parsing a port, simply use of_property_read_bool which checks the
presence of a given property, instead of parsing the link phandle.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:39 +09:00
Vivien Didelot
975e6e3221 net: dsa: rework switch parsing
When parsing a switch, we have to identify to which tree it belongs and
parse its ports. Provide two functions to separate the OF and platform
data specific paths.

Also use the of_property_read_variable_u32_array function to parse the
OF member array instead of calling of_property_read_u32_index twice.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:39 +09:00
Vivien Didelot
0eefe2c173 net: dsa: get tree before parsing ports
We will need a reference to the dsa_switch_tree when parsing a CPU port,
so fetch it right after parsing the member and before parsing ports.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:39 +09:00
Vivien Didelot
6da2a940ac net: dsa: rework switch addition and removal
This patch removes the unnecessary index argument from the
dsa_dst_add_ds and dsa_dst_del_ds functions and renames them to
dsa_tree_add_switch and dsa_tree_remove_switch respectively.

In addition to a more explicit scope, we now check the presence of an
existing switch with the same index directly within dsa_tree_add_switch.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:39 +09:00
Vivien Didelot
1ca28ec9ab net: dsa: provide a find or new tree helper
Rename dsa_get_dst to dsa_tree_find since it doesn't increment the
reference counter, rename dsa_add_dst to dsa_tree_alloc for symmetry
with dsa_tree_free, and provide a convenient dsa_tree_touch function to
find or allocate a new tree.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:39 +09:00
Vivien Didelot
65254108b4 net: dsa: get and put tree reference counting
Provide convenient dsa_tree_get and dsa_tree_put functions scoping a DSA
tree used to increment and decrement its reference counter, instead of
poking directly its kref structure.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:38 +09:00
Vivien Didelot
8e5bf9759a net: dsa: simplify tree reference counting
DSA trees have a refcount used to automatically free the dsa_switch_tree
structure once there is no switch devices inside of it.

The refcount is incremented when a switch is added to the tree, and
decremented when it is removed from it.

But because of kref_init, the refcount is also incremented at
initialization, and when looking up the tree from the list for symmetry.

Thus the current code stores the number of switches plus one, and makes
the switch registration more complex.

To simplify the switch registration function, we reset the refcount to
zero after initialization and don't increment it when looking up a tree.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:38 +09:00
Vivien Didelot
49463b7f2d net: dsa: make tree index unsigned
Similarly to a DSA switch and port, rename the tree index from "tree" to
"index" and make it an unsigned int because it isn't supposed to be less
than 0.

u32 is an OF specific data used to retrieve the value and has no need to
be propagated up to the tree index.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:31:38 +09:00
Jakub Kicinski
b37a530613 bpf: remove old offload/analyzer
Thanks to the ability to load a program for a specific device,
running verifier twice is no longer needed.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:20 +09:00
Jakub Kicinski
6c8dfe21c4 cls_bpf: allow attaching programs loaded for specific device
If TC program is loaded with skip_sw flag, we should allow
the device-specific programs to be accepted.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:19 +09:00
Jakub Kicinski
248f346ffe xdp: allow attaching programs loaded for specific device
Pass the netdev pointer to bpf_prog_get_type().  This way
BPF code can decide whether the device matches what the
code was loaded/translated for.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:19 +09:00
Jakub Kicinski
f4e63525ee net: bpf: rename ndo_xdp to ndo_bpf
ndo_xdp is a control path callback for setting up XDP in the
driver.  We can reuse it for other forms of communication
between the eBPF stack and the drivers.  Rename the callback
and associated structures and definitions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
Guillaume Nault
8f7dc9ae4a l2tp: don't use l2tp_tunnel_find() in l2tp_ip and l2tp_ip6
Using l2tp_tunnel_find() in l2tp_ip_recv() is wrong for two reasons:

  * It doesn't take a reference on the returned tunnel, which makes the
    call racy wrt. concurrent tunnel deletion.

  * The lookup is only based on the tunnel identifier, so it can return
    a tunnel that doesn't match the packet's addresses or protocol.

For example, a packet sent to an L2TPv3 over IPv6 tunnel can be
delivered to an L2TPv2 over UDPv4 tunnel. This is worse than a simple
cross-talk: when delivering the packet to an L2TP over UDP tunnel, the
corresponding socket is UDP, where ->sk_backlog_rcv() is NULL. Calling
sk_receive_skb() will then crash the kernel by trying to execute this
callback.

And l2tp_tunnel_find() isn't even needed here. __l2tp_ip_bind_lookup()
properly checks the socket binding and connection settings. It was used
as a fallback mechanism for finding tunnels that didn't have their data
path registered yet. But it's not limited to this case and can be used
to replace l2tp_tunnel_find() in the general case.

Fix l2tp_ip6 in the same way.

Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:22:15 +09:00
Eric Dumazet
35e00da36c tcp: do not clear again skb->csum in tcp_init_nondata_skb()
tcp_init_nondata_skb() is fed with freshly allocated skbs.
They already have a cleared csum field, no need to clear it again.

This is based on Neal review on commit 3b11775033dc ("tcp: do not mangle
skb->cb[] in tcp_make_synack()"), noticing I did not clear skb->csum.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:14:54 +09:00
Eric Dumazet
d0f3684701 tcp: tcp_mtu_probing() cleanup
Reduce one indentation level to make code more readable.
tcp_sync_mss() can be factorized.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:14:23 +09:00
Jiri Benc
79e1ad148c rtnetlink: use netnsid to query interface
Currently, when an application gets netnsid from the kernel (for example as
the result of RTM_GETLINK call on one end of the veth pair), it's not much
useful. There's no reliable way to get to the netns fd from the netnsid, nor
does any kernel API accept netnsid.

Extend the RTM_GETLINK call to also accept netnsid. It will operate on the
netns with the given netnsid in such case. Of course, the calling process
needs to have enough capabilities in the target name space; for now, require
CAP_NET_ADMIN. This can be relaxed in the future.

To signal to the calling process that the kernel understood the new
IFLA_IF_NETNSID attribute in the query, it will include it in the response.
This is needed to detect older kernels, as they will just ignore
IFLA_IF_NETNSID and query in the current name space.

This patch implemetns IFLA_IF_NETNSID only for get and dump. For set
operations, this can be extended later.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 21:49:17 +09:00
Jiri Benc
9354d45203 openvswitch: reliable interface indentification in port dumps
This patch allows reliable identification of netdevice interfaces connected
to openvswitch bridges. In particular, user space queries the netdev
interfaces belonging to the ports for statistics, up/down state, etc.
Datapath dump needs to provide enough information for the user space to be
able to do that.

Currently, only interface names are returned. This is not sufficient, as
openvswitch allows its ports to be in different name spaces and the
interface name is valid only in its name space. What is needed and generally
used in other netlink APIs, is the pair ifindex+netnsid.

The solution is addition of the ifindex+netnsid pair (or only ifindex if in
the same name space) to vport get/dump operation.

On request side, ideally the ifindex+netnsid pair could be used to
get/set/del the corresponding vport. This is not implemented by this patch
and can be added later if needed.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 21:49:17 +09:00
Jiri Benc
7cbebc8a14 net: export peernet2id_alloc
It will be used by openvswitch.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 21:49:17 +09:00