45516 Commits

Author SHA1 Message Date
Ralf Baechle
4872e57c81 NET: Fix /proc/net/arp for AX.25
When sending ARP requests over AX.25 links the hwaddress in the neighbour
cache are not getting initialized.  For such an incomplete arp entry
ax2asc2 will generate an empty string resulting in /proc/net/arp output
like the following:

$ cat /proc/net/arp
IP address       HW type     Flags       HW address            Mask     Device
192.168.122.1    0x1         0x2         52:54:00:00:5d:5f     *        ens3
172.20.1.99      0x3         0x0              *        bpq0

The missing field will confuse the procfs parsing of arp(8) resulting in
incorrect output for the device such as the following:

$ arp
Address                  HWtype  HWaddress           Flags Mask            Iface
gateway                  ether   52:54:00:00:5d:5f   C                     ens3
172.20.1.99                      (incomplete)                              ens3

This changes the content of /proc/net/arp to:

$ cat /proc/net/arp
IP address       HW type     Flags       HW address            Mask     Device
172.20.1.99      0x3         0x0         *                     *        bpq0
192.168.122.1    0x1         0x2         52:54:00:00:5d:5f     *        ens3

To do so it change ax2asc to put the string "*" in buf for a NULL address
argument.  Finally the HW address field is left aligned in a 17 character
field (the length of an ethernet HW address in the usual hex notation) for
readability.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-13 22:15:03 -05:00
Matthew Wilcox
d3e709e63e idr: Return the deleted entry from idr_remove
It is a relatively common idiom (8 instances) to first look up an IDR
entry, and then remove it from the tree if it is found, possibly doing
further operations upon the entry afterwards.  If we change idr_remove()
to return the removed object, all of these users can save themselves a
walk of the IDR tree.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
2017-02-13 21:44:03 -05:00
Eric Dumazet
8b74d439e1 net/llc: avoid BUG_ON() in skb_orphan()
It seems nobody used LLC since linux-3.12.

Fortunately fuzzers like syzkaller still know how to run this code,
otherwise it would be no fun.

Setting skb->sk without skb->destructor leads to all kinds of
bugs, we now prefer to be very strict about it.

Ideally here we would use skb_set_owner() but this helper does not exist yet,
only CAN seems to have a private helper for that.

Fixes: 376c7311bdb6 ("net: add a temporary sanity check in skb_orphan()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-12 22:14:49 -05:00
David S. Miller
7c92d61eca Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree, most relevantly they are:

1) Extend nft_exthdr to allow to match TCP options bitfields, from
   Manuel Messner.

2) Allow to check if IPv6 extension header is present in nf_tables,
   from Phil Sutter.

3) Allow to set and match conntrack zone in nf_tables, patches from
   Florian Westphal.

4) Several patches for the nf_tables set infrastructure, this includes
   cleanup and preparatory patches to add the new bitmap set type.

5) Add optional ruleset generation ID check to nf_tables and allow to
   delete rules that got no public handle yet via NFTA_RULE_ID. These
   patches add the missing kernel infrastructure to support rule
   deletion by description from userspace.

6) Missing NFT_SET_OBJECT flag to select the right backend when sets
   stores an object map.

7) A couple of cleanups for the expectation and SIP helper, from Gao
   feng.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-12 22:11:43 -05:00
Pablo Neira Ayuso
7286ff7fde netfilter: nf_tables: honor NFT_SET_OBJECT in set backend selection
Check for NFT_SET_OBJECT feature flag, otherwise we may end up selecting
the wrong set backend.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-12 14:45:14 +01:00
Pablo Neira Ayuso
1a94e38d25 netfilter: nf_tables: add NFTA_RULE_ID attribute
This new attribute allows us to uniquely identify a rule in transaction.
Robots may trigger an insertion followed by deletion in a batch, in that
scenario we still don't have a public rule handle that we can use to
delete the rule. This is similar to the NFTA_SET_ID attribute that
allows us to refer to an anonymous set from a batch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-12 14:45:13 +01:00
Pablo Neira Ayuso
74e8bcd21c netfilter: nf_tables: add check_genid to the nfnetlink subsystem
This patch implements the check generation id as provided by nfnetlink.
This allows us to reject ruleset updates against stale baseline, so
userspace can retry update with a fresh ruleset cache.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-12 14:45:12 +01:00
Pablo Neira Ayuso
8c4d4e8b56 netfilter: nfnetlink: allow to check for generation ID
This patch allows userspace to specify the generation ID that has been
used to build an incremental batch update.

If userspace specifies the generation ID in the batch message as
attribute, then nfnetlink compares it to the current generation ID so
you make sure that you work against the right baseline. Otherwise, bail
out with ERESTART so userspace knows that its changeset is stale and
needs to respin. Userspace can do this transparently at the cost of
taking slightly more time to refresh caches and rework the changeset.

This check is optional, if there is no NFNL_BATCH_GENID attribute in the
batch begin message, then no check is performed.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-12 14:45:11 +01:00
Pablo Neira Ayuso
48656835c0 netfilter: nfnetlink: add nfnetlink_rcv_skb_batch()
Add new nfnetlink_rcv_skb_batch() to wrap initial nfnetlink batch
handling.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-12 14:45:10 +01:00
Pablo Neira Ayuso
b745d0358d netfilter: nfnetlink: get rid of u_intX_t types
Use uX types instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-12 14:45:09 +01:00
Gao Feng
4dee62b1b9 netfilter: nf_ct_expect: nf_ct_expect_insert() returns void
Because nf_ct_expect_insert() always succeeds now, its return value can
be just void instead of int. And remove code that checks for its return
value.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-12 14:44:08 +01:00
Gao Feng
a96e66e702 netfilter: nf_ct_sip: Use mod_timer_pending()
timer_del() followed by timer_add() can be replaced by
mod_timer_pending().

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-02-12 14:39:06 +01:00
Eric Dumazet
87b60cfacf net_sched: fix error recovery at qdisc creation
Dmitry reported uses after free in qdisc code [1]

The problem here is that ops->init() can return an error.

qdisc_create_dflt() then call ops->destroy(),
while qdisc_create() does _not_ call it.

Four qdisc chose to call their own ops->destroy(), assuming their caller
would not.

This patch makes sure qdisc_create() calls ops->destroy()
and fixes the four qdisc to avoid double free.

[1]
BUG: KASAN: use-after-free in mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 at addr ffff8801d415d440
Read of size 8 by task syz-executor2/5030
CPU: 0 PID: 5030 Comm: syz-executor2 Not tainted 4.3.5-smp-DEV #119
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
 0000000000000046 ffff8801b435b870 ffffffff81bbbed4 ffff8801db000400
 ffff8801d415d440 ffff8801d415dc40 ffff8801c4988510 ffff8801b435b898
 ffffffff816682b1 ffff8801b435b928 ffff8801d415d440 ffff8801c49880c0
Call Trace:
 [<ffffffff81bbbed4>] __dump_stack lib/dump_stack.c:15 [inline]
 [<ffffffff81bbbed4>] dump_stack+0x6c/0x98 lib/dump_stack.c:51
 [<ffffffff816682b1>] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
 [<ffffffff81668524>] print_address_description mm/kasan/report.c:196 [inline]
 [<ffffffff81668524>] kasan_report_error+0x1b4/0x4b0 mm/kasan/report.c:285
 [<ffffffff81668953>] kasan_report mm/kasan/report.c:305 [inline]
 [<ffffffff81668953>] __asan_report_load8_noabort+0x43/0x50 mm/kasan/report.c:326
 [<ffffffff82527b02>] mq_destroy+0x242/0x290 net/sched/sch_mq.c:33
 [<ffffffff82524bdd>] qdisc_destroy+0x12d/0x290 net/sched/sch_generic.c:953
 [<ffffffff82524e30>] qdisc_create_dflt+0xf0/0x120 net/sched/sch_generic.c:848
 [<ffffffff8252550d>] attach_default_qdiscs net/sched/sch_generic.c:1029 [inline]
 [<ffffffff8252550d>] dev_activate+0x6ad/0x880 net/sched/sch_generic.c:1064
 [<ffffffff824b1db1>] __dev_open+0x221/0x320 net/core/dev.c:1403
 [<ffffffff824b24ce>] __dev_change_flags+0x15e/0x3e0 net/core/dev.c:6858
 [<ffffffff824b27de>] dev_change_flags+0x8e/0x140 net/core/dev.c:6926
 [<ffffffff824f5bf6>] dev_ifsioc+0x446/0x890 net/core/dev_ioctl.c:260
 [<ffffffff824f61fa>] dev_ioctl+0x1ba/0xb80 net/core/dev_ioctl.c:546
 [<ffffffff82430509>] sock_do_ioctl+0x99/0xb0 net/socket.c:879
 [<ffffffff82430d30>] sock_ioctl+0x2a0/0x390 net/socket.c:958
 [<ffffffff816f3b68>] vfs_ioctl fs/ioctl.c:44 [inline]
 [<ffffffff816f3b68>] do_vfs_ioctl+0x8a8/0xe50 fs/ioctl.c:611
 [<ffffffff816f41a4>] SYSC_ioctl fs/ioctl.c:626 [inline]
 [<ffffffff816f41a4>] SyS_ioctl+0x94/0xc0 fs/ioctl.c:617
 [<ffffffff8123e357>] entry_SYSCALL_64_fastpath+0x12/0x17

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-11 21:38:58 -05:00
Julian Anastasov
c16ec18599 net: rename dst_neigh_output back to neigh_output
After the dst->pending_confirm flag was removed, we do not
need anymore to provide dst arg to dst_neigh_output.
So, rename it to neigh_output as before commit 5110effee8fd
("net: Do delayed neigh confirmation.").

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-11 21:25:18 -05:00
David S. Miller
35eeacf182 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-11 02:31:11 -05:00
Eric Dumazet
72fb96e7bd l2tp: do not use udp_ioctl()
udp_ioctl(), as its name suggests, is used by UDP protocols,
but is also used by L2TP :(

L2TP should use its own handler, because it really does not
look the same.

SIOCINQ for instance should not assume UDP checksum or headers.

Thanks to Andrey and syzkaller team for providing the report
and a nice reproducer.

While crashes only happen on recent kernels (after commit
7c13f97ffde6 ("udp: do fwd memory scheduling on dequeue")), this
probably needs to be backported to older kernels.

Fixes: 7c13f97ffde6 ("udp: do fwd memory scheduling on dequeue")
Fixes: 85584672012e ("udp: Fix udp_poll() and ioctl()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 15:57:34 -05:00
Jiri Pirko
4456f61cfd devlink: allow to fillup eswitch attrs even if mode_get op does not exist
Even when mode_get op is not present, other eswitch attrs need to be
filled-up.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 14:43:00 -05:00
Jiri Pirko
1a6aa36b6f devlink: use nla_put_failure goto label instead of out
Be aligned with the rest of the code and use label named nla_put_failure.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 14:43:00 -05:00
Jiri Pirko
21e3d2dd4a devlink: rename devlink_eswitch_fill to devlink_nl_eswitch_fill
Be aligned with the rest of the file and name the helper function
accordingly.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 14:43:00 -05:00
Jiri Pirko
adf200f31c devlink: fix the name of eswitch commands
The eswitch_[gs]et command is supposed to be similar to port_[gs]et
command - for multiple eswitch attributes. However, when it was introduced
by 08f4b5918b2d ("net/devlink: Add E-Switch mode control") it was wrongly
named with the word "mode" in it. So fix this now, make the oririnal
enum value existing but obsolete.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 14:43:00 -05:00
David S. Miller
0d2164af26 Some more updates:
* use shash in mac80211 crypto code where applicable
  * some documentation fixes
  * pass RSSI levels up in change notifications
  * remove unused rfkill-regulator
  * various other cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYnHuZAAoJEGt7eEactAAdorYP/iIfEZjLblLZNui+93OHpmsq
 QgXeQ9Vl/Xgq8g6vEb6jpHE6pvT/xoW/jt0s5rpYpPDzP7WRgkFQI144Jflx9gE+
 7KRHYc7k9XqugbFGFakjT+DyduxrGkEhZ7Vpd39D+zlPaf+p/31Jk6C4MNwk2oG3
 AA7ARUJfOKGpct3+l+cpnWQPUYQQKrSnjnMIwHL3Tu5pvGPDapdDWXbfn6T75Aof
 3oVQSNtbEtdzW/Ty+HgDn/boOCRXzXlOCxFlE7OiH2AXfe2mqGU/hkcm/wZ0QxOf
 9m3CxVjKH10tY6nsjDiXTOzFn+Zkzuum9/gcKtsgx+6BZ4qod2XuhtzmTeXggI8F
 b7nGeadSfSS6/unUu5Fibdn+X4Cw8+Yu5Qyiwo+jLL6yyTLUg6aKN6N9HWddhBfa
 pIjWAZVA1iB2m2XqUqRB/asEhslndSSfDmZK8nruYJSZWtQBNkNUdHXqqbqbBSHv
 KtKbHMOKoLU4zKmV2vMWGy0qypZCxtZkNF6GURhMh2m89qBcIAApFwQMzK09MwmP
 d5TcMwfi8YcKRb1Gw6n7gnJsC8e+tFDGXMi9w6z0FDGZvMbOlnO3ctSd/BY3B01H
 DWimkX8Ev3kt6KKTgbJ/n0lR/vEmDGdKo2ahH1uHOTPudrjpOHUg0cdDzS/VPZqi
 Qw4+FbVArISNrI2skTrA
 =sqkq
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2017-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Some more updates:
 * use shash in mac80211 crypto code where applicable
 * some documentation fixes
 * pass RSSI levels up in change notifications
 * remove unused rfkill-regulator
 * various other cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 14:31:51 -05:00
Chuck Lever
b977b644cc sunrpc: Allow xprt->ops->timer method to sleep
The transport lock is needed to protect the xprt_adjust_cwnd() call
in xs_udp_timer, but it is not necessary for accessing the
rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate
the lock into UDP's xs_udp_timer method, where it is required.

The ->timer method has to take the transport lock if needed, but it
can now sleep safely, or even call back into the RPC scheduler.

This is more a clean-up than a fix, but the "issue" was introduced
by my transport switch patches back in 2005.

Fixes: 46c0ee8bc4ad ("RPC: separate xprt_timer implementations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:37 -05:00
Chuck Lever
9a5c63e9c4 xprtrdma: Refactor management of mw_list field
Clean up some duplicate code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:37 -05:00
Chuck Lever
0a90487bf7 xprtrdma: Handle stale connection rejection
A server rejects a connection attempt with STALE_CONNECTION when a
client attempts to connect to a working remote service, but uses a
QPN and GUID that corresponds to an old connection that was
abandoned. This might occur after a client crashes and restarts.

Fix rpcrdma_conn_upcall() to distinguish between a normal rejection
and rejection of stale connection parameters.

As an additional clean-up, remove the code that retries the
connection attempt with different ORD/IRD values. Code audit of
other ULP initiators shows no similar special case handling of
initiator_depth or responder_resources.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
18c0fb31a0 xprtrdma: Properly recover FRWRs with in-flight FASTREG WRs
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional
double DMA unmap of an FRWR MR when a connection is lost. I see one
way this can happen.

When a request requires more than one segment or chunk,
rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment
(MR) in each chunk. Each call posts a FASTREG Work Request to
register one MR.

Now suppose that the transport connection is lost part-way through
marshaling this request. As part of recovering and resetting that
req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands
all the req's registered FRWRs to the MR recovery thread.

But note: FRWR registration is asynchronous. So it's possible that
some of these "already registered" FRWRs are fully registered, and
some are still waiting for their FASTREG WR to complete.

When the connection is lost, the "already registered" frmrs are
marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then
frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR.

But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing
an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs
FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running.

- If the recovery thread runs last, then the frmr is marked
FRMR_IS_INVALID, and life continues.

- If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR,
but the recovery thread has already DMA unmapped that MR. When
->frwr_op_map later re-uses this frmr, it sees it is not marked
FRMR_IS_INVALID, and tries to recover it before using it, resulting
in a second DMA unmap of the same MR.

The fix is to guarantee in-flight FASTREG WRs have flushed before MR
recovery runs on those FRWRs. Thus we depend on ro_unmap_safe
(called from xprt_rdma_send_request on retransmit, or from
xprt_rdma_free) to clean up old registrations as needed.

Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
c6f5b47f9f xprtrdma: Shrink send SGEs array
We no longer need to accommodate an xdr_buf whose pages start at an
offset and cross extra page boundaries. If there are more partial or
whole pages to send than there are available SGEs, the marshaling
logic is now smart enough to use a Read chunk instead of failing.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
16f906d66c xprtrdma: Reduce required number of send SGEs
The MAX_SEND_SGES check introduced in commit 655fec6987be
("xprtrdma: Use gathered Send for large inline messages") fails
for devices that have a small max_sge.

Instead of checking for a large fixed maximum number of SGEs,
check for a minimum small number. RPC-over-RDMA will switch to
using a Read chunk if an xdr_buf has more pages than can fit in
the device's max_sge limit. This is considerably better than
failing all together to mount the server.

This fix supports devices that have as few as three send SGEs
available.

Reported-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reported-by: Devesh Sharma <devesh.sharma@broadcom.com>
Reported-by: Honggang Li <honli@redhat.com>
Reported-by: Ram Amrani <Ram.Amrani@cavium.com>
Fixes: 655fec6987be ("xprtrdma: Use gathered Send for large ...")
Cc: stable@vger.kernel.org # v4.9+
Tested-by: Honggang Li <honli@redhat.com>
Tested-by: Ram Amrani <Ram.Amrani@cavium.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
c95a3c6b88 xprtrdma: Disable pad optimization by default
Commit d5440e27d3e5 ("xprtrdma: Enable pad optimization") made the
Linux client omit XDR round-up padding in normal Read and Write
chunks so that the client doesn't have to register and invalidate
3-byte memory regions that contain no real data.

Unfortunately, my cheery 2014 assessment that this optimization "is
supported now by both Linux and Solaris servers" was premature.
We've found bugs in Solaris in this area since commit d5440e27d3e5
("xprtrdma: Enable pad optimization") was merged (SYMLINK is the
main offender).

So for maximum interoperability, I'm disabling this optimization
again. If a CM private message is exchanged when connecting, the
client recognizes that the server is Linux, and enables the
optimization for that connection.

Until now the Solaris server bugs did not impact common operations,
and were thus largely benign. Soon, less capable devices on Linux
NFS/RDMA clients will make use of Read chunks more often, and these
Solaris bugs will prevent interoperation in more cases.

Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
b5f0afbea4 xprtrdma: Per-connection pad optimization
Pad optimization is changed by echoing into
/proc/sys/sunrpc/rdma_pad_optimize. This is a global setting,
affecting all RPC-over-RDMA connections to all servers.

The marshaling code picks up that value and uses it for decisions
about how to construct each RPC-over-RDMA frame. Having it change
suddenly in mid-operation can result in unexpected failures. And
some servers a client mounts might need chunk round-up, while
others don't.

So instead, copy the pad_optimize setting into each connection's
rpcrdma_ia when the transport is created, and use the copy, which
can't change during the life of the connection, instead.

This also removes a hack: rpcrdma_convert_iovs was using
the remote-invalidation-expected flag to predict when it could leave
out Write chunk padding. This is because the Linux server handles
implicit XDR padding on Write chunks correctly, and only Linux
servers can set the connection's remote-invalidation-expected flag.

It's more sensible to use the pad optimization setting instead.

Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Chuck Lever
24abdf1be1 xprtrdma: Fix Read chunk padding
When pad optimization is disabled, rpcrdma_convert_iovs still
does not add explicit XDR round-up padding to a Read chunk.

Commit 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
incorrectly short-circuited the test for whether round-up padding
is needed that appears later in rpcrdma_convert_iovs.

However, if this is indeed a regular Read chunk (and not a
Position-Zero Read chunk), the tail iovec _always_ contains the
chunk's padding, and never anything else.

So, it's easy to just skip the tail when padding optimization is
enabled, and add the tail in a subsequent Read chunk segment, if
disabled.

Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10 14:02:36 -05:00
Russell King
43cc277a93 net: cgroups: fix build errors when linux/phy*.h is removed from net/dsa.h
net/core/netprio_cgroup.c:303:16: error: expected declaration specifiers or '...' before string constant
    MODULE_LICENSE("GPL v2");
                   ^~~~~~~~

Add linux/module.h to fix this.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:51:01 -05:00
Russell King
f39b2dde48 net: sunrpc: fix build errors when linux/phy*.h is removed from net/dsa.h
Removing linux/phy.h from net/dsa.h reveals a build error in the sunrpc
code:

net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_rdma_bc_put':
net/sunrpc/xprtrdma/svc_rdma_backchannel.c:277:2: error: implicit declaration of function 'module_put' [-Werror=implicit-function-declaration]
net/sunrpc/xprtrdma/svc_rdma_backchannel.c: In function 'xprt_setup_rdma_bc':
net/sunrpc/xprtrdma/svc_rdma_backchannel.c:348:7: error: implicit declaration of function 'try_module_get' [-Werror=implicit-function-declaration]

Fix this by adding linux/module.h to svc_rdma_backchannel.c

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:51:01 -05:00
tcharding
f4563a75fb net: Fix checkpatch, Missing a blank line after declarations
This patch fixes multiple occurrences of checkpatch WARNING: Missing
a blank line after declarations.

Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:37:49 -05:00
tcharding
eb13da1a10 net: Fix checkpatch block comments warnings
Fix multiple occurrences of checkpatch warning. WARNING: Block
comments use * on subsequent lines. Also make comment blocks
more uniform.

Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:37:48 -05:00
tcharding
643aa9cba0 net: Fix checkpatch whitespace errors
This patch fixes two trivial whitespace errors. Brace should be
on the previous line and trailing statements should be on next line.

Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:37:48 -05:00
tcharding
722c9a0ceb net: Fix checkpatch WARNING: please, no space before tabs
This patch fixes multiple occurrences of space before tabs warnings.
More lines of code were moved than required to keep kernel-doc
comments uniform.

Signed-off-by: Tobin C. Harding <me@tobin.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:37:48 -05:00
Amir Vadai
853a14ba46 net/act_pedit: Introduce 'add' operation
This command could be useful to inc/dec fields.

For example, to forward any TCP packet and decrease its TTL:
$ tc filter add dev enp0s9 protocol ip parent ffff: \
    flower ip_proto tcp \
    action pedit munge ip ttl add 0xff pipe \
    action mirred egress redirect dev veth0

In the example above, adding 0xff to this u8 field is actually
decreasing it by one, since the operation is masked.

Signed-off-by: Amir Vadai <amir@vadai.me>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:18:33 -05:00
Amir Vadai
71d0ed7079 net/act_pedit: Support using offset relative to the conventional network headers
Extend pedit to enable the user setting offset relative to network
headers. This change would enable to work with more complex header
schemes (vs the simple IPv4 case) where setting a fixed offset relative
to the network header is not enough.

After this patch, the action has information about the exact header type
and field inside this header. This information could be used later on
for hardware offloading of pedit.

Backward compatibility was being kept:
1. Old kernel <-> new userspace
2. New kernel <-> old userspace
3. add rule using new userspace <-> dump using old userspace
4. add rule using old userspace <-> dump using new userspace

When using the extended api, new netlink attributes are being used. This
way, operation will fail in (1) and (3) - and no malformed rule be added
or dumped. Of course, new user space that doesn't need the new
functionality can use the old netlink attributes and operation will
succeed.
Since action can support both api's, (2) should work, and it is easy to
write the new user space to have (4) work.

The action is having a strict check that only header types and commands
it can handle are accepted. This way future additions will be much
easier.

Usage example:
$ tc filter add dev enp0s9 protocol ip parent ffff: \
  flower \
    ip_proto tcp \
    dst_port 80 \
  action pedit munge tcp dport set 8080 pipe \
  action mirred egress redirect dev veth0

Will forward tcp port whose original dest port is 80, while modifying
the destination port to 8080.

Signed-off-by: Amir Vadai <amir@vadai.me>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 13:18:33 -05:00
Nogah Frankel
6d5496483f switchdev: bridge: Offload mc router ports
Offload the mc router ports list, whenever it is being changed.
It is done because in some cases mc packets needs to be flooded to all
the ports in this list.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:46:39 -05:00
Nogah Frankel
f12e7d95d1 bridge: mcast: Merge the mc router ports deletions to one function
There are three places where a port gets deleted from the mc router port
list. This patch join the actual deletion to one function.
It will be helpful for later patch that will offload changes in the mc
router ports list.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:46:38 -05:00
Nogah Frankel
147c1e9b90 switchdev: bridge: Offload multicast disabled
Offload multicast disabled flag, for more accurate mc flood behavior:
When it is on, the mdb should be ignored.
When it is off, unregistered mc packets should be flooded to mc router
ports.

Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:46:38 -05:00
Jiri Pirko
40c81b25b1 sched: check negative err value to safe one level of indent
As it is more common, check err for !0. That allows to safe one level of
indentation and makes the code easier to read. Also, make 'next' variable
global in function as it is used twice.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:38:09 -05:00
Jiri Pirko
7215032ced sched: add missing curly braces in else branch in tc_ctl_tfilter
Curly braces need to be there, for stylistic reasons.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:38:09 -05:00
Jiri Pirko
6bb16e7ae2 sched: move err set right before goto errout in tc_ctl_tfilter
This makes the reader to know right away what is the error value.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:38:09 -05:00
Jiri Pirko
33a48927c1 sched: push TC filter protocol creation into a separate function
Make the long function tc_ctl_tfilter a little bit shorter and easier to
read. Also make the creation of filter proto symmetric to destruction.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:38:08 -05:00
Jiri Pirko
cf1facda2f sched: move tcf_proto_destroy and tcf_destroy_chain helpers into cls_api
Creation is done in this file, move destruction to be at the same place.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:38:08 -05:00
Jiri Pirko
79112c26f1 sched: rename tcf_destroy to tcf_destroy_proto
This function destroys TC filter protocol, not TC filter. So name it
accordingly.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:38:08 -05:00
Ido Schimmel
2f3a5272e5 ipv4: fib: Add events for FIB replace and append
The FIB notification chain currently uses the NLM_F_{REPLACE,APPEND}
flags to signal routes being replaced or appended.

Instead of using netlink flags for in-kernel notifications we can simply
introduce two new events in the FIB notification chain. This has the
added advantage of making the API cleaner, thereby making it clear that
these events should be supported by listeners of the notification chain.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:32:13 -05:00
Ido Schimmel
5b7d616dbc ipv4: fib: Send notification before deleting FIB alias
When a FIB alias is replaced following NLM_F_REPLACE, the ENTRY_ADD
notification is sent after the reference on the previous FIB info was
dropped. This is problematic as potential listeners might need to access
it in their notification blocks.

Solve this by sending the notification prior to the deletion of the
replaced FIB alias. This is consistent with ENTRY_DEL notifications.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:32:12 -05:00
Ido Schimmel
42d5aa76ec ipv4: fib: Send deletion notification with actual FIB alias type
When a FIB alias is removed, a notification is sent using the type
passed from user space - can be RTN_UNSPEC - instead of the actual type
of the removed alias. This is problematic for listeners of the FIB
notification chain, as several FIB aliases can exist with matching
parameters, but the type.

Solve this by passing the actual type of the removed FIB alias.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:32:12 -05:00