linux

iv/linux

Author	SHA1	Message	Date
Ross Zwisler	965d004af5	dax: fix deadlock with DAX 4k holes Currently in DAX if we have three read faults on the same hole address we can end up with the following: Thread 0 Thread 1 Thread 2 -------- -------- -------- dax_iomap_fault grab_mapping_entry lock_slot <locks empty DAX entry> dax_iomap_fault grab_mapping_entry get_unlocked_mapping_entry <sleeps on empty DAX entry> dax_iomap_fault grab_mapping_entry get_unlocked_mapping_entry <sleeps on empty DAX entry> dax_load_hole find_or_create_page ... page_cache_tree_insert dax_wake_mapping_entry_waiter <wakes one sleeper> __radix_tree_replace <swaps empty DAX entry with 4k zero page> <wakes> get_page lock_page ... put_locked_mapping_entry unlock_page put_page <sleeps forever on the DAX wait queue> The crux of the problem is that once we insert a 4k zero page, all locking from then on is done in terms of that 4k zero page and any additional threads sleeping on the empty DAX entry will never be woken. Fix this by waking all sleepers when we replace the DAX radix tree entry with a 4k zero page. This will allow all sleeping threads to successfully transition from locking based on the DAX empty entry to locking on the 4k zero page. With the test case reported by Xiong this happens very regularly in my test setup, with some runs resulting in 9+ threads in this deadlocked state. With this fix I've been able to run that same test dozens of times in a loop without issue. Fixes: ac401cc78242 ("dax: New fault locking") Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Xiong Zhou <xzhou@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-01-10 18:31:54 -08:00
Vlastimil Babka	5771f6ea8d	MAINTAINERS: remove duplicate bug filling description I have noticed that two different descriptions for B: entries in MAINTAINERS were merged: commit 686564434e88 ("MAINTAINERS: Add bug tracking system location entry type") and 2de2bd95f456 ("MAINTAINERS: add "B:" for URI where to file bugs"). This patch keeps the description from 2de2bd95f456. There has been a discussion [1] about whether this more detailed description is useful and what it exactly implies. I find it more useful and general, and the author of 686564434e88 agreed in the end that either is fine. [1] https://lkml.org/lkml/2016/12/8/71 Link: http://lkml.kernel.org/r/20161219085158.12114-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-01-10 18:31:54 -08:00
Herbert Xu	57ea52a865	gro: Disable frag0 optimization on IPv6 ext headers The GRO fast path caches the frag0 address. This address becomes invalid if frag0 is modified by pskb_may_pull or its variants. So whenever that happens we must disable the frag0 optimization. This is usually done through the combination of gro_header_hard and gro_header_slow, however, the IPv6 extension header path did the pulling directly and would continue to use the GRO fast path incorrectly. This patch fixes it by disabling the fast path when we enter the IPv6 extension header path. Fixes: 78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address") Reported-by: Slava Shwartsman <slavash@mellanox.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 21:30:33 -05:00
Herbert Xu	1272ce87fa	gro: Enter slow-path if there is no tailroom The GRO path has a fast-path where we avoid calling pskb_may_pull and pskb_expand by directly accessing frag0. However, this should only be done if we have enough tailroom in the skb as otherwise we'll have to expand it later anyway. This patch adds the check by capping frag0_len with the skb tailroom. Fixes: cb18978cbf45 ("gro: Open-code final pskb_may_pull") Reported-by: Slava Shwartsman <slavash@mellanox.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 21:26:12 -05:00
Martin KaFai Lau	9f9b74ef89	mlx4: Return EOPNOTSUPP instead of ENOTSUPP In commit b45f0674b997 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs"), it changed EOPNOTSUPP to ENOTSUPP by mistake. This patch fixes it. Fixes: b45f0674b997 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 21:16:43 -05:00
Julian Wiedmann	dc5367bcc5	net/af_iucv: don't use paged skbs for TX on HiperSockets With commit e53743994e21 ("af_iucv: use paged SKBs for big outbound messages"), we transmit paged skbs for both of AF_IUCV's transport modes (IUCV or HiperSockets). The qeth driver for Layer 3 HiperSockets currently doesn't support NETIF_F_SG, so these skbs would just be linearized again by the stack. Avoid that overhead by using paged skbs only for IUCV transport. cc stable, since this also circumvents a significant skb leak when sending large messages (where the skb then needs to be linearized). Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> # v4.8+ Fixes: e53743994e21 ("af_iucv: use paged SKBs for big outbound messages") Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 21:08:29 -05:00
Sowmini Varadhan	a505e58252	packet: pdiag_put_ring() should return TX_RING info for TPACKET_V3 Commit 7f953ab2ba46 ("af_packet: TX_RING support for TPACKET_V3") now makes it possible to use TX_RING with TPACKET_V3, so make the the relevant information available via 'ss -e -a --packet' Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 21:02:42 -05:00
Tobias Klauser	3bf003335b	bpf: Make unnecessarily global functions static Make the functions __local_list_pop_free(), __local_list_pop_pending(), bpf_common_lru_populate() and bpf_percpu_lru_populate() static as they are not used outide of bpf_lru_list.c This fixes the following GCC warnings when building with 'W=1': kernel/bpf/bpf_lru_list.c:363:22: warning: no previous prototype for ‘__local_list_pop_free’ [-Wmissing-prototypes] kernel/bpf/bpf_lru_list.c:376:22: warning: no previous prototype for ‘__local_list_pop_pending’ [-Wmissing-prototypes] kernel/bpf/bpf_lru_list.c:560:6: warning: no previous prototype for ‘bpf_common_lru_populate’ [-Wmissing-prototypes] kernel/bpf/bpf_lru_list.c:577:6: warning: no previous prototype for ‘bpf_percpu_lru_populate’ [-Wmissing-prototypes] Cc: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 21:00:59 -05:00
Tobias Klauser	a5ef01aaac	bpf: Remove unused but set variable in __bpf_lru_list_shrink_inactive() Remove the unused but set variable 'first_node' in __bpf_lru_list_shrink_inactive() to fix the following GCC warning when building with 'W=1': kernel/bpf/bpf_lru_list.c:216:41: warning: variable ‘first_node’ set but not used [-Wunused-but-set-variable] Cc: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 21:00:59 -05:00
Anna, Suman	5d722b3024	net: add the AF_QIPCRTR entries to family name tables Commit bdabad3e363d ("net: Add Qualcomm IPC router") introduced a new address family. Update the family name tables accordingly so that the lockdep initialization can use the proper names for this family. Cc: Courtney Cavin <courtney.cavin@sonymobile.com> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Suman Anna <s-anna@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 20:50:59 -05:00
Mahesh Bandewar	da36e13cf6	ipvlan: improvise dev_id generation logic in IPvlan The patch 009146d117b ("ipvlan: assign unique dev-id for each slave device.") used ida_simple_get() to generate dev_ids assigned to the slave devices. However (Eric has pointed out that) there is a shortcoming with that approach as it always uses the first available ID. This becomes a problem when a slave gets deleted and a new slave gets added. The ID gets reassigned causing the new slave to get the same link-local address. This side-effect is undesirable. This patch adds a per-port variable that keeps track of the IDs assigned and used as the stat-base for the IDR api. This base will be wrapped around when it reaches the MAX (0xFFFE) value possibly on a busy system where slaves are added and deleted routinely. Fixes: 009146d117b ("ipvlan: assign unique dev-id for each slave device.") Signed-off-by: Mahesh Bandewar <maheshb@google.com> CC: Eric Dumazet <edumazet@google.com> CC: David Miller <davem@davemloft.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 20:47:12 -05:00
Stephen Boyd	3512a1ad56	net: qrtr: Mark 'buf' as little endian Failure to mark this pointer as __le32 causes checkers like sparse to complain: net/qrtr/qrtr.c:274:16: warning: incorrect type in assignment (different base types) net/qrtr/qrtr.c:274:16: expected unsigned int [unsigned] [usertype] <noident> net/qrtr/qrtr.c:274:16: got restricted __le32 [usertype] <noident> net/qrtr/qrtr.c:275:16: warning: incorrect type in assignment (different base types) net/qrtr/qrtr.c:275:16: expected unsigned int [unsigned] [usertype] <noident> net/qrtr/qrtr.c:275:16: got restricted __le32 [usertype] <noident> net/qrtr/qrtr.c:276:16: warning: incorrect type in assignment (different base types) net/qrtr/qrtr.c:276:16: expected unsigned int [unsigned] [usertype] <noident> net/qrtr/qrtr.c:276:16: got restricted __le32 [usertype] <noident> Silence it. Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 20:45:04 -05:00
Florian Fainelli	faf3a932fb	net: dsa: Ensure validity of dst->ds[0] It is perfectly possible to have non zero indexed switches being present in a DSA switch tree, in such a case, we will be deferencing a NULL pointer while dsa_cpu_port_ethtool_{setup,restore}. Be more defensive and ensure that dst->ds[0] is valid before doing anything with it. Fixes: 0c73c523cf73 ("net: dsa: Initialize CPU port ethtool ops per tree") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 20:18:52 -05:00
Linus Torvalds	807b93e995	linux-kselftest-4.10-rc4-fixes This update consists of fixes to use shell instead of bash to run tests in embedded devices where the only shell available is the busybox ash. Also included is a typo fix to a test result message. -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJYdQ7rAAoJEAsCRMQNDUMc3kUP/itK7FDFt4DMoMipiFJ8eI03 qgrAYrSEcNbwZvtorgBpkJZpaFp86y+AkjjLyChWQuudqM0aGSD3mBsp5+8fJp15 ZP6ZzYMLTS+rpn0OcYZWVvAp2OvoLTSOkMy5mJtMAWFZURRYhv4lPoQ5tW6R0i1C IrCz0EEkRWSGMJBLz0Y4h72boDAC/IUj6HxU+5rVpxAYuW6P3ux8ltO53lwR97ud QHqozR6uvd8kYhXjRqy8pi8hgFk83YTaVwnJrSO9/kM1yYxjVlSLXTqd9LZtApeC Rw21zEQ2LimXMGX1WKBWfzeawEjF15W3FIIz3q5FZdnY4EDT/GFj6qJt8vc9XNrW LcxgBiuP6jpeSQf46gS51ubO8obreWyc9TjgrmAk1OSjiQmYtjAMlmUhLG9FiWjx wyGQcZH5opJqDVOfS/o8JTugkl0bMVH6ql7jQsjVYOFImi/nAc+uSaqST0TmuGUy FbbFknqo7e1CASTKki5BKVHjbDNQmYaAI2xqXg36SvJeZZkMCKJLI1NF6W6LLRS7 sbzMkBCybdX6yvoezEfAz0uK0z0dB6i2Nn1IL1E8UjLXZ7JXcACwF4dTdnel5r9Q 7eEONFz0Alyr/rWz2UF7l3KLY15JjbopXrMK+JqGLfZJIvdRqo35y7BjdUFm4NvA 45z0ohs1YlQG+X1U1ZQp =GjQN -----END PGP SIGNATURE----- Merge tag 'linux-kselftest-4.10-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest fixes from Shuah Khan: "This update consists of fixes to use shell instead of bash to run tests in embedded devices where the only shell available is the busybox ash. Also included is a typo fix to a test result message" * tag 'linux-kselftest-4.10-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests: x86/pkeys: fix spelling mistake: "itertation" -> "iteration" selftests: do not require bash to run netsocktests testcase selftests: do not require bash to run bpf tests selftests: do not require bash for the generated test	2017-01-10 13:51:54 -08:00
Prasad Kanneganti	de28c99d71	liquidio: store the L4 hash of rx packets in skb Store the L4 hash of received packets in the skb; the hash is computed in the NIC firmware. Signed-off-by: Prasad Kanneganti <prasad.kanneganti@cavium.com> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: Derek Chickles <derek.chickles@cavium.com> Signed-off-by: Satanand Burla <satananda.burla@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 14:22:34 -05:00
David S. Miller	5bd36a6f02	Merge branch 'sfc-physical-port-ids' Edward Cree says: ==================== sfc: physical port ids This series brings our handling of ndo_get_phys_port_id and related interfaces into line with the behaviour of other drivers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 14:16:18 -05:00
Bert Kenward	0d71a84c74	sfc: stop setting dev_port Setting dev_port changes the device names allocated by systemd. Any devices with a dev_port >0 will (in default distro configurations) have a suffix of "d<port-number>" appended. This is not something done by other drivers, and causes confusion for users. Fixes: 8be41320f346 ("sfc: Add code to export port_num in netdev->dev_port") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 14:16:17 -05:00
Bert Kenward	ac019f0895	sfc: implement ndo_get_phys_port_name Output is of the form p<port-number>. Note that the port numbers don't necessarily map one-to-one to physical cages, partly because of 4x10G port modes on QSFP+ and partly because of hw/fw implementation details. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 14:16:17 -05:00
Bert Kenward	08a7b29be9	sfc: support ndo_get_phys_port_id even when !CONFIG_SFC_SRIOV There's no good reason why this should be an SRIOV-only thing. Thus, also move it out of SRIOV-specific files. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 14:16:17 -05:00
Eric Dumazet	d9584d8ccc	net: skb_flow_get_be16() can be static Removes following sparse complain : net/core/flow_dissector.c:70:8: warning: symbol 'skb_flow_get_be16' was not declared. Should it be static? Fixes: 972d3876faa8 ("flow dissector: ICMP support") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 13:30:13 -05:00
Timur Tabi	79f664edc1	net: qcom/emac: add ethtool support Add support for some ethtool methods: get/set link settings, get/set message level, get statistics, get link status, get ring params, get pause params, and restart autonegotiation. The code to collect the hardware statistics is moved into its own function so that it can be used by "get statistics" method. Signed-off-by: Timur Tabi <timur@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 13:26:25 -05:00
David S. Miller	2f8225e834	Merge branch 'r8152-fix-autosuspend' Hayes Wang says: ==================== r8152: fix autosuspend issue Avoid rx is split into two parts when runtime suspend occurs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 11:46:47 -05:00
hayeswang	75dc692eda	r8152: fix rx issue for runtime suspend Pause the rx and make sure the rx fifo is empty when the autosuspend occurs. If the rx data comes when the driver is canceling the rx urb, the host controller would stop getting the data from the device and continue it after next rx urb is submitted. That is, one continuing data is split into two different urb buffers. That let the driver take the data as a rx descriptor, and unexpected behavior happens. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 11:46:46 -05:00
hayeswang	8fb2806168	r8152: split rtl8152_suspend function Split rtl8152_suspend() into rtl8152_system_suspend() and rtl8152_rumtime_suspend(). Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 11:46:46 -05:00
Tobias Klauser	dc647ec88e	net: socket: Make unnecessarily global sockfs_setattr() static Make sockfs_setattr() static as it is not used outside of net/socket.c This fixes the following GCC warning: net/socket.c:534:5: warning: no previous prototype for ‘sockfs_setattr’ [-Wmissing-prototypes] Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.") Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 11:29:50 -05:00
David S. Miller	2bc979f273	wireless-drivers fixes for 4.10 Only two fixes at this time. The rtlwifi fix is an important one as it fixes a reported oops and Linus was already asking about it. The orinoco fix is not tested on a real device, because it's old legacy hardware and hardly no-one use it, but it should fix a (theoretical) issue with VMAP_STACK. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJYdG+sAAoJEG4XJFUm622bGo0IAJCx6Ibu+EjbfL1KDDiR4TSS ZuWHkjVHtbSE0itUCws9RXWCBStUsDV/eFlVax7zBk4XiLtmDKHG0Kvb6K+Bm1qy rOEgmj8Y0mYzuAHR5efgC5LyyxRqKxGeff20TUZYHQ2KcWmfe5hZi0FJpo0gFjp0 1CUZiSAFuiQmaM6mGleSRZndN65m+bSRjM8yfxTC24XqjdfTVzSXo68Cp4gYk/XE M62gAg0fyk7UPBwBDztmCtARNW2Zr/OGolQp9ohyOxj3iAuazvTpohywA7Fa0vZu piBwZJGMAy1ZXHgK58qvoq1ej51RNq+ttjQKiXt9YinL/YBIcs+AkvvVzDFjzZk= =5g7Z -----END PGP SIGNATURE----- Merge tag 'wireless-drivers-for-davem-2017-01-10' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers Kalle Valo says: ==================== wireless-drivers fixes for 4.10 Only two fixes at this time. The rtlwifi fix is an important one as it fixes a reported oops and Linus was already asking about it. The orinoco fix is not tested on a real device, because it's old legacy hardware and hardly no-one use it, but it should fix a (theoretical) issue with VMAP_STACK. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-10 11:27:46 -05:00
Vivien Didelot	3a89eaa65d	net: dsa: select NET_SWITCHDEV The support for DSA Ethernet switch chips depends on TCP/IP networking, thus explicit that HAVE_NET_DSA depends on INET. DSA uses SWITCHDEV, thus select it instead of depending on it. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 17:17:30 -05:00
David S. Miller	bda65b4255	mlx5 4K UAR The following series of patches optimizes the usage of the UAR area which is contained within the BAR 0-1. Previous versions of the firmware and the driver assumed each system page contains a single UAR. This patch set will query the firmware for a new capability that if published, means that the firmware can support UARs of fixed 4K regardless of system page size. In the case of powerpc, where page size equals 64KB, this means we can utilize 16 UARs per system page. Since user space processes by default consume eight UARs per context this means that with this change a process will need a single system page to fulfill that requirement and in fact make use of more UARs which is better in terms of performance. In addition to optimizing user-space processes, we introduce an allocator that can be used by kernel consumers to allocate blue flame registers (which are areas within a UAR that are used to write doorbells). This provides further optimization on using the UAR area since the Ethernet driver makes use of a single blue flame register per system page and now it will use two blue flame registers per 4K. The series also makes changes to naming conventions and now the terms used in the driver code match the terms used in the PRM (programmers reference manual). Thus, what used to be called UUAR (micro UAR) is now called BFREG (blue flame register). In order to support compatibility between different versions of library/driver/firmware, the library has now means to notify the kernel driver that it supports the new scheme and the kernel can notify the library if it supports this extension. So mixed versions of libraries can run concurrently without any issues. Thanks, Eli and Matan -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJYc9kSAAoJEEg/ir3gV/o+a0EH/jEGiopH7CHc4T4nXT1I4kQa TicrkMNV3Sr9MBWwn8TLOyx+Fi1dex4cumrJI/BNVjC6h/nS6JHbslYoZxTkX9lT L0vRsHJBVr/PODqimIGNnlJFBPhNJSGiHG4JHlJHlpvcGNahitN3gXmUjcRNju+V ExnvgwWzAXM0qg1qWf5A/3HmqbtYES1rJXQUsimtc2QAif/SIayBD4fEA8x5zNBA i0p8xcDrzUqmeblkpnsJA3w40s1rsuqvJnvLPDpbpKENtHfw1UFZ2987P7LvOrIv NF/mZBkStC0gOZX6dLEAdoZXL1gTsJX19hTkUMfYH4BHqHARa2/oCS3wcCf1Giw= =C+cp -----END PGP SIGNATURE----- Merge tag 'mlx5-4kuar-for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== mlx5 4K UAR The following series of patches optimizes the usage of the UAR area which is contained within the BAR 0-1. Previous versions of the firmware and the driver assumed each system page contains a single UAR. This patch set will query the firmware for a new capability that if published, means that the firmware can support UARs of fixed 4K regardless of system page size. In the case of powerpc, where page size equals 64KB, this means we can utilize 16 UARs per system page. Since user space processes by default consume eight UARs per context this means that with this change a process will need a single system page to fulfill that requirement and in fact make use of more UARs which is better in terms of performance. In addition to optimizing user-space processes, we introduce an allocator that can be used by kernel consumers to allocate blue flame registers (which are areas within a UAR that are used to write doorbells). This provides further optimization on using the UAR area since the Ethernet driver makes use of a single blue flame register per system page and now it will use two blue flame registers per 4K. The series also makes changes to naming conventions and now the terms used in the driver code match the terms used in the PRM (programmers reference manual). Thus, what used to be called UUAR (micro UAR) is now called BFREG (blue flame register). In order to support compatibility between different versions of library/driver/firmware, the library has now means to notify the kernel driver that it supports the new scheme and the kernel can notify the library if it supports this extension. So mixed versions of libraries can run concurrently without any issues. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 17:09:31 -05:00
Eric Dumazet	b369e7fd41	tcp: make TCP_INFO more consistent tcp_get_info() has to lock the socket, so lets lock it for an extended critical section, so that various fields have consistent values. This solves an annoying issue that some applications reported when multiple counters are updated during one particular rx/rx event, and TCP_INFO was called from another cpu. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 17:07:54 -05:00
David S. Miller	c22e5c125b	Merge branch 'bpf-verifier-improvements' Alexei Starovoitov says: ==================== bpf: verifier improvements A number of bpf verifier improvements from Gianluca. See individual patches for details. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:56:28 -05:00
Alexei Starovoitov	39f19ebbf5	bpf: rename ARG_PTR_TO_STACK since ARG_PTR_TO_STACK is no longer just pointer to stack rename it to ARG_PTR_TO_MEM and adjust comment. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:56:27 -05:00
Gianluca Borello	06c1c04972	bpf: allow helpers access to variable memory Currently, helpers that read and write from/to the stack can do so using a pair of arguments of type ARG_PTR_TO_STACK and ARG_CONST_STACK_SIZE. ARG_CONST_STACK_SIZE accepts a constant register of type CONST_IMM, so that the verifier can safely check the memory access. However, requiring the argument to be a constant can be limiting in some circumstances. Since the current logic keeps track of the minimum and maximum value of a register throughout the simulated execution, ARG_CONST_STACK_SIZE can be changed to also accept an UNKNOWN_VALUE register in case its boundaries have been set and the range doesn't cause invalid memory accesses. One common situation when this is useful: int len; char buf[BUFSIZE]; /* BUFSIZE is 128 */ if (some_condition) len = 42; else len = 84; some_helper(..., buf, len & (BUFSIZE - 1)); The compiler can often decide to assign the constant values 42 or 48 into a variable on the stack, instead of keeping it in a register. When the variable is then read back from stack into the register in order to be passed to the helper, the verifier will not be able to recognize the register as constant (the verifier is not currently tracking all constant writes into memory), and the program won't be valid. However, by allowing the helper to accept an UNKNOWN_VALUE register, this program will work because the bitwise AND operation will set the range of possible values for the UNKNOWN_VALUE register to [0, BUFSIZE), so the verifier can guarantee the helper call will be safe (assuming the argument is of type ARG_CONST_STACK_SIZE_OR_ZERO, otherwise one more check against 0 would be needed). Custom ranges can be set not only with ALU operations, but also by explicitly comparing the UNKNOWN_VALUE register with constants. Another very common example happens when intercepting system call arguments and accessing user-provided data of variable size using bpf_probe_read(). One can load at runtime the user-provided length in an UNKNOWN_VALUE register, and then read that exact amount of data up to a compile-time determined limit in order to fit into the proper local storage allocated on the stack, without having to guess a suboptimal access size at compile time. Also, in case the helpers accepting the UNKNOWN_VALUE register operate in raw mode, disable the raw mode so that the program is required to initialize all memory, since there is no guarantee the helper will fill it completely, leaving possibilities for data leak (just relevant when the memory used by the helper is the stack, not when using a pointer to map element value or packet). In other words, ARG_PTR_TO_RAW_STACK will be treated as ARG_PTR_TO_STACK. Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:56:27 -05:00
Gianluca Borello	f0318d01b6	bpf: allow adjusted map element values to spill commit 484611357c19 ("bpf: allow access into map value arrays") introduces the ability to do pointer math inside a map element value via the PTR_TO_MAP_VALUE_ADJ register type. The current support doesn't handle the case where a PTR_TO_MAP_VALUE_ADJ is spilled into the stack, limiting several use cases, especially when generating bpf code from a compiler. Handle this case by explicitly enabling the register type PTR_TO_MAP_VALUE_ADJ to be spilled. Also, make sure that min_value and max_value are reset just for BPF_LDX operations that don't result in a restore of a spilled register from stack. Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:56:27 -05:00
Gianluca Borello	5722569bb9	bpf: allow helpers access to map element values Enable helpers to directly access a map element value by passing a register type PTR_TO_MAP_VALUE (or PTR_TO_MAP_VALUE_ADJ) to helper arguments ARG_PTR_TO_STACK or ARG_PTR_TO_RAW_STACK. This enables several use cases. For example, a typical tracing program might want to capture pathnames passed to sys_open() with: struct trace_data { char pathname[PATHLEN]; }; SEC("kprobe/sys_open") void bpf_sys_open(struct pt_regs ctx) { struct trace_data data; bpf_probe_read(data.pathname, sizeof(data.pathname), ctx->di); / consume data.pathname, for example via * bpf_trace_printk() or bpf_perf_event_output() / } Such a program could easily hit the stack limit in case PATHLEN needs to be large or more local variables need to exist, both of which are quite common scenarios. Allowing direct helper access to map element values, one could do: struct bpf_map_def SEC("maps") scratch_map = { .type = BPF_MAP_TYPE_PERCPU_ARRAY, .key_size = sizeof(u32), .value_size = sizeof(struct trace_data), .max_entries = 1, }; SEC("kprobe/sys_open") int bpf_sys_open(struct pt_regs ctx) { int id = 0; struct trace_data p = bpf_map_lookup_elem(&scratch_map, &id); if (!p) return; bpf_probe_read(p->pathname, sizeof(p->pathname), ctx->di); / consume p->pathname, for example via * bpf_trace_printk() or bpf_perf_event_output() */ } And wouldn't risk exhausting the stack. Code changes are loosely modeled after commit 6841de8b0d03 ("bpf: allow helpers access the packet directly"). Unlike with PTR_TO_PACKET, these changes just work with ARG_PTR_TO_STACK and ARG_PTR_TO_RAW_STACK (not ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE, ...): adding those would be trivial, but since there is not currently a use case for that, it's reasonable to limit the set of changes. Also, add new tests to make sure accesses to map element values from helpers never go out of boundary, even when adjusted. Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:56:26 -05:00
Gianluca Borello	dbcfe5f76d	bpf: split check_mem_access logic for map values Move the logic to check memory accesses to a PTR_TO_MAP_VALUE_ADJ from check_mem_access() to a separate helper check_map_access_adj(). This enables to use those checks in other parts of the verifier as well, where boundaries on PTR_TO_MAP_VALUE_ADJ might need to be checked, for example when checking helper function arguments. The same thing is already happening for other types such as PTR_TO_PACKET and its check_packet_access() helper. The code has been copied verbatim, with the only difference of removing the "off += reg->max_value" statement and moving the sum into the call statement to check_map_access(), as that was only needed due to the earlier common check_map_access() call. Signed-off-by: Gianluca Borello <g.borello@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:56:26 -05:00
Eric Dumazet	6bb629db5e	tcp: do not export tcp_peer_is_proven() After commit 1fb6f159fd21 ("tcp: add tcp_conn_request"), tcp_peer_is_proven() no longer needs to be exported. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:34:39 -05:00
Jean Delvare	2ebae8bd60	net: phy: Add Meson GXL PHY hardware dependency As I understand it the Meson GXL PHY driver is only useful on one architecture so only make it visible on that architecture. Signed-off-by: Jean Delvare <jdelvare@suse.de> Fixes: 7334b3e47aee ("net: phy: Add Meson GXL Internal PHY driver") Cc: Neil Armstrong <narmstrong@baylibre.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:34:39 -05:00
Vlad Tsyrklevich	ce7e40c432	net/appletalk: Fix kernel memory disclosure ipddp_route structs contain alignment padding so kernel heap memory is leaked when they are copied to user space in ipddp_ioctl(SIOCFINDIPDDPRT). Change kmalloc() to kzalloc() to clear that memory. Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:34:39 -05:00
Pavel Tikhomirov	b007f09072	ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int > cat /proc/sys/net/ipv4/tcp_notsent_lowat -1 > echo 4294967295 > /proc/sys/net/ipv4/tcp_notsent_lowat -bash: echo: write error: Invalid argument > echo -2147483648 > /proc/sys/net/ipv4/tcp_notsent_lowat > cat /proc/sys/net/ipv4/tcp_notsent_lowat -2147483648 but in documentation we have "tcp_notsent_lowat - UNSIGNED INTEGER" v2: simplify to just proc_douintvec Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:34:38 -05:00
Alexander Alemayhu	67c408cfa8	ipv6: fix typos o s/approriate/appropriate o s/discouvery/discovery Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:34:15 -05:00
David S. Miller	f3a3e248f3	Merge branch 'net-smc' Ursula Braun says: ==================== net/smc: Shared Memory Communications - RDMA here is now V4 of the SMC-R patches having processed your feedback from end of November. The most important change is the replacement of sysfs by a generic netlink solution in patch 04. And I tried to get rid of the __packed attributes. There are still a few usages left due to SMC-R protocol defined structures. V4 changes: The order of patches 03 and 04 for pnet table management and SMC IB-client establishing has been exchanged, since pnet table management is now built on top of smc_ib_devices. Patch 01: Use EXPORT_SYMBOL_GPL(). Patch 02: Define "use_fallback" as bool. Get rid of useless smc_sock fields clearing in smc_sock_alloc(), since sk_alloc() clears out the memory. Patch 03: Postpone smc_ib_remember_port_attr() call till ib_device is mentioned in the pnet table. Patch 04: Replace sysfs-usage by a generic netlink approach for pnet table configuration. Change layout of pnet table entries to reference net_device and ib_device instead of dealing with names of net_devices and ib_devices. Patch 05: Adapt "use_fallback" usages to new type bool. Get rid of useless smc_sock fields clearing in smc_sock_alloc() Avoid __packed where possible. Check if clc responses are not too big. Patch 09: Postpone smc_setup_per_ibdev till the first connection with this ib_device is really created. Patch 11: Get rid of __packed usage. V3 changes: Patch 05: Remove unneeded DEFINE_WAIT Patch 06: Improve synchronization of link group creation Patch 07: Rename peer_rmbe_len into peer_rmbe_size to be more consistent Patch 09: Avoid calls of ib_get_memory_region with IB_ACCESS_LOCAL_WRITE, use new default local_dma_lkey from protection domain as lkey instead. Remove no longer needed function smc_ib_dereg_memory_region(). Patch 14: Switch to state ACTIVE only if still in state INIT. Return 0 for recvmsg invoked in a socket closing state. Allow getname call in state APPCLOSEWAIT1 Do not trigger destruction of a socket-in-error queued in accept queue. During cleanup of accept queue, make sure sockets are destructed, and sockets in fallback mode are handled appropriately. When freeing sndbufs/rmbs, remove them from their list and free the entry. Use add_wait_queue() and remove_wait_queue() in close wait functions. If actively closing a socket in state for PEERFINCLOSEWAIT, keep this state. If passively closing a socket while bytes are to be received, move to state APPCLOSEWAIT1. If actively aborting a socket, skip sending the close_abort flag, since RDMA communication is no longer possible. When terminating a link group, do not schedule link group freeing a 2nd time, since already done when unregistering the last remaining connection. Patch 15: Introduce smc_diag module for monitoring SMC protocol sockets. This replaces the old patch 0015 dealing with procfs. V2 changes: Patch 0002: Add SMC versions for family key strings in net/core/sock.c. Patch 0006: initialize rb_tree. Patch 0007: Get rid of unneeded use of xchg() in smc_sndbuf_unuse() and smc_rmb_unuse(). Patch 0008: Correct error checking logic for ib_function calls. Define struct smc_link field wr_tx_id as atomic_long_t. Use "do_div" instead of "%" to be architecture-independent. Patch 0009: Correct error checking logic for ib_function calls. Patch 0011: Remove xchg() calls in cursor handling. Use atomic64_t for cursor overlays on 64-bit architectures. If not available, use plain u64 and add locking for cursor reading and writing. Implement smc_curs_add() without modulo operator "%". Patch 0012: Remove xchg() calls in cursor handling. Implement smc_tx_rdma_writes() without module operator "%". Patch 0013: Remove xchg() calls in cursor handling. Patch 0014: Return type bool in smc_wr_tx_has_pending(). Remove unneeded semicolon in smc_close_shutdown_write(). Call smc_close_active() in non-fallback case only. Get rid of duplicate schedule of sock_put_work(). Take nested sock_lock in smc_listen_work(). Start close stream_wait in case of prepared sends only. Patch 0015: Remove unneeded socket ref_count in smc_proc_seq_show(). Take lock before list_empty check in smc_proc_sock_list_del(). These patches are the initial part of the implementation of the "Shared Memory Communications-RDMA" (SMC-R) protocol as defined in RFC7609 [1]. While SMC-R does not aim to replace TCP, it taps a wealth of existing data center TCP socket applications to become more efficient without the need for rewriting them. SMC-R uses RDMA over Converged Ethernet (RoCE) to save CPU consumption. For instance, when running 10 parallel connections with uperf, we measured a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP (with throughput and latency comparable; measured on x86_64 with the same RoCE card and port). SMC-R does not require an RDMA communication manager (RDMA CM). SMC-R inherits TCP qualities such as reliable connections, host-based firewall packet filtering (on connection establishment) and unmodified application of communication encryption such as TLS (transport layer security) or SSL (secure sockets layer). Since original TCP is used to establish SMC-R connections, load balancers and packet inspection based on TCP/IP connection establishment continue to work for SMC-R. On the other hand, using SMC-R implies: - either involving a preload library when invoking the unchanged TCP-application or slightly modifying the source by simply changing the socket family in the socket() call - accepting extra overhead and latency in connection establishment due to SMC Connection Layer Control (CLC) handshake - explicit coupling of RoCE ports with Ethernet ports - not routable as currently built on RoCE V1 - bypassing of packet-based networking features - filtering (netfilter) - sniffing (libpcap, packet sockets, (E)BPF) - traffic control (scheduling, shaping) - bypassing of IP-header based socket options - bypassing of memory buffer (pressure) management - unusable together with IPsec Overview of the SMC-R Protocol described in informational RFC 7609 SMC-R is an open protocol that provides RDMA capabilities over RoCE transparently for applications exploiting TCP sockets. A new socket protocol family PF_SMC is introduced. There are no changes required to applications using the sockets API for TCP stream sockets other than the specification of the new socket family AF_SMC. Unmodified applications can be used by means of a dynamic preload shared library which rewrites the socket API call socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) into socket(AF_SMC, SOCK_STREAM, IPPROTO_TCP). SMC-R re-uses the address family AF_INET for all addressing purposes around struct sockaddr. SMC-R system architecture layers: +=============================================================================+ \| \| unmodified TCP application \| \| native SMC application +--------------------------------------+ \| \| dynamic preload shared library \| +=============================================================================+ \| SMC socket \| +-----------------------------------------------------------------------------+ \| \| TCP socket (for connection establishment and fallback) \| \| IB verbs +--------------------------------------------------------+ \| \| IP \| +--------------------+--------------------------------------------------------+ \| RoCE device driver \| some network device driver \| +=============================================================================+ Terms: A link group is determined by an ordered peer pair of TCP client and TCP server (IP addresses and subnet). Reversed client server roles cause an own link group. A link is a logical point-to-point connection based on an infiniband reliable connected queue pair (RC-QP) between two RoCE ports (MACs and GIDs) of a peer pair. A link group can have 1..8 links for failover and load balancing. This initial Linux implementation always has 1 link per link group. Each link group on a peer can have 1..255 remote memory buffers (RMBs). If more RMBs are needed, a peer can open another link group (this initial Linux implementation) or fall back to TCP. Each RMB has its own particular size and its own (R)DMA mapping and credentials (rtoken consisting of rkey and RDMA "virtual address"). This initial Linux implementation uses physically contiguous memory for RMBs but we are working towards scattered memory because of memory fragmentation. Each RMB has 1..255 RMB elements (RMBEs) of equal size to provide multiplexing of connections within an RMB. An RMBE is the RDMA Write destination organized as wrapping ring buffer for data transmit of a particular connection in one direction (duplex by means of mirror symmetry as with TCP). This initial Linux implementation always has 1 RMBE per RMB and thus an individual RMB for each connection. SMC-R connection establishment with subsequent data transfer: CLIENT SERVER TCP three-way handshake: regular TCP SYN --------------------------------------------------------> regular TCP SYN ACK <-------------------------------------------------------- regular TCP ACK --------------------------------------------------------> SMC Connection Layer Control (CLC) handshake exchanges RDMA credentials between peers: via above TCP connection: SMC CLC Proposal --------------------------------------------------------> via above TCP connection: SMC CLC Accept <-------------------------------------------------------- via above TCP connection: SMC CLC Confirm --------------------------------------------------------> SMC Link Layer Control (LLC) (only once per link, i.e. 1st conn. of link group): RoCE RC-QP: SMC LLC Confirm Link <======================================================== RoCE RC-QP: SMC LLC Confirm Link response ========================================================> SMC data transmission (incl. SMC Connection Data Control (CDC) message): RoCE RC-QP: RDMA Write ========================================================> RoCE RC-QP: SMC CDC message (flow control) ========================================================> ... RoCE RC-QP: RDMA Write <======================================================== RoCE RC-QP: SMC CDC message (flow control) <======================================================== ... Data flow within an established connection: +---------------------------------------------------------------------------- \| SENDER \| sendmsg() \| \| \| \| produces into sndbuf [sender's process context] \| v \| +--------+ \| \| sndbuf \| [ring buffer] \| +--------+ \| \| \| \| consumes from sndbuf and produces into receiver's RMBE [any context] \| \| by sending RDMA Write followed by SMC CDC message over RoCE RC-QP \| \| +----\|----------------------------------------------------------------------- \| +----\|----------------------------------------------------------------------- \| v RECEIVER \| +------+ \| \| RMBE \| [ring buffer, can have size different from sender's sndbuf] \| \| \| [RMBE represents rcvbuf, no further de-coupling as on sender side] \| +------+ \| \| \| \| consumes from RMBE [receiver's process context] \| v \| recvmsg() +---------------------------------------------------------------------------- Flow control ("cursor" updates) by means of SMC CDC messages: SENDER RECEIVER sends updates via CDC-------------+ sends updates via CDC on consuming from sndbuf \| on consuming from RMBE and producing into RMBE \| by means of recvmsg() \| \| \| \| +-----------------------------------\|------------+ \| \| +--v-------------------------+ +--v-----------------------+ \| receiver's consumer cursor \| \| sender's producer cursor----+ +----------------\|-----------+ +--------------------------+ \| \| \| \| receiver's RMBE \| \| +--------------------------+ \| \| \| \| \| +--------------------------------+ \| \| \| \| \| \| \| v \| \| \| +------------\| \| \|-------------+////////////\| \| \|//RDMA data written by////\| \| \|////sender that is////////\| \| \|/available to be consumed/\| \| \|///////// +---------------\| \| \|----------+^ \| \| \| \| \| \| \| +-----------------+ \| \| +--------------------------+ Sending updates of the producer cursor is immediate for low latency; something like Nagle's algorithm (absence of TCP_NODELAY) is optional and currently not part of this initial Linux implementation. Sending updates of the consumer cursor is conditional to avoid the silly window syndrome. Normal connection termination: Normal connection termination starts transitioning from socket state ACTIVE via either "Active Close" or "Passive Close". shutdown rdwr +-----------------+ or close, +-------------->\| INIT / CLOSED \|<-------------+ send PeerCon\|nClosed +-----------------+ \| PeerConnClosed \| \| \| received \| connection \| established \| \| V \| +----------------+ +-----------------+ +----------------+ \|AppFinCloseWait \| \| ACTIVE \| \|PeerFinCloseWait\| +----------------+ +-----------------+ +----------------+ \| \| \| \| \| Active Close: \| \|Passive Close: \| \| close or \| \|PeerConnClosed or \| \| shutdown wr or\| \|PeerDoneWriting \| \| shutdown rdwr \| \|received \| \| V V \| PeerConnClo\|sed +--------------+ +-------------+ \| close or received +--<----\|PeerCloseWait1\| \|AppCloseWait1\|--->----+ shutdown rdwr, \| +--------------+ +-------------+ \| send \| PeerDoneWri\|ting \| shutdown wr, \| PeerConnClosed \| received \| send Pee\|rDoneWriting \| \| V V \| \| +--------------+ +-------------+ \| +--<----\|PeerCloseWait2\| \|AppCloseWait2\|--->----+ +--------------+ +-------------+ In state CLOSED, the socket can be destructed only, once the application has issued a close(). Abnormal connection termination: +-----------------+ +-------------->\| INIT / CLOSED \|<-------------+ \| +-----------------+ \| \| \| \| +-----------------------+ \| \| \| Any state \| \| PeerConnAbo\|rt \| (before setting \| \| send received \| \| PeerConnClosed \| \| PeerConnAbort \| \| indicator in \| \| \| \| peer's RMBE) \| \| \| +-----------------------+ \| \| \| \| \| \| Active Abort: \| \| Passive Abort: \| \| problem, \| \| PeerConnAbort \| \| send \| \| received, \| \| PeerConnAbort,\| \| ECONNRESET \| \| ECONNABORTED \| \| \| \| V V \| \| +--------------+ +--------------+ \| +-------\|PeerAbortWait \| \| ProcessAbort \|------+ +--------------+ +--------------+ Implementation notes beyond RFC 7609: A PNET table in sysfs provides the mapping between network device names and RoCE Infiniband device names for the transparent switch of data communication. A PNET table can contain an arbitrary number of PNETIDs. Each PNETID contains exactly one (Ethernet) network device name and one or more RoCE Infiniband device names. Each device name can only exist in at most one PNETID (no overlapping). This initial Linux implementation allows at most one RoCE Infiniband device name per PNETID. After a new TCP connection is established, the network device name used for egress traffic with the TCP connection's local source IP address is used as key to lookup the unique PNETID, and the RoCE Infiniband device of this PNETID is used to switch data communication from TCP to RDMA during SMC CLC handshake. Problem determination: A protocol dissector is available with upstream wireshark for formatting SMC-R related RoCE LAN traffic. [https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-smcr.c] We are working on enhancing the Linux implementation to cover: - Improve default socket closing asynchronicity - Address corner cases with many parallel connections - Tracing - Integrated load balancing and fail-over within a link group - Splice and sendpage support - IPv6 addressing support - Keepalive, Cork - Namespaces support - Urgent data - More socket options - Diagnostics - Statistics support - SNMP support References: [1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:41 -05:00
Ursula Braun	f16a7dd5cf	smc: netlink interface for SMC sockets Support for SMC socket monitoring via netlink sockets of protocol NETLINK_SOCK_DIAG. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:41 -05:00
Ursula Braun	b38d732477	smc: socket closing and linkgroup cleanup smc_shutdown() and smc_release() handling delayed linkgroup cleanup for linkgroups without connections Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:40 -05:00
Ursula Braun	952310ccf2	smc: receive data from RMBE move RMBE data into user space buffer and update managing cursors Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:40 -05:00
Ursula Braun	e6727f3900	smc: send data (through RDMA) copy data to kernel send buffer, and trigger RDMA write Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:40 -05:00
Ursula Braun	5f08318f61	smc: connection data control (CDC) send and receive CDC messages (via IB message send and CQE) Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:40 -05:00
Ursula Braun	9bf9abead2	smc: link layer control (LLC) send and receive LLC messages CONFIRM_LINK (via IB message send and CQE) Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:40 -05:00
Ursula Braun	bd4ad57718	smc: initialize IB transport incl. PD, MR, QP, CQ, event, WR Prepare the link for RDMA transport: Create a queue pair (QP) and move it into the state Ready-To-Receive (RTR). Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:39 -05:00
Ursula Braun	f38ba179c6	smc: work request (WR) base for use by LLC and CDC The base containers for RDMA transport are work requests and completion queue entries processed through Infiniband verbs: * allocate and initialize these areas * map these areas to DMA * implement the basic communication consisting of work request posting and receival of completion queue events Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:39 -05:00
Ursula Braun	cd6851f303	smc: remote memory buffers (RMBs) * allocate data RMB memory for sending and receiving * size depends on the maximum socket send and receive buffers * allocated RMBs are kept during life time of the owning link group * map the allocated RMBs to DMA Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-01-09 16:07:39 -05:00

1 2 3 4 5 ...

648813 Commits