422 Commits

Author SHA1 Message Date
Sowmini Varadhan
b7bc600b05 RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock
commit eb192840266fab3e3da644018121eed30153355d upstream.

There is a race condition between rds_send_xmit -> rds_tcp_xmit
and the code that deals with resolution of duelling syns added
by commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()").

Specifically, we may end up derefencing a null pointer in rds_send_xmit
if we have the interleaving sequence:
           rds_tcp_accept_one                  rds_send_xmit

                                             conn is RDS_CONN_UP, so
    					 invoke rds_tcp_xmit

                                             tc = conn->c_transport_data
        rds_tcp_restore_callbacks
            /* reset t_sock */
    					 null ptr deref from tc->t_sock

The race condition can be avoided without adding the overhead of
additional locking in the xmit path: have rds_tcp_accept_one wait
for rds_tcp_xmit threads to complete before resetting callbacks.
The synchronization can be done in the same manner as rds_conn_shutdown().
First set the rds_conn_state to something other than RDS_CONN_UP
(so that new threads cannot get into rds_tcp_xmit()), then wait for
RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.

Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-10 10:26:01 +02:00
Gerd Rausch
48d127dbee net/rds: Fix 'ib_evt_handler_call' element in 'rds_ib_stat_names'
[ Upstream commit 05a82481a3024b94db00b8c816bb3d526b5209e0 ]

All entries in 'rds_ib_stat_names' are stringified versions
of the corresponding "struct rds_ib_statistics" element
without the "s_"-prefix.

Fix entry 'ib_evt_handler_call' to do the same.

Fixes: f4f943c958a2 ("RDS: IB: ack more receive completions to improve performance")
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-29 10:21:48 +01:00
Dotan Barak
fc9752bcea net/rds: Fix error handling in rds_ib_add_one()
[ Upstream commit d64bf89a75b65f83f06be9fb8f978e60d53752db ]

rds_ibdev:ipaddr_list and rds_ibdev:conn_list are initialized
after allocation some resources such as protection domain.
If allocation of such resources fail, then these uninitialized
variables are accessed in rds_ib_dev_free() in failure path. This
can potentially crash the system. The code has been updated to
initialize these variables very early in the function.

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-07 21:01:06 +02:00
Mao Wenan
91573ae4ae net: rds: Fix NULL ptr use in rds_tcp_kill_sock
After the commit c4e97b06cfdc ("net: rds: force to destroy
connection if t_sock is NULL in rds_tcp_kill_sock()."),
it introduced null-ptr-deref in rds_tcp_kill_sock as below:

BUG: KASAN: null-ptr-deref on address 0000000000000020
Read of size 8 by task kworker/u16:10/910
CPU: 3 PID: 910 Comm: kworker/u16:10 Not tainted 4.4.178+ #3
Hardware name: linux,dummy-virt (DT)
Workqueue: netns cleanup_net
Call trace:
[<ffffff90080abb50>] dump_backtrace+0x0/0x618
[<ffffff90080ac1a0>] show_stack+0x38/0x60
[<ffffff9008c42b78>] dump_stack+0x1a8/0x230
[<ffffff90085d469c>] kasan_report_error+0xc8c/0xfc0
[<ffffff90085d54a4>] kasan_report+0x94/0xd8
[<ffffff90085d1b28>] __asan_load8+0x88/0x150
[<ffffff9009c9cc2c>] rds_tcp_dev_event+0x734/0xb48
[<ffffff90081eacb0>] raw_notifier_call_chain+0x150/0x1e8
[<ffffff900973fec0>] call_netdevice_notifiers_info+0x90/0x110
[<ffffff9009764874>] netdev_run_todo+0x2f4/0xb08
[<ffffff9009796d34>] rtnl_unlock+0x2c/0x48
[<ffffff9009756484>] default_device_exit_batch+0x444/0x528
[<ffffff9009720498>] ops_exit_list+0x1c0/0x240
[<ffffff9009724a80>] cleanup_net+0x738/0xbf8
[<ffffff90081ca6cc>] process_one_work+0x96c/0x13e0
[<ffffff90081cf370>] worker_thread+0x7e0/0x1910
[<ffffff90081e7174>] kthread+0x304/0x390
[<ffffff9008094280>] ret_from_fork+0x10/0x50

If the first loop add the tc->t_sock = NULL to the tmp_list,
1). list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node)

then the second loop is to find connections to destroy, tc->t_sock
might equal NULL, and tc->t_sock->sk happens null-ptr-deref.
2). list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node)

Fixes: c4e97b06cfdc ("net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock().")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-10-05 12:27:38 +02:00
Zhu Yanjun
eded0b11c7 net: rds: fix memory leak in rds_ib_flush_mr_pool
[ Upstream commit 85cb928787eab6a2f4ca9d2a798b6f3bed53ced1 ]

When the following tests last for several hours, the problem will occur.

Server:
    rds-stress -r 1.1.1.16 -D 1M
Client:
    rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30

The following will occur.

"
Starting up....
tsks   tx/s   rx/s  tx+rx K/s    mbi K/s    mbo K/s tx us/c   rtt us cpu
%
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
  1      0      0       0.00       0.00       0.00    0.00 0.00 -1.00
"
>From vmcore, we can find that clean_list is NULL.

>From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker.
Then rds_ib_mr_pool_flush_worker calls
"
 rds_ib_flush_mr_pool(pool, 0, NULL);
"
Then in function
"
int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
                         int free_all, struct rds_ib_mr **ibmr_ret)
"
ibmr_ret is NULL.

In the source code,
"
...
list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
if (ibmr_ret)
        *ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);

/* more than one entry in llist nodes */
if (clean_nodes->next)
        llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list);
...
"
When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next
instead of clean_nodes is added in clean_list.
So clean_nodes is discarded. It can not be used again.
The workqueue is executed periodically. So more and more clean_nodes are
discarded. Finally the clean_list is NULL.
Then this problem will occur.

Fixes: 1bc144b62524 ("net, rds, Replace xlist in net/rds/xlist.h with llist")
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11 12:24:12 +02:00
Mao Wenan
c4e97b06cf net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock().
[ Upstream commit cb66ddd156203daefb8d71158036b27b0e2caf63 ]

When it is to cleanup net namespace, rds_tcp_exit_net() will call
rds_tcp_kill_sock(), if t_sock is NULL, it will not call
rds_conn_destroy(), rds_conn_path_destroy() and rds_tcp_conn_free() to free
connection, and the worker cp_conn_w is not stopped, afterwards the net is freed in
net_drop_ns(); While cp_conn_w rds_connect_worker() will call rds_tcp_conn_path_connect()
and reference 'net' which has already been freed.

In rds_tcp_conn_path_connect(), rds_tcp_set_callbacks() will set t_sock = sock before
sock->ops->connect, but if connect() is failed, it will call
rds_tcp_restore_callbacks() and set t_sock = NULL, if connect is always
failed, rds_connect_worker() will try to reconnect all the time, so
rds_tcp_kill_sock() will never to cancel worker cp_conn_w and free the
connections.

Therefore, the condition !tc->t_sock is not needed if it is going to do
cleanup_net->rds_tcp_exit_net->rds_tcp_kill_sock, because tc->t_sock is always
NULL, and there is on other path to cancel cp_conn_w and free
connection. So this patch is to fix this.

rds_tcp_kill_sock():
...
if (net != c_net || !tc->t_sock)
...
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

==================================================================
BUG: KASAN: use-after-free in inet_create+0xbcc/0xd28
net/ipv4/af_inet.c:340
Read of size 4 at addr ffff8003496a4684 by task kworker/u8:4/3721

CPU: 3 PID: 3721 Comm: kworker/u8:4 Not tainted 5.1.0 #11
Hardware name: linux,dummy-virt (DT)
Workqueue: krdsd rds_connect_worker
Call trace:
 dump_backtrace+0x0/0x3c0 arch/arm64/kernel/time.c:53
 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:152
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x120/0x188 lib/dump_stack.c:113
 print_address_description+0x68/0x278 mm/kasan/report.c:253
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x21c/0x348 mm/kasan/report.c:409
 __asan_report_load4_noabort+0x30/0x40 mm/kasan/report.c:429
 inet_create+0xbcc/0xd28 net/ipv4/af_inet.c:340
 __sock_create+0x4f8/0x770 net/socket.c:1276
 sock_create_kern+0x50/0x68 net/socket.c:1322
 rds_tcp_conn_path_connect+0x2b4/0x690 net/rds/tcp_connect.c:114
 rds_connect_worker+0x108/0x1d0 net/rds/threads.c:175
 process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
 worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
 kthread+0x2f0/0x378 kernel/kthread.c:255
 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117

Allocated by task 687:
 save_stack mm/kasan/kasan.c:448 [inline]
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xd4/0x180 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:490
 slab_post_alloc_hook mm/slab.h:444 [inline]
 slab_alloc_node mm/slub.c:2705 [inline]
 slab_alloc mm/slub.c:2713 [inline]
 kmem_cache_alloc+0x14c/0x388 mm/slub.c:2718
 kmem_cache_zalloc include/linux/slab.h:697 [inline]
 net_alloc net/core/net_namespace.c:384 [inline]
 copy_net_ns+0xc4/0x2d0 net/core/net_namespace.c:424
 create_new_namespaces+0x300/0x658 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xa0/0x198 kernel/nsproxy.c:206
 ksys_unshare+0x340/0x628 kernel/fork.c:2577
 __do_sys_unshare kernel/fork.c:2645 [inline]
 __se_sys_unshare kernel/fork.c:2643 [inline]
 __arm64_sys_unshare+0x38/0x58 kernel/fork.c:2643
 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
 invoke_syscall arch/arm64/kernel/syscall.c:47 [inline]
 el0_svc_common+0x168/0x390 arch/arm64/kernel/syscall.c:83
 el0_svc_handler+0x60/0xd0 arch/arm64/kernel/syscall.c:129
 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:960

Freed by task 264:
 save_stack mm/kasan/kasan.c:448 [inline]
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:521
 kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:528
 slab_free_hook mm/slub.c:1370 [inline]
 slab_free_freelist_hook mm/slub.c:1397 [inline]
 slab_free mm/slub.c:2952 [inline]
 kmem_cache_free+0xb8/0x3a8 mm/slub.c:2968
 net_free net/core/net_namespace.c:400 [inline]
 net_drop_ns.part.6+0x78/0x90 net/core/net_namespace.c:407
 net_drop_ns net/core/net_namespace.c:406 [inline]
 cleanup_net+0x53c/0x6d8 net/core/net_namespace.c:569
 process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153
 worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296
 kthread+0x2f0/0x378 kernel/kthread.c:255
 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117

The buggy address belongs to the object at ffff8003496a3f80
 which belongs to the cache net_namespace of size 7872
The buggy address is located 1796 bytes inside of
 7872-byte region [ffff8003496a3f80, ffff8003496a5e40)
The buggy address belongs to the page:
page:ffff7e000d25a800 count:1 mapcount:0 mapping:ffff80036ce4b000
index:0x0 compound_mapcount: 0
flags: 0xffffe0000008100(slab|head)
raw: 0ffffe0000008100 dead000000000100 dead000000000200 ffff80036ce4b000
raw: 0000000000000000 0000000080040004 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8003496a4580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8003496a4600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8003496a4680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff8003496a4700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8003496a4780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Fixes: 467fa15356ac("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27 09:33:54 +02:00
Santosh Shilimkar
cfc501dfae rds: avoid unenecessary cong_update in loop transport
commit f1693c63ab133d16994cc50f773982b5905af264 upstream.

Loop transport which is self loopback, remote port congestion
update isn't relevant. Infact the xmit path already ignores it.
Receive path needs to do the same.

Reported-by: syzbot+4c20b3866171ce8441d2@syzkaller.appspotmail.com
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-22 14:25:54 +02:00
Guanglei Li
c9eb334707 RDS: IB: Fix null pointer issue
[ Upstream commit 2c0aa08631b86a4678dbc93b9caa5248014b4458 ]

Scenario:
1. Port down and do fail over
2. Ap do rds_bind syscall

PID: 47039  TASK: ffff89887e2fe640  CPU: 47  COMMAND: "kworker/u:6"
 #0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9
 #1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3
 #2 [ffff898e35f15b30] oops_end at ffffffff8150f518
 #3 [ffff898e35f15b60] no_context at ffffffff8104854c
 #4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675
 #5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3
 #6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8
 #7 [ffff898e35f15d10] page_fault at ffffffff8150ea95
    [exception RIP: unknown or invalid address]
    RIP: 0000000000000000  RSP: ffff898e35f15dc8  RFLAGS: 00010282
    RAX: 00000000fffffffe  RBX: ffff889b77f6fc00  RCX:ffffffff81c99d88
    RDX: 0000000000000000  RSI: ffff896019ee08e8  RDI:ffff889b77f6fc00
    RBP: ffff898e35f15df0   R8: ffff896019ee08c8  R9:0000000000000000
    R10: 0000000000000400  R11: 0000000000000000  R12:ffff896019ee08c0
    R13: ffff889b77f6fe68  R14: ffffffff81c99d80  R15: ffffffffa022a1e0
    ORIG_RAX: ffffffffffffffff  CS: 0010 SS: 0018
 #8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm]
 #9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6
 #10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0
 #11 [ffff898e35f15ee8] kthread at ffffffff81090fe6

PID: 45659  TASK: ffff880d313d2500  CPU: 31  COMMAND: "oracle_45659_ap"
 #0 [ffff881024ccfc98] __schedule at ffffffff8150bac4
 #1 [ffff881024ccfd40] schedule at ffffffff8150c2cf
 #2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7
 #3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb
 #4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm]
 #5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma]
 #6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds]
 #7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds]
 #8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670

PID: 45659                          PID: 47039
rds_ib_laddr_check
  /* create id_priv with a null event_handler */
  rdma_create_id
  rdma_bind_addr
    cma_acquire_dev
      /* add id_priv to cma_dev->id_list */
      cma_attach_to_dev
                                    cma_ndev_work_handler
                                      /* event_hanlder is null */
                                      id_priv->id.event_handler

Signed-off-by: Guanglei Li <guanglei.li@oracle.com>
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:48:56 +02:00
Sowmini Varadhan
f69275de10 rds; Reset rs->rs_bound_addr in rds_add_bound() failure path
[ Upstream commit 7ae0c649c47f1c5d2db8cee6dd75855970af1669 ]

If the rds_sock is not added to the bind_hash_table, we must
reset rs_bound_addr so that rds_remove_bound will not trip on
this rds_sock.

rds_add_bound() does a rds_sock_put() in this failure path, so
failing to reset rs_bound_addr will result in a socket refcount
bug, and will trigger a WARN_ON with the stack shown below when
the application subsequently tries to close the PF_RDS socket.

     WARNING: CPU: 20 PID: 19499 at net/rds/af_rds.c:496 \
		rds_sock_destruct+0x15/0x30 [rds]
       :
     __sk_destruct+0x21/0x190
     rds_remove_bound.part.13+0xb6/0x140 [rds]
     rds_release+0x71/0x120 [rds]
     sock_release+0x1a/0x70
     sock_close+0xe/0x20
     __fput+0xd5/0x210
     task_work_run+0x82/0xa0
     do_exit+0x2ce/0xb30
     ? syscall_trace_enter+0x1cc/0x2b0
     do_group_exit+0x39/0xa0
     SyS_exit_group+0x10/0x10
     do_syscall_64+0x61/0x1a0

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13 19:50:13 +02:00
Mohamed Ghannam
44496521c6 RDS: null pointer dereference in rds_atomic_free_op
[ Upstream commit 7d11f77f84b27cef452cee332f4e469503084737 ]

set rm->atomic.op_active to 0 when rds_pin_pages() fails
or the user supplied address is invalid,
this prevents a NULL pointer usage in rds_atomic_free_op()

Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:29 +01:00
Mohamed Ghannam
10d06ed9a1 RDS: Heap OOB write in rds_message_alloc_sgs()
[ Upstream commit c095508770aebf1b9218e77026e48345d719b17c ]

When args->nr_local is 0, nr_pages gets also 0 due some size
calculation via rds_rm_size(), which is later used to allocate
pages for DMA, this bug produces a heap Out-Of-Bound write access
to a specific memory region.

Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:29 +01:00
Håkon Bugge
6c154d536d rds: Fix NULL pointer dereference in __rds_rdma_map
[ Upstream commit f3069c6d33f6ae63a1668737bc78aaaa51bff7ca ]

This is a fix for syzkaller719569, where memory registration was
attempted without any underlying transport being loaded.

Analysis of the case reveals that it is the setsockopt() RDS_GET_MR
(2) and RDS_GET_MR_FOR_DEST (7) that are vulnerable.

Here is an example stack trace when the bug is hit:

BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0
IP: __rds_rdma_map+0x36/0x440 [rds]
PGD 2f93d03067 P4D 2f93d03067 PUD 2f93d02067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: bridge stp llc tun rpcsec_gss_krb5 nfsv4
dns_resolver nfs fscache rds binfmt_misc sb_edac intel_powerclamp
coretemp kvm_intel kvm irqbypass crct10dif_pclmul c rc32_pclmul
ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd
iTCO_wdt mei_me sg iTCO_vendor_support ipmi_si mei ipmi_devintf nfsd
shpchp pcspkr i2c_i801 ioatd ma ipmi_msghandler wmi lpc_ich mfd_core
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2
mgag200 i2c_algo_bit drm_kms_helper ixgbe syscopyarea ahci sysfillrect
sysimgblt libahci mdio fb_sys_fops ttm ptp libata sd_mod mlx4_core drm
crc32c_intel pps_core megaraid_sas i2c_core dca dm_mirror
dm_region_hash dm_log dm_mod
CPU: 48 PID: 45787 Comm: repro_set2 Not tainted 4.14.2-3.el7uek.x86_64 #2
Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 31110000 03/03/2017
task: ffff882f9190db00 task.stack: ffffc9002b994000
RIP: 0010:__rds_rdma_map+0x36/0x440 [rds]
RSP: 0018:ffffc9002b997df0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff882fa2182580 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffc9002b997e40 RDI: ffff882fa2182580
RBP: ffffc9002b997e30 R08: 0000000000000000 R09: 0000000000000002
R10: ffff885fb29e3838 R11: 0000000000000000 R12: ffff882fa2182580
R13: ffff882fa2182580 R14: 0000000000000002 R15: 0000000020000ffc
FS:  00007fbffa20b700(0000) GS:ffff882fbfb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c0 CR3: 0000002f98a66006 CR4: 00000000001606e0
Call Trace:
 rds_get_mr+0x56/0x80 [rds]
 rds_setsockopt+0x172/0x340 [rds]
 ? __fget_light+0x25/0x60
 ? __fdget+0x13/0x20
 SyS_setsockopt+0x80/0xe0
 do_syscall_64+0x67/0x1b0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7fbff9b117f9
RSP: 002b:00007fbffa20aed8 EFLAGS: 00000293 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00000000000c84a4 RCX: 00007fbff9b117f9
RDX: 0000000000000002 RSI: 0000400000000114 RDI: 000000000000109b
RBP: 00007fbffa20af10 R08: 0000000000000020 R09: 00007fbff9dd7860
R10: 0000000020000ffc R11: 0000000000000293 R12: 0000000000000000
R13: 00007fbffa20b9c0 R14: 00007fbffa20b700 R15: 0000000000000021

Code: 41 56 41 55 49 89 fd 41 54 53 48 83 ec 18 8b 87 f0 02 00 00 48
89 55 d0 48 89 4d c8 85 c0 0f 84 2d 03 00 00 48 8b 87 00 03 00 00 <48>
83 b8 c0 00 00 00 00 0f 84 25 03 00 0 0 48 8b 06 48 8b 56 08

The fix is to check the existence of an underlying transport in
__rds_rdma_map().

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-16 10:33:56 +01:00
Santosh Shilimkar
57d2ce1603 RDS: RDMA: return appropriate error on rdma map failures
[ Upstream commit 584a8279a44a800dea5a5c1e9d53a002e03016b4 ]

The first message to a remote node should prompt a new
connection even if it is RDMA operation. For RDMA operation
the MR mapping can fail because connections is not yet up.

Since the connection establishment is asynchronous,
we make sure the map failure because of unavailable
connection reach to the user by appropriate error code.
Before returning to the user, lets trigger the connection
so that its ready for the next retry.

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:37:25 +00:00
Zhu Yanjun
13099ee9c7 rds: ib: add error handle
[ Upstream commit 3b12f73a5c2977153f28a224392fd4729b50d1dc ]

In the function rds_ib_setup_qp, the error handle is missing. When some
error occurs, it is possible that memory leak occurs. As such, error
handle is added.

Cc: Joe Jin <joe.jin@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Guanglei Li <guanglei.li@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-08 10:14:19 +02:00
Santosh Shilimkar
6a501bddeb RDS: RDMA: Fix the composite message user notification
[ Upstream commit 941f8d55f6d613a460a5e080d25a38509f45eb75 ]

When application sends an RDS RDMA composite message consist of
RDMA transfer to be followed up by non RDMA payload, it expect to
be notified *only* when the full message gets delivered. RDS RDMA
notification doesn't behave this way though.

Thanks to Venkat for debug and root casuing the issue
where only first part of the message(RDMA) was
successfully delivered but remainder payload delivery failed.
In that case, application should not be notified with
a false positive of message delivery success.

Fix this case by making sure the user gets notified only after
the full message delivery.

Reviewed-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-08 10:14:16 +02:00
Sowmini Varadhan
9618eb4af3 rds: tcp: use sock_create_lite() to create the accept socket
commit 0933a578cd55b02dc80f219dc8f2efb17ec61c9a upstream.

There are two problems with calling sock_create_kern() from
rds_tcp_accept_one()
1. it sets up a new_sock->sk that is wasteful, because this ->sk
   is going to get replaced by inet_accept() in the subsequent ->accept()
2. The new_sock->sk is a leaked reference in sock_graft() which
   expects to find a null parent->sk

Avoid these problems by calling sock_create_lite().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:55 +02:00
santosh.shilimkar@oracle.com
804605eae4 RDS: Fix the atomicity for congestion map update
commit e47db94e10447fc467777a40302f2b393e9af2fa upstream.

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang <wen.gang.wang@oracle.com> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Reviewed-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-05-02 21:19:49 -07:00
Kangjie Lu
ffd5ce2ad5 rds: fix an infoleak in rds_inc_info_copy
commit 4116def2337991b39919f3b448326e21c40e0dbb upstream.

The last field "flags" of object "minfo" is not initialized.
Copying this object out may leak kernel stack data.
Assign 0 to it to avoid leak.

Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-09-15 08:27:51 +02:00
Vegard Nossum
b3a061d1d8 RDS: fix rds_tcp_init() error path
commit 3dad5424adfb346c871847d467f97dcdca64ea97 upstream.

If register_pernet_subsys() fails, we shouldn't try to call
unregister_pernet_subsys().

Fixes: 467fa15356 ("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.")
Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-10 11:49:29 +02:00
Quentin Casasnovas
8c7188b234 RDS: fix race condition when sending a message on unbound socket
Sasha's found a NULL pointer dereference in the RDS connection code when
sending a message to an apparently unbound socket.  The problem is caused
by the code checking if the socket is bound in rds_sendmsg(), which checks
the rs_bound_addr field without taking a lock on the socket.  This opens a
race where rs_bound_addr is temporarily set but where the transport is not
in rds_bind(), leading to a NULL pointer dereference when trying to
dereference 'trans' in __rds_conn_create().

Vegard wrote a reproducer for this issue, so kindly ask him to share if
you're interested.

I cannot reproduce the NULL pointer dereference using Vegard's reproducer
with this patch, whereas I could without.

Complete earlier incomplete fix to CVE-2015-6937:

  74e98eb08588 ("RDS: verify the underlying transport exists before creating a connection")

Cc: David S. Miller <davem@davemloft.net>
Cc: stable@vger.kernel.org

Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24 17:20:09 -05:00
Linus Torvalds
ad804a0b2a Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - most of the rest of MM

 - procfs

 - lib/ updates

 - printk updates

 - bitops infrastructure tweaks

 - checkpatch updates

 - nilfs2 update

 - signals

 - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
   dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  ipc,msg: drop dst nil validation in copy_msg
  include/linux/zutil.h: fix usage example of zlib_adler32()
  panic: release stale console lock to always get the logbuf printed out
  dma-debug: check nents in dma_sync_sg*
  dma-mapping: tidy up dma_parms default handling
  pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
  kexec: use file name as the output message prefix
  fs, seqfile: always allow oom killer
  seq_file: reuse string_escape_str()
  fs/seq_file: use seq_* helpers in seq_hex_dump()
  coredump: change zap_threads() and zap_process() to use for_each_thread()
  coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
  signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
  signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
  signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
  signals: kill block_all_signals() and unblock_all_signals()
  nilfs2: fix gcc uninitialized-variable warnings in powerpc build
  nilfs2: fix gcc unused-but-set-variable warnings
  MAINTAINERS: nilfs2: add header file for tracing
  nilfs2: add tracepoints for analyzing reading and writing metadata files
  ...
2015-11-07 14:32:45 -08:00
Linus Torvalds
ab9f2faf8f Initial 4.4 merge window submission
- "Checksum offload support in user space" enablement
 - Misc cxgb4 fixes, add T6 support
 - Misc usnic fixes
 - 32 bit build warning fixes
 - Misc ocrdma fixes
 - Multicast loopback prevention extension
 - Extend the GID cache to store and return attributes of GIDs
 - Misc iSER updates
 - iSER clustering update
 - Network NameSpace support for rdma CM
 - Work Request cleanup series
 - New Memory Registration API
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWPO5UAAoJELgmozMOVy/dSCQP/iX2ImMZOS3VkOYKhLR3dSv8
 4vTEiYIoAT1JEXiPpiabuuACwotcZcMRk9kZ0dcWmBoFusTzKJmoDOkgAYd95XqY
 EsAyjqtzUGNNMjH5u5W+kdbaFdH9Ktq7IJvspRlJuvzC47Srax+qBxX01jrAkDgh
 4PoA3hEa2KkvkDjY2Mhvk9EWd/uflO9Ky6o0D8jUQkWtEvKBRyDjQLk30oW6wHX9
 pTWqww3dD0EXTrR+PDA88v2saKH1kZFU1Nt2eU8Bw+zlJM8hcX6U7PfRX0g3HT/J
 o+7ejTdLPWFDH35gJOU+KE519f1JbwfRjPJCqbOC9IttBB7iHSbhcpQLpWv4JV1x
 agdBeDA3TGQj3dHb2SkYMlWXCBp7q8UCbVGvvirTFzGSGU73sc6hhP+vCKvPQIlE
 Ah5tUqD7Y3mOBjvuDeIzKMLXILd5d3cH+m7Laytrf5e7fJPmBRZyOkcMh0QVElyl
 mKo+PFjghgeTFb405J7SDDw/vThVyN9HyIt7AGEzObaajzOOk9R1hkQr46XVy9TK
 yi58fl85yQ2n6TWV6NRnvkQoMy/N2HAEuXk/7HtO0PabV5w3Lo0zvXB9SnVrrVEm
 58FWRBYCWorVSdSacuDnPm0iz45WSRIb9G9sBlhEC93eXRq2rSBoy4RvyLeliHFH
 hllyhNNolI6FJ64j07Xm
 =bBIY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "This is my initial round of 4.4 merge window patches.  There are a few
  other things I wish to get in for 4.4 that aren't in this pull, as
  this represents what has gone through merge/build/run testing and not
  what is the last few items for which testing is not yet complete.

   - "Checksum offload support in user space" enablement
   - Misc cxgb4 fixes, add T6 support
   - Misc usnic fixes
   - 32 bit build warning fixes
   - Misc ocrdma fixes
   - Multicast loopback prevention extension
   - Extend the GID cache to store and return attributes of GIDs
   - Misc iSER updates
   - iSER clustering update
   - Network NameSpace support for rdma CM
   - Work Request cleanup series
   - New Memory Registration API"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (76 commits)
  IB/core, cma: Make __attribute_const__ declarations sparse-friendly
  IB/core: Remove old fast registration API
  IB/ipath: Remove fast registration from the code
  IB/hfi1: Remove fast registration from the code
  RDMA/nes: Remove old FRWR API
  IB/qib: Remove old FRWR API
  iw_cxgb4: Remove old FRWR API
  RDMA/cxgb3: Remove old FRWR API
  RDMA/ocrdma: Remove old FRWR API
  IB/mlx4: Remove old FRWR API support
  IB/mlx5: Remove old FRWR API support
  IB/srp: Dont allocate a page vector when using fast_reg
  IB/srp: Remove srp_finish_mapping
  IB/srp: Convert to new registration API
  IB/srp: Split srp_map_sg
  RDS/IW: Convert to new memory registration API
  svcrdma: Port to new memory registration API
  xprtrdma: Port to new memory registration API
  iser-target: Port to new memory registration API
  IB/iser: Port to new fast registration API
  ...
2015-11-07 13:33:07 -08:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
santosh.shilimkar@oracle.com
7b5654349e RDS: convert bind hash table to re-sizable hashtable
To further improve the RDS connection scalabilty on massive systems
where number of sockets grows into tens of thousands  of sockets, there
is a need of larger bind hashtable. Pre-allocated 8K or 16K table is
not very flexible in terms of memory utilisation. The rhashtable
infrastructure gives us the flexibility to grow the hashtbable based
on use and also comes up with inbuilt efficient bucket(chain) handling.

Reviewed-by: David Miller <davem@davemloft.net>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 15:36:23 -05:00
Saurabh Sengar
d3ffaefa1b net: rds: changing the return type from int to void
as result of function rds_iw_flush_mr_pool is nowhere checked,
changing its return type from int to void.
also removing the unused variable rc as there is nothing to return

Signed-off-by: Saurabh Sengar <saurabh.truth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 15:35:19 -05:00
David S. Miller
b75ec3af27 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-11-01 00:15:30 -04:00
Sagi Grimberg
9ddc87374a RDS/IW: Convert to new memory registration API
Get rid of fast_reg page list and its construction.
Instead, just pass the RDS sg list to ib_map_mr_sg
and post the new ib_reg_wr.

This is done both for server IW RDMA_READ registration
and the client remote key registration.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28 22:27:18 -04:00
Doug Ledford
63e8790d39 Merge branch 'wr-cleanup' into k.o/for-4.4 2015-10-28 22:23:34 -04:00
Guy Shapiro
fa20105e09 IB/cma: Add support for network namespaces
Add support for network namespaces in the ib_cma module. This is
accomplished by:

1. Adding network namespace parameter for rdma_create_id. This parameter is
   used to populate the network namespace field in rdma_id_private.
   rdma_create_id keeps a reference on the network namespace.
2. Using the network namespace from the rdma_id instead of init_net inside
   of ib_cma, when listening on an ID and when looking for an ID for an
   incoming request.
3. Decrementing the reference count for the appropriate network namespace
   when calling rdma_destroy_id.

In order to preserve the current behavior init_net is passed when calling
from other modules.

Signed-off-by: Guy Shapiro <guysh@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Yotam Kenneth <yotamke@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28 12:32:48 -04:00
Sowmini Varadhan
8ce675ff39 RDS-TCP: Recover correctly from pskb_pull()/pksb_trim() failure in rds_tcp_data_recv
Either of pskb_pull() or pskb_trim() may fail under low memory conditions.
If rds_tcp_data_recv() ignores such failures, the application will
receive corrupted data because the skb has not been correctly
carved to the RDS datagram size.

Avoid this by handling pskb_pull/pskb_trim failure in the same
manner as the skb_clone failure: bail out of rds_tcp_data_recv(), and
retry via the deferred call to rds_send_worker() that gets set up on
ENOMEM from rds_tcp_read_sock()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-27 19:46:34 -07:00
santosh.shilimkar@oracle.com
7b4b000951 RDS: fix rds-ping deadlock over TCP transport
Sowmini found hang with rds-ping while testing RDS over TCP. Its
a corner case and doesn't happen always. The issue is not reproducible
with IB transport. Its clear from below dump why we see it with RDS TCP.

 [<ffffffff8153b7e5>] do_tcp_setsockopt+0xb5/0x740
 [<ffffffff8153bec4>] tcp_setsockopt+0x24/0x30
 [<ffffffff814d57d4>] sock_common_setsockopt+0x14/0x20
 [<ffffffffa096071d>] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
 [<ffffffffa093b5f7>] rds_send_xmit+0xd7/0x740 [rds]
 [<ffffffffa093bda2>] rds_send_pong+0x142/0x180 [rds]
 [<ffffffffa0939d34>] rds_recv_incoming+0x274/0x330 [rds]
 [<ffffffff810815ae>] ? ttwu_queue+0x11e/0x130
 [<ffffffff814dcacd>] ? skb_copy_bits+0x6d/0x2c0
 [<ffffffffa0960350>] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
 [<ffffffff8153d836>] tcp_read_sock+0x96/0x1c0
 [<ffffffffa0960060>] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
 [<ffffffff814d6a90>] ? sock_def_write_space+0xa0/0xa0
 [<ffffffffa09604d1>] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
 [<ffffffff81545249>] tcp_data_queue+0x379/0x5b0
 [<ffffffffa0960cdb>] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
 [<ffffffff81547fd2>] tcp_rcv_established+0x2e2/0x6e0
 [<ffffffff81552602>] tcp_v4_do_rcv+0x122/0x220
 [<ffffffff81553627>] tcp_v4_rcv+0x867/0x880
 [<ffffffff8152e0b3>] ip_local_deliver_finish+0xa3/0x220

This happens because rds_send_xmit() chain wants to take
sock_lock which is already taken by tcp_v4_rcv() on its
way to rds_tcp_data_ready(). Commit db6526dcb51b ("RDS: use
rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
was trying to opportunistically finish the send request
in same thread context.

But because of above recursive lock hang with RDS TCP,
the send work from rds_send_pong() needs to deferred to
worker to avoid lock up. Given RDS ping is more of connectivity
test than performance critical path, its should be ok even
for transport like IB.

Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by:  Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-18 22:45:55 -07:00
Sowmini Varadhan
241b271952 RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()
Consider the following "duelling syn" sequence between two peers A and B:
        	A		B
        	SYN1     -->
        	    	<--	SYN2
        	SYN2ACK  -->

Note that the SYN/ACK has already been sent out by TCP before
rds_tcp_accept_one() gets invoked as part of callbacks.

If the inet_addr(A) is numerically less than inet_addr(B),
the arbitration scheme in rds_tcp_accept_one() will prefer the
TCP connection triggered by SYN1, and will send a CLOSE for the
SYN2 (just after the SYN2ACK was sent).

Since B also follows the same arbitration scheme, it will send the SYN-ACK
for SYN1 that will set up a healthy ESTABLISHED connection on both sides.
B will also get a  CLOSE for SYN2, which should result in the cleanup
of the TCP state machine for SYN2, but it should not trigger any
stale RDS-TCP callbacks (such as ->writespace, ->state_change etc),
that would disrupt the progress of the SYN2 based RDS-TCP  connection.

Thus the arbitration scheme in rds_tcp_accept_one() should restore
rds_tcp callbacks for the winner before setting them up for the
new accept socket, and also make sure that conn->c_outgoing
is set to 0 so that we do not trigger any reconnect attempts on the
passive side of the tcp socket in the future, in conformance with
commit c82ac7e69efe ("net/rds: RDS-TCP: only initiate reconnect attempt
on outgoing TCP socket.")

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13 04:22:41 -07:00
Sowmini Varadhan
486798001b RDS: Invoke ->laddr_check() in rds_bind() for explicitly bound transports.
The IP address passed to rds_bind() should be vetted by the
transport's ->laddr_check() for a previously bound transport.
This needs to be done to avoid cases where, for example,
the application has asked for an IB transport,
but the IP address passed to bind is only usable on
ethernet interfaces.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13 04:22:40 -07:00
David S. Miller
91d2f14bc3 Merge branch 'net/rds/4.3-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux
Santosh Shilimkar says:

====================
RDS: connection scalability and performance improvements

[v4]
Re-sending the same patches from v3 again since my repost of
patch 05/14 from v3 was whitespace damaged.

[v3]
Updated patch "[PATCH v2 05/14] RDS: defer the over_batch work to
send worker" as per David Miller's comment [4] to avoid the magic
value usage. Patch now makes use of already available but unused
send_batch_count module parameter. Rest of the patches are same as
earlier version v2 [3]

[v2]:
Dropped "[PATCH 05/15] RDS: increase size of hash-table to 8K" from
earlier version [1]. I plan to address the hash table scalability using
re-sizable hash tables as suggested by David Laight and David Miller [2]

This series addresses RDS connection bottlenecks on massive workloads and
improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
of about 12%.

RDS is being used in massive systems with high scalability where several
hundred thousand end points and tens of thousands of local processes
are operating in tens of thousand sockets. Being RC(reliable connection),
socket bind and release happens very often and any inefficiencies in
bind hash look ups hurts the overall system performance. RDS bin hash-table
uses global spin-lock which is the biggest bottleneck. To make matter worst,
it uses rcu inside global lock for hash buckets.
This is being addressed by simply using per bucket rw lock which makes the
locking simple and very efficient. The hash table size is still an issue and
I plan to address it by using re-sizable hash tables as suggested on the list.

For RDS RDMA improvement, the completion handling is revamped so that we
can do batch completions. Both send and receive completion handlers are
split logically to achieve the same. RDS 8K messages being one of the
key usecase, mr pool is adapted to have the 8K mrs along with default 1M
mrs. And while doing this, few fixes and couple of bottlenecks seen with
rds_sendmsg() are addressed.

Series applies against 4.3-rc1 as well net-next. Its tested on Oracle
hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is
tested with iXGB NIC. Like last time, iWARP transport is untested with
these changes. The patchset is also available at below git repo:

git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git net/rds/4.3-v3

As a side note, the IB HCA driver I used for testing misses at least 3
important patches in upstream to see the full blown IB performance and
am hoping to get that in mainline with help of them.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:38:37 -07:00
Christoph Hellwig
e622f2f4ad IB: split struct ib_send_wr
This patch split up struct ib_send_wr so that all non-trivial verbs
use their own structure which embedds struct ib_send_wr.  This dramaticly
shrinks the size of a WR for most common operations:

sizeof(struct ib_send_wr) (old):	96

sizeof(struct ib_send_wr):		48
sizeof(struct ib_rdma_wr):		64
sizeof(struct ib_atomic_wr):		96
sizeof(struct ib_ud_wr):		88
sizeof(struct ib_fast_reg_wr):		88
sizeof(struct ib_bind_mw_wr):		96
sizeof(struct ib_sig_handover_wr):	80

And with Sagi's pending MR rework the fast registration WR will also be
down to a reasonable size:

sizeof(struct ib_fastreg_wr):		64

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt]
Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc]
Tested-by: Haggai Eran <haggaie@mellanox.com>
Tested-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
2015-10-08 11:09:10 +01:00
Santosh Shilimkar
0676651323 RDS: IB: split mr pool to improve 8K messages performance
8K message sizes are pretty important usecase for RDS current
workloads so we make provison to have 8K mrs available from the pool.
Based on number of SG's in the RDS message, we pick a pool to use.

Also to make sure that we don't under utlise mrs when say 8k messages
are dominating which could lead to 8k pull being exhausted, we fall-back
to 1m pool till 8k pool recovers for use.

This helps to at least push ~55 kB/s bidirectional data which
is a nice improvement.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:02 -07:00
Santosh Shilimkar
41a4e96462 RDS: IB: use max_mr from HCA caps than max_fmr
All HCA drivers seems to popullate max_mr caps and few of
them do both max_mr and max_fmr.

Hence update RDS code to make use of max_mr.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:02 -07:00
Santosh Shilimkar
67161e250a RDS: IB: mark rds_ib_fmr_wq static
Fix below warning by marking rds_ib_fmr_wq static

net/rds/ib_rdma.c:87:25: warning: symbol 'rds_ib_fmr_wq' was not declared. Should it be static?

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:02 -07:00
Santosh Shilimkar
26139dc1db RDS: IB: use already available pool handle from ibmr
rds_ib_mr already keeps the pool handle which it associates
with. Lets use that instead of round about way of fetching
it from rds_ib_device.

No functional change.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:02 -07:00
Santosh Shilimkar
2e1d6b813a RDS: IB: fix the rds_ib_fmr_wq kick call
RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need
to use queue_delayed_work() to kick the work. This was hurting
the performance since pool maintenance was less often triggered
from other path.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:01 -07:00
Santosh Shilimkar
9441c973e1 RDS: IB: handle rds_ibdev release case instead of crashing the kernel
Just in case we are still handling the QP receive completion while the
rds_ibdev is released, drop the connection instead of crashing the kernel.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:01 -07:00
Santosh Shilimkar
0c28c04500 RDS: IB: split send completion handling and do batch ack
Similar to what we did with receive CQ completion handling, we split
the transmit completion handler so that it lets us implement batched
work completion handling.

We re-use the cq_poll routine and makes use of RDS_IB_SEND_OP to
identify the send vs receive completion event handler invocation.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:01 -07:00
Santosh Shilimkar
f4f943c958 RDS: IB: ack more receive completions to improve performance
For better performance, we split the receive completion IRQ handler. That
lets us acknowledge several WCE events in one call. We also limit the WC
to max 32 to avoid latency. Acknowledging several completions in one call
instead of several calls each time will provide better performance since
less mutual exclusion locks are being performed.

In next patch, send completion is also split which re-uses the poll_cq()
and hence the code is moved to ib_cm.c

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:01 -07:00
Santosh Shilimkar
db6526dcb5 RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL
In Transport indepedent rds_sendmsg(), we shouldn't make decisions based
on RDS_LL_SEND_FULL which is used to manage the ring for RDMA based
transports. We can safely issue rds_send_xmit() and the using its
return value take decision on deferred work. This will also fix
the scenario where at times we are seeing connections stuck with
the LL_SEND_FULL bit getting set and never cleared.

We kick krdsd after any time we see -ENOMEM or -EAGAIN from the
ring allocation code.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:19:01 -07:00
Santosh Shilimkar
4bebdd7a4d RDS: defer the over_batch work to send worker
Current process gives up if its send work over the batch limit.
The work queue will get  kicked to finish off any other requests.
This fixes remainder condition from commit 443be0e5affe ("RDS: make
sure not to loop forever inside rds_send_xmit").

The restart condition is only for the case where we reached to
over_batch code for some other reason so just retrying again
before giving up.

While at it, make sure we use already available 'send_batch_count'
parameter instead of magic value. The batch count threshold value
of 1024 came via commit 443be0e5affe ("RDS: make sure not to loop
forever inside rds_send_xmit"). The idea is to process as big a
batch as we can but at the same time we don't hold other waiting
processes for send. Hence back-off after the send_batch_count
limit (1024) to avoid soft-lock ups.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-10-05 11:18:45 -07:00
Sowmini Varadhan
76b29ef120 RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit
For the same reasons as commit 2f5338442425 ("tcp: allow splice() to
build full TSO packets") and commit 35f9c09fe9c7 ("tcp: tcp_sendpages()
should call tcp_push() once"), rds_tcp_xmit may have multiple pages to
send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to
tcp_sendpage()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 03:34:53 -07:00
Sowmini Varadhan
1edd6a14d2 RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune
Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
clobbers efficient use of TSO because it inflates the size_goal
that is computed in tcp_sendmsg/tcp_sendpage and skews packet
latency, and the default values for these parameters actually
results in significantly better performance.

In request-response tests using rds-stress with a packet size of
100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
between a single pair of IP addresses achieves a throughput of
6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
equivalent conditions on these platforms.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 03:34:53 -07:00
Sowmini Varadhan
3b20fc3897 RDS: Use a single TCP socket for both send and receive.
Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
for an incoming connection.") modified rds-tcp so that an incoming SYN
would ignore an existing "client" TCP connection which had the local
port set to the transient port.  The motivation for ignoring the existing
"client" connection in f711a6ae was to avoid race conditions and an
endless duel of reconnect attempts triggered by a restart/abort of one
of the nodes in the TCP connection.

However, having separate sockets for active and passive sides
is avoidable, and the simpler model of a single TCP socket for
both send and receives of all RDS connections associated with
that tcp socket makes for easier observability. We avoid the race
conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
The c_outgoing bit is initialized in __rds_conn_create().

A side-effect of re-using the client rds_connection for an incoming
SYN is the potential of encountering duelling SYNs, i.e., we
have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
SYN. The logic to arbitrate this criss-crossing SYN exchange in
rds_tcp_accept_one() has been modified to emulate the BGP state
machine: the smaller IP address should back off from the connection attempt.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 03:34:51 -07:00
Santosh Shilimkar
9b9acde7e8 RDS: Use per-bucket rw lock for bind hash-table
One global lock protecting hash-tables with 1024 buckets isn't
efficient and it shows up in a massive systems with truck
loads of RDS sockets serving multiple databases. The
perf data clearly highlights the contention on the rw
lock in these massive workloads.

When the contention gets worse, the code gets into a state where
it decides to back off on the lock. So while it has disabled interrupts,
it sits and backs off on this lock get. This causes the system to
become sluggish and eventually all sorts of bad things happen.

The simple fix is to move the lock into the hash bucket and
use per-bucket lock to improve the scalability.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-09-30 12:43:25 -04:00
Santosh Shilimkar
2812695988 RDS: fix rds_sock reference bug while doing bind
One need to take rds socket reference while using it and release it
once done with it. rds_add_bind() code path does not do that so
lets fix it.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2015-09-30 12:43:25 -04:00