3031 Commits

Author SHA1 Message Date
David Ruth
0d8a2d287c net/sched: act_api: fix possible infinite loop in tcf_idr_check_alloc()
[ Upstream commit d864319871b05fadd153e0aede4811ca7008f5d6 ]

syzbot found hanging tasks waiting on rtnl_lock [1]

A reproducer is available in the syzbot bug.

When a request to add multiple actions with the same index is sent, the
second request will block forever on the first request. This holds
rtnl_lock, and causes tasks to hang.

Return -EAGAIN to prevent infinite looping, while keeping documented
behavior.

[1]

INFO: task kworker/1:0:5088 blocked for more than 143 seconds.
Not tainted 6.9.0-rc4-syzkaller-00173-g3cdb45594619 #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/1:0 state:D stack:23744 pid:5088 tgid:5088 ppid:2 flags:0x00004000
Workqueue: events_power_efficient reg_check_chans_work
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5409 [inline]
__schedule+0xf15/0x5d00 kernel/sched/core.c:6746
__schedule_loop kernel/sched/core.c:6823 [inline]
schedule+0xe7/0x350 kernel/sched/core.c:6838
schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:6895
__mutex_lock_common kernel/locking/mutex.c:684 [inline]
__mutex_lock+0x5b8/0x9c0 kernel/locking/mutex.c:752
wiphy_lock include/net/cfg80211.h:5953 [inline]
reg_leave_invalid_chans net/wireless/reg.c:2466 [inline]
reg_check_chans_work+0x10a/0x10e0 net/wireless/reg.c:2481

Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action")
Reported-by: syzbot+b87c222546179f4513a7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b87c222546179f4513a7
Signed-off-by: David Ruth <druth@chromium.org>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20240614190326.1349786-1-druth@chromium.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-05 09:08:23 +02:00
Pedro Tammela
6b02df925e net/sched: act_api: rely on rcu in tcf_idr_check_alloc
[ Upstream commit 4b55e86736d5b492cf689125da2600f59c7d2c39 ]

Instead of relying only on the idrinfo->lock mutex for
bind/alloc logic, rely on a combination of rcu + mutex + atomics
to better scale the case where multiple rtnl-less filters are
binding to the same action object.

Action binding happens when an action index is specified explicitly and
an action exists which such index exists. Example:
  tc actions add action drop index 1
  tc filter add ... matchall action drop index 1
  tc filter add ... matchall action drop index 1
  tc filter add ... matchall action drop index 1
  tc filter ls ...
     filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1
     not_in_hw
           action order 1: gact action drop
            random type none pass val 0
            index 1 ref 4 bind 3

   filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1
     not_in_hw
           action order 1: gact action drop
            random type none pass val 0
            index 1 ref 4 bind 3

   filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1
     not_in_hw
           action order 1: gact action drop
            random type none pass val 0
            index 1 ref 4 bind 3

When no index is specified, as before, grab the mutex and allocate
in the idr the next available id. In this version, as opposed to before,
it's simplified to store the -EBUSY pointer instead of the previous
alloc + replace combination.

When an index is specified, rely on rcu to find if there's an object in
such index. If there's none, fallback to the above, serializing on the
mutex and reserving the specified id. If there's one, it can be an -EBUSY
pointer, in which case we just try again until it's an action, or an action.
Given the rcu guarantees, the action found could be dead and therefore
we need to bump the refcount if it's not 0, handling the case it's
in fact 0.

As bind and the action refcount are already atomics, these increments can
happen without the mutex protection while many tcf_idr_check_alloc race
to bind to the same action instance.

In case binding encounters a parallel delete or add, it will return
-EAGAIN in order to try again. Both filter and action apis already
have the retry machinery in-place. In case it's an unlocked filter it
retries under the rtnl lock.

Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20231211181807.96028-2-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: d864319871b0 ("net/sched: act_api: fix possible infinite loop in tcf_idr_check_alloc()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-05 09:08:23 +02:00
Eric Dumazet
c6041e7124 net/sched: taprio: always validate TCA_TAPRIO_ATTR_PRIOMAP
[ Upstream commit f921a58ae20852d188f70842431ce6519c4fdc36 ]

If one TCA_TAPRIO_ATTR_PRIOMAP attribute has been provided,
taprio_parse_mqprio_opt() must validate it, or userspace
can inject arbitrary data to the kernel, the second time
taprio_change() is called.

First call (with valid attributes) sets dev->num_tc
to a non zero value.

Second call (with arbitrary mqprio attributes)
returns early from taprio_parse_mqprio_opt()
and bad things can happen.

Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
Reported-by: Noam Rathaus <noamr@ssd-disclosure.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20240604181511.769870-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-05 09:08:11 +02:00
Hangyu Hua
d5d9d24178 net: sched: sch_multiq: fix possible OOB write in multiq_tune()
[ Upstream commit affc18fdc694190ca7575b9a86632a73b9fe043d ]

q->bands will be assigned to qopt->bands to execute subsequent code logic
after kmalloc. So the old q->bands should not be used in kmalloc.
Otherwise, an out-of-bounds write will occur.

Fixes: c2999f7fb05b ("net: sched: multiq: don't call qdisc_put() while holding tree lock")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-05 09:08:11 +02:00
Eric Dumazet
f356eb2fb5 net/sched: act_skbmod: prevent kernel-infoleak
commit d313eb8b77557a6d5855f42d2234bd592c7b50dd upstream.

syzbot found that tcf_skbmod_dump() was copying four bytes
from kernel stack to user space [1].

The issue here is that 'struct tc_skbmod' has a four bytes hole.

We need to clear the structure before filling fields.

[1]
BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline]
 BUG: KMSAN: kernel-infoleak in copy_to_user_iter lib/iov_iter.c:24 [inline]
 BUG: KMSAN: kernel-infoleak in iterate_ubuf include/linux/iov_iter.h:29 [inline]
 BUG: KMSAN: kernel-infoleak in iterate_and_advance2 include/linux/iov_iter.h:245 [inline]
 BUG: KMSAN: kernel-infoleak in iterate_and_advance include/linux/iov_iter.h:271 [inline]
 BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x366/0x2520 lib/iov_iter.c:185
  instrument_copy_to_user include/linux/instrumented.h:114 [inline]
  copy_to_user_iter lib/iov_iter.c:24 [inline]
  iterate_ubuf include/linux/iov_iter.h:29 [inline]
  iterate_and_advance2 include/linux/iov_iter.h:245 [inline]
  iterate_and_advance include/linux/iov_iter.h:271 [inline]
  _copy_to_iter+0x366/0x2520 lib/iov_iter.c:185
  copy_to_iter include/linux/uio.h:196 [inline]
  simple_copy_to_iter net/core/datagram.c:532 [inline]
  __skb_datagram_iter+0x185/0x1000 net/core/datagram.c:420
  skb_copy_datagram_iter+0x5c/0x200 net/core/datagram.c:546
  skb_copy_datagram_msg include/linux/skbuff.h:4050 [inline]
  netlink_recvmsg+0x432/0x1610 net/netlink/af_netlink.c:1962
  sock_recvmsg_nosec net/socket.c:1046 [inline]
  sock_recvmsg+0x2c4/0x340 net/socket.c:1068
  __sys_recvfrom+0x35a/0x5f0 net/socket.c:2242
  __do_sys_recvfrom net/socket.c:2260 [inline]
  __se_sys_recvfrom net/socket.c:2256 [inline]
  __x64_sys_recvfrom+0x126/0x1d0 net/socket.c:2256
 do_syscall_64+0xd5/0x1f0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

Uninit was stored to memory at:
  pskb_expand_head+0x30f/0x19d0 net/core/skbuff.c:2253
  netlink_trim+0x2c2/0x330 net/netlink/af_netlink.c:1317
  netlink_unicast+0x9f/0x1260 net/netlink/af_netlink.c:1351
  nlmsg_unicast include/net/netlink.h:1144 [inline]
  nlmsg_notify+0x21d/0x2f0 net/netlink/af_netlink.c:2610
  rtnetlink_send+0x73/0x90 net/core/rtnetlink.c:741
  rtnetlink_maybe_send include/linux/rtnetlink.h:17 [inline]
  tcf_add_notify net/sched/act_api.c:2048 [inline]
  tcf_action_add net/sched/act_api.c:2071 [inline]
  tc_ctl_action+0x146e/0x19d0 net/sched/act_api.c:2119
  rtnetlink_rcv_msg+0x1737/0x1900 net/core/rtnetlink.c:6595
  netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2559
  rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6613
  netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
  netlink_unicast+0xf4c/0x1260 net/netlink/af_netlink.c:1361
  netlink_sendmsg+0x10df/0x11f0 net/netlink/af_netlink.c:1905
  sock_sendmsg_nosec net/socket.c:730 [inline]
  __sock_sendmsg+0x30f/0x380 net/socket.c:745
  ____sys_sendmsg+0x877/0xb60 net/socket.c:2584
  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2638
  __sys_sendmsg net/socket.c:2667 [inline]
  __do_sys_sendmsg net/socket.c:2676 [inline]
  __se_sys_sendmsg net/socket.c:2674 [inline]
  __x64_sys_sendmsg+0x307/0x4a0 net/socket.c:2674
 do_syscall_64+0xd5/0x1f0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

Uninit was stored to memory at:
  __nla_put lib/nlattr.c:1041 [inline]
  nla_put+0x1c6/0x230 lib/nlattr.c:1099
  tcf_skbmod_dump+0x23f/0xc20 net/sched/act_skbmod.c:256
  tcf_action_dump_old net/sched/act_api.c:1191 [inline]
  tcf_action_dump_1+0x85e/0x970 net/sched/act_api.c:1227
  tcf_action_dump+0x1fd/0x460 net/sched/act_api.c:1251
  tca_get_fill+0x519/0x7a0 net/sched/act_api.c:1628
  tcf_add_notify_msg net/sched/act_api.c:2023 [inline]
  tcf_add_notify net/sched/act_api.c:2042 [inline]
  tcf_action_add net/sched/act_api.c:2071 [inline]
  tc_ctl_action+0x1365/0x19d0 net/sched/act_api.c:2119
  rtnetlink_rcv_msg+0x1737/0x1900 net/core/rtnetlink.c:6595
  netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2559
  rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6613
  netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
  netlink_unicast+0xf4c/0x1260 net/netlink/af_netlink.c:1361
  netlink_sendmsg+0x10df/0x11f0 net/netlink/af_netlink.c:1905
  sock_sendmsg_nosec net/socket.c:730 [inline]
  __sock_sendmsg+0x30f/0x380 net/socket.c:745
  ____sys_sendmsg+0x877/0xb60 net/socket.c:2584
  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2638
  __sys_sendmsg net/socket.c:2667 [inline]
  __do_sys_sendmsg net/socket.c:2676 [inline]
  __se_sys_sendmsg net/socket.c:2674 [inline]
  __x64_sys_sendmsg+0x307/0x4a0 net/socket.c:2674
 do_syscall_64+0xd5/0x1f0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75

Local variable opt created at:
  tcf_skbmod_dump+0x9d/0xc20 net/sched/act_skbmod.c:244
  tcf_action_dump_old net/sched/act_api.c:1191 [inline]
  tcf_action_dump_1+0x85e/0x970 net/sched/act_api.c:1227

Bytes 188-191 of 248 are uninitialized
Memory access of size 248 starts at ffff888117697680
Data copied to user address 00007ffe56d855f0

Fixes: 86da71b57383 ("net_sched: Introduce skbmod action")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20240403130908.93421-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13 12:51:34 +02:00
Jamal Hadi Salim
010dc505ea net/sched: Retire dsmark qdisc
commit bbe77c14ee6185a61ba6d5e435c1cbb489d2a9ed upstream.

The dsmark qdisc has served us well over the years for diffserv but has not
been getting much attention due to other more popular approaches to do diffserv
services. Most recently it has become a shooting target for syzkaller. For this
reason, we are retiring it.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01 13:13:32 +01:00
Jamal Hadi Salim
40e8abb86d net/sched: Retire ATM qdisc
commit fb38306ceb9e770adfb5ffa6e3c64047b55f7a07 upstream.

The ATM qdisc has served us well over the years but has not been getting much
TLC due to lack of known users. Most recently it has become a shooting target
for syzkaller. For this reason, we are retiring it.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01 13:13:32 +01:00
Jamal Hadi Salim
493685f3dd net/sched: Retire CBQ qdisc
commit 051d442098421c28c7951625652f61b1e15c4bd5 upstream.

While this amazing qdisc has served us well over the years it has not been
getting any tender love and care and has bitrotted over time.
It has become mostly a shooting target for syzkaller lately.
For this reason, we are retiring it. Goodbye CBQ - we loved you.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01 13:13:32 +01:00
Hangyu Hua
b1719cbb73 net: sched: em_text: fix possible memory leak in em_text_destroy()
[ Upstream commit 8fcb0382af6f1ef50936f1be05b8149eb2f88496 ]

m->data needs to be freed when em_text_destroy is called.

Fixes: d675c989ed2d ("[PKT_SCHED]: Packet classification based on textsearch (ematch)")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-15 18:25:25 +01:00
Pedro Tammela
f34916502d net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve
commit a13b67c9a015c4e21601ef9aa4ec9c5d972df1b4 upstream.

Christian Theune says:
   I upgraded from 6.1.38 to 6.1.55 this morning and it broke my traffic shaping script,
   leaving me with a non-functional uplink on a remote router.

A 'rt' curve cannot be used as a inner curve (parent class), but we were
allowing such configurations since the qdisc was introduced. Such
configurations would trigger a UAF as Budimir explains:
   The parent will have vttree_insert() called on it in init_vf(),
   but will not have vttree_remove() called on it in update_vf()
   because it does not have the HFSC_FSC flag set.

The qdisc always assumes that inner classes have the HFSC_FSC flag set.
This is by design as it doesn't make sense 'qdisc wise' for an 'rt'
curve to be an inner curve.

Budimir's original patch disallows users to add classes with a 'rt'
parent, but this is too strict as it breaks users that have been using
'rt' as a inner class. Another approach, taken by this patch, is to
upgrade the inner 'rt' into a 'sc', warning the user in the process.
It avoids the UAF reported by Budimir while also being more permissive
to bad scripts/users/code using 'rt' as a inner class.

Users checking the `tc class ls [...]` or `tc class get [...]` dumps would
observe the curve change and are potentially breaking with this change.

v1->v2: https://lore.kernel.org/all/20231013151057.2611860-1-pctammela@mojatatu.com/
- Correct 'Fixes' tag and merge with revert (Jakub)

Cc: Christian Theune <ct@flyingcircus.io>
Cc: Budimir Markovic <markovicbudimir@gmail.com>
Fixes: b3d26c5702c7 ("net/sched: sch_hfsc: Ensure inner classes have fsc curve")
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20231017143602.3191556-1-pctammela@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25 11:53:21 +02:00
Jamal Hadi Salim
42900fd140 net/sched: Retire rsvp classifier
commit 265b4da82dbf5df04bee5a5d46b7474b1aaf326a upstream.

The rsvp classifier has served us well for about a quarter of a century but has
has not been getting much maintenance attention due to lack of known users.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-23 11:00:07 +02:00
valis
a6d11571b9 net: sched: sch_qfq: Fix UAF in qfq_dequeue()
[ Upstream commit 8fc134fee27f2263988ae38920bc03da416b03d8 ]

When the plug qdisc is used as a class of the qfq qdisc it could trigger a
UAF. This issue can be reproduced with following commands:

  tc qdisc add dev lo root handle 1: qfq
  tc class add dev lo parent 1: classid 1:1 qfq weight 1 maxpkt 512
  tc qdisc add dev lo parent 1:1 handle 2: plug
  tc filter add dev lo parent 1: basic classid 1:1
  ping -c1 127.0.0.1

and boom:

[  285.353793] BUG: KASAN: slab-use-after-free in qfq_dequeue+0xa7/0x7f0
[  285.354910] Read of size 4 at addr ffff8880bad312a8 by task ping/144
[  285.355903]
[  285.356165] CPU: 1 PID: 144 Comm: ping Not tainted 6.5.0-rc3+ #4
[  285.357112] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[  285.358376] Call Trace:
[  285.358773]  <IRQ>
[  285.359109]  dump_stack_lvl+0x44/0x60
[  285.359708]  print_address_description.constprop.0+0x2c/0x3c0
[  285.360611]  kasan_report+0x10c/0x120
[  285.361195]  ? qfq_dequeue+0xa7/0x7f0
[  285.361780]  qfq_dequeue+0xa7/0x7f0
[  285.362342]  __qdisc_run+0xf1/0x970
[  285.362903]  net_tx_action+0x28e/0x460
[  285.363502]  __do_softirq+0x11b/0x3de
[  285.364097]  do_softirq.part.0+0x72/0x90
[  285.364721]  </IRQ>
[  285.365072]  <TASK>
[  285.365422]  __local_bh_enable_ip+0x77/0x90
[  285.366079]  __dev_queue_xmit+0x95f/0x1550
[  285.366732]  ? __pfx_csum_and_copy_from_iter+0x10/0x10
[  285.367526]  ? __pfx___dev_queue_xmit+0x10/0x10
[  285.368259]  ? __build_skb_around+0x129/0x190
[  285.368960]  ? ip_generic_getfrag+0x12c/0x170
[  285.369653]  ? __pfx_ip_generic_getfrag+0x10/0x10
[  285.370390]  ? csum_partial+0x8/0x20
[  285.370961]  ? raw_getfrag+0xe5/0x140
[  285.371559]  ip_finish_output2+0x539/0xa40
[  285.372222]  ? __pfx_ip_finish_output2+0x10/0x10
[  285.372954]  ip_output+0x113/0x1e0
[  285.373512]  ? __pfx_ip_output+0x10/0x10
[  285.374130]  ? icmp_out_count+0x49/0x60
[  285.374739]  ? __pfx_ip_finish_output+0x10/0x10
[  285.375457]  ip_push_pending_frames+0xf3/0x100
[  285.376173]  raw_sendmsg+0xef5/0x12d0
[  285.376760]  ? do_syscall_64+0x40/0x90
[  285.377359]  ? __static_call_text_end+0x136578/0x136578
[  285.378173]  ? do_syscall_64+0x40/0x90
[  285.378772]  ? kasan_enable_current+0x11/0x20
[  285.379469]  ? __pfx_raw_sendmsg+0x10/0x10
[  285.380137]  ? __sock_create+0x13e/0x270
[  285.380673]  ? __sys_socket+0xf3/0x180
[  285.381174]  ? __x64_sys_socket+0x3d/0x50
[  285.381725]  ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.382425]  ? __rcu_read_unlock+0x48/0x70
[  285.382975]  ? ip4_datagram_release_cb+0xd8/0x380
[  285.383608]  ? __pfx_ip4_datagram_release_cb+0x10/0x10
[  285.384295]  ? preempt_count_sub+0x14/0xc0
[  285.384844]  ? __list_del_entry_valid+0x76/0x140
[  285.385467]  ? _raw_spin_lock_bh+0x87/0xe0
[  285.386014]  ? __pfx__raw_spin_lock_bh+0x10/0x10
[  285.386645]  ? release_sock+0xa0/0xd0
[  285.387148]  ? preempt_count_sub+0x14/0xc0
[  285.387712]  ? freeze_secondary_cpus+0x348/0x3c0
[  285.388341]  ? aa_sk_perm+0x177/0x390
[  285.388856]  ? __pfx_aa_sk_perm+0x10/0x10
[  285.389441]  ? check_stack_object+0x22/0x70
[  285.390032]  ? inet_send_prepare+0x2f/0x120
[  285.390603]  ? __pfx_inet_sendmsg+0x10/0x10
[  285.391172]  sock_sendmsg+0xcc/0xe0
[  285.391667]  __sys_sendto+0x190/0x230
[  285.392168]  ? __pfx___sys_sendto+0x10/0x10
[  285.392727]  ? kvm_clock_get_cycles+0x14/0x30
[  285.393328]  ? set_normalized_timespec64+0x57/0x70
[  285.393980]  ? _raw_spin_unlock_irq+0x1b/0x40
[  285.394578]  ? __x64_sys_clock_gettime+0x11c/0x160
[  285.395225]  ? __pfx___x64_sys_clock_gettime+0x10/0x10
[  285.395908]  ? _copy_to_user+0x3e/0x60
[  285.396432]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.397086]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.397734]  ? do_syscall_64+0x71/0x90
[  285.398258]  __x64_sys_sendto+0x74/0x90
[  285.398786]  do_syscall_64+0x64/0x90
[  285.399273]  ? exit_to_user_mode_prepare+0x1a/0x120
[  285.399949]  ? syscall_exit_to_user_mode+0x22/0x50
[  285.400605]  ? do_syscall_64+0x71/0x90
[  285.401124]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.401807] RIP: 0033:0x495726
[  285.402233] Code: ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 09
[  285.404683] RSP: 002b:00007ffcc25fb618 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  285.405677] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 0000000000495726
[  285.406628] RDX: 0000000000000040 RSI: 0000000002518750 RDI: 0000000000000000
[  285.407565] RBP: 00000000005205ef R08: 00000000005f8838 R09: 000000000000001c
[  285.408523] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000002517634
[  285.409460] R13: 00007ffcc25fb6f0 R14: 0000000000000003 R15: 0000000000000000
[  285.410403]  </TASK>
[  285.410704]
[  285.410929] Allocated by task 144:
[  285.411402]  kasan_save_stack+0x1e/0x40
[  285.411926]  kasan_set_track+0x21/0x30
[  285.412442]  __kasan_slab_alloc+0x55/0x70
[  285.412973]  kmem_cache_alloc_node+0x187/0x3d0
[  285.413567]  __alloc_skb+0x1b4/0x230
[  285.414060]  __ip_append_data+0x17f7/0x1b60
[  285.414633]  ip_append_data+0x97/0xf0
[  285.415144]  raw_sendmsg+0x5a8/0x12d0
[  285.415640]  sock_sendmsg+0xcc/0xe0
[  285.416117]  __sys_sendto+0x190/0x230
[  285.416626]  __x64_sys_sendto+0x74/0x90
[  285.417145]  do_syscall_64+0x64/0x90
[  285.417624]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  285.418306]
[  285.418531] Freed by task 144:
[  285.418960]  kasan_save_stack+0x1e/0x40
[  285.419469]  kasan_set_track+0x21/0x30
[  285.419988]  kasan_save_free_info+0x27/0x40
[  285.420556]  ____kasan_slab_free+0x109/0x1a0
[  285.421146]  kmem_cache_free+0x1c2/0x450
[  285.421680]  __netif_receive_skb_core+0x2ce/0x1870
[  285.422333]  __netif_receive_skb_one_core+0x97/0x140
[  285.423003]  process_backlog+0x100/0x2f0
[  285.423537]  __napi_poll+0x5c/0x2d0
[  285.424023]  net_rx_action+0x2be/0x560
[  285.424510]  __do_softirq+0x11b/0x3de
[  285.425034]
[  285.425254] The buggy address belongs to the object at ffff8880bad31280
[  285.425254]  which belongs to the cache skbuff_head_cache of size 224
[  285.426993] The buggy address is located 40 bytes inside of
[  285.426993]  freed 224-byte region [ffff8880bad31280, ffff8880bad31360)
[  285.428572]
[  285.428798] The buggy address belongs to the physical page:
[  285.429540] page:00000000f4b77674 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xbad31
[  285.430758] flags: 0x100000000000200(slab|node=0|zone=1)
[  285.431447] page_type: 0xffffffff()
[  285.431934] raw: 0100000000000200 ffff88810094a8c0 dead000000000122 0000000000000000
[  285.432757] raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
[  285.433562] page dumped because: kasan: bad access detected
[  285.434144]
[  285.434320] Memory state around the buggy address:
[  285.434828]  ffff8880bad31180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.435580]  ffff8880bad31200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.436264] >ffff8880bad31280: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  285.436777]                                   ^
[  285.437106]  ffff8880bad31300: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[  285.437616]  ffff8880bad31380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  285.438126] ==================================================================
[  285.438662] Disabling lock debugging due to kernel taint

Fix this by:
1. Changing sch_plug's .peek handler to qdisc_peek_dequeued(), a
function compatible with non-work-conserving qdiscs
2. Checking the return value of qdisc_dequeue_peeked() in sch_qfq.

Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Reported-by: valis <sec@valis.email>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230901162237.11525-1-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23 10:59:59 +02:00
Budimir Markovic
da13749d5f net/sched: sch_hfsc: Ensure inner classes have fsc curve
[ Upstream commit b3d26c5702c7d6c45456326e56d2ccf3f103e60f ]

HFSC assumes that inner classes have an fsc curve, but it is currently
possible for classes without an fsc curve to become parents. This leads
to bugs including a use-after-free.

Don't allow non-root classes without HFSC_FSC to become parents.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Budimir Markovic <markovicbudimir@gmail.com>
Signed-off-by: Budimir Markovic <markovicbudimir@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230824084905.422-1-markovicbudimir@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23 10:59:43 +02:00
Jamal Hadi Salim
eebd074af2 net/sched: fix a qdisc modification with ambiguous command request
[ Upstream commit da71714e359b64bd7aab3bd56ec53f307f058133 ]

When replacing an existing root qdisc, with one that is of the same kind, the
request boils down to essentially a parameterization change  i.e not one that
requires allocation and grafting of a new qdisc. syzbot was able to create a
scenario which resulted in a taprio qdisc replacing an existing taprio qdisc
with a combination of NLM_F_CREATE, NLM_F_REPLACE and NLM_F_EXCL leading to
create and graft scenario.
The fix ensures that only when the qdisc kinds are different that we should
allow a create and graft, otherwise it goes into the "change" codepath.

While at it, fix the code and comments to improve readability.

While syzbot was able to create the issue, it did not zone on the root cause.
Analysis from Vladimir Oltean <vladimir.oltean@nxp.com> helped narrow it down.

v1->V2 changes:
- remove "inline" function definition (Vladmir)
- remove extrenous braces in branches (Vladmir)
- change inline function names (Pedro)
- Run tdc tests (Victor)
v2->v3 changes:
- dont break else/if (Simon)

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+a3618a167af2021433cd@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/20230816225759.g25x76kmgzya2gei@skbuf/T/
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-30 16:27:24 +02:00
Eric Dumazet
e7711f15ed sch_netem: fix issues in netem_change() vs get_dist_table()
commit 11b73313c12403f617b47752db0ab3deef201af7 upstream.

In blamed commit, I missed that get_dist_table() was allocating
memory using GFP_KERNEL, and acquiring qdisc lock to perform
the swap of newly allocated table with current one.

In this patch, get_dist_table() is allocating memory and
copy user data before we acquire the qdisc lock.

Then we perform swap operations while being protected by the lock.

Note that after this patch netem_change() no longer can do partial changes.
If an error is returned, qdisc conf is left unchanged.

Fixes: 2174a08db80d ("sch_netem: acquire qdisc lock in netem_change()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230622181503.2327695-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-16 18:19:25 +02:00
valis
1c8262f31f net/sched: cls_route: No longer copy tcf_result on update to avoid use-after-free
[ Upstream commit b80b829e9e2c1b3f7aae34855e04d8f6ecaf13c8 ]

When route4_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.

This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.

Fix this by no longer copying the tcf_result struct from the old filter.

Fixes: 1109c00547fc ("net: sched: RCU cls_route")
Reported-by: valis <sec@valis.email>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg>
Link: https://lore.kernel.org/r/20230729123202.72406-4-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11 11:53:56 +02:00
valis
83e3d4b0ae net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-free
[ Upstream commit 76e42ae831991c828cffa8c37736ebfb831ad5ec ]

When fw_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.

This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.

Fix this by no longer copying the tcf_result struct from the old filter.

Fixes: e35a8ee5993b ("net: sched: fw use RCU")
Reported-by: valis <sec@valis.email>
Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg>
Link: https://lore.kernel.org/r/20230729123202.72406-3-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11 11:53:56 +02:00
valis
be785808db net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-free
[ Upstream commit 3044b16e7c6fe5d24b1cdbcf1bd0a9d92d1ebd81 ]

When u32_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.

This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.

Fix this by no longer copying the tcf_result struct from the old filter.

Fixes: de5df63228fc ("net: sched: cls_u32 changes to knode must appear atomic to readers")
Reported-by: valis <sec@valis.email>
Reported-by: M A Ramdhan <ramdhan@starlabs.sg>
Signed-off-by: valis <sec@valis.email>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg>
Link: https://lore.kernel.org/r/20230729123202.72406-2-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11 11:53:56 +02:00
Jamal Hadi Salim
42b2880807 net: sched: cls_u32: Fix match key mis-addressing
[ Upstream commit e68409db995380d1badacba41ff24996bd396171 ]

A match entry is uniquely identified with an "address" or "path" in the
form of: hashtable ID(12b):bucketid(8b):nodeid(12b).

When creating table match entries all of hash table id, bucket id and
node (match entry id) are needed to be either specified by the user or
reasonable in-kernel defaults are used. The in-kernel default for a table id is
0x800(omnipresent root table); for bucketid it is 0x0. Prior to this fix there
was none for a nodeid i.e. the code assumed that the user passed the correct
nodeid and if the user passes a nodeid of 0 (as Mingi Cho did) then that is what
was used. But nodeid of 0 is reserved for identifying the table. This is not
a problem until we dump. The dump code notices that the nodeid is zero and
assumes it is referencing a table and therefore references table struct
tc_u_hnode instead of what was created i.e match entry struct tc_u_knode.

Ming does an equivalent of:
tc filter add dev dummy0 parent 10: prio 1 handle 0x1000 \
protocol ip u32 match ip src 10.0.0.1/32 classid 10:1 action ok

Essentially specifying a table id 0, bucketid 1 and nodeid of zero
Tableid 0 is remapped to the default of 0x800.
Bucketid 1 is ignored and defaults to 0x00.
Nodeid was assumed to be what Ming passed - 0x000

dumping before fix shows:
~$ tc filter ls dev dummy0 parent 10:
filter protocol ip pref 1 u32 chain 0
filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1
filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor -30591

Note that the last line reports a table instead of a match entry
(you can tell this because it says "ht divisor...").
As a result of reporting the wrong data type (misinterpretting of struct
tc_u_knode as being struct tc_u_hnode) the divisor is reported with value
of -30591. Ming identified this as part of the heap address
(physmap_base is 0xffff8880 (-30591 - 1)).

The fix is to ensure that when table entry matches are added and no
nodeid is specified (i.e nodeid == 0) then we get the next available
nodeid from the table's pool.

After the fix, this is what the dump shows:
$ tc filter ls dev dummy0 parent 10:
filter protocol ip pref 1 u32 chain 0
filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1
filter protocol ip pref 1 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 10:1 not_in_hw
  match 0a000001/ffffffff at 12
	action order 1: gact action pass
	 random type none pass val 0
	 index 1 ref 1 bind 1

Reported-by: Mingi Cho <mgcho.minic@gmail.com>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230726135151.416917-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11 11:53:55 +02:00
Pedro Tammela
cf8ecd6ea6 net/sched: sch_qfq: account for stab overhead in qfq_enqueue
commit 3e337087c3b5805fe0b8a46ba622a962880b5d64 upstream.

Lion says:
-------
In the QFQ scheduler a similar issue to CVE-2023-31436
persists.

Consider the following code in net/sched/sch_qfq.c:

static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                struct sk_buff **to_free)
{
     unsigned int len = qdisc_pkt_len(skb), gso_segs;

    // ...

     if (unlikely(cl->agg->lmax < len)) {
         pr_debug("qfq: increasing maxpkt from %u to %u for class %u",
              cl->agg->lmax, len, cl->common.classid);
         err = qfq_change_agg(sch, cl, cl->agg->class_weight, len);
         if (err) {
             cl->qstats.drops++;
             return qdisc_drop(skb, sch, to_free);
         }

    // ...

     }

Similarly to CVE-2023-31436, "lmax" is increased without any bounds
checks according to the packet length "len". Usually this would not
impose a problem because packet sizes are naturally limited.

This is however not the actual packet length, rather the
"qdisc_pkt_len(skb)" which might apply size transformations according to
"struct qdisc_size_table" as created by "qdisc_get_stab()" in
net/sched/sch_api.c if the TCA_STAB option was set when modifying the qdisc.

A user may choose virtually any size using such a table.

As a result the same issue as in CVE-2023-31436 can occur, allowing heap
out-of-bounds read / writes in the kmalloc-8192 cache.
-------

We can create the issue with the following commands:

tc qdisc add dev $DEV root handle 1: stab mtu 2048 tsize 512 mpu 0 \
overhead 999999999 linklayer ethernet qfq
tc class add dev $DEV parent 1: classid 1:1 htb rate 6mbit burst 15k
tc filter add dev $DEV parent 1: matchall classid 1:1
ping -I $DEV 1.1.1.2

This is caused by incorrectly assuming that qdisc_pkt_len() returns a
length within the QFQ_MIN_LMAX < len < QFQ_MAX_LMAX.

Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Reported-by: Lion <nnamrec@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11 11:53:54 +02:00
Lin Ma
07d9723cef net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64
[ Upstream commit 6c58c8816abb7b93b21fa3b1d0c1726402e5e568 ]

The nla_for_each_nested parsing in function mqprio_parse_nlattr() does
not check the length of the nested attribute. This can lead to an
out-of-attribute read and allow a malformed nlattr (e.g., length 0) to
be viewed as 8 byte integer and passed to priv->max_rate/min_rate.

This patch adds the check based on nla_len() when check the nla_type(),
which ensures that the length of these two attribute must equals
sizeof(u64).

Fixes: 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and shaper in mqprio")
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20230725024227.426561-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11 11:53:48 +02:00
Vladimir Oltean
2eb6175294 net/sched: mqprio: add extack to mqprio_parse_nlattr()
[ Upstream commit 57f21bf85400abadac0cb2a4db5de1d663f8863f ]

Netlink attribute parsing in mqprio is a minesweeper game, with many
options having the possibility of being passed incorrectly and the user
being none the wiser.

Try to make errors less sour by giving user space some information
regarding what went wrong.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 6c58c8816abb ("net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11 11:53:48 +02:00
Vladimir Oltean
29c5eb0ffa net/sched: mqprio: refactor nlattr parsing to a separate function
[ Upstream commit feb2cf3dcfb930aec2ca65c66d1365543d5ba943 ]

mqprio_init() is quite large and unwieldy to add more code to.
Split the netlink attribute parsing to a dedicated function.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 6c58c8816abb ("net/sched: mqprio: Add length check for TCA_MQPRIO_{MAX/MIN}_RATE64")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-11 11:53:48 +02:00
Ido Schimmel
96391959a9 net/sched: flower: Ensure both minimum and maximum ports are specified
[ Upstream commit d3f87278bcb80bd7f9519669d928b43320363d4f ]

The kernel does not currently validate that both the minimum and maximum
ports of a port range are specified. This can lead user space to think
that a filter matching on a port range was successfully added, when in
fact it was not. For example, with a patched (buggy) iproute2 that only
sends the minimum port, the following commands do not return an error:

 # tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp src_port 100-200 action pass

 # tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp dst_port 100-200 action pass

 # tc filter show dev swp1 ingress
 filter protocol ip pref 1 flower chain 0
 filter protocol ip pref 1 flower chain 0 handle 0x1
   eth_type ipv4
   ip_proto udp
   not_in_hw
         action order 1: gact action pass
          random type none pass val 0
          index 1 ref 1 bind 1

 filter protocol ip pref 1 flower chain 0 handle 0x2
   eth_type ipv4
   ip_proto udp
   not_in_hw
         action order 1: gact action pass
          random type none pass val 0
          index 2 ref 1 bind 1

Fix by returning an error unless both ports are specified:

 # tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp src_port 100-200 action pass
 Error: Both min and max source ports must be specified.
 We have an error talking to the kernel

 # tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp dst_port 100-200 action pass
 Error: Both min and max destination ports must be specified.
 We have an error talking to the kernel

Fixes: 5c72299fba9d ("net: sched: cls_flower: Classify packets using port ranges")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27 08:37:33 +02:00
Guillaume Nault
166fa538e0 cls_flower: Add extack support for src and dst port range options
[ Upstream commit bd7d4c12819b60b161939bc2f43053955d24d0df ]

Pass extack down to fl_set_key_port_range() and set message on error.

Both the min and max ports would qualify as invalid attributes here.
Report the min one as invalid, as it's probably what makes the most
sense from a user point of view.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: d3f87278bcb8 ("net/sched: flower: Ensure both minimum and maximum ports are specified")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27 08:37:33 +02:00
M A Ramdhan
808211a8d4 net/sched: cls_fw: Fix improper refcount update leads to use-after-free
[ Upstream commit 0323bce598eea038714f941ce2b22541c46d488f ]

In the event of a failure in tcf_change_indev(), fw_set_parms() will
immediately return an error after incrementing or decrementing
reference counter in tcf_bind_filter().  If attacker can control
reference counter to zero and make reference freed, leading to
use after free.

In order to prevent this, move the point of possible failure above the
point where the TC_FW_CLASSID is handled.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: M A Ramdhan <ramdhan@starlabs.sg>
Signed-off-by: M A Ramdhan <ramdhan@starlabs.sg>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Message-ID: <20230705161530.52003-1-ramdhan@starlabs.sg>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27 08:37:30 +02:00
Lin Ma
67a67e2584 net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX
[ Upstream commit 30c45b5361d39b4b793780ffac5538090b9e2eb1 ]

The attribute TCA_PEDIT_PARMS_EX is not be included in pedit_policy and
one malicious user could fake a TCA_PEDIT_PARMS_EX whose length is
smaller than the intended sizeof(struct tc_pedit). Hence, the
dereference in tcf_pedit_init() could access dirty heap data.

static int tcf_pedit_init(...)
{
  // ...
  pattr = tb[TCA_PEDIT_PARMS]; // TCA_PEDIT_PARMS is included
  if (!pattr)
    pattr = tb[TCA_PEDIT_PARMS_EX]; // but this is not

  // ...
  parm = nla_data(pattr);

  index = parm->index; // parm is able to be smaller than 4 bytes
                       // and this dereference gets dirty skb_buff
                       // data created in netlink_sendmsg
}

This commit adds TCA_PEDIT_PARMS_EX length in pedit_policy which avoid
the above case, just like the TCA_PEDIT_PARMS.

Fixes: 71d0ed7079df ("net/act_pedit: Support using offset relative to the conventional network headers")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Link: https://lore.kernel.org/r/20230703110842.590282-1-linma@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27 08:37:24 +02:00
Eric Dumazet
321a81d26c sch_netem: acquire qdisc lock in netem_change()
[ Upstream commit 2174a08db80d1efeea382e25ac41c4e7511eb6d6 ]

syzbot managed to trigger a divide error [1] in netem.

It could happen if q->rate changes while netem_enqueue()
is running, since q->rate is read twice.

It turns out netem_change() always lacked proper synchronization.

[1]
divide error: 0000 [#1] SMP KASAN
CPU: 1 PID: 7867 Comm: syz-executor.1 Not tainted 6.1.30-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/25/2023
RIP: 0010:div64_u64 include/linux/math64.h:69 [inline]
RIP: 0010:packet_time_ns net/sched/sch_netem.c:357 [inline]
RIP: 0010:netem_enqueue+0x2067/0x36d0 net/sched/sch_netem.c:576
Code: 89 e2 48 69 da 00 ca 9a 3b 42 80 3c 28 00 4c 8b a4 24 88 00 00 00 74 0d 4c 89 e7 e8 c3 4f 3b fd 48 8b 4c 24 18 48 89 d8 31 d2 <49> f7 34 24 49 01 c7 4c 8b 64 24 48 4d 01 f7 4c 89 e3 48 c1 eb 03
RSP: 0018:ffffc9000dccea60 EFLAGS: 00010246
RAX: 000001a442624200 RBX: 000001a442624200 RCX: ffff888108a4f000
RDX: 0000000000000000 RSI: 000000000000070d RDI: 000000000000070d
RBP: ffffc9000dcceb90 R08: ffffffff849c5e26 R09: fffffbfff10e1297
R10: 0000000000000000 R11: dffffc0000000001 R12: ffff888108a4f358
R13: dffffc0000000000 R14: 0000001a8cd9a7ec R15: 0000000000000000
FS: 00007fa73fe18700(0000) GS:ffff8881f6b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa73fdf7718 CR3: 000000011d36e000 CR4: 0000000000350ee0
Call Trace:
<TASK>
[<ffffffff84714385>] __dev_xmit_skb net/core/dev.c:3931 [inline]
[<ffffffff84714385>] __dev_queue_xmit+0xcf5/0x3370 net/core/dev.c:4290
[<ffffffff84d22df2>] dev_queue_xmit include/linux/netdevice.h:3030 [inline]
[<ffffffff84d22df2>] neigh_hh_output include/net/neighbour.h:531 [inline]
[<ffffffff84d22df2>] neigh_output include/net/neighbour.h:545 [inline]
[<ffffffff84d22df2>] ip_finish_output2+0xb92/0x10d0 net/ipv4/ip_output.c:235
[<ffffffff84d21e63>] __ip_finish_output+0xc3/0x2b0
[<ffffffff84d10a81>] ip_finish_output+0x31/0x2a0 net/ipv4/ip_output.c:323
[<ffffffff84d10f14>] NF_HOOK_COND include/linux/netfilter.h:298 [inline]
[<ffffffff84d10f14>] ip_output+0x224/0x2a0 net/ipv4/ip_output.c:437
[<ffffffff84d123b5>] dst_output include/net/dst.h:444 [inline]
[<ffffffff84d123b5>] ip_local_out net/ipv4/ip_output.c:127 [inline]
[<ffffffff84d123b5>] __ip_queue_xmit+0x1425/0x2000 net/ipv4/ip_output.c:542
[<ffffffff84d12fdc>] ip_queue_xmit+0x4c/0x70 net/ipv4/ip_output.c:556

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230620184425.1179809-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28 10:18:40 +02:00
Vlad Buslov
ec694ad393 net/sched: cls_api: Fix lockup on flushing explicitly created chain
[ Upstream commit c9a82bec02c339cdda99b37c5e62b3b71fc4209c ]

Mingshuai Ren reports:

When a new chain is added by using tc, one soft lockup alarm will be
 generated after delete the prio 0 filter of the chain. To reproduce
 the problem, perform the following steps:
(1) tc qdisc add dev eth0 root handle 1: htb default 1
(2) tc chain add dev eth0
(3) tc filter del dev eth0 chain 0 parent 1: prio 0
(4) tc filter add dev eth0 chain 0 parent 1:

Fix the issue by accounting for additional reference to chains that are
explicitly created by RTM_NEWCHAIN message as opposed to implicitly by
RTM_NEWTFILTER message.

Fixes: 726d061286ce ("net: sched: prevent insertion of new classifiers during chain flush")
Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
Closes: https://lore.kernel.org/lkml/87legswvi3.fsf@nvidia.com/T/
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20230612093426.2867183-1-vladbu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21 15:44:12 +02:00
Lee Jones
46305daf80 net/sched: cls_u32: Fix reference counter leak leading to overflow
[ Upstream commit 04c55383fa5689357bcdd2c8036725a55ed632bc ]

In the event of a failure in tcf_change_indev(), u32_set_parms() will
immediately return without decrementing the recently incremented
reference counter.  If this happens enough times, the counter will
rollover and the reference freed, leading to a double free which can be
used to do 'bad things'.

In order to prevent this, move the point of possible failure above the
point where the reference counter is incremented.  Also save any
meaningful return values to be applied to the return data at the
appropriate point in time.

This issue was caught with KASAN.

Fixes: 705c7091262d ("net: sched: cls_u32: no need to call tcf_exts_change for newly allocated struct")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Lee Jones <lee@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21 15:44:11 +02:00
Hangyu Hua
198da74a4e net: sched: fix possible refcount leak in tc_chain_tmplt_add()
[ Upstream commit 44f8baaf230c655c249467ca415b570deca8df77 ]

try_module_get will be called in tcf_proto_lookup_ops. So module_put needs
to be called to drop the refcount if ops don't implement the required
function.

Fixes: 9f407f1768d3 ("net: sched: introduce chain templates")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-14 10:59:56 +02:00
Eric Dumazet
8f7cbd6d5e net: sched: move rtm_tca_policy declaration to include file
[ Upstream commit 886bc7d6ed3357975c5f1d3c784da96000d4bbb4 ]

rtm_tca_policy is used from net/sched/sch_api.c and net/sched/cls_api.c,
thus should be declared in an include file.

This fixes the following sparse warning:
net/sched/sch_api.c:1434:25: warning: symbol 'rtm_tca_policy' was not declared. Should it be static?

Fixes: e331473fee3d ("net/sched: cls_api: add missing validation of netlink attributes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-14 10:59:56 +02:00
Kees Cook
0638dcc7e7 treewide: Remove uninitialized_var() usage
commit 3f649ab728cda8038259d8f14492fe400fbab911 upstream.

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09 10:29:01 +02:00
Arnd Bergmann
f0bb513555 flow_dissector: work around stack frame size warning
commit 0af413bd3e2de73bcf0742ed556be4af83c71964 upstream.

The fl_flow_key structure is around 500 bytes, so having two of them
on the stack in one function now exceeds the warning limit after an
otherwise correct change:

net/sched/cls_flower.c:298:12: error: stack frame size of 1056 bytes in function 'fl_classify' [-Werror,-Wframe-larger-than=]

I suspect the fl_classify function could be reworked to only have one
of them on the stack and modify it in place, but I could not work out
how to do that.

As a somewhat hacky workaround, move one of them into an out-of-line
function to reduce its scope. This does not necessarily reduce the stack
usage of the outer function, but at least the second copy is removed
from the stack during most of it and does not add up to whatever is
called from there.

I now see 552 bytes of stack usage for fl_classify(), plus 528 bytes
for fl_mask_lookup().

Fixes: 58cff782cc55 ("flow_dissector: Parse multiple MPLS Label Stack Entries")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09 10:29:00 +02:00
Hangyu Hua
94a00f1142 net/sched: flower: fix possible OOB write in fl_set_geneve_opt()
[ Upstream commit 4d56304e5827c8cc8cc18c75343d283af7c4825c ]

If we send two TCA_FLOWER_KEY_ENC_OPTS_GENEVE packets and their total
size is 252 bytes(key->enc_opts.len = 252) then
key->enc_opts.len = opt->length = data_len / 4 = 0 when the third
TCA_FLOWER_KEY_ENC_OPTS_GENEVE packet enters fl_set_geneve_opt. This
bypasses the next bounds check and results in an out-of-bounds.

Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Link: https://lore.kernel.org/r/20230531102805.27090-1-hbh25y@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-09 10:28:54 +02:00
Zhengchao Shao
9e6bb63e5e net: sched: fix NULL pointer dereference in mq_attach
[ Upstream commit 36eec020fab668719b541f34d97f44e232ffa165 ]

When use the following command to test:
1)ip link add bond0 type bond
2)ip link set bond0 up
3)tc qdisc add dev bond0 root handle ffff: mq
4)tc qdisc replace dev bond0 parent ffff:fff1 handle ffff: mq

The kernel reports NULL pointer dereference issue. The stack information
is as follows:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Internal error: Oops: 0000000096000006 [#1] SMP
Modules linked in:
pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mq_attach+0x44/0xa0
lr : qdisc_graft+0x20c/0x5cc
sp : ffff80000e2236a0
x29: ffff80000e2236a0 x28: ffff0000c0e59d80 x27: ffff0000c0be19c0
x26: ffff0000cae3e800 x25: 0000000000000010 x24: 00000000fffffff1
x23: 0000000000000000 x22: ffff0000cae3e800 x21: ffff0000c9df4000
x20: ffff0000c9df4000 x19: 0000000000000000 x18: ffff80000a934000
x17: ffff8000f5b56000 x16: ffff80000bb08000 x15: 0000000000000000
x14: 0000000000000000 x13: 6b6b6b6b6b6b6b6b x12: 6b6b6b6b00000001
x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
x8 : ffff0000c0be0730 x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008
x5 : ffff0000cae3e864 x4 : 0000000000000000 x3 : 0000000000000001
x2 : 0000000000000001 x1 : ffff8000090bc23c x0 : 0000000000000000
Call trace:
mq_attach+0x44/0xa0
qdisc_graft+0x20c/0x5cc
tc_modify_qdisc+0x1c4/0x664
rtnetlink_rcv_msg+0x354/0x440
netlink_rcv_skb+0x64/0x144
rtnetlink_rcv+0x28/0x34
netlink_unicast+0x1e8/0x2a4
netlink_sendmsg+0x308/0x4a0
sock_sendmsg+0x64/0xac
____sys_sendmsg+0x29c/0x358
___sys_sendmsg+0x90/0xd0
__sys_sendmsg+0x7c/0xd0
__arm64_sys_sendmsg+0x2c/0x38
invoke_syscall+0x54/0x114
el0_svc_common.constprop.1+0x90/0x174
do_el0_svc+0x3c/0xb0
el0_svc+0x24/0xec
el0t_64_sync_handler+0x90/0xb4
el0t_64_sync+0x174/0x178

This is because when mq is added for the first time, qdiscs in mq is set
to NULL in mq_attach(). Therefore, when replacing mq after adding mq, we
need to initialize qdiscs in the mq before continuing to graft. Otherwise,
it will couse NULL pointer dereference issue in mq_attach(). And the same
issue will occur in the attach functions of mqprio, taprio and htb.
ffff:fff1 means that the repalce qdisc is ingress. Ingress does not allow
any qdisc to be attached. Therefore, ffff:fff1 is incorrectly used, and
the command should be dropped.

Fixes: 6ec1c69a8f64 ("net_sched: add classful multiqueue dummy scheduler")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Tested-by: Peilin Ye <peilin.ye@bytedance.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20230527093747.3583502-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-09 10:28:54 +02:00
Peilin Ye
2188c0f095 net/sched: Prohibit regrafting ingress or clsact Qdiscs
[ Upstream commit 9de95df5d15baa956c2b70b9e794842e790a8a13 ]

Currently, after creating an ingress (or clsact) Qdisc and grafting it
under TC_H_INGRESS (TC_H_CLSACT), it is possible to graft it again under
e.g. a TBF Qdisc:

  $ ip link add ifb0 type ifb
  $ tc qdisc add dev ifb0 handle 1: root tbf rate 20kbit buffer 1600 limit 3000
  $ tc qdisc add dev ifb0 clsact
  $ tc qdisc link dev ifb0 handle ffff: parent 1:1
  $ tc qdisc show dev ifb0
  qdisc tbf 1: root refcnt 2 rate 20Kbit burst 1600b lat 560.0ms
  qdisc clsact ffff: parent ffff:fff1 refcnt 2
                                      ^^^^^^^^

clsact's refcount has increased: it is now grafted under both
TC_H_CLSACT and 1:1.

ingress and clsact Qdiscs should only be used under TC_H_INGRESS
(TC_H_CLSACT).  Prohibit regrafting them.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-09 10:28:54 +02:00
Peilin Ye
80b20d528a net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs
[ Upstream commit f85fa45d4a9408d98c46c8fa45ba2e3b2f4bf219 ]

Currently it is possible to add e.g. an HTB Qdisc under ffff:fff1
(TC_H_INGRESS, TC_H_CLSACT):

  $ ip link add name ifb0 type ifb
  $ tc qdisc add dev ifb0 parent ffff:fff1 htb
  $ tc qdisc add dev ifb0 clsact
  Error: Exclusivity flag on, cannot modify.
  $ drgn
  ...
  >>> ifb0 = netdev_get_by_name(prog, "ifb0")
  >>> qdisc = ifb0.ingress_queue.qdisc_sleeping
  >>> print(qdisc.ops.id.string_().decode())
  htb
  >>> qdisc.flags.value_() # TCQ_F_INGRESS
  2

Only allow ingress and clsact Qdiscs under ffff:fff1.  Return -EINVAL
for everything else.  Make TCQ_F_INGRESS a static flag of ingress and
clsact Qdiscs.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-09 10:28:54 +02:00
Peilin Ye
321f383755 net/sched: sch_clsact: Only create under TC_H_CLSACT
[ Upstream commit 5eeebfe6c493192b10d516abfd72742900f2a162 ]

clsact Qdiscs are only supposed to be created under TC_H_CLSACT (which
equals TC_H_INGRESS).  Return -EOPNOTSUPP if 'parent' is not
TC_H_CLSACT.

Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-09 10:28:53 +02:00
Peilin Ye
5f67d33c01 net/sched: sch_ingress: Only create under TC_H_INGRESS
[ Upstream commit c7cfbd115001f94de9e4053657946a383147e803 ]

ingress Qdiscs are only supposed to be created under TC_H_INGRESS.
Return -EOPNOTSUPP if 'parent' is not TC_H_INGRESS, similar to
mq_init().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-09 10:28:53 +02:00
Victor Nogueira
dfa36eb380 net/sched: act_mirred: Add carrier check
[ Upstream commit 526f28bd0fbdc699cda31426928802650c1528e5 ]

There are cases where the device is adminstratively UP, but operationally
down. For example, we have a physical device (Nvidia ConnectX-6 Dx, 25Gbps)
who's cable was pulled out, here is its ip link output:

5: ens2f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether b8:ce:f6:4b:68:35 brd ff:ff:ff:ff:ff:ff
    altname enp179s0f1np1

As you can see, it's administratively UP but operationally down.
In this case, sending a packet to this port caused a nasty kernel hang (so
nasty that we were unable to capture it). Aborting a transmit based on
operational status (in addition to administrative status) fixes the issue.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
v1->v2: Add fixes tag
v2->v3: Remove blank line between tags + add change log, suggested by Leon
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-17 11:35:59 +02:00
Vlad Buslov
cc5fe387c6 net/sched: cls_api: remove block_cb from driver_list before freeing
[ Upstream commit da94a7781fc3c92e7df7832bc2746f4d39bc624e ]

Error handler of tcf_block_bind() frees the whole bo->cb_list on error.
However, by that time the flow_block_cb instances are already in the driver
list because driver ndo_setup_tc() callback is called before that up the
call chain in tcf_block_offload_cmd(). This leaves dangling pointers to
freed objects in the list and causes use-after-free[0]. Fix it by also
removing flow_block_cb instances from driver_list before deallocating them.

[0]:
[  279.868433] ==================================================================
[  279.869964] BUG: KASAN: slab-use-after-free in flow_block_cb_setup_simple+0x631/0x7c0
[  279.871527] Read of size 8 at addr ffff888147e2bf20 by task tc/2963

[  279.873151] CPU: 6 PID: 2963 Comm: tc Not tainted 6.3.0-rc6+ #4
[  279.874273] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  279.876295] Call Trace:
[  279.876882]  <TASK>
[  279.877413]  dump_stack_lvl+0x33/0x50
[  279.878198]  print_report+0xc2/0x610
[  279.878987]  ? flow_block_cb_setup_simple+0x631/0x7c0
[  279.879994]  kasan_report+0xae/0xe0
[  279.880750]  ? flow_block_cb_setup_simple+0x631/0x7c0
[  279.881744]  ? mlx5e_tc_reoffload_flows_work+0x240/0x240 [mlx5_core]
[  279.883047]  flow_block_cb_setup_simple+0x631/0x7c0
[  279.884027]  tcf_block_offload_cmd.isra.0+0x189/0x2d0
[  279.885037]  ? tcf_block_setup+0x6b0/0x6b0
[  279.885901]  ? mutex_lock+0x7d/0xd0
[  279.886669]  ? __mutex_unlock_slowpath.constprop.0+0x2d0/0x2d0
[  279.887844]  ? ingress_init+0x1c0/0x1c0 [sch_ingress]
[  279.888846]  tcf_block_get_ext+0x61c/0x1200
[  279.889711]  ingress_init+0x112/0x1c0 [sch_ingress]
[  279.890682]  ? clsact_init+0x2b0/0x2b0 [sch_ingress]
[  279.891701]  qdisc_create+0x401/0xea0
[  279.892485]  ? qdisc_tree_reduce_backlog+0x470/0x470
[  279.893473]  tc_modify_qdisc+0x6f7/0x16d0
[  279.894344]  ? tc_get_qdisc+0xac0/0xac0
[  279.895213]  ? mutex_lock+0x7d/0xd0
[  279.896005]  ? __mutex_lock_slowpath+0x10/0x10
[  279.896910]  rtnetlink_rcv_msg+0x5fe/0x9d0
[  279.897770]  ? rtnl_calcit.isra.0+0x2b0/0x2b0
[  279.898672]  ? __sys_sendmsg+0xb5/0x140
[  279.899494]  ? do_syscall_64+0x3d/0x90
[  279.900302]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  279.901337]  ? kasan_save_stack+0x2e/0x40
[  279.902177]  ? kasan_save_stack+0x1e/0x40
[  279.903058]  ? kasan_set_track+0x21/0x30
[  279.903913]  ? kasan_save_free_info+0x2a/0x40
[  279.904836]  ? ____kasan_slab_free+0x11a/0x1b0
[  279.905741]  ? kmem_cache_free+0x179/0x400
[  279.906599]  netlink_rcv_skb+0x12c/0x360
[  279.907450]  ? rtnl_calcit.isra.0+0x2b0/0x2b0
[  279.908360]  ? netlink_ack+0x1550/0x1550
[  279.909192]  ? rhashtable_walk_peek+0x170/0x170
[  279.910135]  ? kmem_cache_alloc_node+0x1af/0x390
[  279.911086]  ? _copy_from_iter+0x3d6/0xc70
[  279.912031]  netlink_unicast+0x553/0x790
[  279.912864]  ? netlink_attachskb+0x6a0/0x6a0
[  279.913763]  ? netlink_recvmsg+0x416/0xb50
[  279.914627]  netlink_sendmsg+0x7a1/0xcb0
[  279.915473]  ? netlink_unicast+0x790/0x790
[  279.916334]  ? iovec_from_user.part.0+0x4d/0x220
[  279.917293]  ? netlink_unicast+0x790/0x790
[  279.918159]  sock_sendmsg+0xc5/0x190
[  279.918938]  ____sys_sendmsg+0x535/0x6b0
[  279.919813]  ? import_iovec+0x7/0x10
[  279.920601]  ? kernel_sendmsg+0x30/0x30
[  279.921423]  ? __copy_msghdr+0x3c0/0x3c0
[  279.922254]  ? import_iovec+0x7/0x10
[  279.923041]  ___sys_sendmsg+0xeb/0x170
[  279.923854]  ? copy_msghdr_from_user+0x110/0x110
[  279.924797]  ? ___sys_recvmsg+0xd9/0x130
[  279.925630]  ? __perf_event_task_sched_in+0x183/0x470
[  279.926656]  ? ___sys_sendmsg+0x170/0x170
[  279.927529]  ? ctx_sched_in+0x530/0x530
[  279.928369]  ? update_curr+0x283/0x4f0
[  279.929185]  ? perf_event_update_userpage+0x570/0x570
[  279.930201]  ? __fget_light+0x57/0x520
[  279.931023]  ? __switch_to+0x53d/0xe70
[  279.931846]  ? sockfd_lookup_light+0x1a/0x140
[  279.932761]  __sys_sendmsg+0xb5/0x140
[  279.933560]  ? __sys_sendmsg_sock+0x20/0x20
[  279.934436]  ? fpregs_assert_state_consistent+0x1d/0xa0
[  279.935490]  do_syscall_64+0x3d/0x90
[  279.936300]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  279.937311] RIP: 0033:0x7f21c814f887
[  279.938085] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[  279.941448] RSP: 002b:00007fff11efd478 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  279.942964] RAX: ffffffffffffffda RBX: 0000000064401979 RCX: 00007f21c814f887
[  279.944337] RDX: 0000000000000000 RSI: 00007fff11efd4e0 RDI: 0000000000000003
[  279.945660] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[  279.947003] R10: 00007f21c8008708 R11: 0000000000000246 R12: 0000000000000001
[  279.948345] R13: 0000000000409980 R14: 000000000047e538 R15: 0000000000485400
[  279.949690]  </TASK>

[  279.950706] Allocated by task 2960:
[  279.951471]  kasan_save_stack+0x1e/0x40
[  279.952338]  kasan_set_track+0x21/0x30
[  279.953165]  __kasan_kmalloc+0x77/0x90
[  279.954006]  flow_block_cb_setup_simple+0x3dd/0x7c0
[  279.955001]  tcf_block_offload_cmd.isra.0+0x189/0x2d0
[  279.956020]  tcf_block_get_ext+0x61c/0x1200
[  279.956881]  ingress_init+0x112/0x1c0 [sch_ingress]
[  279.957873]  qdisc_create+0x401/0xea0
[  279.958656]  tc_modify_qdisc+0x6f7/0x16d0
[  279.959506]  rtnetlink_rcv_msg+0x5fe/0x9d0
[  279.960392]  netlink_rcv_skb+0x12c/0x360
[  279.961216]  netlink_unicast+0x553/0x790
[  279.962044]  netlink_sendmsg+0x7a1/0xcb0
[  279.962906]  sock_sendmsg+0xc5/0x190
[  279.963702]  ____sys_sendmsg+0x535/0x6b0
[  279.964534]  ___sys_sendmsg+0xeb/0x170
[  279.965343]  __sys_sendmsg+0xb5/0x140
[  279.966132]  do_syscall_64+0x3d/0x90
[  279.966908]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

[  279.968407] Freed by task 2960:
[  279.969114]  kasan_save_stack+0x1e/0x40
[  279.969929]  kasan_set_track+0x21/0x30
[  279.970729]  kasan_save_free_info+0x2a/0x40
[  279.971603]  ____kasan_slab_free+0x11a/0x1b0
[  279.972483]  __kmem_cache_free+0x14d/0x280
[  279.973337]  tcf_block_setup+0x29d/0x6b0
[  279.974173]  tcf_block_offload_cmd.isra.0+0x226/0x2d0
[  279.975186]  tcf_block_get_ext+0x61c/0x1200
[  279.976080]  ingress_init+0x112/0x1c0 [sch_ingress]
[  279.977065]  qdisc_create+0x401/0xea0
[  279.977857]  tc_modify_qdisc+0x6f7/0x16d0
[  279.978695]  rtnetlink_rcv_msg+0x5fe/0x9d0
[  279.979562]  netlink_rcv_skb+0x12c/0x360
[  279.980388]  netlink_unicast+0x553/0x790
[  279.981214]  netlink_sendmsg+0x7a1/0xcb0
[  279.982043]  sock_sendmsg+0xc5/0x190
[  279.982827]  ____sys_sendmsg+0x535/0x6b0
[  279.983703]  ___sys_sendmsg+0xeb/0x170
[  279.984510]  __sys_sendmsg+0xb5/0x140
[  279.985298]  do_syscall_64+0x3d/0x90
[  279.986076]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

[  279.987532] The buggy address belongs to the object at ffff888147e2bf00
                which belongs to the cache kmalloc-192 of size 192
[  279.989747] The buggy address is located 32 bytes inside of
                freed 192-byte region [ffff888147e2bf00, ffff888147e2bfc0)

[  279.992367] The buggy address belongs to the physical page:
[  279.993430] page:00000000550f405c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x147e2a
[  279.995182] head:00000000550f405c order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  279.996713] anon flags: 0x200000000010200(slab|head|node=0|zone=2)
[  279.997878] raw: 0200000000010200 ffff888100042a00 0000000000000000 dead000000000001
[  279.999384] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[  280.000894] page dumped because: kasan: bad access detected

[  280.002386] Memory state around the buggy address:
[  280.003338]  ffff888147e2be00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  280.004781]  ffff888147e2be80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[  280.006224] >ffff888147e2bf00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  280.007700]                                ^
[  280.008592]  ffff888147e2bf80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[  280.010035]  ffff888147e2c000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  280.011564] ==================================================================

Fixes: 59094b1e5094 ("net: sched: use flow block API")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-17 11:35:58 +02:00
Gwangun Jung
35dceaeab9 net: sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg
[ Upstream commit 3037933448f60f9acb705997eae62013ecb81e0d ]

If the TCA_QFQ_LMAX value is not offered through nlattr, lmax is determined by the MTU value of the network device.
The MTU of the loopback device can be set up to 2^31-1.
As a result, it is possible to have an lmax value that exceeds QFQ_MIN_LMAX.

Due to the invalid lmax value, an index is generated that exceeds the QFQ_MAX_INDEX(=24) value, causing out-of-bounds read/write errors.

The following reports a oob access:

[   84.582666] BUG: KASAN: slab-out-of-bounds in qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313)
[   84.583267] Read of size 4 at addr ffff88810f676948 by task ping/301
[   84.583686]
[   84.583797] CPU: 3 PID: 301 Comm: ping Not tainted 6.3.0-rc5 #1
[   84.584164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   84.584644] Call Trace:
[   84.584787]  <TASK>
[   84.584906] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
[   84.585108] print_report (mm/kasan/report.c:320 mm/kasan/report.c:430)
[   84.585570] kasan_report (mm/kasan/report.c:538)
[   84.585988] qfq_activate_agg.constprop.0 (net/sched/sch_qfq.c:1027 net/sched/sch_qfq.c:1060 net/sched/sch_qfq.c:1313)
[   84.586599] qfq_enqueue (net/sched/sch_qfq.c:1255)
[   84.587607] dev_qdisc_enqueue (net/core/dev.c:3776)
[   84.587749] __dev_queue_xmit (./include/net/sch_generic.h:186 net/core/dev.c:3865 net/core/dev.c:4212)
[   84.588763] ip_finish_output2 (./include/net/neighbour.h:546 net/ipv4/ip_output.c:228)
[   84.589460] ip_output (net/ipv4/ip_output.c:430)
[   84.590132] ip_push_pending_frames (./include/net/dst.h:444 net/ipv4/ip_output.c:126 net/ipv4/ip_output.c:1586 net/ipv4/ip_output.c:1606)
[   84.590285] raw_sendmsg (net/ipv4/raw.c:649)
[   84.591960] sock_sendmsg (net/socket.c:724 net/socket.c:747)
[   84.592084] __sys_sendto (net/socket.c:2142)
[   84.593306] __x64_sys_sendto (net/socket.c:2150)
[   84.593779] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[   84.593902] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[   84.594070] RIP: 0033:0x7fe568032066
[   84.594192] Code: 0e 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c09[ 84.594796] RSP: 002b:00007ffce388b4e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c

Code starting with the faulting instruction
===========================================
[   84.595047] RAX: ffffffffffffffda RBX: 00007ffce388cc70 RCX: 00007fe568032066
[   84.595281] RDX: 0000000000000040 RSI: 00005605fdad6d10 RDI: 0000000000000003
[   84.595515] RBP: 00005605fdad6d10 R08: 00007ffce388eeec R09: 0000000000000010
[   84.595749] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
[   84.595984] R13: 00007ffce388cc30 R14: 00007ffce388b4f0 R15: 0000001d00000001
[   84.596218]  </TASK>
[   84.596295]
[   84.596351] Allocated by task 291:
[   84.596467] kasan_save_stack (mm/kasan/common.c:46)
[   84.596597] kasan_set_track (mm/kasan/common.c:52)
[   84.596725] __kasan_kmalloc (mm/kasan/common.c:384)
[   84.596852] __kmalloc_node (./include/linux/kasan.h:196 mm/slab_common.c:967 mm/slab_common.c:974)
[   84.596979] qdisc_alloc (./include/linux/slab.h:610 ./include/linux/slab.h:731 net/sched/sch_generic.c:938)
[   84.597100] qdisc_create (net/sched/sch_api.c:1244)
[   84.597222] tc_modify_qdisc (net/sched/sch_api.c:1680)
[   84.597357] rtnetlink_rcv_msg (net/core/rtnetlink.c:6174)
[   84.597495] netlink_rcv_skb (net/netlink/af_netlink.c:2574)
[   84.597627] netlink_unicast (net/netlink/af_netlink.c:1340 net/netlink/af_netlink.c:1365)
[   84.597759] netlink_sendmsg (net/netlink/af_netlink.c:1942)
[   84.597891] sock_sendmsg (net/socket.c:724 net/socket.c:747)
[   84.598016] ____sys_sendmsg (net/socket.c:2501)
[   84.598147] ___sys_sendmsg (net/socket.c:2557)
[   84.598275] __sys_sendmsg (./include/linux/file.h:31 net/socket.c:2586)
[   84.598399] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[   84.598520] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[   84.598688]
[   84.598744] The buggy address belongs to the object at ffff88810f674000
[   84.598744]  which belongs to the cache kmalloc-8k of size 8192
[   84.599135] The buggy address is located 2664 bytes to the right of
[   84.599135]  allocated 7904-byte region [ffff88810f674000, ffff88810f675ee0)
[   84.599544]
[   84.599598] The buggy address belongs to the physical page:
[   84.599777] page:00000000e638567f refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10f670
[   84.600074] head:00000000e638567f order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[   84.600330] flags: 0x200000000010200(slab|head|node=0|zone=2)
[   84.600517] raw: 0200000000010200 ffff888100043180 dead000000000122 0000000000000000
[   84.600764] raw: 0000000000000000 0000000080020002 00000001ffffffff 0000000000000000
[   84.601009] page dumped because: kasan: bad access detected
[   84.601187]
[   84.601241] Memory state around the buggy address:
[   84.601396]  ffff88810f676800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.601620]  ffff88810f676880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.601845] >ffff88810f676900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.602069]                                               ^
[   84.602243]  ffff88810f676980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.602468]  ffff88810f676a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   84.602693] ==================================================================
[   84.602924] Disabling lock debugging due to kernel taint

Fixes: 3015f3d2a3cd ("pkt_sched: enable QFQ to support TSO/GSO")
Reported-by: Gwangun Jung <exsociety@gmail.com>
Signed-off-by: Gwangun Jung <exsociety@gmail.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26 11:24:01 +02:00
Pratyush Yadav
c381527918 net_sched: prevent NULL dereference if default qdisc setup failed
If qdisc_create_dflt() fails, it returns NULL. With CONFIG_NET_SCHED
enabled, the check qdisc != &noop_qdisc passes and qdisc will be passed
to qdisc_hash_add(), which dereferences it.

This assignment was present in the upstream commit 5891cd5ec46c2
("net_sched: add __rcu annotation to netdev->qdisc") but was missed in
the backport 22d95b5449249 ("net_sched: add __rcu annotation to
netdev->qdisc"), perhaps due to merge conflicts.  dev->qdisc is
&noop_qdisc by default and if qdisc_create_dflt() fails, this assignment
will make sure qdisc == &noop_qdisc and no NULL dereference will take
place.

This bug was discovered and resolved using Coverity Static Analysis
Security Testing (SAST) by Synopsys, Inc.

Fixes: 22d95b5449249 ("net_sched: add __rcu annotation to netdev->qdisc")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20 12:07:34 +02:00
Eric Dumazet
0f5c0e0a4c net: sched: fix race condition in qdisc_graft()
commit ebda44da44f6f309d302522b049f43d6f829f7aa upstream.

We had one syzbot report [1] in syzbot queue for a while.
I was waiting for more occurrences and/or a repro but
Dmitry Vyukov spotted the issue right away.

<quoting Dmitry>
qdisc_graft() drops reference to qdisc in notify_and_destroy
while it's still assigned to dev->qdisc
</quoting>

Indeed, RCU rules are clear when replacing a data structure.
The visible pointer (dev->qdisc in this case) must be updated
to the new object _before_ RCU grace period is started
(qdisc_put(old) in this case).

[1]
BUG: KASAN: use-after-free in __tcf_qdisc_find.part.0+0xa3a/0xac0 net/sched/cls_api.c:1066
Read of size 4 at addr ffff88802065e038 by task syz-executor.4/21027

CPU: 0 PID: 21027 Comm: syz-executor.4 Not tainted 6.0.0-rc3-syzkaller-00363-g7726d4c3e60b #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:317 [inline]
print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
kasan_report+0xb1/0x1e0 mm/kasan/report.c:495
__tcf_qdisc_find.part.0+0xa3a/0xac0 net/sched/cls_api.c:1066
__tcf_qdisc_find net/sched/cls_api.c:1051 [inline]
tc_new_tfilter+0x34f/0x2200 net/sched/cls_api.c:2018
rtnetlink_rcv_msg+0x955/0xca0 net/core/rtnetlink.c:6081
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:734
____sys_sendmsg+0x6eb/0x810 net/socket.c:2482
___sys_sendmsg+0x110/0x1b0 net/socket.c:2536
__sys_sendmsg+0xf3/0x1c0 net/socket.c:2565
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f5efaa89279
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f5efbc31168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f5efab9bf80 RCX: 00007f5efaa89279
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000005
RBP: 00007f5efaae32e9 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5efb0cfb1f R14: 00007f5efbc31300 R15: 0000000000022000
</TASK>

Allocated by task 21027:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:45 [inline]
set_alloc_info mm/kasan/common.c:437 [inline]
____kasan_kmalloc mm/kasan/common.c:516 [inline]
____kasan_kmalloc mm/kasan/common.c:475 [inline]
__kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:525
kmalloc_node include/linux/slab.h:623 [inline]
kzalloc_node include/linux/slab.h:744 [inline]
qdisc_alloc+0xb0/0xc50 net/sched/sch_generic.c:938
qdisc_create_dflt+0x71/0x4a0 net/sched/sch_generic.c:997
attach_one_default_qdisc net/sched/sch_generic.c:1152 [inline]
netdev_for_each_tx_queue include/linux/netdevice.h:2437 [inline]
attach_default_qdiscs net/sched/sch_generic.c:1170 [inline]
dev_activate+0x760/0xcd0 net/sched/sch_generic.c:1229
__dev_open+0x393/0x4d0 net/core/dev.c:1441
__dev_change_flags+0x583/0x750 net/core/dev.c:8556
rtnl_configure_link+0xee/0x240 net/core/rtnetlink.c:3189
rtnl_newlink_create net/core/rtnetlink.c:3371 [inline]
__rtnl_newlink+0x10b8/0x17e0 net/core/rtnetlink.c:3580
rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3593
rtnetlink_rcv_msg+0x43a/0xca0 net/core/rtnetlink.c:6090
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:734
____sys_sendmsg+0x6eb/0x810 net/socket.c:2482
___sys_sendmsg+0x110/0x1b0 net/socket.c:2536
__sys_sendmsg+0xf3/0x1c0 net/socket.c:2565
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Freed by task 21020:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:45
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:367 [inline]
____kasan_slab_free+0x166/0x1c0 mm/kasan/common.c:329
kasan_slab_free include/linux/kasan.h:200 [inline]
slab_free_hook mm/slub.c:1754 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1780
slab_free mm/slub.c:3534 [inline]
kfree+0xe2/0x580 mm/slub.c:4562
rcu_do_batch kernel/rcu/tree.c:2245 [inline]
rcu_core+0x7b5/0x1890 kernel/rcu/tree.c:2505
__do_softirq+0x1d3/0x9c6 kernel/softirq.c:571

Last potentially related work creation:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
__kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348
call_rcu+0x99/0x790 kernel/rcu/tree.c:2793
qdisc_put+0xcd/0xe0 net/sched/sch_generic.c:1083
notify_and_destroy net/sched/sch_api.c:1012 [inline]
qdisc_graft+0xeb1/0x1270 net/sched/sch_api.c:1084
tc_modify_qdisc+0xbb7/0x1a00 net/sched/sch_api.c:1671
rtnetlink_rcv_msg+0x43a/0xca0 net/core/rtnetlink.c:6090
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2501
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x917/0xe10 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:734
____sys_sendmsg+0x6eb/0x810 net/socket.c:2482
___sys_sendmsg+0x110/0x1b0 net/socket.c:2536
__sys_sendmsg+0xf3/0x1c0 net/socket.c:2565
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Second to last potentially related work creation:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
__kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348
kvfree_call_rcu+0x74/0x940 kernel/rcu/tree.c:3322
neigh_destroy+0x431/0x630 net/core/neighbour.c:912
neigh_release include/net/neighbour.h:454 [inline]
neigh_cleanup_and_release+0x1f8/0x330 net/core/neighbour.c:103
neigh_del net/core/neighbour.c:225 [inline]
neigh_remove_one+0x37d/0x460 net/core/neighbour.c:246
neigh_forced_gc net/core/neighbour.c:276 [inline]
neigh_alloc net/core/neighbour.c:447 [inline]
___neigh_create+0x18b5/0x29a0 net/core/neighbour.c:642
ip6_finish_output2+0xfb8/0x1520 net/ipv6/ip6_output.c:125
__ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
ip6_finish_output+0x690/0x1160 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:296 [inline]
ip6_output+0x1ed/0x540 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:451 [inline]
NF_HOOK include/linux/netfilter.h:307 [inline]
NF_HOOK include/linux/netfilter.h:301 [inline]
mld_sendpack+0xa09/0xe70 net/ipv6/mcast.c:1820
mld_send_cr net/ipv6/mcast.c:2121 [inline]
mld_ifc_work+0x71c/0xdc0 net/ipv6/mcast.c:2653
process_one_work+0x991/0x1610 kernel/workqueue.c:2289
worker_thread+0x665/0x1080 kernel/workqueue.c:2436
kthread+0x2e4/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306

The buggy address belongs to the object at ffff88802065e000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 56 bytes inside of
1024-byte region [ffff88802065e000, ffff88802065e400)

The buggy address belongs to the physical page:
page:ffffea0000819600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20658
head:ffffea0000819600 order:3 compound_mapcount:0 compound_pincount:0
flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000010200 0000000000000000 dead000000000001 ffff888011841dc0
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3523, tgid 3523 (sshd), ts 41495190986, free_ts 41417713212
prep_new_page mm/page_alloc.c:2532 [inline]
get_page_from_freelist+0x109b/0x2ce0 mm/page_alloc.c:4283
__alloc_pages+0x1c7/0x510 mm/page_alloc.c:5515
alloc_pages+0x1a6/0x270 mm/mempolicy.c:2270
alloc_slab_page mm/slub.c:1824 [inline]
allocate_slab+0x27e/0x3d0 mm/slub.c:1969
new_slab mm/slub.c:2029 [inline]
___slab_alloc+0x7f1/0xe10 mm/slub.c:3031
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3118
slab_alloc_node mm/slub.c:3209 [inline]
__kmalloc_node_track_caller+0x2f2/0x380 mm/slub.c:4955
kmalloc_reserve net/core/skbuff.c:358 [inline]
__alloc_skb+0xd9/0x2f0 net/core/skbuff.c:430
alloc_skb_fclone include/linux/skbuff.h:1307 [inline]
tcp_stream_alloc_skb+0x38/0x580 net/ipv4/tcp.c:861
tcp_sendmsg_locked+0xc36/0x2f80 net/ipv4/tcp.c:1325
tcp_sendmsg+0x2b/0x40 net/ipv4/tcp.c:1483
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:734
sock_write_iter+0x291/0x3d0 net/socket.c:1108
call_write_iter include/linux/fs.h:2187 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x9e9/0xdd0 fs/read_write.c:578
ksys_write+0x1e8/0x250 fs/read_write.c:631
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1449 [inline]
free_pcp_prepare+0x5e4/0xd20 mm/page_alloc.c:1499
free_unref_page_prepare mm/page_alloc.c:3380 [inline]
free_unref_page+0x19/0x4d0 mm/page_alloc.c:3476
__unfreeze_partials+0x17c/0x1a0 mm/slub.c:2548
qlink_free mm/kasan/quarantine.c:168 [inline]
qlist_free_all+0x6a/0x170 mm/kasan/quarantine.c:187
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:294
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:447
kasan_slab_alloc include/linux/kasan.h:224 [inline]
slab_post_alloc_hook mm/slab.h:727 [inline]
slab_alloc_node mm/slub.c:3243 [inline]
slab_alloc mm/slub.c:3251 [inline]
__kmem_cache_alloc_lru mm/slub.c:3258 [inline]
kmem_cache_alloc+0x267/0x3b0 mm/slub.c:3268
kmem_cache_zalloc include/linux/slab.h:723 [inline]
alloc_buffer_head+0x20/0x140 fs/buffer.c:2974
alloc_page_buffers+0x280/0x790 fs/buffer.c:829
create_empty_buffers+0x2c/0xee0 fs/buffer.c:1558
ext4_block_write_begin+0x1004/0x1530 fs/ext4/inode.c:1074
ext4_da_write_begin+0x422/0xae0 fs/ext4/inode.c:2996
generic_perform_write+0x246/0x560 mm/filemap.c:3738
ext4_buffered_write_iter+0x15b/0x460 fs/ext4/file.c:270
ext4_file_write_iter+0x44a/0x1660 fs/ext4/file.c:679
call_write_iter include/linux/fs.h:2187 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x9e9/0xdd0 fs/read_write.c:578

Fixes: af356afa010f ("net_sched: reintroduce dev->qdisc for use by sch_api")
Reported-by: syzbot <syzkaller@googlegroups.com>
Diagnosed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20221018203258.2793282-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Zubin Mithra <zsm@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-05 11:16:46 +02:00
Eric Dumazet
22d95b5449 net_sched: add __rcu annotation to netdev->qdisc
commit 5891cd5ec46c2c2eb6427cb54d214b149635dd0e upstream.

syzbot found a data-race [1] which lead me to add __rcu
annotations to netdev->qdisc, and proper accessors
to get LOCKDEP support.

[1]
BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu

write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1:
 attach_default_qdiscs net/sched/sch_generic.c:1167 [inline]
 dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221
 __dev_open+0x2e9/0x3a0 net/core/dev.c:1416
 __dev_change_flags+0x167/0x3f0 net/core/dev.c:8139
 rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150
 __rtnl_newlink net/core/rtnetlink.c:3489 [inline]
 rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529
 rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594
 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
 netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg net/socket.c:725 [inline]
 ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
 ___sys_sendmsg net/socket.c:2467 [inline]
 __sys_sendmsg+0x195/0x230 net/socket.c:2496
 __do_sys_sendmsg net/socket.c:2505 [inline]
 __se_sys_sendmsg net/socket.c:2503 [inline]
 __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0:
 qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323
 __tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050
 tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211
 rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585
 netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
 netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
 netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg net/socket.c:725 [inline]
 ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
 ___sys_sendmsg net/socket.c:2467 [inline]
 __sys_sendmsg+0x195/0x230 net/socket.c:2496
 __do_sys_sendmsg net/socket.c:2505 [inline]
 __se_sys_sendmsg net/socket.c:2503 [inline]
 __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e1383-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: 470502de5bdb ("net: sched: unlock rules update API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Zubin Mithra <zsm@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-05 11:16:46 +02:00
Pedro Tammela
dda4f0a424 net/sched: act_sample: fix action bind logic
[ Upstream commit 4a20056a49a1854966562241922f68197f950539 ]

The TC architecture allows filters and actions to be created independently.
In filters the user can reference action objects using:
tc action add action sample ... index 1
tc filter add ... action pedit index 1

In the current code for act_sample this is broken as it checks netlink
attributes for create/update before actually checking if we are binding to an
existing action.

tdc results:
1..29
ok 1 9784 - Add valid sample action with mandatory arguments
ok 2 5c91 - Add valid sample action with mandatory arguments and continue control action
ok 3 334b - Add valid sample action with mandatory arguments and drop control action
ok 4 da69 - Add valid sample action with mandatory arguments and reclassify control action
ok 5 13ce - Add valid sample action with mandatory arguments and pipe control action
ok 6 1886 - Add valid sample action with mandatory arguments and jump control action
ok 7 7571 - Add sample action with invalid rate
ok 8 b6d4 - Add sample action with mandatory arguments and invalid control action
ok 9 a874 - Add invalid sample action without mandatory arguments
ok 10 ac01 - Add invalid sample action without mandatory argument rate
ok 11 4203 - Add invalid sample action without mandatory argument group
ok 12 14a7 - Add invalid sample action without mandatory argument group
ok 13 8f2e - Add valid sample action with trunc argument
ok 14 45f8 - Add sample action with maximum rate argument
ok 15 ad0c - Add sample action with maximum trunc argument
ok 16 83a9 - Add sample action with maximum group argument
ok 17 ed27 - Add sample action with invalid rate argument
ok 18 2eae - Add sample action with invalid group argument
ok 19 6ff3 - Add sample action with invalid trunc size
ok 20 2b2a - Add sample action with invalid index
ok 21 dee2 - Add sample action with maximum allowed index
ok 22 560e - Add sample action with cookie
ok 23 704a - Replace existing sample action with new rate argument
ok 24 60eb - Replace existing sample action with new group argument
ok 25 2cce - Replace existing sample action with new trunc argument
ok 26 59d1 - Replace existing sample action with new control argument
ok 27 0a6e - Replace sample action with invalid goto chain control
ok 28 3872 - Delete sample action with valid index
ok 29 a394 - Delete sample action with invalid index

Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-03-11 16:44:12 +01:00
Jamal Hadi Salim
7a6fb69bbc net/sched: Retire tcindex classifier
commit 8c710f75256bb3cf05ac7b1672c82b92c43f3d28 upstream.

The tcindex classifier has served us well for about a quarter of a century
but has not been getting much TLC due to lack of known users. Most recently
it has become easy prey to syzkaller. For this reason, we are retiring it.

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11 16:44:07 +01:00
Vladimir Oltean
5d873a6c65 Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"
commit af7b29b1deaac6da3bb7637f0e263dfab7bfc7a3 upstream.

taprio_attach() has this logic at the end, which should have been
removed with the blamed patch (which is now being reverted):

	/* access to the child qdiscs is not needed in offload mode */
	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
		kfree(q->qdiscs);
		q->qdiscs = NULL;
	}

because otherwise, we make use of q->qdiscs[] even after this array was
deallocated, namely in taprio_leaf(). Therefore, whenever one would try
to attach a valid child qdisc to a fully offloaded taprio root, one
would immediately dereference a NULL pointer.

$ tc qdisc replace dev eno0 handle 8001: parent root taprio \
	num_tc 8 \
	map 0 1 2 3 4 5 6 7 \
	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
	max-sdu 0 0 0 0 0 200 0 0 \
	base-time 200 \
	sched-entry S 80 20000 \
	sched-entry S a0 20000 \
	sched-entry S 5f 60000 \
	flags 2
$ max_frame_size=1500
$ data_rate_kbps=20000
$ port_transmit_rate_kbps=1000000
$ idleslope=$data_rate_kbps
$ sendslope=$(($idleslope - $port_transmit_rate_kbps))
$ locredit=$(($max_frame_size * $sendslope / $port_transmit_rate_kbps))
$ hicredit=$(($max_frame_size * $idleslope / $port_transmit_rate_kbps))
$ tc qdisc replace dev eno0 parent 8001:7 cbs \
	idleslope $idleslope \
	sendslope $sendslope \
	hicredit $hicredit \
	locredit $locredit \
	offload 0

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000030
pc : taprio_leaf+0x28/0x40
lr : qdisc_leaf+0x3c/0x60
Call trace:
 taprio_leaf+0x28/0x40
 tc_modify_qdisc+0xf0/0x72c
 rtnetlink_rcv_msg+0x12c/0x390
 netlink_rcv_skb+0x5c/0x130
 rtnetlink_rcv+0x1c/0x2c

The solution is not as obvious as the problem. The code which deallocates
q->qdiscs[] is in fact copied and pasted from mqprio, which also
deallocates the array in mqprio_attach() and never uses it afterwards.

Therefore, the identical cleanup logic of priv->qdiscs[] that
mqprio_destroy() has is deceptive because it will never take place at
qdisc_destroy() time, but just at raw ops->destroy() time (otherwise
said, priv->qdiscs[] do not last for the entire lifetime of the mqprio
root), but rather, this is just the twisted way in which the Qdisc API
understands error path cleanup should be done (Qdisc_ops :: destroy() is
called even when Qdisc_ops :: init() never succeeded).

Side note, in fact this is also what the comment in mqprio_init() says:

	/* pre-allocate qdisc, attachment can't fail */

Or reworded, mqprio's priv->qdiscs[] scheme is only meant to serve as
data passing between Qdisc_ops :: init() and Qdisc_ops :: attach().

[ this comment was also copied and pasted into the initial taprio
  commit, even though taprio_attach() came way later ]

The problem is that taprio also makes extensive use of the q->qdiscs[]
array in the software fast path (taprio_enqueue() and taprio_dequeue()),
but it does not keep a reference of its own on q->qdiscs[i] (you'd think
that since it creates these Qdiscs, it holds the reference, but nope,
this is not completely true).

To understand the difference between taprio_destroy() and mqprio_destroy()
one must look before commit 13511704f8d7 ("net: taprio offload: enforce
qdisc to netdev queue mapping"), because that just muddied the waters.

In the "original" taprio design, taprio always attached itself (the root
Qdisc) to all netdev TX queues, so that dev_qdisc_enqueue() would go
through taprio_enqueue().

It also called qdisc_refcount_inc() on itself for as many times as there
were netdev TX queues, in order to counter-balance what tc_get_qdisc()
does when destroying a Qdisc (simplified for brevity below):

	if (n->nlmsg_type == RTM_DELQDISC)
		err = qdisc_graft(dev, parent=NULL, new=NULL, q, extack);

qdisc_graft(where "new" is NULL so this deletes the Qdisc):

	for (i = 0; i < num_q; i++) {
		struct netdev_queue *dev_queue;

		dev_queue = netdev_get_tx_queue(dev, i);

		old = dev_graft_qdisc(dev_queue, new);
		if (new && i > 0)
			qdisc_refcount_inc(new);

		qdisc_put(old);
		~~~~~~~~~~~~~~
		this decrements taprio's refcount once for each TX queue
	}

	notify_and_destroy(net, skb, n, classid,
			   rtnl_dereference(dev->qdisc), new);
			   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
			   and this finally decrements it to zero,
			   making qdisc_put() call qdisc_destroy()

The q->qdiscs[] created using qdisc_create_dflt() (or their
replacements, if taprio_graft() was ever to get called) were then
privately freed by taprio_destroy().

This is still what is happening after commit 13511704f8d7 ("net: taprio
offload: enforce qdisc to netdev queue mapping"), but only for software
mode.

In full offload mode, the per-txq "qdisc_put(old)" calls from
qdisc_graft() now deallocate the child Qdiscs rather than decrement
taprio's refcount. So when notify_and_destroy(taprio) finally calls
taprio_destroy(), the difference is that the child Qdiscs were already
deallocated.

And this is exactly why the taprio_attach() comment "access to the child
qdiscs is not needed in offload mode" is deceptive too. Not only the
q->qdiscs[] array is not needed, but it is also necessary to get rid of
it as soon as possible, because otherwise, we will also call qdisc_put()
on the child Qdiscs in qdisc_destroy() -> taprio_destroy(), and this
will cause a nasty use-after-free/refcount-saturate/whatever.

In short, the problem is that since the blamed commit, taprio_leaf()
needs q->qdiscs[] to not be freed by taprio_attach(), while qdisc_destroy()
-> taprio_destroy() does need q->qdiscs[] to be freed by taprio_attach()
for full offload. Fixing one problem triggers the other.

All of this can be solved by making taprio keep its q->qdiscs[i] with a
refcount elevated at 2 (in offloaded mode where they are attached to the
netdev TX queues), both in taprio_attach() and in taprio_graft(). The
generic qdisc_graft() would just decrement the child qdiscs' refcounts
to 1, and taprio_destroy() would give them the final coup de grace.

However the rabbit hole of changes is getting quite deep, and the
complexity increases. The blamed commit was supposed to be a bug fix in
the first place, and the bug it addressed is not so significant so as to
justify further rework in stable trees. So I'd rather just revert it.
I don't know enough about multi-queue Qdisc design to make a proper
judgement right now regarding what is/isn't idiomatic use of Qdisc
concepts in taprio. I will try to study the problem more and come with a
different solution in net-next.

Fixes: 1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs")
Reported-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>
Reported-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/20221004220100.1650558-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-25 11:53:27 +01:00
Dan Carpenter
7519069f1f net: sched: sch: Fix off by one in htb_activate_prios()
commit 9cec2aaffe969f2a3e18b5ec105fc20bb908e475 upstream.

The > needs be >= to prevent an out of bounds access.

Fixes: de5ca4c3852f ("net: sched: sch: Bounds check priority")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/Y+D+KN18FQI2DKLq@kili
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22 12:50:42 +01:00