Bitao Hu d7037381d0 watchdog/softlockup: Low-overhead detection of interrupt storm
The following softlockup is caused by interrupt storm, but it cannot be
identified from the call tree. Because the call tree is just a snapshot
and doesn't fully capture the behavior of the CPU during the soft lockup.
  watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
  ...
  Call trace:
    __do_softirq+0xa0/0x37c
    __irq_exit_rcu+0x108/0x140
    irq_exit+0x14/0x20
    __handle_domain_irq+0x84/0xe0
    gic_handle_irq+0x80/0x108
    el0_irq_naked+0x50/0x58

Therefore, it is necessary to report CPU utilization during the
softlockup_threshold period (report once every sample_period, for a total
of 5 reportings), like this:
  watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
  CPU#28 Utilization every 4s during lockup:
    #1: 0% system, 0% softirq, 100% hardirq, 0% idle
    #2: 0% system, 0% softirq, 100% hardirq, 0% idle
    #3: 0% system, 0% softirq, 100% hardirq, 0% idle
    #4: 0% system, 0% softirq, 100% hardirq, 0% idle
    #5: 0% system, 0% softirq, 100% hardirq, 0% idle
  ...

This is helpful in determining whether an interrupt storm has occurred or
in identifying the cause of the softlockup. The criteria for determination
are as follows:

  a. If the hardirq utilization is high, then interrupt storm should be
     considered and the root cause cannot be determined from the call tree.
  b. If the softirq utilization is high, then the call might not necessarily
     point at the root cause.
  c. If the system utilization is high, then analyzing the root
     cause from the call tree is possible in most cases.

The mechanism requires a considerable amount of global storage space
when configured for the maximum number of CPUs. Therefore, adding a
SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes"
if the max number of CPUs is <= 128.

Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240411074134.30922-5-yaoma@linux.alibaba.com
2024-04-12 17:08:05 +02:00
..
2024-01-19 11:59:11 -08:00
2024-02-01 13:06:40 +01:00
2023-11-03 07:08:36 -10:00
2021-01-21 14:06:00 -07:00
2022-03-07 12:48:35 -07:00
2021-08-19 09:02:55 +09:00
2023-02-02 22:50:01 -08:00
2023-02-02 22:50:01 -08:00
2021-01-03 20:05:18 -05:00
2023-03-19 10:02:04 -07:00
2022-03-07 12:48:35 -07:00
2022-04-29 14:38:01 -07:00
2022-10-03 14:03:21 -07:00
2024-03-11 09:38:17 -07:00
2021-08-19 09:02:55 +09:00
2024-02-12 10:35:40 -06:00
2021-07-08 11:48:20 -07:00
2023-08-21 13:46:25 -07:00
2023-04-17 18:01:23 +02:00
2021-07-08 11:48:20 -07:00
2023-10-16 12:44:06 -04:00
2023-08-24 16:20:18 -07:00
2023-10-16 12:44:06 -04:00
2024-03-14 10:58:27 -07:00
2022-10-03 17:34:32 -07:00
2021-07-08 11:48:20 -07:00
2024-02-15 12:17:28 -05:00
2022-06-03 10:34:34 -07:00
2024-02-20 20:47:32 -08:00
2021-06-18 11:43:09 +02:00
2021-07-08 11:48:20 -07:00
2022-01-20 08:52:54 +02:00
2024-02-22 10:24:48 -08:00