Re: Autobuilder "rcu stall" issue summary


Richard Purdie
 

On Thu, 2021-06-24 at 17:31 +0100, Richard Purdie via lists.yoctoproject.org wrote:
I'll stop here but will follow up on the mail if I remember more info. Anyone else
feel free to add and if anyone has any insight into what is happening or how to better
debug, I'm very open to it!
I formed a new plan and set the stall detector to 3 seconds instead of 
21 seconds with this hack in master-next:

http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=71b6bc157d39e09e8f76a15a049168eb72bbd3d9

i.e. adding CONFIG_RCU_CPU_STALL_TIMEOUT=3

I ran two builds on the autobuilder, one of which was a heavy rebuild build
so should trigger a lot of IO. I've been looking over the build results and
we have a number of builds which failed or are in the process of doing so,
I've dumped 4 below (three arm, one x86-64).

My thinking is that if we reduced the stall detector limit and only saw 
occasional hangs at the same rate as before, it was likely qemu. If we saw
an increase in hangs (which we definitely have even on the preliminary 
incomplete results), it is more likely something in the kernel RCU stall
code is taking out the system. The BUG: in occasional logs also hints at
the latter.

My own next step is probably to hack a "stall report" trigger into sysrq
and try and in some of my own images locally, see if I can break this at
will.

Cheers,

Richard


qemuarm:

https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/3608

/home/pokybuild/yocto-worker/qemuarm/build/build/tmp/work/qemuarm-poky-linux-gnueabi/core-image-sato-sdk/1.0-
r0/testimage/qemu_boot_log.20210625121601

** (matchbox-desktop:715): [1;33mWARNING[0m **: [34m12:17:35.454[0m: Error loading icon: Icon 'terminal' not
present in theme Sato
[ 2173.684752] hellomod: loading out-of-tree module taints kernel.
[ 2173.726821] Hello world!
[ 2178.269807] Cleaning up hellomod.
[ 2426.747471] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2426.753269] rcu: 2-....: (32 ticks this GP) idle=21e/1/0x40000000 softirq=95871/95871 fqs=19
[ 2426.823824] (detected by 0, t=302 jiffies, g=214353, q=107)


qemuarm-oecore:

https://autobuilder.yoctoproject.org/typhoon/#/builders/47/builds/3601


OpenEmbedded nodistro.0 qemuarm /dev/hvc0

/home/pokybuild/yocto-worker/qemuarm-oecore/build/build/tmp-glibc/work/qemuarm-oe-linux-gnueabi/core-image-
sato/1.0-r0/testimage/qemu_boot_log.20210625114847

qemuarm login: [ 53.248661] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 53.250587] rcu: 0-...!: (1 ticks this GP) idle=4aa/0/0x1 softirq=1385/1385 fqs=0
[ 53.251571] (t=1019 jiffies g=1293 q=4)
[ 53.252192] rcu: rcu_preempt kthread starved for 1019 jiffies! g1293 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 -
cpu=1
[ 53.253026] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 53.253697] rcu: RCU grace-period kthread stack dump:
[ 53.254348] task:rcu_preempt state:R running task stack: 0 pid: 13 ppid: 2
flags:0x00000000
[ 53.258129] [<c0a73808>] (__schedule) from [<c0a73a74>] (schedule+0xc8/0x118)
[ 53.258925] [<c0a73a74>] (schedule) from [<c0a75ef8>] (schedule_timeout+0xcc/0xfc)
[ 53.259247] [<c0a75ef8>] (schedule_timeout) from [<c028b8b0>] (rcu_gp_kthread+0x2ac/0x92c)
[ 53.259619] [<c028b8b0>] (rcu_gp_kthread) from [<c024f314>] (kthread+0x158/0x164)
[ 53.259865] [<c024f314>] (kthread) from [<c0200138>] (ret_from_fork+0x14/0x3c)
[ 53.260207] Exception stack(0xc1539fb0 to 0xc1539ff8)
[ 53.260691] 9fa0: 00000000 00000000 00000000 00000000
[ 53.260973] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 53.261236] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 53.261718] NMI backtrace for cpu 0
[ 53.262040] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.43-yocto-standard #1
[ 53.262324] Hardware name: Generic DT based system
[ 53.262584] [<c020fe28>] (unwind_backtrace) from [<c020ab54>] (show_stack+0x18/0x1c)
[ 53.262847] [<c020ab54>] (show_stack) from [<c0a6e304>] (dump_stack+0x88/0xa4)
[ 53.263081] [<c0a6e304>] (dump_stack) from [<c06dc0e8>] (nmi_cpu_backtrace+0xb8/0xcc)
[ 53.263340] [<c06dc0e8>] (nmi_cpu_backtrace) from [<c06dc1b8>] (nmi_trigger_cpumask_backtrace+0xbc/0x164)
[ 53.263615] [<c06dc1b8>] (nmi_trigger_cpumask_backtrace) from [<c0a69248>] (rcu_dump_cpu_stacks+0xe8/0x114)
[ 53.264190] [<c0a69248>] (rcu_dump_cpu_stacks) from [<c028d9d8>] (rcu_sched_clock_irq+0x35c/0xa78)
[ 53.264463] [<c028d9d8>] (rcu_sched_clock_irq) from [<c0296f8c>] (update_process_times+0x5c/0x80)
[ 53.264757] [<c0296f8c>] (update_process_times) from [<c02a8220>] (tick_sched_timer+0x17c/0x1fc)
[ 53.265047] [<c02a8220>] (tick_sched_timer) from [<c0297fc8>] (__hrtimer_run_queues+0x19c/0x2c4)
[ 53.265317] [<c0297fc8>] (__hrtimer_run_queues) from [<c0298528>] (hrtimer_interrupt+0xc8/0x204)
[ 53.265559] [<c0298528>] (hrtimer_interrupt) from [<c08a112c>] (arch_timer_handler_virt+0x30/0x38)
[ 53.265802] [<c08a112c>] (arch_timer_handler_virt) from [<c027f844>] (handle_percpu_devid_irq+0x60/0x118)
[ 53.266079] [<c027f844>] (handle_percpu_devid_irq) from [<c0279bf0>] (generic_handle_irq+0x30/0x40)
[ 53.266320] [<c0279bf0>] (generic_handle_irq) from [<c027a218>] (__handle_domain_irq+0x98/0xa8)
[ 53.266554] [<c027a218>] (__handle_domain_irq) from [<c06eb518>] (gic_handle_irq+0x74/0x84)
[ 53.266778] [<c06eb518>] (gic_handle_irq) from [<c0201278>] (__irq_svc+0x58/0x8c)
[ 53.266998] Exception stack(0xc1201f20 to 0xc1201f68)
[ 53.267171] 1f20: 0000a4a8 cfd6de84 00000000 c0218860 c1200000 00000000 c1200000 c1201f80
[ 53.267390] 1f40: c1204f94 00000001 c1204fd4 00000000 c1684040 c1201f70 c0207930 c0207920
[ 53.267611] 1f60: 60010113 ffffffff
[ 53.267744] [<c0201278>] (__irq_svc) from [<c0207920>] (arch_cpu_idle+0x24/0x40)
[ 53.267946] [<c0207920>] (arch_cpu_idle) from [<c0a76898>] (default_idle_call+0x40/0x5c)
[ 53.268167] [<c0a76898>] (default_idle_call) from [<c025e110>] (do_idle+0x108/0x254)
[ 53.268384] [<c025e110>] (do_idle) from [<c025e534>] (cpu_startup_entry+0x20/0x24)
[ 53.268624] [<c025e534>] (cpu_startup_entry) from [<c1000e80>] (start_kernel+0x4a0/0x57c)
[ 73.311040] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 73.313047] (detected by 2, t=428 jiffies, g=1965, q=113)
[ 73.313886] rcu: All QSes seen, last rcu_preempt kthread activity 425 (-22941--23366),
jiffies_till_next_fqs=1, root ->qsmask 0x0
[ 73.315413] rcu: rcu_preempt kthread starved for 425 jiffies! g1965 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 -
cpu=2
[ 73.316823] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 73.317775] rcu: RCU grace-period kthread stack dump:
[ 73.318304] task:rcu_preempt state:R stack: 0 pid: 13 ppid: 2 flags:0x00000000
[ 73.318759] [<c0a73808>] (__schedule) from [<c0a73a74>] (schedule+0xc8/0x118)
[ 73.319123] [<c0a73a74>] (schedule) from [<c0a75ef8>] (schedule_timeout+0xcc/0xfc)
[ 73.319500] [<c0a75ef8>] (schedule_timeout) from [<c028b8b0>] (rcu_gp_kthread+0x2ac/0x92c)
[ 73.319886] [<c028b8b0>] (rcu_gp_kthread) from [<c024f314>] (kthread+0x158/0x164)
[ 73.320201] [<c024f314>] (kthread) from [<c0200138>] (ret_from_fork+0x14/0x3c)
[ 73.320776] Exception stack(0xc1539fb0 to 0xc1539ff8)
[ 73.321088] 9fa0: 00000000 00000000 00000000 00000000
[ 73.321403] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 73.321698] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 73.322376] BUG: scheduling while atomic: swapper/2/0/0x00000002
[ 73.322697] Modules linked in:
[ 73.323092] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.10.43-yocto-standard #1
[ 73.323383] Hardware name: Generic DT based system
[ 73.323621] [<c020fe28>] (unwind_backtrace) from [<c020ab54>] (show_stack+0x18/0x1c)
[ 73.323908] [<c020ab54>] (show_stack) from [<c0a6e304>] (dump_stack+0x88/0xa4)
[ 73.324184] [<c0a6e304>] (dump_stack) from [<c02575e4>] (__schedule_bug+0x68/0x8c)
[ 73.324454] [<c02575e4>] (__schedule_bug) from [<c0a73578>] (__schedule+0x58/0x48c)
[ 73.324741] [<c0a73578>] (__schedule) from [<c0a73e00>] (schedule_idle+0x68/0x80)
[ 73.325046] [<c0a73e00>] (schedule_idle) from [<c025e534>] (cpu_startup_entry+0x20/0x24)
[ 73.325576] [<c025e534>] (cpu_startup_entry) from [<40201c2c>] (0x40201c2c)

qemuarm64-ltp (on arm64 host ubuntu1804-arm-1)

https://autobuilder.yoctoproject.org/typhoon/#/builders/96/builds/1888

/home/pokybuild/yocto-worker/qemuarm64-ltp/build/build/tmp/work/qemuarm64-poky-linux/core-image-sato/1.0-
r0/testimage/qemu_boot_log.20210625121516

qemuarm64 login: [ 21.445581] Process accounting resumed
[ 22.705496] Process accounting resumed
[ 331.583598] capability: warning: `capget01' uses 32-bit capabilities (legacy support in use)
[ 331.585764] capability: warning: `capget01' uses deprecated v2 capabilities in a way that may be insecure
[ 345.996555] Clock: inserting leap second 23:59:60 UTC
[ 713.924549] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 714.133632] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P9691/1:b..l
[ 714.255745] (detected by 1, t=822 jiffies, g=195937, q=3443)
[ 714.256974] task:msgstress03 state:R running task stack: 0 pid: 9589 ppid: 9543
flags:0x00000000
[ 714.258971] Call trace:
[ 714.259520] __switch_to+0x114/0x19c
[ 714.260274] __schedule+0x274/0x610
[ 714.261008] preempt_schedule_common+0x28/0x50
[ 714.261928] preempt_schedule+0x30/0x40
[ 714.262711] _raw_spin_unlock+0x38/0x50
[ 714.263507] do_msgrcv+0x284/0x4f0
[ 714.264230] __arm64_sys_msgrcv+0x38/0x44
[ 714.265042] el0_svc_common.constprop.0+0x9c/0x1c0
[ 714.266014] do_el0_svc+0x7c/0xa0
[ 714.266718] el0_svc+0x20/0x30
[ 714.267391] el0_sync_handler+0xa4/0x130
[ 714.268193] el0_sync+0x180/0x1c0
[ 714.268900] task:msgstress03 state:R running task stack: 0 pid: 9691 ppid: 32178
flags:0x00000000
[ 714.270830] Call trace:
[ 714.271369] __switch_to+0x114/0x19c
[ 714.313241] __schedule+0x274/0x610
[ 714.333040] preempt_schedule_irq+0x68/0x100
[ 714.333944] arm64_preempt_schedule_irq+0x2c/0x40
[ 714.334910] el1_irq+0xe4/0x180
[ 714.335606] wake_up_q+0xa8/0xd0
[ 714.336308] do_msgsnd+0x3f8/0x4e0
[ 714.337029] ksys_msgsnd+0x90/0xb0
[ 714.337750] __arm64_sys_msgsnd+0x2c/0x40
[ 714.338564] el0_svc_common.constprop.0+0x9c/0x1c0
[ 714.339552] do_el0_svc+0x7c/0xa0
[ 714.340266] el0_svc+0x20/0x30
[ 714.340936] el0_sync_handler+0xa4/0x130
[ 714.341735] el0_sync+0x180/0x1c0
[ 720.932547] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 721.016226] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3):
[ 721.072646] (detected by 0, t=752 jiffies, g=195941, q=7890)
[ 721.073841] rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295072518-4295072517),
jiffies_till_next_fqs=1, root ->qsmask 0x0
[ 750.072043] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 750.073887] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P1738/1:b..l
[ 750.116802] (detected by 0, t=773 jiffies, g=196493, q=15176)
[ 750.118095] task:msgstress04 state:R running task stack: 0 pid: 1733 ppid: 1482
flags:0x00000000
[ 750.120144] Call trace:
[ 750.120730] __switch_to+0x114/0x19c
[ 750.144388] __schedule+0x274/0x610
[ 750.145215] preempt_schedule_common+0x28/0x50
[ 750.146187] preempt_schedule+0x30/0x40
[ 750.147020] _raw_spin_unlock+0x38/0x50
[ 750.147853] do_msgsnd+0x3ec/0x4e0
[ 750.148610] ksys_msgsnd+0x90/0xb0
[ 750.149365] __arm64_sys_msgsnd+0x2c/0x40
[ 750.150219] el0_svc_common.constprop.0+0x9c/0x1c0
[ 750.151244] do_el0_svc+0x7c/0xa0
[ 750.151995] el0_svc+0x20/0x30
[ 750.152698] el0_sync_handler+0xa4/0x130
[ 750.153532] el0_sync+0x180/0x1c0
[ 750.154273] task:msgstress04 state:R running task stack: 0 pid: 1738 ppid: 1482
flags:0x00000000
[ 750.156271] Call trace:
[ 750.156842] __switch_to+0x114/0x19c
[ 750.157626] __schedule+0x274/0x610
[ 750.158395] preempt_schedule_common+0x28/0x50
[ 750.159363] preempt_schedule+0x30/0x40
[ 750.160196] _raw_spin_unlock+0x38/0x50
[ 750.161020] do_msgrcv+0x284/0x4f0
[ 750.161774] __arm64_sys_msgrcv+0x38/0x44
[ 750.162620] el0_svc_common.constprop.0+0x9c/0x1c0
[ 750.163643] do_el0_svc+0x7c/0xa0
[ 750.164388] el0_svc+0x20/0x30
[ 750.165086] el0_sync_handler+0xa4/0x130
[ 750.165918] el0_sync+0x180/0x1c0
[ 763.280549] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 763.282336] (detected by 0, t=752 jiffies, g=196741, q=7494)
[ 763.283694] rcu: All QSes seen, last rcu_preempt kthread activity 703 (4295083105-4295082402),
jiffies_till_next_fqs=1, root ->qsmask 0x0
[ 763.286260] rcu: rcu_preempt kthread starved for 703 jiffies! g196741 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 -
cpu=3
[ 763.288363] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 763.290242] rcu: RCU grace-period kthread stack dump:
[ 763.291334] task:rcu_preempt state:R running task stack: 0 pid: 13 ppid: 2
flags:0x00000028
[ 763.293387] Call trace:
[ 763.293976] __switch_to+0x114/0x19c
[ 763.294779] __schedule+0x274/0x610
[ 763.295582] schedule+0x7c/0x110
[ 763.296343] schedule_timeout+0xb0/0x1c0
[ 763.297216] rcu_gp_kthread+0x4e0/0xa90
[ 763.298073] kthread+0x168/0x16c
[ 763.298831] ret_from_fork+0x10/0x3c
[ 763.299741] BUG: scheduling while atomic: msgstress04/18860/0x00000002
[ 763.301093] Modules linked in:
[ 763.301808] CPU: 0 PID: 18860 Comm: msgstress04 Not tainted 5.10.43-yocto-standard #1
[ 763.303502] Hardware name: linux,dummy-virt (DT)
[ 763.304491] Call trace:
[ 763.305056] dump_backtrace+0x0/0x1a0
[ 763.305852] show_stack+0x24/0x30
[ 763.306595] dump_stack+0xd4/0x110
[ 763.307357] __schedule_bug+0x68/0x80
[ 763.308160] __schedule+0x454/0x610
[ 763.308925] schedule+0x7c/0x110
[ 763.309655] do_msgsnd+0x294/0x4e0
[ 763.310406] ksys_msgsnd+0x90/0xb0
[ 763.311167] __arm64_sys_msgsnd+0x2c/0x40
[ 763.312024] el0_svc_common.constprop.0+0x9c/0x1c0
[ 763.313039] do_el0_svc+0x7c/0xa0
[ 763.313780] el0_svc+0x20/0x30
[ 763.314479] el0_sync_handler+0xa4/0x130
[ 763.315326] el0_sync+0x180/0x1c0

multilib core-image-sato ubuntu1804-ty-1

https://autobuilder.yoctoproject.org/typhoon/#/builders/44/builds/3622

/home/pokybuild/yocto-worker/multilib/build/build/tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-
r0/testimage/qemu_boot_log.20210625121139

libGL error: MESA-LOADER: failed to open bochs-drm: /usr/lib/dri/bochs-drm_dri.so: cannot open shared object
file: No such file or directory (search paths /usr/lib/dri)
libGL error: failed to load driver: bochs-drm
libGL error: pci id for fd 4: 1234:1111, driver (null)
libGL error: MESA-LOADER: failed to open bochs-drm: /usr/lib64/dri/bochs-drm_dri.so: cannot open shared object
file: No such file or directory (search paths /usr/lib64/dri)
libGL error: failed to load driver: bochs-drm
[settings daemon] Forking. run with -n to prevent fork
dbus-daemon[652]: Activating service name='org.a11y.atspi.Registry' requested by ':1.1' (uid=0 pid=669
comm="matchbox-panel --start-applets showdesktop,windows")
dbus-daemon[652]: Successfully activated service 'org.a11y.atspi.Registry'
SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry

** (matchbox-desktop:668): [1;33mWARNING[0m **: [34m12:12:24.699[0m: Error loading icon: Icon 'terminal' not
present in theme Sato

** (matchbox-desktop:668): [1;33mWARNING[0m **: [34m12:12:24.713[0m: Error loading icon: Icon 'net-48d24' not
present in theme Sato
[ 21.640654] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 21.642263] rcu: 3-...0: (0 ticks this GP) idle=18a/1/0x4000000000000000 softirq=1618/1618 fqs=425
[ 21.644170] (detected by 1, t=3155 jiffies, g=2653, q=16)
[ 30.951654] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

Join swat@lists.yoctoproject.org to automatically receive all group messages.