Autobuilder "rcu stall" issue summary
Richard Purdie
We have info all over the place relating to our "rcu stall" issue on the
autobuilder. This is an attempt to document what we know and what we don't. We see builds fail periodically where there is an "rcu stall" mentioned by the kernel and the image "locks up". An easier to read example is: https://autobuilder.yoctoproject.org/typhoon/#/builders/56/builds/1488/steps/14/logs/stdio A core-image-minimal in testimage sitting using 394% CPU having run for 97 minutes of cpu time. A minimal image doesn't execute much at all and should be over in a minute max. An example we were able to "catch" and debug a bit at the expense of breaking the output is: https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/3576/steps/12/logs/stdio where a musl qemux86-64 core-image-sato-sdk stopped mid "configure" on target during testimage. We were able to look at the kernel log, the tail end of which showed: [settings daemon] Forking. run with -n to prevent fork [ 102.082744] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 102.093213] rcu: 1-...!: (1 GPs behind) idle=3f6/0/0x1 softirq=1374/1375 fqs=0 [ 102.096448] rcu: 3-...!: (12 ticks this GP) idle=e34/0/0x0 softirq=1658/1658 fqs=0 [ 102.099326] (detected by 1, t=21002 jiffies, g=1937, q=278) The kernel was: [ 0.000000] Linux version 5.10.43-yocto-standard (oe-user@oe-host) (x86_64-poky-linux-musl-gcc (GCC) 11.1.1 20210523, GNU ld (GNU Binutils) 2.36.1.20210209) #1 SMP PREEMPT Wed Jun 16 1 6:00:47 UTC 2021 [ 0.000000] Command line: root=/dev/vda rw mem=512M ip=192.168.7.4::192.168.7.3:255.255.255.0 console=ttyS0 console=ttyS1 oprofile.timer=1 tsc=reliable no_timer_check rcupdate.rcu_exp edited=1 printk.time=1 Full log: https://www.rpsys.net/wp/rp/qemuboot-autobuild.log In the above state, qemu was seem with 303% CPU usage and 403% cpu usage. I do have strace output from the hung qemu: https://www.rpsys.net/wp/rp/qemu-strace.log It was interesting to note that pinging the image did work and you can see that in the strace log so something was still alive. You couldn't ssh in though and existing ssh connections failed with timeouts. We do also have the VM core dump for this which Alexandre is looking into (from qmp with: {"execute": "qmp_capabilities"} {"execute":"dump-guest-memory","arguments":{"paging":false,"protocol":"file:/tmp/vmcore.img"}} ) I couldn't see any extra kernel log messages in the vmcore file. We've seen this before in core-image-sato, core-image-sato-sdk and also with qemuarm, the latter seeming to imply that KVM isn't the issue and that it isn't arch specific, implying it is qemu related. The load itself doesn't appear to matter, above it was a configure script but we've seen it on images without toolchains in a variety of states/loads. We've seen on a variety of host workers, ubuntu1804 being most common but it is a more common worker OS too. I suspect that this is the same issue that causes images to lock up in the firmware boot stage as in: https://bugzilla.yoctoproject.org/show_bug.cgi?id=14018 There may also be other related bugs, without looking at the kernel traces, it is hard to say for sure https://bugzilla.yoctoproject.org/show_bug.cgi?id=14273 lists a load more. We've not been able to replicate the issue at will. I have tried various stress-ng --all loads locally with no success at replicating. I did compare the boot output from qemu on my local system with the above kernel boot and it matches apart from timings and kvm-clock msr value. The suspicion is that some high IO load event somehow triggers qemu to fail. I also suspect that the "rcu stall" is just a symptom of a wider issue. The stall is suspicious as it looks incomplete. I'll stop here but will follow up on the mail if I remember more info. Anyone else feel free to add and if anyone has any insight into what is happening or how to better debug, I'm very open to it! Cheers, Richard |
|