Re: Autobuilder "rcu stall" issue summary


Richard Purdie
 

On Fri, 2021-06-25 at 16:34 +0100, Richard Purdie via lists.yoctoproject.org wrote:
On Thu, 2021-06-24 at 17:31 +0100, Richard Purdie via lists.yoctoproject.org wrote:
I'll stop here but will follow up on the mail if I remember more info. Anyone else
feel free to add and if anyone has any insight into what is happening or how to better
debug, I'm very open to it!
I formed a new plan and set the stall detector to 3 seconds instead of 
21 seconds with this hack in master-next:

http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=71b6bc157d39e09e8f76a15a049168eb72bbd3d9

i.e. adding CONFIG_RCU_CPU_STALL_TIMEOUT=3

I ran two builds on the autobuilder, one of which was a heavy rebuild build
so should trigger a lot of IO. I've been looking over the build results and
we have a number of builds which failed or are in the process of doing so,
I've dumped 4 below (three arm, one x86-64).

My thinking is that if we reduced the stall detector limit and only saw 
occasional hangs at the same rate as before, it was likely qemu. If we saw
an increase in hangs (which we definitely have even on the preliminary 
incomplete results), it is more likely something in the kernel RCU stall
code is taking out the system. The BUG: in occasional logs also hints at
the latter.

My own next step is probably to hack a "stall report" trigger into sysrq
and try and in some of my own images locally, see if I can break this at
will.
I realised there was a potential lock issue in the rcu stall code in the
kernel. In looking into it with Paul we found that upstream have a fix
queued for the same issue:

https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?id=406a2f008f2e

This would account for the BUG: entries in the previous email. I plan to
test this with the short rcu stall timing, see if we still see the hangs.
Adding this fix to linux-yocto would seem helpful just to rule it out
if nothing else but this fix does look like the kind of thing that would
cause the issues we've been tracking.

Cheers,

Richard

Join swat@lists.yoctoproject.org to automatically receive all group messages.