Further rcu stall on autobuilder


Richard Purdie
 

We've got yet another rcu stall failure on the autobuilder:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/2123/steps/15/logs/stdio

and looking at the dmesg in the qemu log:

[ 20.424033] Freeing unused kernel image (rodata/data gap) memory: 652K
[ 20.425229] Run /sbin/init as init process
INIT: version 2.99 booting
FBIOPUT_VSCREENINFO failed, double buffering disabledStarting udev
[ 20.547298] udevd[161]: starting version 3.2.10
[ 20.553329] udevd[162]: starting eudev-3.2.10
[ 20.751260] EXT4-fs (vda): re-mounted. Opts: (null)
[ 20.752548] ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
INIT: Entering runlevel: 5
Configuring network interfaces... RTNETLINK answers: File exists
Starting random number generator daemon.
Starting OpenBSD Secure Shell server: sshd
done.
Starting rpcbind daemon...done.
starting statd: done
Starting atd: OK
[ 21.921925] Installing knfsd (copyright (C) 1996 okir@...).
starting 8 nfsd kernel threads: [ 23.066283] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 23.068096] NFSD: Using legacy client tracking operations.
[ 23.069086] NFSD: starting 90-second grace period (net f0000098)
done
starting mountd: [ 45.272151] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 45.273423] rcu: 1-...0: (10 ticks this GP) idle=7ba/1/0x4000000000000000 softirq=598/612 fqs=5249
[ 45.274951] (detected by 2, t=21002 jiffies, g=-195, q=13)
[ 138.202149] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 332.762209] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

This is with the kvm clock source disabled (in master-next) and with Bruce's 
5.10.38 upgrade so that kind of rules out either of those two things for this
issue. It also can't be the qemu platform or cpu emulation used since we've
changed that.

What is really odd is that it never actually prints the stalled tasks. That
seems really strange. It is obviously alive enough to print a stall message
later but stalls out and is terminated after 1500s.

Really open to ideas at this point. Should we try a newer kernel version
for testing in -next, see if we can isolate this to 5.10?

Cheers,

Richard


Bruce Ashfield <bruce.ashfield@...>
 

On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:

We've got yet another rcu stall failure on the autobuilder:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/2123/steps/15/logs/stdio

and looking at the dmesg in the qemu log:

[ 20.424033] Freeing unused kernel image (rodata/data gap) memory: 652K
[ 20.425229] Run /sbin/init as init process
INIT: version 2.99 booting
FBIOPUT_VSCREENINFO failed, double buffering disabledStarting udev
[ 20.547298] udevd[161]: starting version 3.2.10
[ 20.553329] udevd[162]: starting eudev-3.2.10
[ 20.751260] EXT4-fs (vda): re-mounted. Opts: (null)
[ 20.752548] ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
INIT: Entering runlevel: 5
Configuring network interfaces... RTNETLINK answers: File exists
Starting random number generator daemon.
Starting OpenBSD Secure Shell server: sshd
done.
Starting rpcbind daemon...done.
starting statd: done
Starting atd: OK
[ 21.921925] Installing knfsd (copyright (C) 1996 okir@...).
starting 8 nfsd kernel threads: [ 23.066283] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 23.068096] NFSD: Using legacy client tracking operations.
[ 23.069086] NFSD: starting 90-second grace period (net f0000098)
done
starting mountd: [ 45.272151] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 45.273423] rcu: 1-...0: (10 ticks this GP) idle=7ba/1/0x4000000000000000 softirq=598/612 fqs=5249
[ 45.274951] (detected by 2, t=21002 jiffies, g=-195, q=13)
[ 138.202149] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 332.762209] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

This is with the kvm clock source disabled (in master-next) and with Bruce's
5.10.38 upgrade so that kind of rules out either of those two things for this
issue. It also can't be the qemu platform or cpu emulation used since we've
changed that.

What is really odd is that it never actually prints the stalled tasks. That
seems really strange. It is obviously alive enough to print a stall message
later but stalls out and is terminated after 1500s.

Really open to ideas at this point. Should we try a newer kernel version
for testing in -next, see if we can isolate this to 5.10?
If you want to switch to linux-yocto-dev, it is on 5.12.x, and I have
a local 5.13-rcX version of -dev.

We could whip together a SRCREV recipe for it, if you don't want to
use the AUTOREV.

I'm not going to do a full versioned linux-yocto for 5.12, but we can
special case this if we want to go that route.

Bruce



Cheers,

Richard

--
- Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end
- "Use the force Harry" - Gandalf, Star Trek II


Richard Purdie
 

On Sun, 2021-05-23 at 12:51 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:

We've got yet another rcu stall failure on the autobuilder:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/2123/steps/15/logs/stdio

and looking at the dmesg in the qemu log:

[ 20.424033] Freeing unused kernel image (rodata/data gap) memory: 652K
[ 20.425229] Run /sbin/init as init process
INIT: version 2.99 booting
FBIOPUT_VSCREENINFO failed, double buffering disabledStarting udev
[ 20.547298] udevd[161]: starting version 3.2.10
[ 20.553329] udevd[162]: starting eudev-3.2.10
[ 20.751260] EXT4-fs (vda): re-mounted. Opts: (null)
[ 20.752548] ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
INIT: Entering runlevel: 5
Configuring network interfaces... RTNETLINK answers: File exists
Starting random number generator daemon.
Starting OpenBSD Secure Shell server: sshd
done.
Starting rpcbind daemon...done.
starting statd: done
Starting atd: OK
[ 21.921925] Installing knfsd (copyright (C) 1996 okir@...).
starting 8 nfsd kernel threads: [ 23.066283] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 23.068096] NFSD: Using legacy client tracking operations.
[ 23.069086] NFSD: starting 90-second grace period (net f0000098)
done
starting mountd: [ 45.272151] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 45.273423] rcu: 1-...0: (10 ticks this GP) idle=7ba/1/0x4000000000000000 softirq=598/612 fqs=5249
[ 45.274951] (detected by 2, t=21002 jiffies, g=-195, q=13)
[ 138.202149] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 332.762209] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

This is with the kvm clock source disabled (in master-next) and with Bruce's
5.10.38 upgrade so that kind of rules out either of those two things for this
issue. It also can't be the qemu platform or cpu emulation used since we've
changed that.

What is really odd is that it never actually prints the stalled tasks. That
seems really strange. It is obviously alive enough to print a stall message
later but stalls out and is terminated after 1500s.

Really open to ideas at this point. Should we try a newer kernel version
for testing in -next, see if we can isolate this to 5.10?
If you want to switch to linux-yocto-dev, it is on 5.12.x, and I have
a local 5.13-rcX version of -dev.

We could whip together a SRCREV recipe for it, if you don't want to
use the AUTOREV.

I'm not going to do a full versioned linux-yocto for 5.12, but we can
special case this if we want to go that route.
A set of SRCREVs sounds like the best plan, I think it might be worth testing
to see if things improve or not.

What is also odd is that in that in that same build, another qemu instance
hung in syslinux loading bzImage. We've seen this before occasionally and
it seems to keep happening periodically. That would seem more like a qemu
bug yet we're on the latest qemu release :/.

In neither case did Randy's stall detector trigger as far as I can tell.

Cheers,

Richard


Bruce Ashfield <bruce.ashfield@...>
 

On Sun, May 23, 2021 at 12:56 PM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 12:51 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:

We've got yet another rcu stall failure on the autobuilder:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/2123/steps/15/logs/stdio

and looking at the dmesg in the qemu log:

[ 20.424033] Freeing unused kernel image (rodata/data gap) memory: 652K
[ 20.425229] Run /sbin/init as init process
INIT: version 2.99 booting
FBIOPUT_VSCREENINFO failed, double buffering disabledStarting udev
[ 20.547298] udevd[161]: starting version 3.2.10
[ 20.553329] udevd[162]: starting eudev-3.2.10
[ 20.751260] EXT4-fs (vda): re-mounted. Opts: (null)
[ 20.752548] ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
INIT: Entering runlevel: 5
Configuring network interfaces... RTNETLINK answers: File exists
Starting random number generator daemon.
Starting OpenBSD Secure Shell server: sshd
done.
Starting rpcbind daemon...done.
starting statd: done
Starting atd: OK
[ 21.921925] Installing knfsd (copyright (C) 1996 okir@...).
starting 8 nfsd kernel threads: [ 23.066283] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 23.068096] NFSD: Using legacy client tracking operations.
[ 23.069086] NFSD: starting 90-second grace period (net f0000098)
done
starting mountd: [ 45.272151] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 45.273423] rcu: 1-...0: (10 ticks this GP) idle=7ba/1/0x4000000000000000 softirq=598/612 fqs=5249
[ 45.274951] (detected by 2, t=21002 jiffies, g=-195, q=13)
[ 138.202149] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 332.762209] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

This is with the kvm clock source disabled (in master-next) and with Bruce's
5.10.38 upgrade so that kind of rules out either of those two things for this
issue. It also can't be the qemu platform or cpu emulation used since we've
changed that.

What is really odd is that it never actually prints the stalled tasks. That
seems really strange. It is obviously alive enough to print a stall message
later but stalls out and is terminated after 1500s.

Really open to ideas at this point. Should we try a newer kernel version
for testing in -next, see if we can isolate this to 5.10?
If you want to switch to linux-yocto-dev, it is on 5.12.x, and I have
a local 5.13-rcX version of -dev.

We could whip together a SRCREV recipe for it, if you don't want to
use the AUTOREV.

I'm not going to do a full versioned linux-yocto for 5.12, but we can
special case this if we want to go that route.
A set of SRCREVs sounds like the best plan, I think it might be worth testing
to see if things improve or not.
I created the attached recipes. Built and booted on qemux86-64 with no
issues.

I assume you'll do the appropriate preferred version in the test
branches to make
sure they are used instead of 5.10 ?

Bruce


What is also odd is that in that in that same build, another qemu instance
hung in syslinux loading bzImage. We've seen this before occasionally and
it seems to keep happening periodically. That would seem more like a qemu
bug yet we're on the latest qemu release :/.

In neither case did Randy's stall detector trigger as far as I can tell.

Cheers,

Richard

--
- Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end
- "Use the force Harry" - Gandalf, Star Trek II


Richard Purdie
 

On Mon, 2021-05-24 at 09:21 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:56 PM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 12:51 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:
A set of SRCREVs sounds like the best plan, I think it might be worth testing
to see if things improve or not.
I created the attached recipes. Built and booted on qemux86-64 with no
issues.

I assume you'll do the appropriate preferred version in the test
branches to make
sure they are used instead of 5.10 ?
About the time you were writing this, I'd hacked up:

http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=de3e2253482b6d9df1137128a9fde35dec8fd915

and put it into a build on the autobuilder. It caused meta-arm to blow up
and I suspect there may be other fallout but we'll see...

FWIW, I checked with Alexandre and it seems all the rcu failure issues
are on qemuXXX builds but not qemuXXX-alt. The former is 5.10, the latter 
5.4.

I'm starting to strongly suspect there is some issue with 5.10 as we don't
see this with dunfell or with poky-alt :/. I'd wonder why nobody else has
noticed though...

Cheers,

Richard


Richard Purdie
 

On Mon, 2021-05-24 at 15:29 +0100, Richard Purdie via lists.yoctoproject.org wrote:
On Mon, 2021-05-24 at 09:21 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:56 PM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 12:51 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:
A set of SRCREVs sounds like the best plan, I think it might be worth testing
to see if things improve or not.
I created the attached recipes. Built and booted on qemux86-64 with no
issues.

I assume you'll do the appropriate preferred version in the test
branches to make
sure they are used instead of 5.10 ?
About the time you were writing this, I'd hacked up:

http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=de3e2253482b6d9df1137128a9fde35dec8fd915

and put it into a build on the autobuilder. It caused meta-arm to blow up
and I suspect there may be other fallout but we'll see...

FWIW, I checked with Alexandre and it seems all the rcu failure issues
are on qemuXXX builds but not qemuXXX-alt. The former is 5.10, the latter 
5.4.

I'm starting to strongly suspect there is some issue with 5.10 as we don't
see this with dunfell or with poky-alt :/. I'd wonder why nobody else has
noticed though...
I switched to Bruce's 5.12 patches. Unfortunately even with 5.12:

https://autobuilder.yoctoproject.org/typhoon/#/builders/81/builds/2118/steps/12/logs/stdio

:(

Also,
https://autobuilder.yoctoproject.org/typhoon/#/builders/110/builds/2362
and the corresponding:
https://autobuilder.yocto.io/pub/non-release/20210523-10/testresults/qemuarm-alt/2021-05-24--01-52/host_stats_1_top.txt
is interesting. That was a qemuarm-alt image (5.4 kernel) which could be a genuine load 
issue. It is getting 300% cpu though so hardly resource starved.

Ideas welcome at this point.

Cheers,

Richard