Date   

VMs hanging with rcu stall problems

Richard Purdie
 

Not cross posted but mentioned here for info. Seeing if the qemu devs have any ideas.

Cheers,

Richard


qa-extras2 hang rcu/scheduling while atomic

Richard Purdie
 

https://autobuilder.yoctoproject.org/typhoon/#/builders/72/builds/3538

failure during execution of dnf --help

Traceback from qemu logs fro thre 5.10 kernel (full log attached):

qemux86-64 login: [ 133.333475] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 133.337109] (detected by 2, t=25864 jiffies, g=1529, q=10)
[ 133.339025] rcu: All QSes seen, last rcu_preempt kthread activity 4865 (4294800423-4294795558), jiffies_till_next_fqs=3, root ->qsmask 0x0
[ 133.343445] rcu: rcu_preempt kthread starved for 4870 jiffies! g1529 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=2
[ 133.346976] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 133.350262] rcu: RCU grace-period kthread stack dump:
[ 133.352704] task:rcu_preempt state:R stack: 0 pid: 13 ppid: 2 flags:0x00004000
[ 133.355581] Call Trace:
[ 133.356488] __schedule+0x1dc/0x570
[ 133.357693] ? __mod_timer+0x220/0x3c0
[ 133.359018] schedule+0x68/0xe0
[ 133.360000] schedule_timeout+0x8f/0x160
[ 133.361267] ? force_qs_rnp+0x8d/0x1c0
[ 133.362515] ? __next_timer_interrupt+0x100/0x100
[ 133.364264] rcu_gp_kthread+0x55f/0xba0
[ 133.365701] ? note_gp_changes+0x70/0x70
[ 133.367356] kthread+0x145/0x170
[ 133.368597] ? kthread_associate_blkcg+0xc0/0xc0
[ 133.370686] ret_from_fork+0x22/0x30
[ 133.371976] BUG: scheduling while atomic: swapper/2/0/0x00000002
[ 133.374066] Modules linked in: bnep
[ 133.375324] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.10.41-yocto-standard #1
[ 133.377813] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 133.381882] Call Trace:
[ 133.382744] dump_stack+0x5e/0x74
[ 133.384027] __schedule_bug.cold+0x4b/0x59
[ 133.385362] __schedule+0x3f6/0x570
[ 133.386655] schedule_idle+0x2c/0x40
[ 133.388033] do_idle+0x15a/0x250
[ 133.389257] ? complete+0x3f/0x50
[ 133.390406] cpu_startup_entry+0x20/0x30
[ 133.391827] start_secondary+0xf1/0x100
[ 133.393143] secondary_startup_64_no_verify+0xc2/0xcb
[ 191.482302] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 255.155323] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

Shouldn't see "scheduling while atomic", the stall detector also isn't listing 
tasks.

Cheers,

Richard


Re: [PATCH 0/4] Re-implement prserv on top of asyncrpc

Paul Barker <pbarker@...>
 

On Mon, 31 May 2021 at 12:25, Richard Purdie
<richard.purdie@...> wrote:

Hi Paul,

On Fri, 2021-05-28 at 09:42 +0100, Paul Barker wrote:
These changes replace the old XML-based RPC system in prserv with the
new asyncrpc implementation originally used by hashserv. A couple of
improvments are required in asyncrpc to support this.

I finally stumbled across the issue which led to the hanging builds
seen on the autobuilder when testing the initial RFC series.
It was a fairly dumb mistake on my behalf and I'm not sure how it
didn't trigger in my initial testing! The
`PRServerClient.handle_export()` function was missing a call to
`self.write_message()` so the client just ended up stuck waiting for a
response that was never to come. This issue is fixed here.

I've ran these changes through both `bitbake-selftest` and
`oe-selftest -a` and all looks good on my end. A couple of failures
were seen in oe-selftest but these are related to my host system
configuration (socat not installed, firewall blocking ports, etc) so
I'm fairly confident they aren't caused by this patch series.
Thanks for these. Unfortunately I think there is still a gremlin somewhere
as this was included in an autobuilder test build that is showing as this:

https://autobuilder.yoctoproject.org/typhoon/#/builders/83/builds/2203

i.e. all four selftests have not finished and I'd have expected them to
by now.
(╯°□°)╯︵ ┻━┻


I'm trying not to work today so I haven't debugged them or confirmed where
they are hanging but it seems likely related.
If you're planning to take the day off don't worry about investigating
these. I'll take a look at the patches again on Wednesday. I think the
best approach may be to add some timeouts and maybe more error
handling to the asyncrpc code I extracted from hashserv - if we can
turn these hangs into a proper error then we can reduce the amount of
autobuilder time they take to test and hopefully we'll get a better
insight into what is actually going wrong. My guess is that there's
something in the autobuilder config or just the level of load on the
machines which is aggravating this as the tests finish successfully on
my build machine (with a few expected test failures as noted
previously).

Thanks,

--
Paul Barker
Konsulko Group


Re: [PATCH 0/4] Re-implement prserv on top of asyncrpc

Richard Purdie
 

Hi Paul,

On Fri, 2021-05-28 at 09:42 +0100, Paul Barker wrote:
These changes replace the old XML-based RPC system in prserv with the
new asyncrpc implementation originally used by hashserv. A couple of
improvments are required in asyncrpc to support this.

I finally stumbled across the issue which led to the hanging builds
seen on the autobuilder when testing the initial RFC series.
It was a fairly dumb mistake on my behalf and I'm not sure how it
didn't trigger in my initial testing! The
`PRServerClient.handle_export()` function was missing a call to
`self.write_message()` so the client just ended up stuck waiting for a
response that was never to come. This issue is fixed here.

I've ran these changes through both `bitbake-selftest` and
`oe-selftest -a` and all looks good on my end. A couple of failures
were seen in oe-selftest but these are related to my host system
configuration (socat not installed, firewall blocking ports, etc) so
I'm fairly confident they aren't caused by this patch series.
Thanks for these. Unfortunately I think there is still a gremlin somewhere
as this was included in an autobuilder test build that is showing as this:

https://autobuilder.yoctoproject.org/typhoon/#/builders/83/builds/2203

i.e. all four selftests have not finished and I'd have expected them to 
by now.

I'm trying not to work today so I haven't debugged them or confirmed where
they are hanging but it seems likely related.

Cheers,

Richard


SWAT Rotation

Alexandre Belloni
 

Hello 민재,

As discussed last week, you are on SWAT duty this week, could you
confirm you'll be able to work on the topic?

I hope you moved without any issue.

Regards,

--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


qemuarm failure on autobuilder - analysis

Richard Purdie
 

Today's failure for analysis is a qemuarm failure where all the test images 
failed at the same time:

https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/3493

We have's Randy+team's logging for this here:

https://autobuilder.yocto.io/pub/non-release/20210526-2/testresults/qemuarm/2021-05-26--08-58/host_stats_1_top.txt

What is interesting is the load average is peaked at the point the three 
qemu-system-arm are running at 300+ compared to the usual 50-80.

What is it doing at the time? In parallel to qemuarm there appears to 
be:

reproducibile-fedora 
(in stage B, non-sstate, i.e. from scratch) 
building llvm, webkitgkt, kernel-devsrc, qemu, kea
musl-x86-64
building webkitgtk, piglit, cmake, kernel-devsrc, llvm, stress-ng

which is a pretty heavy workload as those are all pretty heavy targets.

I question whether cmake's rpmbuild really should be using 7g of RES 
memory (15g VIRT).

The python bitbake-worker processes at near 100% cpu are interesting, 
I suspect those are python tasks for recipes. The code is meant to 
rename the process so we can better identify it but that is a tricky
thing under linux and hints it may not be working.

Bottom line for this one suggests it was load related.

Cheers,

Richard


Re: SWAT statistics for week 19

Alexandre Belloni
 

Hi,

On 25/05/2021 19:33:53-0400, Randy MacLeod wrote:
On 2021-05-18 6:26 p.m., Alexandre Belloni wrote:
Hi,

On 18/05/2021 23:21:53+0100, Ross Burton wrote:
Quick idea for swatbot: a top ten list of open bugs which have the
highest number of instances.
I'm maintaining a spreadsheet that goes a bit beyond that. I'm also
tracking the frequency of the bugs in the last months and we started to
close few of the older AB-INT issues. I'll share that publicly soon.
Hi Alexandre,

Any update on the list/spreadsheet?
Ah sorry, I forgot to send it. It is up to date with what was triaged
last week:

https://docs.google.com/spreadsheets/d/1bviDvW1SRwflofKLx9SwPUTWE3sBkvL3eb1PKnmZJcM/edit?usp=sharing

Tony is getting going on valgrind and he'll start with:
   https://bugzilla.yoctoproject.org/show_bug.cgi?id=14294

   [Bug 14294] valgrind memcheck/tests/linux/timerfd-syscall ptest
intermittent failure

unless there's another ptest issue that is more urgent.
This is probably the best one to start with, the following one would be
14311 which has more occurrences but is about multiple (related) ptests.

Regards,

--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Re: SWAT statistics for week 19

Randy MacLeod
 

On 2021-05-18 6:26 p.m., Alexandre Belloni wrote:
Hi,

On 18/05/2021 23:21:53+0100, Ross Burton wrote:
Quick idea for swatbot: a top ten list of open bugs which have the
highest number of instances.
I'm maintaining a spreadsheet that goes a bit beyond that. I'm also
tracking the frequency of the bugs in the last months and we started to
close few of the older AB-INT issues. I'll share that publicly soon.
Hi Alexandre,

Any update on the list/spreadsheet?

Tony is getting going on valgrind and he'll start with:
   https://bugzilla.yoctoproject.org/show_bug.cgi?id=14294

   [Bug 14294] valgrind memcheck/tests/linux/timerfd-syscall ptest intermittent failure

unless there's another ptest issue that is more urgent.


../Randy


Ross

On Tue, 18 May 2021 at 23:19, Alexandre Belloni
<alexandre.belloni@...> wrote:
Hello,

Here are the statistics for last week. Chee Yang was on SWAT duty.

160 failures were triaged:

* 119 by Chee Yang
- 38 for meson changes
- 24 for an issue in meta-arm after an upgrade of u-boot
- 11 for the btrfs-tools upgrade
- 6 for ovmf reproducibility issues
- 2 for meta-oe YP compatibility issues
- 4 new occurrences of bug 14310
- 4 new occurrences of bug 14251
- 3 new occurrences of bug 13802
- 3 new occurrences of bug 14273
- 2 new occurrences of bug 14208
- 2 new occurrences of bug 14381
- 1 new occurrence of bug 14145
- 1 new occurrence of bug 14163
- 1 new occurrence of bug 14165
- 1 new occurrence of bug 14177
- 1 new occurrence of bug 14197
- 1 new occurrence of bug 14201
- 1 new occurrence of bug 14250
- 1 new occurrence of bug 14294
- 1 new occurrence of bug 14296
- 1 new occurrence of bug 14311
- 4 occurrences of new bug 14388
- 2 occurrences of new bug 14393
- 1 occurrence of new bug 14389
- 1 occurrence of new bug 14390
- 1 occurrence of new bug 14391

* 41 by Richard
- 20 for an issue in meta-arm after an upgrade of u-boot
- 10 for issues he fixed
- 4 for the libepoxy upgrade
- 2 for YP compatibility issues in meta-AGL
- 2 for patches merged out of order
- 2 for branch names changed upstream
- 1 because gitlab was down

Regards,

--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com






--
# Randy MacLeod
# Wind River Linux


ubuntu2004-arm-1 load increase and possible instablity

Michael Halstead <mhalstead@...>
 

The ubuntu2004-arm-1 worker has been unstable in the past and we reduced the number of simultaneous builds from 3 to 1 to see if that would stop the crashes. It didn't at first but now the crashes have stopped. Perhaps due to kernel updates. I'm planning to increase the simultaneous builds back to 3 when the controller is next idle. This may cause the crashes to begin again and I want the SWAT team to be aware of the change.

--
Michael Halstead
Linux Foundation / Yocto Project
Systems Operations Engineer


Re: Further rcu stall on autobuilder

Richard Purdie
 

On Mon, 2021-05-24 at 15:29 +0100, Richard Purdie via lists.yoctoproject.org wrote:
On Mon, 2021-05-24 at 09:21 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:56 PM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 12:51 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:
A set of SRCREVs sounds like the best plan, I think it might be worth testing
to see if things improve or not.
I created the attached recipes. Built and booted on qemux86-64 with no
issues.

I assume you'll do the appropriate preferred version in the test
branches to make
sure they are used instead of 5.10 ?
About the time you were writing this, I'd hacked up:

http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=de3e2253482b6d9df1137128a9fde35dec8fd915

and put it into a build on the autobuilder. It caused meta-arm to blow up
and I suspect there may be other fallout but we'll see...

FWIW, I checked with Alexandre and it seems all the rcu failure issues
are on qemuXXX builds but not qemuXXX-alt. The former is 5.10, the latter 
5.4.

I'm starting to strongly suspect there is some issue with 5.10 as we don't
see this with dunfell or with poky-alt :/. I'd wonder why nobody else has
noticed though...
I switched to Bruce's 5.12 patches. Unfortunately even with 5.12:

https://autobuilder.yoctoproject.org/typhoon/#/builders/81/builds/2118/steps/12/logs/stdio

:(

Also,
https://autobuilder.yoctoproject.org/typhoon/#/builders/110/builds/2362
and the corresponding:
https://autobuilder.yocto.io/pub/non-release/20210523-10/testresults/qemuarm-alt/2021-05-24--01-52/host_stats_1_top.txt
is interesting. That was a qemuarm-alt image (5.4 kernel) which could be a genuine load 
issue. It is getting 300% cpu though so hardly resource starved.

Ideas welcome at this point.

Cheers,

Richard


Re: Further ltp hang - kernel issue?

Bruce Ashfield <bruce.ashfield@...>
 

On Mon, May 24, 2021 at 11:31 AM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 07:42 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 6:36 AM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 11:33 +0100, Richard Purdie via lists.yoctoproject.org wrote:
https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1932
Oddly enough,
https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1933
on centos7-ty-4 (master build) is locked up with pretty much exactly
the same issue/ps output/tests/dmesg.

The first one above was debian10-ty-1 with master-next.

Recent kernel version bump?
I can't think of anything specific that would cause those issues, but the
Wind River guys did report some bad iommu patches that were part of
5.10.37

I've merged .38, which has the fixes, but I haven't sent the bumps yet.
It is worth trying the attached SRCREV patch, to see if there's any
change in behaviour.
Thanks for the patch, I ran with it for a number of runs. I have not seen .38
break in the way master or master-next with .37 did. I've ran several and 50%
of the time .37 would hang in ltp.

Can we upgrade to .38 ASAP please? :)
sent. I cherry picked it from my queue and sent it individually.

I'll continue testing the rest of my updates.

Bruce

This is obviously a separate issue to the rcu stalls but I also think that
is 5.10 related.

Cheers,

Richard


--
- Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end
- "Use the force Harry" - Gandalf, Star Trek II


Re: Further ltp hang - kernel issue?

Richard Purdie
 

On Sun, 2021-05-23 at 07:42 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 6:36 AM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 11:33 +0100, Richard Purdie via lists.yoctoproject.org wrote:
https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1932
Oddly enough,
https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1933
on centos7-ty-4 (master build) is locked up with pretty much exactly
the same issue/ps output/tests/dmesg.

The first one above was debian10-ty-1 with master-next.

Recent kernel version bump?
I can't think of anything specific that would cause those issues, but the
Wind River guys did report some bad iommu patches that were part of
5.10.37

I've merged .38, which has the fixes, but I haven't sent the bumps yet.
It is worth trying the attached SRCREV patch, to see if there's any
change in behaviour.
Thanks for the patch, I ran with it for a number of runs. I have not seen .38
break in the way master or master-next with .37 did. I've ran several and 50%
of the time .37 would hang in ltp.

Can we upgrade to .38 ASAP please? :)

This is obviously a separate issue to the rcu stalls but I also think that
is 5.10 related.

Cheers,

Richard


Re: Further rcu stall on autobuilder

Richard Purdie
 

On Mon, 2021-05-24 at 09:21 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:56 PM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 12:51 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:
A set of SRCREVs sounds like the best plan, I think it might be worth testing
to see if things improve or not.
I created the attached recipes. Built and booted on qemux86-64 with no
issues.

I assume you'll do the appropriate preferred version in the test
branches to make
sure they are used instead of 5.10 ?
About the time you were writing this, I'd hacked up:

http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=de3e2253482b6d9df1137128a9fde35dec8fd915

and put it into a build on the autobuilder. It caused meta-arm to blow up
and I suspect there may be other fallout but we'll see...

FWIW, I checked with Alexandre and it seems all the rcu failure issues
are on qemuXXX builds but not qemuXXX-alt. The former is 5.10, the latter 
5.4.

I'm starting to strongly suspect there is some issue with 5.10 as we don't
see this with dunfell or with poky-alt :/. I'd wonder why nobody else has
noticed though...

Cheers,

Richard


Re: Further rcu stall on autobuilder

Bruce Ashfield <bruce.ashfield@...>
 

On Sun, May 23, 2021 at 12:56 PM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 12:51 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:

We've got yet another rcu stall failure on the autobuilder:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/2123/steps/15/logs/stdio

and looking at the dmesg in the qemu log:

[ 20.424033] Freeing unused kernel image (rodata/data gap) memory: 652K
[ 20.425229] Run /sbin/init as init process
INIT: version 2.99 booting
FBIOPUT_VSCREENINFO failed, double buffering disabledStarting udev
[ 20.547298] udevd[161]: starting version 3.2.10
[ 20.553329] udevd[162]: starting eudev-3.2.10
[ 20.751260] EXT4-fs (vda): re-mounted. Opts: (null)
[ 20.752548] ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
INIT: Entering runlevel: 5
Configuring network interfaces... RTNETLINK answers: File exists
Starting random number generator daemon.
Starting OpenBSD Secure Shell server: sshd
done.
Starting rpcbind daemon...done.
starting statd: done
Starting atd: OK
[ 21.921925] Installing knfsd (copyright (C) 1996 okir@...).
starting 8 nfsd kernel threads: [ 23.066283] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 23.068096] NFSD: Using legacy client tracking operations.
[ 23.069086] NFSD: starting 90-second grace period (net f0000098)
done
starting mountd: [ 45.272151] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 45.273423] rcu: 1-...0: (10 ticks this GP) idle=7ba/1/0x4000000000000000 softirq=598/612 fqs=5249
[ 45.274951] (detected by 2, t=21002 jiffies, g=-195, q=13)
[ 138.202149] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 332.762209] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

This is with the kvm clock source disabled (in master-next) and with Bruce's
5.10.38 upgrade so that kind of rules out either of those two things for this
issue. It also can't be the qemu platform or cpu emulation used since we've
changed that.

What is really odd is that it never actually prints the stalled tasks. That
seems really strange. It is obviously alive enough to print a stall message
later but stalls out and is terminated after 1500s.

Really open to ideas at this point. Should we try a newer kernel version
for testing in -next, see if we can isolate this to 5.10?
If you want to switch to linux-yocto-dev, it is on 5.12.x, and I have
a local 5.13-rcX version of -dev.

We could whip together a SRCREV recipe for it, if you don't want to
use the AUTOREV.

I'm not going to do a full versioned linux-yocto for 5.12, but we can
special case this if we want to go that route.
A set of SRCREVs sounds like the best plan, I think it might be worth testing
to see if things improve or not.
I created the attached recipes. Built and booted on qemux86-64 with no
issues.

I assume you'll do the appropriate preferred version in the test
branches to make
sure they are used instead of 5.10 ?

Bruce


What is also odd is that in that in that same build, another qemu instance
hung in syslinux loading bzImage. We've seen this before occasionally and
it seems to keep happening periodically. That would seem more like a qemu
bug yet we're on the latest qemu release :/.

In neither case did Randy's stall detector trigger as far as I can tell.

Cheers,

Richard

--
- Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end
- "Use the force Harry" - Gandalf, Star Trek II


Re: SWAT Rotation

Alexandre Belloni
 

Hello Jaga,

Are you available for SWAT this week?

I'll be looking at some of the failures today.

On 22/05/2021 11:44:56+0900, 김민재 wrote:
Hi Alexandre


I am sorry. I can't work next week. Because my house is moving to another place
in this weekend. So, I can swat work on June 1st.

Can I delay my rotation by just one week?
Sure, no problem


--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Re: Further rcu stall on autobuilder

Richard Purdie
 

On Sun, 2021-05-23 at 12:51 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:

We've got yet another rcu stall failure on the autobuilder:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/2123/steps/15/logs/stdio

and looking at the dmesg in the qemu log:

[ 20.424033] Freeing unused kernel image (rodata/data gap) memory: 652K
[ 20.425229] Run /sbin/init as init process
INIT: version 2.99 booting
FBIOPUT_VSCREENINFO failed, double buffering disabledStarting udev
[ 20.547298] udevd[161]: starting version 3.2.10
[ 20.553329] udevd[162]: starting eudev-3.2.10
[ 20.751260] EXT4-fs (vda): re-mounted. Opts: (null)
[ 20.752548] ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
INIT: Entering runlevel: 5
Configuring network interfaces... RTNETLINK answers: File exists
Starting random number generator daemon.
Starting OpenBSD Secure Shell server: sshd
done.
Starting rpcbind daemon...done.
starting statd: done
Starting atd: OK
[ 21.921925] Installing knfsd (copyright (C) 1996 okir@...).
starting 8 nfsd kernel threads: [ 23.066283] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 23.068096] NFSD: Using legacy client tracking operations.
[ 23.069086] NFSD: starting 90-second grace period (net f0000098)
done
starting mountd: [ 45.272151] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 45.273423] rcu: 1-...0: (10 ticks this GP) idle=7ba/1/0x4000000000000000 softirq=598/612 fqs=5249
[ 45.274951] (detected by 2, t=21002 jiffies, g=-195, q=13)
[ 138.202149] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 332.762209] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

This is with the kvm clock source disabled (in master-next) and with Bruce's
5.10.38 upgrade so that kind of rules out either of those two things for this
issue. It also can't be the qemu platform or cpu emulation used since we've
changed that.

What is really odd is that it never actually prints the stalled tasks. That
seems really strange. It is obviously alive enough to print a stall message
later but stalls out and is terminated after 1500s.

Really open to ideas at this point. Should we try a newer kernel version
for testing in -next, see if we can isolate this to 5.10?
If you want to switch to linux-yocto-dev, it is on 5.12.x, and I have
a local 5.13-rcX version of -dev.

We could whip together a SRCREV recipe for it, if you don't want to
use the AUTOREV.

I'm not going to do a full versioned linux-yocto for 5.12, but we can
special case this if we want to go that route.
A set of SRCREVs sounds like the best plan, I think it might be worth testing
to see if things improve or not.

What is also odd is that in that in that same build, another qemu instance
hung in syslinux loading bzImage. We've seen this before occasionally and
it seems to keep happening periodically. That would seem more like a qemu
bug yet we're on the latest qemu release :/.

In neither case did Randy's stall detector trigger as far as I can tell.

Cheers,

Richard


Re: Further rcu stall on autobuilder

Bruce Ashfield <bruce.ashfield@...>
 

On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote:

We've got yet another rcu stall failure on the autobuilder:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/2123/steps/15/logs/stdio

and looking at the dmesg in the qemu log:

[ 20.424033] Freeing unused kernel image (rodata/data gap) memory: 652K
[ 20.425229] Run /sbin/init as init process
INIT: version 2.99 booting
FBIOPUT_VSCREENINFO failed, double buffering disabledStarting udev
[ 20.547298] udevd[161]: starting version 3.2.10
[ 20.553329] udevd[162]: starting eudev-3.2.10
[ 20.751260] EXT4-fs (vda): re-mounted. Opts: (null)
[ 20.752548] ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
INIT: Entering runlevel: 5
Configuring network interfaces... RTNETLINK answers: File exists
Starting random number generator daemon.
Starting OpenBSD Secure Shell server: sshd
done.
Starting rpcbind daemon...done.
starting statd: done
Starting atd: OK
[ 21.921925] Installing knfsd (copyright (C) 1996 okir@...).
starting 8 nfsd kernel threads: [ 23.066283] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 23.068096] NFSD: Using legacy client tracking operations.
[ 23.069086] NFSD: starting 90-second grace period (net f0000098)
done
starting mountd: [ 45.272151] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 45.273423] rcu: 1-...0: (10 ticks this GP) idle=7ba/1/0x4000000000000000 softirq=598/612 fqs=5249
[ 45.274951] (detected by 2, t=21002 jiffies, g=-195, q=13)
[ 138.202149] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 332.762209] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

This is with the kvm clock source disabled (in master-next) and with Bruce's
5.10.38 upgrade so that kind of rules out either of those two things for this
issue. It also can't be the qemu platform or cpu emulation used since we've
changed that.

What is really odd is that it never actually prints the stalled tasks. That
seems really strange. It is obviously alive enough to print a stall message
later but stalls out and is terminated after 1500s.

Really open to ideas at this point. Should we try a newer kernel version
for testing in -next, see if we can isolate this to 5.10?
If you want to switch to linux-yocto-dev, it is on 5.12.x, and I have
a local 5.13-rcX version of -dev.

We could whip together a SRCREV recipe for it, if you don't want to
use the AUTOREV.

I'm not going to do a full versioned linux-yocto for 5.12, but we can
special case this if we want to go that route.

Bruce



Cheers,

Richard

--
- Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end
- "Use the force Harry" - Gandalf, Star Trek II


Further rcu stall on autobuilder

Richard Purdie
 

We've got yet another rcu stall failure on the autobuilder:

https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/2123/steps/15/logs/stdio

and looking at the dmesg in the qemu log:

[ 20.424033] Freeing unused kernel image (rodata/data gap) memory: 652K
[ 20.425229] Run /sbin/init as init process
INIT: version 2.99 booting
FBIOPUT_VSCREENINFO failed, double buffering disabledStarting udev
[ 20.547298] udevd[161]: starting version 3.2.10
[ 20.553329] udevd[162]: starting eudev-3.2.10
[ 20.751260] EXT4-fs (vda): re-mounted. Opts: (null)
[ 20.752548] ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
INIT: Entering runlevel: 5
Configuring network interfaces... RTNETLINK answers: File exists
Starting random number generator daemon.
Starting OpenBSD Secure Shell server: sshd
done.
Starting rpcbind daemon...done.
starting statd: done
Starting atd: OK
[ 21.921925] Installing knfsd (copyright (C) 1996 okir@...).
starting 8 nfsd kernel threads: [ 23.066283] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 23.068096] NFSD: Using legacy client tracking operations.
[ 23.069086] NFSD: starting 90-second grace period (net f0000098)
done
starting mountd: [ 45.272151] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 45.273423] rcu: 1-...0: (10 ticks this GP) idle=7ba/1/0x4000000000000000 softirq=598/612 fqs=5249
[ 45.274951] (detected by 2, t=21002 jiffies, g=-195, q=13)
[ 138.202149] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 332.762209] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

This is with the kvm clock source disabled (in master-next) and with Bruce's 
5.10.38 upgrade so that kind of rules out either of those two things for this
issue. It also can't be the qemu platform or cpu emulation used since we've
changed that.

What is really odd is that it never actually prints the stalled tasks. That
seems really strange. It is obviously alive enough to print a stall message
later but stalls out and is terminated after 1500s.

Really open to ideas at this point. Should we try a newer kernel version
for testing in -next, see if we can isolate this to 5.10?

Cheers,

Richard


Re: Further ltp hang - kernel issue?

Bruce Ashfield <bruce.ashfield@...>
 

On Sun, May 23, 2021 at 6:36 AM Richard Purdie
<richard.purdie@...> wrote:

On Sun, 2021-05-23 at 11:33 +0100, Richard Purdie via lists.yoctoproject.org wrote:
https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1932
Oddly enough,
https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1933
on centos7-ty-4 (master build) is locked up with pretty much exactly
the same issue/ps output/tests/dmesg.

The first one above was debian10-ty-1 with master-next.

Recent kernel version bump?
I can't think of anything specific that would cause those issues, but the
Wind River guys did report some bad iommu patches that were part of
5.10.37

I've merged .38, which has the fixes, but I haven't sent the bumps yet.
It is worth trying the attached SRCREV patch, to see if there's any
change in behaviour.

Bruce


Cheers,

Richard

--
- Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end
- "Use the force Harry" - Gandalf, Star Trek II


Re: Further ltp hang - kernel issue?

Richard Purdie
 

On Sun, 2021-05-23 at 11:33 +0100, Richard Purdie via lists.yoctoproject.org wrote:
https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1932
Oddly enough, 
https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1933 
on centos7-ty-4 (master build) is locked up with pretty much exactly 
the same issue/ps output/tests/dmesg.

The first one above was debian10-ty-1 with master-next.

Recent kernel version bump?

Cheers,

Richard

141 - 160 of 293