Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: June 17, 2021


Randy MacLeod
 

Join Zoom Meeting - 9 AM ET
https://windriver.zoom.us/j/3696693975

Attendees: Alex, Richard, Saul, Randy, Tony, Trevor, Sakib


Summary: Things are improving somewhat on the autobuilder,
RCU stalls are the top problem now.


1.
LTP kernel BUG:
Many thanks to Paul Gortmaker for his work on this!


2. The most common problem now is the qemu RCU hang.

For example these builds:

https://autobuilder.yoctoproject.org/typhoon/#/builders/73/builds/3541/steps/13/logs/stdio

https://autobuilder.yoctoproject.org/typhoon/#/builders/73/builds/3541/steps/13/logs/stdio

Richards links on RCU stall detection, and tuning parameters:
https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt
https://lwn.net/Articles/777214/

Next:
- Ask around for advice on qemu debugging.
- RP thinks that the underlying system has a problem:
CPU or other overload.
We do see that there are two qemus that are using lots of CPU in
the links above.
Richard says that the likely activity is:
- core-image-sato-sdk, compiler tests
- core-image-sato lighter general tests
Alex thinks that the particular workload is not significant.
- run two qemu in a controlled env, with stress-ng.
- iostat will help - Sakib.

3. Valgrind ptest results are getting better.

4. ptest issues are coming along, with util-linux being the next
thing to be merged today likely.

5. On the ubuntu-18.04 builders, we seem to see issues there,
we dont' know why, maybe only that we have more of those workers...
Alex, could you possibly get failures per worker statistics?


6. discussed Sakib's summary script. It's coming along.
TO DO:
- special activities: rm (of trash), tar, qemu*
- report all zombies
(The current hoard are due to Paul Barker's patch)
7.
make: job server
- the fifo was being re-created by the wrapper on each call
so Trevor will fix that.



8. From last week, I don't think we've increased the timeouts:

- qemu-runner? timeout increase 120 -> 240
- ptest timeouts 300 -> 450?




8.
Plans for the week:

Richard: RCU stall
Alex:
Sakib: task summary
Trevor: make job server
Tony: ptests and work with upstream valgrind on fixing bugs.
Saul: (1 week) have QMP deal with sigusr1 to close the QMP socket
Randy: coffee, herd cats!!


../Randy


Alexandre Belloni
 

On 17/06/2021 10:05:48-0400, Randy MacLeod wrote:
5. On the ubuntu-18.04 builders, we seem to see issues there,
we dont' know why, maybe only that we have more of those workers...
Alex, could you possibly get failures per worker statistics?
Specifically for bug 14273 (the rcu stall is seen on the logs):

ubuntu1804-ty-1 6 23.1%
ubuntu1804-ty-2 5 19.2%
ubuntu1804-ty-3 12 38.5%
fedora31-ty-1 2 7.7%
debian8-ty-1 3 11.5%


6. discussed Sakib's summary script. It's coming along.
TO DO:
- special activities: rm (of trash), tar, qemu*
- report all zombies
(The current hoard are due to Paul Barker's patch)
7.
make: job server
- the fifo was being re-created by the wrapper on each call
so Trevor will fix that.



8. From last week, I don't think we've increased the timeouts:

- qemu-runner? timeout increase 120 -> 240
- ptest timeouts 300 -> 450?




8.
Plans for the week:

Richard: RCU stall
Alex:
Sakib: task summary
Trevor: make job server
Tony: ptests and work with upstream valgrind on fixing bugs.
Saul: (1 week) have QMP deal with sigusr1 to close the QMP socket
Randy: coffee, herd cats!!


../Randy
--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com