Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: Dec 9, 2021


Randy MacLeod
 

YP AB Intermittent failures meeting
===================================
https://windriver.zoom.us/j/3696693975

Attendees: Richard, Trevor, Randy, Saul, AlexB


Summary:
========

--------------------------------------------------------------
People have been busy with other work so this is mostly a duplicate
of the previous minutes with some cyptic? IRC chat logs added below.
--------------------------------------------------------------

Ptest results continue to improve yet again but there's still room
for even more improvement.

Alex made a graph of the number of AB INT issues per week:
https://bootlin.com/~alexandre/SWAT_stats.png
We assume that week 15, 16 was when the RCU bug in he kernel
started being a problem and week 29 was when it go fixed but
more careful analysis is required.

The make/ninja load average limit is in but it's not clear
if it's effective yet and it breaks dunfell.
Trevor has a build of dunfell that with some patches appears to work.


If anyone wants to help, we could use more eyes on the logs,
particularly the summary logs and understanding iostat #
when the dd test times out.



Plans for the week:
===================


Richard: M1
Alex: look into Rest API for BZ as part of Triage.
Sakib: hook more responsive load average in to latency test. (v3)
: Add PSI (/proc/pressure/*) when available
Trevor: No AB work
Saul: No AB work
Randy: PSI, simple experiments to learn what's 'normal',
make some PSI graphs!

Idea: bitbake/make/ninja use /proc/pressure to throttle builds?

../Randy

Meeting Notes:
==============


Cryptic notes this week since I don't have time for a proper summary!


[09:01] <vmeson> Agenda Items?
[09:01] <vmeson> 1. PSI /proc/pressure
see: https://www.kernel.org/doc/html/latest/accounting/psi.html

[09:01] <vmeson> 2. Status of AB
[09:02] <vmeson> 3. Randy promises valgrind patches (again!).

Discussion:

Randy explained what the PSI proc data was and
how overloaded ubuntu2004-ty-2.yocto.io was for most of a 25 hour
logging window. Typical high load data is:

pressure.cpu

some avg10=0.00 avg60=0.00 avg300=1.12 total=8891584119

some avg10=0.00 avg60=0.00 avg300=1.00 total=8891596517

some avg10=0.00 avg60=0.00 avg300=0.90 total=8891613258

some avg10=0.00 avg60=0.00 avg300=0.81 total=8891631326

pressure.memory

some avg10=0.00 avg60=0.00 avg300=0.00 total=237120503

full avg10=0.00 avg60=0.00 avg300=0.00 total=194500170

some avg10=0.00 avg60=0.00 avg300=0.00 total=237121370

full avg10=0.00 avg60=0.00 avg300=0.00 total=194500170

pressure.io

some avg10=65.22 avg60=56.24 avg300=40.82 total=142092410628

full avg10=65.22 avg60=56.23 avg300=40.28 total=134090291475

some avg10=48.50 avg60=52.56 avg300=41.42 total=142106526703

full avg10=48.50 avg60=52.56 avg300=40.93 total=134104406782


This is just:
for i in pressure.cpu pressure.memory pressure.io; do \
echo $i; tail -4 $i; \
done


There are two lines per call for some of the /proc/pressure files:
$ wc -l pressure*

3000 pressure.cpu

6000 pressure.io

6000 pressure.memory



Looking at the max io load for the times when the 'full' system
is overloaded:

$ grep full pressure.io | cat -n > pressure.io-full.numbered
$ sed -e 's/=/ /g' pressure.io-full.numbered | sort -k 4 -n | tail -2
1710 full avg10 97.52 avg60 93.85 avg300 92.29 total 124322928244

1699 full avg10 97.57 avg60 93.16 avg300 91.36 total 124015261496


# make a link so we have *full.numbered files for each subsystem:
$ ln -s pressure.cpu.numbered pressure.cpu-ful.numbered

$ for i in pressure.cpu pressure.memory pressure.io; do \
echo $i; egrep ' 1710| 1699' $i-full.numbered; \
done

pressure.cpu

1699 some avg10=0.00 avg60=0.00 avg300=0.00 total=7340981927

1710 some avg10=0.00 avg60=0.00 avg300=0.00 total=7341522636

pressure.memory

1699 full avg10=0.00 avg60=0.00 avg300=0.00 total=131800420

1710 full avg10=0.00 avg60=0.00 avg300=0.00 total=131800502

pressure.io

1699 full avg10=97.57 avg60=93.16 avg300=91.36 total=124015261496

1710 full avg10=97.52 avg60=93.85 avg300=92.29 total=124322928244




Conclusion, no cpu, memory contention, lots of IO -- clean-up or ???





[09:22] <vmeson> read-only YP BZ database query to find "AB-INT" build failures, find build and highlight in "non-release" AB build summary.
[09:23] <vmeson> add /proc/pressure info to test results logs.

Sakib and I may do this, else Alex in January

[09:24] <vmeson> network access: weird AB bugs: DNS or connextion failure, at times even qemu net connection fails!
[09:26] <vmeson> systemd-networkd at times monitors and maybe even changes (if up/down?) tap interfaces

This seems to be hard-coded, mandatory behaviour in systemd managed
distros.

[09:26] <vmeson> Test system, start / stop qemu instances in a loop.
[09:29] <vmeson> say: 16 qemus booting core-image-minimal and doing a simple test such as a network connection (download curl), along with stress --mumble-args to load the host sytem to simulate a build.
[09:30] <abelloni> vmeson: https://bugzilla.yoctoproject.org/show_bug.cgi?id=14467
[09:32] <vmeson> use rest API to query the bugzilla -- where's the API?
[09:33] <abelloni> https://wiki.mozilla.org/Bugzilla:REST_API
[09:33] <abelloni> I'll have a look at that
[09:37] <RP> "BzAPI is an alternate, deprecated REST API" - that will be ours as we're not on bz5 yet



Notes from previous meeting which are largely still relevant.

1. job server

- ninja could be patched with make's more responsive algorithm
next or is this good enough?

Aug 26:
Randy made some graphs that show that the -l NUM results
in the number of compile jobs oscillates *wildly* between 0 and 200
on a 192 core builder compiling chromium. What I did was:
$ bitbake -c cleansstate chromium-x11
$ bitbake -c configure chromium-x11
$ bitbake -c compile chromium-x11
and while that compile was running:
$ while [ ! -f /tmp/compiling-chromium-is-done ]; do \
cat /proc/loadavg >> procs-load.log ; sleep 0.5 ;
done
Results so far:
https://postimg.cc/gallery/3hjfYfG/f8f46c97
Next step is either:
a. collect data as above for an image build and see if the sub-optimal
ninja behaviour makes a difference
and/or
b. patch ninja with make's more responsive load avg
algorithm:
https://git.savannah.gnu.org/cgit/make.git/commit/?id=d8728efc8


- Richard suggested that we extract make's code for measuring the load
average to a separate binary and run it in the periodic io latency
test. Also can we translate it to python?
- Trevor is working on this and had some problems so next week.
(Aug 19 - Trevor is back from vaction so maybe next week.)

- Trevor to see if the load average change really did reduce load
on WR build systems. (Aug 19)

2. AB status

Trevor is learning about buildbot and working on a scheduling bug
(CentOS worker?)

bitbake layer setup tool should allow multiple backends:
eg: kas, a y-a-helper.

ptest cases are improving, we may be close to done!
Let's wait a week to see how things go.
(July29, Aug 5, Aug 19, we're not done...)

- lttng-tools ptest is failing. RP is working on it with upstream.
The timeout (done on Aug 5) increase hasn't helped.


3. Sakib's improvements to the logging are merged.

Sakib generated a summary of all high latency 'top' logs from
~July 23->July 29 by just running his summary script on the
merged raw top logs.

More analysis required....


Still relevant parts of
Previous Meeting Notes:
=======================


4. bitbake server timeout ( no change july 29, Aug 19, Oct 7)

"Timeout while waiting for a reply from the bitbake server (60s)"

5. io stalls (no update: July 29, Oct 7)

Richard said that it would make sense to write an ftrace utility
/ script to monitor io latency and we could install it with sudo
Ch^W mentioned ftrace on IRC.
Sakib and Randy will work on that but not for a week or two
or longer! (Aug 19).

Randy collected iostat data on 3 build server:
https://postimg.cc/gallery/8cN6LYB
We agreed that having -ty-2 be ~ 100 utilization for many hours
in a row is not acceptable and that a threshold of ~ 10 minutes
at 100% utilization may be a reasonable limt. I need to figure out
if I can get data on the fraction of IO done per IO clas since
we do use ionice to do clean-up and other activities.


../Randy

Join yocto@lists.yoctoproject.org to automatically receive all group messages.