Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: July 29, 2021

Randy MacLeod

YP AB Intermittent failures meeting
July 29, 2021, 9 AM ET

Attendees: Tony, Richard, Trevor, Randy, Sakib!


ptest failures again are better but there's still room
for improvement.

The make/ninja load average limit is in but it's not clear
if it's effective yet.

We tried mechanically summarizing all the high latency top logs
for the last week. No firm conclusions but initial thoughts below.

If anyone wants to help, we could use more eyes on the logs,
particularly the summary logs and understanding iostat #
when the dd test times out.

Plans for the week:

All: Wait and see if the ptest failure rate continues to be lower
than previous weeks.

Sakib: hook more responsive load average in to latency test. (v2)
Trevor: patch to set PARALLEL_MAKE : -l 50
-> dunfell, gatesgarth, hardknott
Investigate dunfell which failed with this change.
Saul: on vacation
Randy: Look at performance data

Meeting Notes:

1. job server

- ninja could be patched with make's more responsive algorithm
next or is this good enough?

- Richard suggested that we extract make's code for measuring the load
average to a separate binary and run it in the periodic io latency
test. Also can we translate it to python?
- Trevor is working on this and had some problems so next week.

2. AB status

ptest cases are improving, we may be close to done!
Let's wait a week to see how things go. (July29, we're not done...)

- development week with lots of failures and a-quick builds
so it's hard to say.

3. Nothing new on this item this week (July 29):
Richard reported
- something really flaky going on with serial ports.
- particularly bad on qemuppc but also x86.
- related to Saul's QMP data dump?
- Juy 22/29: We didn't talk about this issue this week.

4. Sakib's improvements to the logging are merged.
We think Michael needs to update the script that generates the
web page. Randy/Sakib to talk with Michael.
-- Done.

Sakib generated a summary of all high latency 'top' logs from
~July 23->July 29 by just running his summary script on the
merged raw top logs. see all_summary attached.

You can see what compilation jobs are most frequently associated
with high latency events by:
$ grep GCC ~/Downloads/all_summary.txt | less

They are: linux-yocto, llvm, qemu, gtk+,

# quick top 10 list:
$ grep GCC ~/Downloads/all_summary.txt | grep cc | head -10 | \
cut -d"/" -f1,4,5,8

104 ~/genericx86_64-poky-linux/linux-yocto/x86_64-poky-linux

94 ~/core2-32-poky-linux-musl/llvm/i686-poky-linux-musl

89 ~/i686-nativesdk-pokysdk-linux/nativesdk-qemu/i686-pokysdk-linux

74 ~/core2-32-poky-linux-musl/gtk+3/i686-poky-linux-musl

64 ~/build-st/reproducibleB/core2-64-poky-linux

59 ~/cortexa8hf-neon-poky-linux-gnueabi/qemu/arm-poky-linux-gnueabi

53 ~/qemux86-poky-linux/perf/i686-poky-linux

40 ~/core2-64-poky-linux/glibc/x86_64-poky-linux

39 ~/build-st/reproducibleA/core2-64-poky-linux

38 ~/ppc7400-poky-linux/ofono/powerpc-poky-linux

If you look at the non-GCC activities that are not part of the
base OS activities you see processes such as:
make, mv, perl, tar, pseudo, rm, ninja

$ grep -v GCC ~/Downloads/all_summary.txt | grep -A 33 "Userspace Process Summary:"

Userspace Process Summary:

12326 bitbake-server

12145 python3

8112 /bin/sh

7213 /bin/bash

5207 make

1329 /usr/bin/python3

826 mv

758 (sd-pam)

715 perl

694 x86_64-poky-linux-gcc

620 top

587 sshd:

580 bash

566 /lib/systemd/systemd

561 -bash

476 tar

398 sh

397 arm-poky-linux-gnueabi-gcc

386 /usr/bin/dbus-daemon

382 gcc

379 /usr/sbin/irqbalance

379 /sbin/agetty

373 ~/pkgman-rpm-non-rpm/build/build/tmp/sysroots-components/x86_64/pseudo-native/usr/bin/pseudo

360 qmgr

360 dpkg-deb

358 pickup

356 rm

351 as

335 /usr/sbin/cron

333 /usr/sbin/rsyslogd

326 /usr/sbin/atd

314 /usr/sbin/sshd

296 ninja

mv is likely blocked on IO (Sakib please confirm from logs)
Since make is around more than ninja, we may be able to better
control the load using the 'load average' limit and not have to patch
ninja (with make's enhancement) to be more responsive.

a. script to gather the file: sum_sum.py
b. ./summarize_to_outup.py all <directory w/ all the interval files>

More analysis required....

5. (From July 8)
Richard says that we may need to redesign the data collection system
that Sakib's AB INT tests are based on.
Was worried the current approach does NOT cover oe-selftests but
we do see it when we see the AB-INT trigger from builds. Not sure
if we need the change anything yet. Everything goes through
run-command in yocto-ab-helper.

Still relevant parts of
Previous Meeting Notes:

4. bitbake server timeout ( no change july 29)

"Timeout while waiting for a reply from the bitbake server (60s)"

Randy mentioned that the bitbake server timeouts seen in the
Wind River build cluster have gone away after upgrading to
a newer version of docker.

Old: Docker Version: Docker version 18.09.4, build d14af54266
New: Docker Version: Docker version 20.10.7, build f0df350

Clearly the YP ABs aren't running in docker but what
about firmware and kernel tunings.


Is the BIOS/firmware kept up to date on most nodes?
- July 22: This was done.

For the performance builder trend see:



- CentOS-7 seems to take less time (~ 1 min),
- Ubuntu-16.04 seems to take more (~ 5 min)
That's a bit surprising!
Randy to look at 62659 commit number in poky.

5. io stalls (no update: July 29)

Richard said that it would make sense to write an ftrace utility
/ script to monitor io latency and we could install it with sudo
Ch^W mentioned ftrace on IRC.
Sakib and Randy will work on that but not for a week or two.


Join yocto@lists.yoctoproject.org to automatically receive all group messages.