Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: Aug 5, 2021

Randy MacLeod

YP AB Intermittent failures meeting
Aug 5, 2021, 9 AM ET

Attendees: Tony, Richard, Trevor, Randy, Sakib!


ptest failures again are better but there's still room
for improvement.

The make/ninja load average limit is in but it's not clear
if it's effective yet and it breaks dunfell. Trevor investigating.

There's not much new this week, I've commented on a few existing
activities below and added "Aug 5" in most cases.

We did talk about the YP SWAT process and trying to get people
to all follow the same workflow and for the people who are working
on reporting and analysis tools to understand what SWAT does.
Alex is going to think about it and come up with a plan.

If anyone wants to help, we could use more eyes on the logs,
particularly the summary logs and understanding iostat #
when the dd test times out.

I moved Michael to BCC here and
I'll drop him next week unless asked to do otherwise.

Plans for the week:

All: Wait and see if the ptest failure rate continues to be lower
than previous weeks.

Alex: SWAT plans.
Sakib: hook more responsive load average in to latency test. (v3)
Trevor: patch to set PARALLEL_MAKE : -l 50
-> dunfell, gatesgarth, hardknott (Aug 5 - it's a priority)
Investigate dunfell which failed with this change.
Randy: Look at performance data

Meeting Notes:

1. job server

- ninja could be patched with make's more responsive algorithm
next or is this good enough?

- Richard suggested that we extract make's code for measuring the load
average to a separate binary and run it in the periodic io latency
test. Also can we translate it to python?
- Trevor is working on this and had some problems so next week.

2. AB status

Trevor is learning about buildbot and working on a scheduling bug
(CentOS worker?)

bitbake layer setup tool should allow multiple backends:
eg: kas, a y-a-helper.

ptest cases are improving, we may be close to done!
Let's wait a week to see how things go.
(July29, Aug 5, we're not done...)

- development week with lots of failures and a-quick builds
so it's hard to say.

- lttng timeouts are still happening so RP is going to increase
timeout for all ptests from 300, 450. (Aug 5, timeout bumped)

3. Sakib's improvements to the logging are merged.

Sakib generated a summary of all high latency 'top' logs from
~July 23->July 29 by just running his summary script on the
merged raw top logs.

<snip last week's summary of summaries text>

More analysis required....

Still relevant parts of
Previous Meeting Notes:

4. bitbake server timeout ( no change july 29)

"Timeout while waiting for a reply from the bitbake server (60s)"

Clearly the YP ABs aren't running in docker but what
about firmware and kernel tunings.

5. io stalls (no update: July 29)

Richard said that it would make sense to write an ftrace utility
/ script to monitor io latency and we could install it with sudo
Ch^W mentioned ftrace on IRC.
Sakib and Randy will work on that but not for a week or two.