[PATCH yocto-autobuilder-helper 1/4] config.json: add "collect-data" template -- Build: 20210420-1


Randy MacLeod
 

Summary:

Build: 20210420-1 had 23 triggers and isn't really worth
analyzing. Stay tuned for what is hopefully better data
with a higher timeout threshold.

There is a nice looking pair of graph attached though! :)

../Randy

On 2021-04-15 4:48 p.m., Randy MacLeod wrote:
On 2021-04-15 1:55 p.m., Randy MacLeod wrote:

Should we  increase the interval to 30, 60, ore more seconds?
I've bumped the interval to 60 seconds.


I spent some time looking at the first bit of data along with
Sakib and Saul from time to time.
General conclusions:
1. It seems like ALL triggers involve oe-selftest being active.
Still true.
2. xz might be a problem but we're not sure yet.
Added to graph attached.

3. We need more data and tools and time to think about it.


To Do:
1. increase top cmdline length from 512 to 16K
Still to do, maybe. It's not clear that it's useful yet.

2. sometimes we see:
Command '['oe-time-dd-test.sh', '100']' timed out after 10.0 seconds
That should not happen so we should understand why and either increase
the time between runs or fix the tooling. This seems to happen under load
so it's hiding the interesting data that we are looking for!
With a 60 second interval, this does NOT happen.

3. tail the cooker console in addition to top. Present that before top.
It would be nice to have a top equivalent for bitbake.
Still to do.
This can be done via the cooker log.
We can parse the file and do:
1. list recent tasks regardless of whether they have completed:
tail -20?
2. list tasks that have started but not completed.



This email is about:
https://autobuilder.yocto.io/pub/non-release/20210420-1/
It was a 'quick' build. There was only 1 log file produced.
It used the master so the timeout was still 5 seconds.

There were 23 (!!) times that the dd time exceeded the
15 (ACTUALLY 5) second limit out of a total of 1504 invocations
and those triggers were captured by 1 log file:
testresults/qemux86-world/2021-04-20--03-43/host_stats_0_top.txt


Splitting each top output into a separate file as before,
now we have 23 log files:

How big are these files, ie how many process/kernel threads were running
when top ran?

$ wc -l host-stats-0* | sort -n
699 host-stats-03:42:40--23.log
704 host-stats-02:20:41--5.log
720 host-stats-03:33:41--21.log
733 host-stats-02:19:40--4.log
743 host-stats-03:32:41--20.log
752 host-stats-02:25:45--6.log
753 host-stats-02:02:46--3.log
760 host-stats-03:36:07--22.log
784 host-stats-02:01:48--2.log
784 host-stats-03:30:43--18.log
802 host-stats-02:45:55--10.log
807 host-stats-03:31:42--19.log
816 host-stats-03:29:49--17.log
829 host-stats-03:19:50--15.log
845 host-stats-02:41:40--8.log
851 host-stats-02:00:57--1.log
899 host-stats-02:32:47--7.log
906 host-stats-03:18:41--14.log
925 host-stats-03:28:24--16.log
947 host-stats-02:42:41--9.log
1084 host-stats-03:16:50--13.log
1204 host-stats-02:54:43--11.log
1314 host-stats-03:03:41--12.log
19661 total



I noticed that several but not all log files were running xz:

$ for i in `ls host-stats-*`; do echo -n $i ": "; grep "xz " $i | wc -l; done
host-stats-02:00:57--1.log : 0
host-stats-02:01:48--2.log : 0
host-stats-02:02:46--3.log : 0
host-stats-02:19:40--4.log : 2
host-stats-02:20:41--5.log : 0
host-stats-02:25:45--6.log : 2
host-stats-02:32:47--7.log : 0
host-stats-02:41:40--8.log : 1
host-stats-02:42:41--9.log : 2
host-stats-02:45:55--10.log : 1
host-stats-02:54:43--11.log : 0
host-stats-03:03:41--12.log : 0
host-stats-03:16:50--13.log : 47
host-stats-03:18:41--14.log : 0
host-stats-03:19:50--15.log : 9
host-stats-03:28:24--16.log : 15
host-stats-03:29:49--17.log : 0
host-stats-03:30:43--18.log : 0
host-stats-03:31:42--19.log : 7
host-stats-03:32:41--20.log : 7
host-stats-03:33:41--21.log : 11
host-stats-03:36:07--22.log : 15
host-stats-03:42:40--23.log : 9


I've plotted the number of xz processes along with the load average
and the dd time in the attached graph.


All of the top output logs seems to be running oe-selftest:

for i in host-stats-0*; do grep -H -c "DISPLAY.*oe-selftest " $i ; done
host-stats-02:00:57--1.log:1
host-stats-02:01:48--2.log:1
host-stats-02:02:46--3.log:1
host-stats-02:19:40--4.log:1
host-stats-02:20:41--5.log:1
host-stats-02:25:45--6.log:1
host-stats-02:32:47--7.log:1
host-stats-02:41:40--8.log:2
host-stats-02:42:41--9.log:2
host-stats-02:45:55--10.log:2
host-stats-02:54:43--11.log:2
host-stats-03:03:41--12.log:2
host-stats-03:16:50--13.log:2
host-stats-03:18:41--14.log:2
host-stats-03:19:50--15.log:2
host-stats-03:28:24--16.log:2
host-stats-03:29:49--17.log:2
host-stats-03:30:43--18.log:2
host-stats-03:31:42--19.log:2
host-stats-03:32:41--20.log:2
host-stats-03:33:41--21.log:2
host-stats-03:36:07--22.log:2
host-stats-03:42:40--23.log:2



The logs DO seem to be clustered in that there are several 1 minute adjacent intervals where the dd time exceeded the threshold.

Data here but you can see this in the attached graph.

$ for i in host-stats-*; do echo -n $i ": "; head -1 $i | cut -c -15; done
host-stats-02:00:57--1.log : top - 02:00:57
host-stats-02:01:48--2.log : top - 02:01:48
host-stats-02:02:46--3.log : top - 02:02:46
host-stats-02:19:40--4.log : top - 02:19:40
host-stats-02:20:41--5.log : top - 02:20:41
host-stats-02:25:45--6.log : top - 02:25:45
host-stats-02:32:47--7.log : top - 02:32:47
host-stats-02:41:40--8.log : top - 02:41:40
host-stats-02:42:41--9.log : top - 02:42:41
host-stats-02:45:55--10.log : top - 02:45:55
host-stats-02:54:43--11.log : top - 02:54:43
host-stats-03:03:41--12.log : top - 03:03:41
host-stats-03:16:50--13.log : top - 03:16:50
host-stats-03:18:41--14.log : top - 03:18:41
host-stats-03:19:50--15.log : top - 03:19:50
host-stats-03:28:24--16.log : top - 03:28:24
host-stats-03:29:49--17.log : top - 03:29:49
host-stats-03:30:43--18.log : top - 03:30:43
host-stats-03:31:42--19.log : top - 03:31:42
host-stats-03:32:41--20.log : top - 03:32:41
host-stats-03:33:41--21.log : top - 03:33:41
host-stats-03:36:07--22.log : top - 03:36:07
host-stats-03:42:40--23.log : top - 03:42:40



*** Add time distribution of dd.
Oddly enough, the output of dd does not seem to be consistent:

102400 bytes (102 kB) copied, 0.0163975 s, 6.2 MB/s
102400 bytes (102 kB) copied, 0.0054732 s, 18.7 MB/s
102400 bytes (102 kB, 100 KiB) copied, 0.0984687 s, 1.0 MB/s
102400 bytes (102 kB, 100 KiB) copied, 0.304233 s, 337 kB/s

I'm not sure what's going on there, yet. Sakib?


I think the cooker output will be quite useful.
I'm not going to put any more time into this data set
since the time threshold is still 5 seconds.

All for now.

../Randy




../Randy

Cheers,

Richard
The template is not used anywhere, yet, the initial patchset enables the
data collection by default.

I have left the template in case the data collection is removed from
defaults and need to be used on a case by case basis.

I am not entirely sure if nesting templates work. I have not seen any
examples of it, neither did i try it myself. If nesting does work, the
template should be useful.
I had a quick look at the code and sadly, it doesn't appear I implemented
nesting so this wouldn't be that useful as things stand.

Cheers,

Richard




--
# Randy MacLeod
# Wind River Linux