Info for swat for hung build


Richard Purdie
 

I noticed:

https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/2779

has hung. Since this will be gone by the time anyone looks, inspection on the
system suggests it is hung in perl do_compile building a core-image-minimal:

$ pstree -p 3934003
run.do_compile.(3934003)───make(3934023)───make(4035804)───make(909354)───true(909946)

909354 ? SN 0:00 make -C ext/XS-Typemap/ all PERL_CORE=1 LIBPERL=libperl.so.5.34.0 LINKTYPE=dynamic
909946 ? ZN 0:00 [true] <defunct>
3933716 ? SNs 0:02 python3 /home/pokybuild/yocto-worker/oe-selftest-centos/build/bitbake/bin/bitbake-worker decafbad
3934003 ? SN 0:00 /bin/sh /home/pokybuild/yocto-worker/oe-selftest-centos/build/build-st-3687560/tmp/work/core2-64-poky-linux/perl/5.34.0-r0/temp/run.do_compile.3933716
3934023 ? SN 0:00 make -j 16 -l 52
4035804 ? SN 0:00 make perl nonxs_ext utilities extensions pods

so the true exit code was never looked at by make?

I think it was just starting to run wic tests (wic.Wic, not wic.Wic2).

Cheers,

Richard


Richard Purdie
 

On Sat, 2021-10-30 at 14:04 +0100, Richard Purdie via lists.yoctoproject.org
wrote:
I noticed:

https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/2779

has hung. Since this will be gone by the time anyone looks, inspection on the
system suggests it is hung in perl do_compile building a core-image-minimal:

$ pstree -p 3934003
run.do_compile.(3934003)───make(3934023)───make(4035804)───make(909354)───true(909946)

909354 ? SN 0:00 make -C ext/XS-Typemap/ all PERL_CORE=1 LIBPERL=libperl.so.5.34.0 LINKTYPE=dynamic
909946 ? ZN 0:00 [true] <defunct>
3933716 ? SNs 0:02 python3 /home/pokybuild/yocto-worker/oe-selftest-centos/build/bitbake/bin/bitbake-worker decafbad
3934003 ? SN 0:00 /bin/sh /home/pokybuild/yocto-worker/oe-selftest-centos/build/build-st-3687560/tmp/work/core2-64-poky-linux/perl/5.34.0-r0/temp/run.do_compile.3933716
3934023 ? SN 0:00 make -j 16 -l 52
4035804 ? SN 0:00 make perl nonxs_ext utilities extensions pods

so the true exit code was never looked at by make?

I think it was just starting to run wic tests (wic.Wic, not wic.Wic2).
This was on centos8-ty-2.

gdb -p 909354
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-16.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 909354
Reading symbols from /usr/bin/make...Reading symbols from .gnu_debugdata for
/usr/bin/make...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
0x00007f79a1bc146d in pselect () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install make-4.2.1-11.el8.x86_64
(gdb) bt
#0 0x00007f796a1bc146(gdb) btd in pselect () from /lib64/libc.so.6
#1 0x000055851c9f700d in jobserver_acquire ()
#2 0x000055851c9f3935 in new_job ()
#3 0x000055851c9ffc07 in update_file ()
#4 0x000055851ca00055 in check_dep ()
#5 0x000055851c9fefe6 in update_file ()
#6 0x000055851ca00055 in check_dep ()
#7 0x000055851c9fefe6 in update_file ()
#8 0x000055851ca00055 in check_dep ()
#9 0x000055851c9fefe6 in update_file ()
#10 0x000055851ca00055 in check_dep ()
#11 0x000055851c9fefe6 in update_file ()
#12 0x000055851ca00055 in check_dep ()
#13 0x000055851c9fefe6 in update_file ()
#14 0x000055851ca004cf in update_goal_chain ()
#15 0x000055851c9e3f56 in main ()
(gdb)

Centos 8 has make 4.2.1 which made me wonder about:
https://git.savannah.gnu.org/cgit/make.git/commit/src/posixos.c?id=d79fe162c009788888faaf0317253b6f0cac7092

?

Cheers,

Richard


Richard Purdie
 

On Sat, 2021-10-30 at 14:14 +0100, Richard Purdie via lists.yoctoproject.org
wrote:
On Sat, 2021-10-30 at 14:04 +0100, Richard Purdie via lists.yoctoproject.org
wrote:
I noticed:

https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/2779

has hung. Since this will be gone by the time anyone looks, inspection on the
system suggests it is hung in perl do_compile building a core-image-minimal:

$ pstree -p 3934003
run.do_compile.(3934003)───make(3934023)───make(4035804)───make(909354)───true(909946)

909354 ? SN 0:00 make -C ext/XS-Typemap/ all PERL_CORE=1 LIBPERL=libperl.so.5.34.0 LINKTYPE=dynamic
909946 ? ZN 0:00 [true] <defunct>
3933716 ? SNs 0:02 python3 /home/pokybuild/yocto-worker/oe-selftest-centos/build/bitbake/bin/bitbake-worker decafbad
3934003 ? SN 0:00 /bin/sh /home/pokybuild/yocto-worker/oe-selftest-centos/build/build-st-3687560/tmp/work/core2-64-poky-linux/perl/5.34.0-r0/temp/run.do_compile.3933716
3934023 ? SN 0:00 make -j 16 -l 52
4035804 ? SN 0:00 make perl nonxs_ext utilities extensions pods

so the true exit code was never looked at by make?

I think it was just starting to run wic tests (wic.Wic, not wic.Wic2).
This was on centos8-ty-2.

gdb -p 909354
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-16.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 909354
Reading symbols from /usr/bin/make...Reading symbols from .gnu_debugdata for
/usr/bin/make...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
0x00007f79a1bc146d in pselect () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install make-4.2.1-11.el8.x86_64
(gdb) bt
#0 0x00007f796a1bc146(gdb) btd in pselect () from /lib64/libc.so.6
#1 0x000055851c9f700d in jobserver_acquire ()
#2 0x000055851c9f3935 in new_job ()
#3 0x000055851c9ffc07 in update_file ()
#4 0x000055851ca00055 in check_dep ()
#5 0x000055851c9fefe6 in update_file ()
#6 0x000055851ca00055 in check_dep ()
#7 0x000055851c9fefe6 in update_file ()
#8 0x000055851ca00055 in check_dep ()
#9 0x000055851c9fefe6 in update_file ()
#10 0x000055851ca00055 in check_dep ()
#11 0x000055851c9fefe6 in update_file ()
#12 0x000055851ca00055 in check_dep ()
#13 0x000055851c9fefe6 in update_file ()
#14 0x000055851ca004cf in update_goal_chain ()
#15 0x000055851c9e3f56 in main ()
(gdb)

Centos 8 has make 4.2.1 which made me wonder about:
https://git.savannah.gnu.org/cgit/make.git/commit/src/posixos.c?id=d79fe162c009788888faaf0317253b6f0cac7092

?
Sending an ECHLD (sig 17) to the process had it exit and start building again. 

Logs show it was stuck in wic.Wic.test_bootloader_config.

That means it could be the above bug and something got messed up with the job
counting? Or the ECHLD from the true process was somehow missed and there is a
race in make somewhere but that seems less likely?...

Cheers,

Richard